not a beautiful or unique snowflake (nothings) wrote,
not a beautiful or unique snowflake

You know what next file systems should support? Or really, current ones should, given current hard drive sizes.

Maybe a Linux one does, but NTFS definitely doesn't.

  1. Recursive directory sizes. Currently, scanning a directory tree to determine its recursive size is incredibly slow on a drive with a lot of files. This is because drives have gotten larger more than they've gotten faster. If you're using proportionally larger files, that's not a problem though.

    The file system doesn't actually need to propogate file-sizes all the way up the tree; but if the file system had an API requesting the hierarchical size, the file system could take steps to optimize it, like tracking the cumulative size of all files in a given directory (but not its descendents), and making sure it's efficient to look at all the subdirectories (without having to check every file in a directory to see if it's a subdirectory).

  2. File hashes. See, the normal way to find all the duplicate files on a hard drive is to scan it and record all the files and their sizes. Sort by size, and then go through comparing all files of the same size to each other to see if they're really the same. If you start getting a lot of files with the same size, you can speed this up by hashing the whole file, or even just hashing its first cluster, and comparing those before doing the whole-file-compare.

    But as hard drives get bigger and file sizes don't increase, you end up with more files in the same range of sizes as before. Due to a birthday-paradox kind of effect, the odds of two files having the same size increase, to the point where this gets slower _disproportionate_ to the increasing file count--it slows down worse than linear, until you reach the point where you're checking every file. Thus you have to hash a lot of files. (If you have, say, 200,000 files of about 100K each, you probably have to hash almost all your files, even if none are duplicates.)

    Unfortunately, even though you only need to hash a little of the file to get a useful comparison checksum, you still have to load a whole disk cluster (or whatever) to do so. The end result is that, even when there are no duplicates, the duplicate scanning programs I've tried grind the drive for a long time generating these checksums.

    Now, even though we're lazily only computing checksums for files that have some identically-sized other file, we'd be perfectly happy if every file had a checksum; we're just optimizing out computing something we never need. The reality is that we're really just using the checksums and even the original size as hashes on the content--saying 'any two files that are identical must have identical hashes'. And we've just reached the point where size is no longer a sufficient hash.

    Now, it might sound crazy to ask the file system to keep a hash of the entire file. I bet that would be possible if done cleverly, but I don't even want that. Just hash the first cluster, or the first sector--whatever the smallest actual disk transfer unit is. It doesn't matter that it's not a hash of the whole content; some kind of hash is better than none. You'll only need to update it if that sector is written, and you'll always have all the data available.

    Admittedly, there are limited applications for this hash. But the point is, it's something the filesystem can do, and nobody else can do. Java implementations (as well as Squeak smalltalk) do the same thing: because you can't hash using pointers if you have relocating-garbage-collection, these implementations store a hash value (not content derived, in this case; just a random number) so that if you should happen to want to hash the object, you can. They pay this cost even if you never hash any of the objects. Of course, arguably, hashing of objects is a lot more common than hashing of file contents, but I still think it's worth thinking about.

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment