Using MD5 of the file contents as a proxy for modification

igor-makarov · February 16, 2020, 12:45pm

Hi,

I've been working on a large Xcode project for a while now, and one of the things that I've run into is rebuilds getting triggered when a file is touched, but its contents are not modified.

The reason is that llbuild considers several fstat values when determining whether a file was modified.

This is also the practice in several other well-known build systems, such as make, but some have claimed that it might be harmful.

Cases when using modification time produces a false positive:

Git operations
Code getting regenerated by a script
Accidentally typing a character and then deleting it

What do you think? I've done a quick experiment and it seems that using an MD5 hash of the file instead is possible!

ddunbar · February 16, 2020, 3:52pm

Yes, using hashes of the file contents to track modifications is indeed a completely reasonable thing to do, and one that has always been intended as the long term direction for llbuild. There are some challenges to be solved, for example in some cases the build does or may care about the non-content attributes (for example, suppose they file is being copied and downstream dependencies cares about the other attributes). We would need to find a strategy for how to handle such things. Also, we need to take some care about the performance overhead of hashing.

If you want to potentially work on these issues, I’m happy to brainstorm with you and @David_M_Bryson.

scanon · February 16, 2020, 5:17pm

One note: please don't use MD5; in fact, I don't see any reason why this would use any fixed hash function, rather than whatever is fastest with suitable guarantees on each platform, since this is data that never needs to move across systems.

MD5, in particular, is significantly slower than other options that we might use instead--this use case doesn't require a cryptographic hash function (and if it did require a cryptographic hash, MD5 is no longer considered safe to use, and is also slower than other options on many platforms)--it requires only a hash that is sufficient to detect if a file has changed, which is a much weaker requirement.

igor-makarov · February 16, 2020, 5:34pm

Thanks for the comment, @scanon!
I picked MD5 merely because there was an easy way to use it in the context of my experiment. What hash function do you suggest? Alternate question, what portable way is there to determine file content modification?

@ddunbar - thanks for taking an interest. I didn't quite understand the non-content remark and the example. Can you elaborate on this and other possible corner cases?

Also, does this go through Swift Evolution?

Edit: I'm not sure this data necessarily stays on the same machine - distributed builds are part of the plan eventually, I hope

lukasa · February 17, 2020, 8:19am

Cryptographic hash functions are still fairly reasonable choices in this kind of environment: they're still very fast! They certainly aren't as fast as other choices, but cryptographic hash functions are definitely designed for speed. The advantage of using a cryptographic hash here is that there is no risk that the hash function will be incorrectly used in a situation where a cryptographic hash function should have been used instead.

That said, if you were going to use a non-cryptographic hash function, Murmur seems to be the go-to solution in the space. If you were going to use a cryptographic hash, anything in the SHA2 family is fine, with SHA-256 a reasonable choice. And @scanon's advice of making this parameterised is also extremely sound.

ddunbar · February 17, 2020, 2:47pm

Non-content meaning that sometimes the file system attributes (user, group, permissions, even time stamp), could be relevant on files that are copied wholesale, not just which have their contents read.

Nope, doesn’t need swift evolution; just happens as an llbuild design discussion.

Daniel

scanon · February 17, 2020, 2:51pm

Fair, though MD5 isn't particularly so (partially because it's no longer considered secure, so it isn't ever going to get HW assist instructions, like the various SHA hashes have on arm64 and some x86 implementations, and a number of CS hashes that came later were more explicitly designed with speed as a goal).

igor-makarov · February 17, 2020, 8:03pm

@ddunbar any examples of these non-content cases?

igor-makarov · February 18, 2020, 11:13am

Update: I've been doing some more prototyping, and I've realized I can make the hash calculation lazy.

Rely on mtime first, and only calculate the hash if the mtime changed.

This means that the change will not incur a performance penalty in any existing cases!

Also, I've been thinking about the hash function bikeshed and I can't find any other function that's readily available in the llbuild project hierarchy. We can use the one available, since it doesn't seem to slow anything down much.

I'm thinking I'll open a PR so that the discussion can become more concrete.

abertelrud · February 19, 2020, 1:00am

Thanks for working on this! Using the mtime as a quick check before the hash seems very reasonable. There is a potential case of the content changing and the mtime getting restored, masking the change to the content, but that seems like quite an edge case.

As for the question of what constitutes "content": in addition to metadata such as permissions, there are also things like extended attributes on some platforms. Whether or not these matter actually depends on the task to which the node is an input — a task that compiles source code wouldn't care, but a task that copies a file might care.

igor-makarov · February 19, 2020, 6:30am

I might be wrong here, but wouldn't something like this require actually sourcing the rule validity function to the actual node?

I mean, surely ExternalCommand cannot know about all the possible cases?

abertelrud · February 19, 2020, 5:00pm

It certainly would complicate things, and is not a distinction that most build systems pay attention to. It may be possible to abstract out various aspects of file contents and to vend a hash for each one (e.g. one for data content, one for permissions, and another for xattrs). Each rule could then specify which aspects it cares about, and hashes would only be computed as needed.

It's also likely that it doesn't end up mattering much in practice: many SCM systems have trouble properly storing most kinds of metadata (I think Git stores permissions but not, for example, xattrs). Therefore it's probably not reliable for a build system to depend on a copied file always having the right metadata set at the source anyway, so this may be a conceptual distinction that doesn't make a difference in practice.

I raised it mostly as something to think about when considering what the "contents" of a file really includes, and as answer to the question about examples of the "non-content cases" that Daniel had pointed out.

stevenhepting · June 1, 2020, 4:42pm

You would likely want to integrate with watchman as a related feature so that you don't ever need to scan files that haven't changed since the original scan. This is how Buck is so fast while still verifying the contents of files.

ddunbar · June 1, 2020, 4:44pm

The time to actually stat is often overinflated. On modern filesystems (APFS) and with a proper implementation, it is possible to scan many hundreds of thousands of files in a very short order. APFS even supports some very fancy ways of being able to prune the scan when subdirectories are unmodified, in certain situations.

Not saying we shouldn't do this, but there is a lot of FUD in this area. Whenever I've written actual benchmarks the results are very different from what would expect only based on anecdotal work.

stevenhepting · June 1, 2020, 9:37pm

Is this just for a plain stat though? Or getting a full SHA (or other hash) of the file contents? I think the hash calculation is often what folks attempt to avoid.

DavidGoldman · June 2, 2020, 12:18am

Integrating with a tool like watchman makes sense since it will effectively handle the file watching (e.g. inotify/FSEvents) and re-stating - the other option is implementing a similar solution for all platforms. Not sure if it also handles re-reading the file (to get the hash), but that makes sense to confirm that the file has changed before re-running things.

Not sure of the performance of stat though, I'm guessing results may vary based on your drive, FS type, and # of input files. And of course this may largely vary by OS - if i recall correctly I think I remember reading that Windows has quite slow stat performance.

ddunbar · June 2, 2020, 2:22am

It’s easy to drive the hashing off the stat, so files are only rehashed if modified, otherwise the hash is cached.