How does DocC mitigate FNV-1 hash collisions?

as is well-known, DocC relies on truncated FNV-1 hashes to disambiguate colliding symbol names. FNV-1 hash collisions are uncommon but trivial to produce in the general case, and while it’s harder for the subset of swift USR strings to collide, my suspicion is that it is quite possible for two compiler-generated USRs to collide.

does DocC have a mechanism to mitigate this?

whelp, i did some investigation and it turns out FNV-1 collisions are astonishingly common, even within the standard library!

hash symbols
6TDP8 [s:12RegexBuilder7CaptureV2as_9transformACySs_qd_7_qd_0_qd_1_qd_2_qd_3_qd_4_qd_5_qd_6_tGAA9ReferenceVyqd_7_G_qd_8_yXEqd_7_qd__KctcSs_qd_7_qd_0_qd_1_qd_2_qd_3_qd_4_qd_5_qd_6_tRsz17_StringProcessing0A9ComponentRd_8_qd___qd_0_qd_1_qd_2_qd_3_qd_4_qd_5_qd_6_t0A6OutputRtd_8_r8_lufc, s:10Foundation4rintyAA7CGFloatVADF]
ZJEM [s:s21UnicodeDecodingResultO5erroryA2BmF, s:10Foundation12UserDefaultsC14volatileDomain7forNameSDySSypGSS_tF]
6ZUHS [s:s16IndexingIteratorV11SubSequencea, s:10Foundation16FileAttributeKeyV21groupOwnerAccountNameACvpZ]
2ZV9P [s:10Foundation40ObjectiveCConvertibleAttributedStringKeyP15objectiveCValue3for0bH0Qz5ValueQz_tKFZ, s:20FoundationNetworking10HTTPCookieC10isHTTPOnlySbvp]
90YA4 [s:s7UnicodeO12_RegexParserE6ScriptO4totoyA2EmF, s:10Foundation12NSDictionaryC23enumerateKeysAndObjects7options5usingyAA20NSEnumerationOptionsV_yyp_ypSpyAA8ObjCBoolVGtXEtF]
9S1CO [s:17_StringProcessing23RegexRepetitionBehaviorV, s:12RegexBuilder011AlternationB0V17buildPartialBlock11accumulated4nextAA8ChoiceOfVySs_q_q0_q2_Sgq3_Sgq4_SgtGq5__q6_t17_StringProcessing0A9ComponentR5_AmNR6_x_q_q0_t0A6OutputRt5_q1__q2_q3_q4_tAORt6_r7_lFZ]
4EN3E [s:ScT12_Concurrencys5NeverORszACRs_rlE5yieldyyYaFZ, s:s7UnicodeO12_RegexParserE5BlockO10warangCitiyA2EmF]

how did this go unnoticed for so long?

Do they have to be unique, or just unique across symbols with the same full-name? That really changes the scope of the problem.

4 Likes

for disambiguation they only need to be unique for symbols with the same full name after case-folding, but if you want to use the hash as an ID for the symbol in a database (since symbol strings can grow to be very long), it needs to be unique at least at the module level.

1 Like

We're aware that the hash collisions are possible but for what DocC uses the truncated FNV-1 hashes they're unique enough.

DocC only uses the truncated FNV-1 hash as disambiguation for symbols that have the same name in the same scope (and the same module). This significantly narrows the scope and reduces the risk of collisions.

Using one example from the collisions you found: "2ZV9P" could be the hash for either of these symbols or possible another symbol in the developers own code:

  • static Foundation.ObjectiveCConvertibleAttributedStringKey.objectiveCValue(for: A.Value) throws -> A.ObjectiveCValue
  • FoundationNetworking.HTTPCookie.isHTTPOnly : Swift.Bool

These are in different modules, so that already avoids the link collision, but even if they were in the same module, the first is named objectiveCValue(for:) and the other is named isHTTPOnly so this hash collision isn't an issue for DocC because DocC doesn't use the hash without also using the symbol name.

We haven't explored how unique these hashes are within a module because we've haven't needed to use them as unique identifiers in DocC.

2 Likes

still, seeing as there is a non-negligible probability of the compiler generating USRs with the same truncated hash, it is only a matter of time until a hash collision occurs between two symbols with the same name across the vast multitude of swift packages in existence. what is the fallback mechanism available when FNV-1 hashes cannot be used to disambiguate between two symbols?

If they are different kinds of symbols, then their links can be disambiguated by the symbol kind. The only other fallback would be to add more leading path components until the link starts with the module name.

For example; even in a relative link the isHTTPOnly symbol above could also be written like HTTPCookie/isHTTPOnly or like FoundationNetworking/HTTPCookie/isHTTPOnly.

If two symbols in the same module have all the same path components, are the same kind of symbol, and have a collision in the truncated FNV-1 hash then a couple of things would break;

  • DocC wouldn't be able to know which of the two symbols an authored link was referring to
  • Both symbols would have the same web URL and same file path within the .doccarchive—likely resulting in only one of them getting a page.

DocC operates under the assumption that within a module, the combination of full symbol path, symbol kind, and truncated symbol hash is always unique. There isn't a fallback mechanism today if that assumption is proven wrong.

1 Like

yes, this is the edge case i am concerned about. my feeling is that certain scopes can get quite crowded with overloads (due to Encoder/Decoder conformances, gybbed interfaces, and soon, macros) and that drives up the probability of a hash collision.

gotcha.

1 Like