What locale does String.lowercased() use?

taylorswift · October 25, 2024, 10:26pm

it is surprisingly hard to find the answer to this question! from reading the source code of the standard library, it seems String.lowercased calls Unicode.Scalar.Properties.lowercaseMapping

is this the same as the en_US_POSIX locale?

Joe_Groff · October 25, 2024, 10:31pm

It doesn't use any locale, it uses the lowercase case mapping as defined by the Unicode standard, which is invariant of locale.

taylorswift · October 25, 2024, 10:41pm

if i have a database with a unique index on a string field, and i want to validate from Swift that all keys are unique under String.lowercased before performing an insert, what collation should i select when creating the index?

the closest thing i could think of is en_US_POSIX with secondary ICU comparison level, but i have no idea if this is actually correct.

Karl · October 25, 2024, 10:58pm

Just be careful that Swift and your database engine may disagree about whether a key is unique or a duplicate, as their data tables may be at different versions. And each may further change its answer as they are updated with new tables.

See the recent pitch thread for Unicode normalisation APIs, and in particular the discussion around stable normalisations.

taylorswift · October 25, 2024, 11:07pm

i suppose this is probably unavoidable. but if i wanted to minimize the probability of disagreement (e.g., to minimize the amount of HTTP redirect cycles in a URL router), what is the best way to align both sides?

for example, is there any advantage to using String.lowercased(with:) with the language set to en?

Karl · October 25, 2024, 11:34pm

I would check from the Swift side whether the string contains unassigned code-points.

If it does, Swift can’t say for certain whether the strings are unique or duplicates. The uncertainty is that strings which are considered unique (today) may be duplicates (tomorrow). No strings ever go from duplicates -> unique.

If it does not, Swift can definitely say whether the strings are unique or duplicates. So you can put them in the DB knowing that nobody will ever think those keys are duplicates.

As for the locale-related aspect, I don’t know. But the normalisation stability issue remains regardless, so I figured it was worth mentioning.