You answered about it being mostly ASCII while I was in the middle of writing. It means this post is probably moot with regards to your specific test, but Iâll leave it here for others who come by wondering something similar.
P.S. You can check for sure how many nonâASCII scalars are in your 4 095 547âcharacter string by doing the following:
print(theBigString.unicodeScalars.filter({ $0.value >= 0x80 }).count)
(You said you created the list from your home directory, which could easily include things like the Music library, where an application is itself happily creating files with all sorts of Unicode, named after things you never consciously typed in and never touched in Finder.)
[Original post:]
If you happen to be as French as your name, and have accented characters all throughout your file system...
...then for your particular test, Swift is severely handicapped because of how smart it is. Swift will never be as fast as the other languages in the list, but it will also never be as incorrect.
Swift respects Unicode canonical equivalence, whereas the languages you are comparing against do not. That means Swift considers "a" + "\u{300}" (a + âĚ) and "\u{E0}" (aĚ) to be equal.
That has implications on its conformance to Comparable as well, since disparate scalar sequences need to compare the same. As such, String sorts lexicographically after normalization. I couldnât find it in the documentation, so it might just be an implementation detail, but at the moment it happens to use NFC (the composed form). You can verify that with this code:
let composed = "\u{E0}"
let decomposed = "a\u{300}"
let strings = [composed, decomposed, composed, decomposed, "a", "b", "c"]
let sorted = strings.sorted()
let descriptions: [String] = sorted.map { string in
let scalars = string.unicodeScalars
.map({ String($0.value, radix: 16, uppercase: true) })
.joined(separator: ", ")
return "\(string): \(scalars)"
}
for entry in descriptions {
print(entry)
}
a: 61
b: 62
c: 63
Ă : E0
Ă : 61, 300
Ă : E0
Ă : 61, 300
So to do things right, Swift first has to normalize the string before it can lexicographically compare the bytes. The other languages are being negligent and just feeding the raw bytes directly into a lexicographical comparison. You can see the mess they make by sorting the same strings I listed aboveâthe four instances of aĚ will be split into two groups far away from each other in the list.
To make matters worse for your particular test, the macOS filesystem also respects canonical equivalence, but it does so by forcing all file names into NFD (the decomposed form). Swift does fastâpath NFC strings by marking them as already normal so that normalization can be skipped. But by sourcing your list from the file system, you have ensured that none of the strings hit that fast path, because they will all be in the opposite form.
So since Swift is doing a lot more work (for good reason), it will quite expectedly finish your sort more slowly than the other languages. To make a fairer comparison, you could do one of two things:
- Make the other platforms do all the work Swift is doing, by swapping out their standard implementation of
< with something that compares the stringâs normalized forms instead.
- Abandon Unicode correctness in Swift, by circumventing the
String type and turning each one into a plain array of scalars (Array($0.unicodeScalars)) before you start the sort.
(None of this negates the observations made by the others. It stacks with the sorts of things they have already mentioned.)
[End of original post.]
String(contentsOf:encoding:) comes from ObjectiveâC. It is really an NSString under the hood, and NSString uses UTFâ16 internally. So what you have loads UTFâ8 from the file system, converting it to UTFâ16 for the sake of NSString, and then bridging it to String. The bridging leaves it in UTFâ16 on the assumption that if it came from ObjectiveâC it is likely to go back again. Each time you compare such bridged strings, Swift first has to convert them to UTFâ8 (since their order in Swift is defined as bytewise according to UTFâ8). That is a lot of repetitive unnecessary work, all triggered because the strings were fetched from ObjectiveâC. String instances that begin their existence in native Swift do not suffer from this.
Adding makeContiguousUTF8() at the beginning forced each string into UTFâ8 before starting the sort. Since it now only happens once instead of repeatedly, the sort goes much faster.