So I've been investigating why some code which iterates over a large(ish) string was a lot slower than expected
let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
for c in fileText {
// Do something
}
For my ~1MB file, the above loop (with no actual useful code) averages out at about 0.4 seconds just to iterate the text.
Now this is where it gets weird. If I do:
let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
let oneCharacterLess = String(fileText.prefix(fileText.count - 1))
for c in oneCharacterLess {
// Do something
}
then the loop takes ~ 0.046 seconds !! So nearly 10 times quicker iteration
Going one step further and wanting to parse the entire string, I did:
var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.append("A")
fileText.removeLast
for c in fileText {
// Do something
}
This also took ~ 0.046 seconds. So what is going on?
I have also measured the time it takes for the append and remove shenegans above and that accounted for ~ 0.02 seconds.
In real life, the origin of the string I am iterating is not under my control, nor its length. My function just takes a String parameter that could have come from anywhere. i.e sometimes the string may already be in a good state to be iterated quickly, while sometimes it may not. I don't really want to have to do that append/removeLast trick just to ensure it is.
String(contentsOfFile:) is from Foundation and will create an NSString that is then bridged to a Swift String. Those are much slower. String(fileText.prefix(fileText.count - 1)) happens to copy the String natively into Swift.
As others have suggested, if you do
var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.makeContiguousUTF8()
then it should be fast because it also makes sure it's contiguous UTF-8.
It's possible I can fix this properly at some point for String(contentsOfFile: …, encoding: .utf8), but the non-encoding taking version will be much trickier (it does a bunch of work to try to guess the encoding that would be delicate to replicate in the overlay).
So my suggestion would be regardless of what workarounds you use (e.g. fileText.makeContiguousUTF8()), switch to the encoding-taking version of the initializer as well.
My initial observations were made for strings returned to me via WKWebView.evaluateJavaScript(...) and not String(contentsOfFile:). The latter was only for testing purposes, but the WKWebView strings exhibit the same behaviour of being nonContiguous and based of NSString deep down.
The tricky part in all this is given a String of unknown origin as to whether it will actually be beneficial to make it contiguous (which itself costs time) before manipulating it, or just manipulate it directly.
That would really need some kind of hinting mechanism for the user code to be able to let the OS know the kind of things its about to want to do.
If the user only needs to loop through the string once, and doesn't need to keep indices or anything like that, then, there's no reason to call makeContiguousUTF8() that will in fact loop through the string.
let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
for c in fileText {
// Do something
}
On average, takes about 0.4 seconds. Whereas:
var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
fileText.makeContiguousUTF8()
for c in fileText {
// Do something
}
In general, if you're going to look at any significant portion of a String, you really want it to be in native UTF-8. Lazy bridging is a win for the (surprisingly common) case where some Objective-C component passed you a string that you won't end up inspecting before handing it off to some other Objective-C component.
String(contentsOf: …, encoding: .utf8) (or .ascii) should be much faster in the new OS betas. It now directly creates a native Swift String, rather than going through NSString and bridging to Swift.
“Contiguous strings” are strings that are capable of providing a pointer and length to validly encoded UTF-8 contents in constant time.
If the string is not in native UTF-8 (say, because it is bridged from NSString, which represents string contents as a sequence of UTF-16 code units), then it cannot provide a pointer and length to validly encoded UTF-8 contents in constant time.
I understand the definition given in the doc, also UTF-8 and UTF-16. My question is on the word "contiguous". Why the UTF-8 Swift String is called "contiguous"?
Compared to it, how is a UTF-16 NSString un-contiguous? Are there holes in its memory layout while there are none in a UTF-8 string?
I suppose you could infer from that definition that conceptually Swift only considers UTF-8-encoded codepoints to be “strings”, and that UTF-16-encoded codepoints are not strings (but can be converted to strings, lazily or eagerly). It would fit; UTF-16 codepoints would not be “contiguous strings”, even if their code-units are stored in contiguous memory.
Discontiguous strings may put the data in different regions of memory, e.g., the first half is in one buffer while the latter half is in another.
The APIs actually use the term contiguousUTF8 to avoid such confusion. Though "contiguous string" seems to include UTF-8 requirement in most of the Swift colloquial contexts.