Confused by String Iteration performance

So I've been investigating why some code which iterates over a large(ish) string was a lot slower than expected

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
for c in fileText {
    // Do something
}

For my ~1MB file, the above loop (with no actual useful code) averages out at about 0.4 seconds just to iterate the text.

Now this is where it gets weird. If I do:

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
let oneCharacterLess = String(fileText.prefix(fileText.count - 1))
for c in oneCharacterLess {
    // Do something
}

then the loop takes ~ 0.046 seconds !! So nearly 10 times quicker iteration

Going one step further and wanting to parse the entire string, I did:

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.append("A")
fileText.removeLast
for c in fileText {
    // Do something
}

This also took ~ 0.046 seconds. So what is going on?

I have also measured the time it takes for the append and remove shenegans above and that accounted for ~ 0.02 seconds.

In real life, the origin of the string I am iterating is not under my control, nor its length. My function just takes a String parameter that could have come from anywhere. i.e sometimes the string may already be in a good state to be iterated quickly, while sometimes it may not. I don't really want to have to do that append/removeLast trick just to ensure it is.

I can't be the first to notice this??

1 Like

Sounds like a bridging issue. Does fileText.makeContiguousUTF8() also make iteration much faster?

Also the usual caveat: are you running this in in an optimized release build (-O)?

Yes. That is a more sensible way than the append/removeLast hack. In both cases, fileText is not contiguous beforehand but is after

Yes, release build

String(contentsOfFile:) is from Foundation and will create an NSString that is then bridged to a Swift String. Those are much slower. String(fileText.prefix(fileText.count - 1)) happens to copy the String natively into Swift.

As others have suggested, if you do

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.makeContiguousUTF8()

then it should be fast because it also makes sure it's contiguous UTF-8.

2 Likes

Is there anyway to know which String methods are bridged from NSString/Foundation?

Usually, it's more important to know if the string isContiguousUTF8, since that's really where the performance gap is.

Otherwise, I'd just check the definition of NSString. There may be exceptions, but should be good enough as a rule of thumb.

It's possible I can fix this properly at some point for String(contentsOfFile: …, encoding: .utf8), but the non-encoding taking version will be much trickier (it does a bunch of work to try to guess the encoding that would be delicate to replicate in the overlay).

So my suggestion would be regardless of what workarounds you use (e.g. fileText.makeContiguousUTF8()), switch to the encoding-taking version of the initializer as well.

3 Likes

My initial observations were made for strings returned to me via WKWebView.evaluateJavaScript(...) and not String(contentsOfFile:). The latter was only for testing purposes, but the WKWebView strings exhibit the same behaviour of being nonContiguous and based of NSString deep down.

The tricky part in all this is given a String of unknown origin as to whether it will actually be beneficial to make it contiguous (which itself costs time) before manipulating it, or just manipulate it directly.

That would really need some kind of hinting mechanism for the user code to be able to let the OS know the kind of things its about to want to do.

You can just call makeContiguousUTF8(). If it's already contiguously stored in UTF-8, this will be a no-op.

2 Likes

If the user only needs to loop through the string once, and doesn't need to keep indices or anything like that, then, there's no reason to call makeContiguousUTF8() that will in fact loop through the string.

But that isn't the case: i.e:

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
for c in fileText {
    // Do something
}

On average, takes about 0.4 seconds. Whereas:

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
fileText.makeContiguousUTF8()
for c in fileText {
    // Do something
}

Takes about 0.06 seconds.

3 Likes

In general, if you're going to look at any significant portion of a String, you really want it to be in native UTF-8. Lazy bridging is a win for the (surprisingly common) case where some Objective-C component passed you a string that you won't end up inspecting before handing it off to some other Objective-C component.

1 Like

String(contentsOf: …, encoding: .utf8) (or .ascii) should be much faster in the new OS betas. It now directly creates a native Swift String, rather than going through NSString and bridging to Swift.

16 Likes

A lot of people mentioned "contiguous" strings. I also read swift-evolution/0247-contiguous-strings.md at master · apple/swift-evolution · GitHub. But it is still not clear to me what exactly "contiguous" means. What does this contiguousness describe? Memory layout? What makes a non-contiguous string not contiguous?

The document you linked to gives the definition:

“Contiguous strings” are strings that are capable of providing a pointer and length to validly encoded UTF-8 contents in constant time.

If the string is not in native UTF-8 (say, because it is bridged from NSString, which represents string contents as a sequence of UTF-16 code units), then it cannot provide a pointer and length to validly encoded UTF-8 contents in constant time.

I understand the definition given in the doc, also UTF-8 and UTF-16. My question is on the word "contiguous". Why the UTF-8 Swift String is called "contiguous"?
Compared to it, how is a UTF-16 NSString un-contiguous? Are there holes in its memory layout while there are none in a UTF-8 string?

I suppose you could infer from that definition that conceptually Swift only considers UTF-8-encoded codepoints to be “strings”, and that UTF-16-encoded codepoints are not strings (but can be converted to strings, lazily or eagerly). It would fit; UTF-16 codepoints would not be “contiguous strings”, even if their code-units are stored in contiguous memory.

That may be reading too much in to it, though.

Discontiguous strings may put the data in different regions of memory, e.g., the first half is in one buffer while the latter half is in another.

The APIs actually use the term contiguousUTF8 to avoid such confusion. Though "contiguous string" seems to include UTF-8 requirement in most of the Swift colloquial contexts.

Terms of Service

Privacy Policy

Cookie Policy