Confused by String Iteration performance

gutley · March 19, 2021, 2:27pm

So I've been investigating why some code which iterates over a large(ish) string was a lot slower than expected

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
for c in fileText {
    // Do something
}

For my ~1MB file, the above loop (with no actual useful code) averages out at about 0.4 seconds just to iterate the text.

Now this is where it gets weird. If I do:

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
let oneCharacterLess = String(fileText.prefix(fileText.count - 1))
for c in oneCharacterLess {
    // Do something
}

then the loop takes ~ 0.046 seconds !! So nearly 10 times quicker iteration

Going one step further and wanting to parse the entire string, I did:

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.append("A")
fileText.removeLast
for c in fileText {
    // Do something
}

This also took ~ 0.046 seconds. So what is going on?

I have also measured the time it takes for the append and remove shenegans above and that accounted for ~ 0.02 seconds.

In real life, the origin of the string I am iterating is not under my control, nor its length. My function just takes a String parameter that could have come from anywhere. i.e sometimes the string may already be in a good state to be iterated quickly, while sometimes it may not. I don't really want to have to do that append/removeLast trick just to ensure it is.

I can't be the first to notice this??

Karl · March 19, 2021, 3:06pm

Sounds like a bridging issue. Does fileText.makeContiguousUTF8() also make iteration much faster?

AlexanderM · March 19, 2021, 3:10pm

Also the usual caveat: are you running this in in an optimized release build (-O)?

gutley · March 19, 2021, 4:04pm

Yes. That is a more sensible way than the append/removeLast hack. In both cases, fileText is not contiguous beforehand but is after

gutley · March 19, 2021, 4:05pm

Yes, release build

johannesweiss · March 19, 2021, 4:08pm

String(contentsOfFile:) is from Foundation and will create an NSString that is then bridged to a Swift String. Those are much slower. String(fileText.prefix(fileText.count - 1)) happens to copy the String natively into Swift.

As others have suggested, if you do

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.makeContiguousUTF8()

then it should be fast because it also makes sure it's contiguous UTF-8.

young · March 19, 2021, 11:05pm

Is there anyway to know which String methods are bridged from NSString/Foundation?

Lantua · March 19, 2021, 11:24pm

Usually, it's more important to know if the string isContiguousUTF8, since that's really where the performance gap is.

Otherwise, I'd just check the definition of NSString. There may be exceptions, but should be good enough as a rule of thumb.

David_Smith · March 20, 2021, 1:01am

It's possible I can fix this properly at some point for String(contentsOfFile: …, encoding: .utf8), but the non-encoding taking version will be much trickier (it does a bunch of work to try to guess the encoding that would be delicate to replicate in the overlay).

So my suggestion would be regardless of what workarounds you use (e.g. fileText.makeContiguousUTF8()), switch to the encoding-taking version of the initializer as well.

gutley · March 22, 2021, 1:55pm

My initial observations were made for strings returned to me via WKWebView.evaluateJavaScript(...) and not String(contentsOfFile:). The latter was only for testing purposes, but the WKWebView strings exhibit the same behaviour of being nonContiguous and based of NSString deep down.

The tricky part in all this is given a String of unknown origin as to whether it will actually be beneficial to make it contiguous (which itself costs time) before manipulating it, or just manipulate it directly.

That would really need some kind of hinting mechanism for the user code to be able to let the OS know the kind of things its about to want to do.

johannesweiss · March 22, 2021, 2:53pm

You can just call makeContiguousUTF8(). If it's already contiguously stored in UTF-8, this will be a no-op.

Erick · March 25, 2021, 7:55pm

If the user only needs to loop through the string once, and doesn't need to keep indices or anything like that, then, there's no reason to call makeContiguousUTF8() that will in fact loop through the string.

gutley · March 26, 2021, 9:11am

But that isn't the case: i.e:

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
for c in fileText {
    // Do something
}

On average, takes about 0.4 seconds. Whereas:

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
fileText.makeContiguousUTF8()
for c in fileText {
    // Do something
}

Takes about 0.06 seconds.

Michael_Ilseman · April 8, 2021, 6:41pm

In general, if you're going to look at any significant portion of a String, you really want it to be in native UTF-8. Lazy bridging is a win for the (surprisingly common) case where some Objective-C component passed you a string that you won't end up inspecting before handing it off to some other Objective-C component.

David_Smith · June 8, 2021, 11:05pm

String(contentsOf: …, encoding: .utf8) (or .ascii) should be much faster in the new OS betas. It now directly creates a native Swift String, rather than going through NSString and bridging to Swift.

Ling_Wang · June 15, 2021, 4:38am

A lot of people mentioned "contiguous" strings. I also read swift-evolution/0247-contiguous-strings.md at master · apple/swift-evolution · GitHub. But it is still not clear to me what exactly "contiguous" means. What does this contiguousness describe? Memory layout? What makes a non-contiguous string not contiguous?

xwu · June 15, 2021, 4:45am

The document you linked to gives the definition:

“Contiguous strings” are strings that are capable of providing a pointer and length to validly encoded UTF-8 contents in constant time.

If the string is not in native UTF-8 (say, because it is bridged from NSString, which represents string contents as a sequence of UTF-16 code units), then it cannot provide a pointer and length to validly encoded UTF-8 contents in constant time.

Ling_Wang · June 15, 2021, 5:29am

I understand the definition given in the doc, also UTF-8 and UTF-16. My question is on the word "contiguous". Why the UTF-8 Swift String is called "contiguous"?
Compared to it, how is a UTF-16 NSString un-contiguous? Are there holes in its memory layout while there are none in a UTF-8 string?

Karl · June 15, 2021, 6:34am

I suppose you could infer from that definition that conceptually Swift only considers UTF-8-encoded codepoints to be “strings”, and that UTF-16-encoded codepoints are not strings (but can be converted to strings, lazily or eagerly). It would fit; UTF-16 codepoints would not be “contiguous strings”, even if their code-units are stored in contiguous memory.

That may be reading too much in to it, though.

Lantua · June 15, 2021, 6:41am

Discontiguous strings may put the data in different regions of memory, e.g., the first half is in one buffer while the latter half is in another.

The APIs actually use the term contiguousUTF8 to avoid such confusion. Though "contiguous string" seems to include UTF-8 requirement in most of the Swift colloquial contexts.

Topic		Replies	Views
Difficulties With Efficient Large File Parsing Using Swift foundation	40	11035	January 19, 2024
Swift performance while manipulating String Using Swift performance	36	3830	September 12, 2020
String.replacing performance Standard Library string , performance	5	1633	November 29, 2023
Async iteration over lines of a file is surprising slow Using Swift concurrency , performance , asyncsequence , asynciterator	11	1878	July 19, 2023
String initializers and developer ergonomics Using Swift	8	423	May 10, 2016

Confused by String Iteration performance

Related Topics