Confused by String Iteration performance

Also the usual caveat: are you running this in in an optimized release build (-O)?

Yes. That is a more sensible way than the append/removeLast hack. In both cases, fileText is not contiguous beforehand but is after

Yes, release build

String(contentsOfFile:) is from Foundation and will create an NSString that is then bridged to a Swift String. Those are much slower. String(fileText.prefix(fileText.count - 1)) happens to copy the String natively into Swift.

As others have suggested, if you do

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
fileText.makeContiguousUTF8()

then it should be fast because it also makes sure it's contiguous UTF-8.

2 Likes

Is there anyway to know which String methods are bridged from NSString/Foundation?

Usually, it's more important to know if the string isContiguousUTF8, since that's really where the performance gap is.

Otherwise, I'd just check the definition of NSString. There may be exceptions, but should be good enough as a rule of thumb.

It's possible I can fix this properly at some point for String(contentsOfFile: …, encoding: .utf8), but the non-encoding taking version will be much trickier (it does a bunch of work to try to guess the encoding that would be delicate to replicate in the overlay).

So my suggestion would be regardless of what workarounds you use (e.g. fileText.makeContiguousUTF8()), switch to the encoding-taking version of the initializer as well.

3 Likes

My initial observations were made for strings returned to me via WKWebView.evaluateJavaScript(...) and not String(contentsOfFile:). The latter was only for testing purposes, but the WKWebView strings exhibit the same behaviour of being nonContiguous and based of NSString deep down.

The tricky part in all this is given a String of unknown origin as to whether it will actually be beneficial to make it contiguous (which itself costs time) before manipulating it, or just manipulate it directly.

That would really need some kind of hinting mechanism for the user code to be able to let the OS know the kind of things its about to want to do.

You can just call makeContiguousUTF8(). If it's already contiguously stored in UTF-8, this will be a no-op.

2 Likes

If the user only needs to loop through the string once, and doesn't need to keep indices or anything like that, then, there's no reason to call makeContiguousUTF8() that will in fact loop through the string.

But that isn't the case: i.e:

let fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
for c in fileText {
    // Do something
}

On average, takes about 0.4 seconds. Whereas:

var fileText = try! String(contentsOfFile: "<some 1MB utf8 text file>")
//Start timing
fileText.makeContiguousUTF8()
for c in fileText {
    // Do something
}

Takes about 0.06 seconds.

3 Likes

In general, if you're going to look at any significant portion of a String, you really want it to be in native UTF-8. Lazy bridging is a win for the (surprisingly common) case where some Objective-C component passed you a string that you won't end up inspecting before handing it off to some other Objective-C component.

1 Like

String(contentsOf: …, encoding: .utf8) (or .ascii) should be much faster in the new OS betas. It now directly creates a native Swift String, rather than going through NSString and bridging to Swift.

16 Likes

A lot of people mentioned "contiguous" strings. I also read swift-evolution/0247-contiguous-strings.md at master · apple/swift-evolution · GitHub. But it is still not clear to me what exactly "contiguous" means. What does this contiguousness describe? Memory layout? What makes a non-contiguous string not contiguous?

The document you linked to gives the definition:

“Contiguous strings” are strings that are capable of providing a pointer and length to validly encoded UTF-8 contents in constant time.

If the string is not in native UTF-8 (say, because it is bridged from NSString, which represents string contents as a sequence of UTF-16 code units), then it cannot provide a pointer and length to validly encoded UTF-8 contents in constant time.

I understand the definition given in the doc, also UTF-8 and UTF-16. My question is on the word "contiguous". Why the UTF-8 Swift String is called "contiguous"?
Compared to it, how is a UTF-16 NSString un-contiguous? Are there holes in its memory layout while there are none in a UTF-8 string?

I suppose you could infer from that definition that conceptually Swift only considers UTF-8-encoded codepoints to be “strings”, and that UTF-16-encoded codepoints are not strings (but can be converted to strings, lazily or eagerly). It would fit; UTF-16 codepoints would not be “contiguous strings”, even if their code-units are stored in contiguous memory.

That may be reading too much in to it, though.

Discontiguous strings may put the data in different regions of memory, e.g., the first half is in one buffer while the latter half is in another.

The APIs actually use the term contiguousUTF8 to avoid such confusion. Though "contiguous string" seems to include UTF-8 requirement in most of the Swift colloquial contexts.

You should also consider checking out Piercing the String Veil to help provide further clarity on the various String representations. Note that the above reference to "contiguous UTF-8" is important: a String will be opaque if it is incapable of providing a pointer to contiguous UTF-8 bytes. It doesn't matter which of those two constraints it can't satisfy: it may store contiguous UTF-16 bytes, or it may store discontiguous UTF-8, either will force the String type to be opaque.

1 Like

The way I think of what “contiguous” means in this context is less that it is contiguous, and more that it provides access to its contiguousness. So a random NSString may or may not be contiguous in practice, but if we can’t ask it (via CFStringGetCString or its private ObjC counterpart) for that storage, then it may as well not be for our purposes.

2 Likes
Terms of Service

Privacy Policy

Cookie Policy