Strings, UTF-8, and C interop

I understand that Swift strings are stored in UTF-8 now, internally. Do these UTF-8 strings also exist in the platform C libraries? Or are things like CFString and NSString still always using UTF-16 storage? I'm wondering if there is a version of those objects that is "toll free bridged" to Swift Strings? And I'm wondering if calling APIs like CoreText causes all String arguments to be converted into UTF-16 strings anyway.

These still use a UTF-16 encoding in most cases.

These types are all bridged to Swift String. This produces a Swift string that is effectively encoded as UTF-16.

1 Like

Okay. So going the other direction - passing Swift strings to older non-Swift APIs - I guess you don't get CFStrings that are encoded as UTF-8. (?)

I'm concerned that if I have a bunch of text in Swift Strings that I'm passing to CoreText, that I will be constantly reconverting the UTF-8 to UTF-16, so I need to be aware of that.

I checked the CFString API to see if any of the CFStringCreate... functions looked new, like a CFStringCreateWithUTF8BackingStore(...) ... but I don't see anything.

Yup, this can be an issue. We have it set up so that String will cache offsets to allow for faster subsequent utf8 offset -> utf16 offset calculations, but there can still be overhead.

Strings containing only the ASCII subset of UTF8 don't have this issue though, since CFString knows how to handle ASCII directly.

As always, please file bugs if you run into performance issues in practice, almost every release of Swift has contained optimizations to the bridging layer based on things users ran into.

7 Likes

How do I check the underlying String / NSString encoding / representation? For debugging purposes so could use some dirty / semi-private API if nothing better is available.

For most NSString subclasses (including bridged Swift Strings), this will give you visibility into what encoding it doesn't need to transcode from: fastestEncoding | Apple Developer Documentation

You can check the class name to see which kind of NSString you got, "__StringStorage" or "__SharedStringStorage" are bridged Swift Strings, "__NSCFString" and "__NSCFConstantString" are the most common kinds of non-Swift NSString. (of course, never rely on getting a particular kind of NSString in any given situation. As the OS changes, which string types are used in which places shifts)

Checking what kind of String you have is somewhat trickier, I guess two things you could do are:

  • Try calling makeContiguousUTF8() on it and see if what the debugger says about it changes
  • Try bridging it to NSString and see if you get one of the Swift types. If it's a bridged NSString to begin with it will just give you the original NSString back instead.

Oh also there is one edge case fastestEncoding doesn't cover, which is that in a couple of situations Swift Strings that have been bridged to NSString will accidentally double-transcode UTF8 -> UTF16 -> UTF8 instead of just using the data as-is. This should be fixed at some point but is tricky due to the internal structure of NSString. If you see transcoding overhead in situations where you think it shouldn't be needed, you may be running into that.