Why don't UTF8Views conform to RangeReplaceableCollection?

stephencelis · November 3, 2020, 4:21pm

String, Substring, and UnicodeScalarView conform to RangeReplaceableCollection, so is there a reason why their associated UTF8View and UTF16Views don't as well? Couldn't it be helpful for working with a string's underlying representation in a performant way? Or is there a "better" way to modify a string's bytes?

If this conformance does seems like a good addition, what is the path forward? Swift Evolution pitch?

jayton · November 3, 2020, 4:54pm

Because they don’t conform to MutableCollection. :-)

The views on a string are direct views on the string’s storage, and in the case of UnicodeScalarView can be used to mutate the string:

var str = "😮😨"
str.unicodeScalars.replaceSubrange(...str.startIndex, with: [UnicodeScalar(0xff)])
str // ÿ😨

This is safe because it maintains String’s core invariant, that it’s a sequence of Unicode scalars. If we did the same with UTF-8, it would be possible to break the string:

var str = "😮😨" // UTF-8: [f0 9f 98 ae f0 9f 98 a8]
str.utf8.replaceSubrange(...str.startIndex, with: [0xff])
str // Invalid UTF-8: [ff 9f 98 ae f0 9f 98 a8]

stephencelis · November 3, 2020, 5:04pm

Ah, so this design decision is based on the fact that UTF8Views should be safe to traverse but are not safe to modify? Is there any reasoning behind the ability to create broken strings using String.init(decoding:as:)?

Regardless, is there a recommended way to "unsafely" modify a string's bytes in a performant way (avoiding copies and having the speed of random access)?

xwu · November 3, 2020, 5:13pm

There’s withUTF8:

https://developer.apple.com/documentation/swift/string/3201135-withutf8

Note that not all strings have a mutable backing UTF-8 storage, so you can’t avoid the possibility of copying in those circumstances if your intention is to mutate, but this API is guaranteed to avoid unnecessary copies.

Lantua · November 3, 2020, 5:27pm

You can also use isContiguousUTF8 to manually check if the string is already utf-8, and use makeContiguousUTF8 to forcefully convert it to utf-8 for later consumption.

scanon · November 3, 2020, 6:28pm

Also, note that it is not currently possible to conform an existing type to an existing protocol while maintaining ABI stability, so this isn't a change we could make with the existing language support for availability.

Michael_Ilseman · November 3, 2020, 6:41pm

RangeReplaceableCollection unfortunately does not support situations where the collection can temporarily be in an invalid state before an eventual validation. RRC conformance would add append(_:UInt8) to the UTF8View, and validation would convert this to U+FFFD for every byte of a multi-byte scalar. An alternative approach could be something like a closure (or coroutine) that allows you to append bytes and only does the error correction at the very end.

I wrote much more about this here: String Mutations

Karl · November 3, 2020, 11:27pm

You shouldn’t be able to create a “broken string”. That particular initializer replaces invalid UTF8 sequences/code-points with the Unicode replacement character.

If you are able to create a broken string through that initializer, I think that would be a bug.

That gives you an immutable buffer. However, the function itself is mutating as it may copy the String’s contents to a new backing store. You should never mutate the String’s content via the provided pointer. Just in case that wasn’t clear to everyone.

xwu · November 3, 2020, 11:38pm

Ah, yes, the ask was to mutate the bytes. My bad.