Why don't UTF8Views conform to RangeReplaceableCollection?

String, Substring, and UnicodeScalarView conform to RangeReplaceableCollection, so is there a reason why their associated UTF8View and UTF16Views don't as well? Couldn't it be helpful for working with a string's underlying representation in a performant way? Or is there a "better" way to modify a string's bytes?

If this conformance does seems like a good addition, what is the path forward? Swift Evolution pitch?

Because they donā€™t conform to MutableCollection. :-)

The views on a string are direct views on the stringā€™s storage, and in the case of UnicodeScalarView can be used to mutate the string:

var str = "šŸ˜®šŸ˜Ø"
str.unicodeScalars.replaceSubrange(...str.startIndex, with: [UnicodeScalar(0xff)])
str // Ć暟˜Ø

This is safe because it maintains Stringā€™s core invariant, that itā€™s a sequence of Unicode scalars. If we did the same with UTF-8, it would be possible to break the string:

var str = "šŸ˜®šŸ˜Ø" // UTF-8: [f0 9f 98 ae f0 9f 98 a8]
str.utf8.replaceSubrange(...str.startIndex, with: [0xff])
str // Invalid UTF-8: [ff 9f 98 ae f0 9f 98 a8]
1 Like

Ah, so this design decision is based on the fact that UTF8Views should be safe to traverse but are not safe to modify? Is there any reasoning behind the ability to create broken strings using String.init(decoding:as:)?

Regardless, is there a recommended way to "unsafely" modify a string's bytes in a performant way (avoiding copies and having the speed of random access)?

Thereā€™s withUTF8:

https://developer.apple.com/documentation/swift/string/3201135-withutf8

Note that not all strings have a mutable backing UTF-8 storage, so you canā€™t avoid the possibility of copying in those circumstances if your intention is to mutate, but this API is guaranteed to avoid unnecessary copies.

2 Likes

You can also use isContiguousUTF8 to manually check if the string is already utf-8, and use makeContiguousUTF8 to forcefully convert it to utf-8 for later consumption.

Also, note that it is not currently possible to conform an existing type to an existing protocol while maintaining ABI stability, so this isn't a change we could make with the existing language support for availability.

1 Like

RangeReplaceableCollection unfortunately does not support situations where the collection can temporarily be in an invalid state before an eventual validation. RRC conformance would add append(_:UInt8) to the UTF8View, and validation would convert this to U+FFFD for every byte of a multi-byte scalar. An alternative approach could be something like a closure (or coroutine) that allows you to append bytes and only does the error correction at the very end.

I wrote much more about this here: String Mutations

12 Likes

You shouldnā€™t be able to create a ā€œbroken stringā€. That particular initializer replaces invalid UTF8 sequences/code-points with the Unicode replacement character.

If you are able to create a broken string through that initializer, I think that would be a bug.

That gives you an immutable buffer. However, the function itself is mutating as it may copy the Stringā€™s contents to a new backing store. You should never mutate the Stringā€™s content via the provided pointer. Just in case that wasnā€™t clear to everyone.

1 Like

Ah, yes, the ask was to mutate the bytes. My bad.