Conforming CollectionDifference<Character> to LosslessStringConvertible

Jens · April 9, 2022, 8:21pm

I'd like to (reasonably efficiently) serialize diffs of Strings (lines of text) to and from Strings.

For this purpose, would it make sense to conform CollectionDifference<Character> to LosslessStringConvertible, ie:

extension CollectionDifference: CustomStringConvertible where ChangeElement == Character {
    public var description: String {
        fatalError("Not yet implemented")
    }
}
extension CollectionDifference: LosslessStringConvertible where ChangeElement == Character {
    public init?(_ description: String) {
        fatalError("Not yet implemented")
    }
}

?

SDGGiesbrecht · April 9, 2022, 8:55pm

Because CollectionDifference.Change relies on index offsets, round‐trip serialization becomes a crash risk as soon as the serialized data passes outside the application’s memory. Most use cases for a LosslessStringConvertible conformance would be dangerous. See the following thread and its links for more information:

If you want to communicate string differences safely, you should split on some stable character like a line break, and then perform the differences on the resulting array of strings. Then even if the representation of any of the component strings changes, your overall data remains intact.

Jens · April 9, 2022, 9:30pm

If I understand you correctly, this would not be an issue if I used
CollectionDifference<String.UTF8View.Element>
instead of
CollectionDifference<Character>
ie serializing to and from UInt8 arrays (that are then converted to Strings only within my application).
Is that correct?

SDGGiesbrecht · April 9, 2022, 9:52pm

If the strings themselves are never serialized, then it is safe. Two instances of your application will construct the same UTF‐8 sequences with the same indices, regardless of how much time or space separates the devices on which they are running (which could not be said of the character boundaries).

However, if the strings themselves are also serialized, you will need to be careful to use an opaque format. If the serialization is done in a way that a middleman (file system, code library, etc.) might recognize as text and normalize “e” + “◌́” to “é” or vice versa, then even the UTF‐8 bytes of the string might have changed by the time you load it again. In that case your offsets no longer point at what you think, and they might even reach out of bounds.