Pitch: Unicode Processing APIs

davidbalbert · February 13, 2024, 11:50pm

This is very exciting!

I've been working on a UTF-8 rope, similar to BigString. Its internals are closer to xi-editor's rope, but its grapheme breaking implementation is similar to BigString's. Relevant code: Rope.swift, AttributedRope.swift, BTree.swift.

My primary reasons for building custom string storage were:

Performance:
- O(log n) subscripting and mutation.
- O(1) counting of characters, scalars, etc.
Efficiently count and iterate over larger segments: words, lines, etc.

There's a lot in this pitch that would have been useful to me, but most of the issues I've run into would be best solved by higher level abstractions that would let custom string types take advantage of existing code in the stdlib. Much of this could be built on top of the "collection-of-buffers" abstraction outlined in the pitch.

Some of these are definitely out of scope, but I think they might inform the design of in-scope APIs.

N.b. I haven't thought through the example APIs below. They probably don’t work and all names are strawpeople. For now I'm just trying to articulate problems I've had.

Streaming validation/repair during allocation

When allocating a string stored in a tree, the first step is to split the underlying flat byte buffer into chunks that are stored in the tree's leaves. I like the String(repairing:), String(normalizing:) and String(validating:) initializers, but for very large (i.e. multi-gigabyte) strings, allocating a String just to get the repairing: behavior before re-allocating your custom storage can be expensive. An API that can do this efficiently on the fly would be better.

Grapheme breaking

To get the O(log n) indexing behavior from storing a string in a balanced tree, leaves must be fixed size and Characters must be allowed to span multiple leaves. If you force leaves to be Character aligned, it becomes impossible to have fixed-size leaves, and in pathological cases – i.e. a single character with tons of combining codepoints – you can end up back at O(n).

This means each leaf has to store at minimum the following extra state:

prefixCount – the number of bytes at the beginning of the leaf that continue a character from the previous leaf.
lastCharContinues - whether nextLeaf().prefixCount > 0. Not technically necessary, but having to regularly load the next leaf to check wither chunk.bytes.endIndex is a character boundary isn't ideal.

Furthermore, this state has to be updated when mutating your BigString or appending two BigStrings:

s1 = "a" * 1000
s2 = "´" + "b" * 999
s3 = s1 + s2

s1.count == 1000
s2.count == 1000
s3.count == 1999 // the last "a" and "´" have combined, so there's one less character.

s1.root.height == 0 // a single node, which is a leaf
s2.root.height == 0

// s3 has two leaves
s3.root.height == 1
s3.root.children.count == 2

l1 = s1.root
l2 = s2.root
l3a = s3.root.children[0]
l3b = s3.root.children[1]

// s3's leaves have the same bytes as s1 and s2's, but their extra state has changed.
l1.bytes == l3a.bytes
l2.bytes == l3b.bytes

l1.prefixCount == 0; l1.lastCharContinues == false
l2.prefixCount == 0; l2.lastCharContinues == false

l3a.prefixCount == 0; l3a.lastCharContinues == true    // lastCharContinues is different
l3b.prefixCount == 2; l3b.lastCharContinues == false   // prefixCount is different

The proposed GraphemeFormer is too low level to deal with mutations like this without a lot of complex logic, which is easy to get wrong. BigString+Managing Breaks.swift gives a sense of what the logic looks like. The technique used in BigString also requires each leaf to store another piece of state – the GraphemeFormer having consumed everything up to the first byte of the leaf (this assumes the leaves are scalar-aligned).

It would be great to have a higher level API that makes this easier. One option would be to have an API like GraphemeCursor from the unicode_segmentation Rust crate. It allows you to start at an arbitrary unicode scalar offset, and ask for the next or previous grapheme boundary, and it will ask you for next and previous chunks as necessary until it can return a boundary.

This does have some problems: a pathological string with gigabytes of regional indicators (the codepoints used in flag emojis) requires an O(n) scan back to the first regional indicator, but I think this is true of String.index(before:) now, so maybe it’s not so bad. I also don’t know how easy it would be to keep character counts up to date using this API. Perhaps it’s enough, or perhaps there’s a better high level API without the downsides.

I don't know how this problem applies to other string data structures like gap buffers or piece tables.

Indexing APIs

Replicating String’s indexing behavior is difficult. It would be great if there was a way to use the same codepaths that String uses in a custom string type.

Consider the string "aé" (using the non-combining scalar U+00E9 LATIN SMALL LETTER E WITH ACUTE) with i == 2[utf8], which is non character-aligned inside "é".

Subscripting a character with a non-aligned index should round down to the nearest character boundary: s[i] == "é"
index(before:) of an aligned index moves to the left by one boundary, but an unaligned index moves to the left by two boundaries: s.index(before: i) == s.startIndex (skipping the boundary between "a" and "é").
index(after:) always moves to the right by a single boundary regardless of whether the index is aligned or not: s.index(after: i) == s.endIndex (not skipping a boundary).

It’s hard to know all the rules that String follows, let alone reimplement them.

Additionally, String.Index has performance improvements enabled by properties like _isCharacterAligned, _isScalarAligned, etc. It should be easy to apply these to a custom string type without reimplementing String's logic. Supporting these optimizations might also reduce the requirements on the grapheme breaking API, but I'm not sure.

It's possible all of this could be solved with a Unicode.Index type that can be embedded in your custom string's Index type, as well as a protocol for moving between boundaries. Here's a rough sketch.

extension Unicode {
    struct Index {
        var utf8Offset: Int { get }
    }
}

extension Unicode {
    enum Boundary {
        case utf8
        case utf16
        case scalar
        case character
        
        // Maybe these from UAX #29 too? How to make them optional?
        case word
        case sentence
        case line
    }
}

protocol UnicodeIndexWrapping {
    init(_ unicodeIndex: Unicode.Index)
    var unicodeIndex: { get }
}

protocol UnicodeIndexable {
    associatedtype Index: UnicodeIndexWrapping

    func unicodeIndex(ofBoundaryBefore i: Index, boundary: Unicode.Boundary) -> Index
    func unicodeIndex(ofBoundaryAfter i: Index, boundary: Unicode.Boundary) -> Index
    func unicodeIndex(ofBoundaryRoundingDown i: Index, boundary: Unicode.Boundary) -> Index
}

struct BigString: UnicodeIndexable {
    struct Index: UnicodeIndexWrapping {
        let unicodeIndex: Unicode.Index
    }
}

Actually sharing indexing logic with String is complicated by the fact that String.Index is frozen, but I'm hopeful there's a clever way around that.

UAX #29 word and sentence breaking

Swift has an implementation of UAX #29’s word and sentence breaking algorithms, including locale specific variants – enumerateSubstrings(in:options:_:) with .byWord, .bySentence, optionally combined with .localized. Being able to use the stdlib’s implementation with custom string types would be great.

Even better, there are times when you might want to customize this behavior (UAX #29 notes this). Some examples: having no internal word boundaries in a number with punctuation “1,234.56”, or having different word boundary rules for different programming languages in a text editor – e.g. in a language like Swift that uses camel case, with the string “fooBarBaz”, having word boundaries between “foo” and “Bar” and “Bar” and “Baz”. Extension points to allow this would be great.

A generic UTF16View from a collection of UTF-8 buffers

Building a UTF-16 view on top of UTF-8 storage is hard. I haven’t even attempted it. But it’s necessary to efficiently integrate with system frameworks that use UTF-16. It would be great if Unicode.Index (or another API) were able to deal with trailing surrogate indices, which don't actually exist in the underlying UTF-8 storage, and make it easy to return both leading and trailing surrogates when subscripting.

Similarly, if String’s UTF-16 breadcrumbing code could somehow be extracted and made generic, that would be even better.

Tools for implementing other views

It’s possible something like UnicodeIndexable would allow Swift to provide APIs for implementing other views on top of custom string storage, including views for words, sentences, and lines.

Indexing into a LinesView is particularly gnarly because s.lines[s.endIndex] should be valid and return "" for strings ending in an empty line. This breaks all sorts of Collection invariants/expectations which I have not found a good solution for.

Custom Unicode tables

This is a good idea. Here’s an article with more context.

Streaming normalization and denormalization

If you want to be able to present a good find/replace UI, you need to deal with normalization. You don’t necessarily want to normalize the string as you read it in because your app may need to write the string out to disk later, with minimal edits.

It would be great if Swift provided APIs to do streaming normalization and denormalization of custom string types.

Fixed-size arrays

This is pretty far afield, and would require a language change (there are ongoing discussions), but it’s still a pain point.

Right now, in my Rope, each leaf stores a String. I’d like to be able to store a fixed-size array of bytes. Partially this is for performance – storing the bytes directly in the leaf eliminates unnecessary pointer chasing to the String’s storage. But it’s also important for having more control so I can better interop with zero-copy C APIs (see below).

Interop with zero-copy C APIs with temporarily escaping buffers

This is a problem with String, but because I build on top of String (see “Fixed-size arrays”), it’s still a problem for me.

Consider this (simplified) API from Tree-sitter:

typedef struct {
  void *payload;
  const char *(*read)(void *payload, uint32_t byte_index, uint32_t *bytes_read);
} TSInput;

TSTree *ts_parser_parse(TSParser *self, TSInput input);

ts_parser_parse is designed to be zero-copy. During the execution of ts_parser_parse, your read function is called N times, giving you the ability to provide a pointer to a portion of your input, whatever size is convenient. In the case of BigString, the bytes in a single leaf is the obvious choice.

While the char pointer escapes read, it’s guaranteed not to escape ts_parser_parse.

Because String.withUTF8 et. al. do not allow their pointer to escape the passed in closure, there’s no way to implement this in Swift without copying the bytes of each leaf into a struct that wraps TSInput and (IIRC) manually allocating and freeing the memory yourself. This is a big hole in Swift – it's not that you can't use this zero-copy API without resorting to unsafe constructs; it's that you can't express it at all. There's no way to use ts_parser_parse from Swift without eventually doing a memcpy of the entire storage of your string.

From what I can remember of the BufferView vision doc, BufferViews would not solve this problem because supporting unsafe pointers is an explicit non-goal. I have not had a chance to read the StorageView proposal that came out of that, so maybe this has changed. Perhaps some of the escape analysis stuff that goes along with it might help too.

At minimum, if I had fixed-size arrays, I could do this unsafely with my own rope (I can guarantee that any unsafe pointer will live as long as the rope it’s derived from), but this is a general problem, and it would be great to have a general solution.