Using Character literals results in very poor parser performance

compared to Unicode.Scalar literals, the machine code generated from an identical parser using Character literals just seems dreadful:

enum Terminal<Generic> 
{
    static 
    var character:CollectionOfOne<Character> { .init("a") }

    static 
    var scalar:CollectionOfOne<Unicode.Scalar> { .init("a") }
}
public
func useUnicodeScalarConstant() -> Int 
{
    var result:Int = 0 
    for scalar:Unicode.Scalar in Terminal<String.Index>.scalar 
    {
        result += scalar == Unicode.Scalar.init(97) ? 1234 : 5678
    }
    return result
}
output.useUnicodeScalarConstant() -> Swift.Int:
        movl    $1234, %eax
        retq
public
func useCharacterConstant() -> Int 
{
    var result:Int = 0 
    for character:Character in Terminal<String.Index>.character 
    {
        result += character == Character.init(Unicode.Scalar.init(97)) ? 
            0xabcd : 0x0000
    }
    return result
}
output.useCharacterConstant() -> Swift.Int:
        pushq   %rbp
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        pushq   %rax
        movabsq $-2233785415175766016, %r14
        movq    $97, (%rsp)
        movq    %rsp, %rdi
        movl    $1, %esi
        callq   ($sSS18_uncheckedFromUTF8ySSSRys5UInt8VGFZ)@PLT
        movq    %rdx, %rbx
        cmpq    $97, %rax
        jne     .LBB2_4
        cmpq    %r14, %rbx
        jne     .LBB2_4
        movq    %r14, %rdi
        callq   swift_bridgeObjectRelease@PLT
        movl    $43981, %eax
        jmp     .LBB2_3

(i set godbolt to use -O -whole-module-optimization.)

the compiler is able to constant-fold useUnicodeScalarConstant in its entirety, but the useCharacterConstant version actually calls into the runtime.

i know why swift must call into the runtime for arbitrary Character literals, since that depends on ICU. but i don’t know why it can’t detect that the "a" as Character is entirely ASCII and optimize it the way it can optimize the unicode scalar-based code.

this is just killing the JSON parser from the other thread, since it parses at the grapheme cluster level.

1 Like

If you go back to Swift 4.2 on Godbolt, the Character version does also optimize to a constant:

output.useCharacterConstant() -> Swift.Int:
        push    rbp
        mov     rbp, rsp
        mov     eax, 43981
        pop     rbp
        ret

It looks like this stopped working starting with Swift 5.0. The documentation for one of the Character initializers around here specifically says that they want to be able to fold simple Characters to constants when possible, so something must have caused it to take a different code path and lose the optimization. It'd be worth filing an issue for that regression.

That being said, why parse JSON at the grapheme cluster level instead of the UTF-8 or Unicode.Scalar level? All the delimiters are ASCII, and in the one place where you might be interested in graphemes (string literals), there won't be graphemes containing a double-quote that need to be distinguished from a double-quote as a delimiter. So even if Character.init was optimized, your parsing loop still takes the performance hit of doing grapheme cluster checking to find the next Character in the input. Iterating UTF-8 or Unicode.Scalar avoids that.

5 Likes

i’m lowering it to the unicode scalar level right now, but it just seems unfortunate, because a String is a Collection of Characters, and not Unicode.Scalars.

I agree that it is best to parse formats with ASCII delimiters at the UTF-8 level. There are a lot of benefits:

  • You can accept inputs using types other than String. For a JSON parser, it would likely be important to support parsing binary data (e.g. loaded from a file or received from the network). Requiring that data first be copied in to a String and go through UTF-8 validation adds overheads for no gain.

  • It is easier to support non-UTF-8 encodings (if that's important). Almost all common legacy encodings maintain that byte values in the ASCII range represent the same code-points as they do in ASCII. See SHIFT-JIS vs. Windows-31J for an idea of what happens when an encoding deviates from that - long story short, Japanese people using Windows spent decades writing filesystem paths like:
    C:ÂĄWindowsÂĄnotepad.exe

    This means you only need to care about looking for ASCII delimiters to ascertain the structure, and can essentially treat the elements as opaque bytes until you actually need to do something text-related with them. You get a lot more flexibility.

  • Character is variable-length. Internally, it is actually just stored as a String (yes, a full string!), the only difference being that it is known to have a count of 1. Thanks to small-string optimisations, Character will generally not cause a heap allocation, but it will still require ARC because theoretically it could be backed by a heap allocation.

UnicodeScalar is better than Character, but for this purpose, neither is as good as UTF-8.

2 Likes

how to avoid accumulating a [UInt8] buffer of UTF-8 data for string literals, and then sending it through String.init(decoding:as:)? the largest percentage of a json string usually consists of string literals.

If you have a slice of the input collection, you can send that through String.init(decoding: as:) directly. I'm not sure why you'd need to accumulate it in a buffer first.

right, i forgot that API is generic over Sequence. it’s a little more subtle than “lazily convert a slice of the input buffer” though, because the literal can contain escape sequences. it is even possible to encode UTF-16 surrogate pairs in escape sequences.

so everything after the first escape sequence wold be subject to a second copy, one to parse the UTF-8, and one to concatenate it into the result String. i don’t know how bad this would be though, most JSON string literals are object keys and don’t have escape sequences.

For that, I would recommend creating a lazy wrapper to decode the escape sequences.

That's what I did for percent-decoding, and there are a lot of tweaks you can make to optimise it. For example:

  1. Scanning the contents and avoiding decoding if there are no escape sequences. For contiguous source data, I'm using bit-twiddling hacks to process chunks of 8 bytes in only 5 instructions per chunk. It's incredibly fast, handles the common case well, and is only really possible at the UTF-8 level. Godbolt

  2. Calculating the decoded string size without overflow, then using String(unsafeUninitializedCapacity:) to write the data in to String's buffer manually. For contiguous source data, that first step will result in essentially optimal code - a very small loop with only a handful of instructions. You don't need to care about overflow to calculate the buffer size, because it isn't actually writing to memory, and you can detect if overflow occurred during the second pass when you do actually write to memory.

  3. We could even optimise that further by writing to a stack buffer. But we need a way to get a guaranteed stack buffer without falling back to the heap (Pitch).

It may/may not be worth doing all of those things in your case, at least not at first. I would definitely recommend scanning before decoding, though. In my case, I saw some operations take 30-40% less time as a result.

2 Likes

just an update, I lowered the parser from the Character level to the Unicode.Scalar level, which surprisingly did not affect performance that much. However, I lowered it from Unicode.Scalar to the UInt8 UTF8 level today, and it resulted in a 3x performance gain. wow.

i don’t know what is wrong with Unicode.Scalar, or why Character doesn’t apply additional penalties on top of that, but combined with other improvements, ss-json is now running 12x faster than before. the benchmarks still run on String.utf8, so this does not include time saved by not having to convert ByteBuffer to String.

(the tall narrow column on the left is the new UTF8-based implementation)

the only downside is the parser code has become much more unreadable, since all the terminals have to be written as hex literals. yikes.

enum Whitespace:Grammar.TerminalClass 
{
    typealias Terminal      = UInt8
    typealias Construction  = Void 
    static 
    func parse(terminal:UInt8) -> Void? 
    {
        switch terminal 
        {
        case    0x20, // ' '
                0x09, // '\t'
                0x0a, // '\n'
                0x0d: // '\r'
            return ()
        default:
            return nil
        }
    }
}
2 Likes

Great stuff!

You can use UInt8(ascii: "a") to help readability a bit. Alternatively just define them all as constants.

3 Likes