Optimal way to create a static lookup table?

Karl · April 30, 2020, 2:38pm

I have a need to encode/decode hex characters without using the standard library's built-in utilities (for performance - I'm working on the level of a single, known-to-be-ASCII character as a UInt8, rather than a full-blown UTF8 String). The generally-accepted fastest way to do this is using a lookup table, like this:

/// Returns the ASCII value corresponding to the low nibble of `number`, encoded as hex.
func ascii_getHexDigit(_ number: UInt8) -> UInt8 {
    let table: StaticString = "0123456789ABCDEF"
    return table.withUTF8Buffer { table in
        table[Int(number & 0x0F)]
    }
}

/// Returns the numerical value of a hex digit, or 99 if the character is not a hex digit.
func ascii_parseHexDigit(ascii: UInt8) -> UInt8 {
    let (index, underflow) = ascii.subtractingReportingOverflow(0x30)
    guard underflow == false else { return 99 }
    let table = [
                 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, // numbers 0-9 
                 99, 99, 99, 99, 99, 99, 99,   // 7 invalid chars from ':' to '@'
                 10, 11, 12, 13, 14, 15,       // uppercase A-F
                 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, // 20 invalid chars G-Z
                 99, 99, 99, 99, 99, 99,       // 6 invalid chars from '[' to '`'
                 10, 11, 12, 13, 14, 15,       // lowercase a-f
                 99 as UInt8 // for type inference
                ]
    guard index < table.count else { return 99 }
    return table[Int(index)]
}

Godbolt link

I really need to micro-optimise this code as part of a parser I'm building. For the StaticString case, this looks good. The produced assembly is basically optimal:

output.ascii_getHexDigit(Swift.UInt8) -> Swift.UInt8:
        push    rbp
        mov     rbp, rsp
        and     edi, 15
        lea     rax, [rip + .L__unnamed_1]
        mov     al, byte ptr [rdi + rax]
        pop     rbp
        ret

However, for the second case (a let array of bytes), the result isn't so amazing. There are 4 runtime calls, including retain/release:

output.ascii_parseHexDigit(ascii: Swift.UInt8) -> Swift.UInt8:
        push    rbp
        mov     rbp, rsp
        push    r15
        push    r14
        push    rbx
        push    rax
        mov     ebx, edi
        mov     r15b, 99
        sub     bl, 48
        jb      .LBB2_4
        lea     rdi, [rip + (demangling cache variable for type metadata for Swift._ContiguousArrayStorage<Swift.UInt8>)]
        call    __swift_instantiateConcreteTypeFromMangledName
        lea     rsi, [rip + (outlined variable #0 of output.ascii_parseHexDigit(ascii: Swift.UInt8) -> Swift.UInt8)+8]
        mov     rdi, rax
        call    swift_initStaticObject@PLT
        mov     r14, rax
        mov     rdi, rax
        call    swift_retain@PLT
        mov     r15b, 99
        cmp     bl, 55
        ja      .LBB2_3
        movzx   eax, bl
        mov     r15b, byte ptr [r14 + rax + 32]
.LBB2_3:
        mov     rdi, r14
        call    swift_release@PLT
.LBB2_4:
        mov     eax, r15d
        add     rsp, 8
        pop     rbx
        pop     r14
        pop     r15
        pop     rbp
        ret

I've tried rewriting this as a tuple, but that also isn't amazing. It avoids the runtime calls, which is nice, but since tuples don't have subscript operations, I need to use pointers and offsets to access the elements. Unfortunately, this means copying the tuple to a temporary inside the function (see Godbolt link).

Does anybody know the optimal way to create a numerical lookup table in Swift? What I'd like is something that results in essentially the same assembly as the StaticString example. I could encode the table as a String, but that just seems kind of ridiculous.

AlexanderM · April 30, 2020, 3:26pm

It helps to have a test case for developing this kinda thing, so I made one:

let testCases: [(input: UInt8, expectedOutput: UInt8)] = [
		(input: "0", expectedOutput: 0),
		(input: "1", expectedOutput: 1),
		(input: "2", expectedOutput: 2),
		(input: "3", expectedOutput: 3),
		(input: "4", expectedOutput: 4),
		(input: "5", expectedOutput: 5),
		(input: "6", expectedOutput: 6),
		(input: "7", expectedOutput: 7),
		(input: "8", expectedOutput: 8),
		(input: "9", expectedOutput: 9),
		(input: "A", expectedOutput: 10), (input: "a", expectedOutput: 10),
		(input: "B", expectedOutput: 11), (input: "b", expectedOutput: 11),
		(input: "C", expectedOutput: 12), (input: "c", expectedOutput: 12),
		(input: "D", expectedOutput: 13), (input: "d", expectedOutput: 13),
		(input: "E", expectedOutput: 14), (input: "e", expectedOutput: 14),
		(input: "F", expectedOutput: 15), (input: "f", expectedOutput: 15),
	].map { (input: UnicodeScalar, expectedOutput: UInt8) in
		(input: UInt8(input.value), expectedOutput: expectedOutput) 
	}


for i in UInt8.min ... UInt8.max {
	let parsedValue = AsciiUtils.parseHexDigit(fromAscii: i)
	
	let formattedParseValue = (parsedValue == 99) ? "x" : String(parsedValue)
	print("\(i) => \(formattedParseValue)")
	
	let expectedValue = testCases.first(where: { $0.input == i })?.expectedOutput ?? 99
	assert(parsedValue == expectedValue, "Expected ascii code \(i) ('\(UnicodeScalar(expectedValue))') to result in \(expectedValue), but got: \(parsedValue)")
}

AlexanderM · April 30, 2020, 3:30pm

What happens if you use a bigger look up table, to get rid of the substraction, overflow check, index check, etc. Granted, this bigger table will use more cache, which might be worse over-all, but it's worth playing with it.

I also made some other changes:

Namespace the functions in an AsciiUtils enum, rather than prefixing them ascii_
I extracted the magic 99 to a invalidHexCharacter constant
I made a local variable x to reference invalidHexCharacter more succinctly, which has the nice benefit of making the real digits stick out more, visually.

enum AsciiUtils {
	static let invalidHexCharacter: UInt8 = 99
	
	/// Returns the ASCII value corresponding to the low nibble of `number`, encoded as hex.
	static func getHexDigit(for number: UInt8) -> UInt8 {
		let table: StaticString = "0123456789ABCDEF"
		return table.withUTF8Buffer { table in
			table[Int(number & 0x0F)]
		}
	}

	/// Returns the numerical value of a hex digit, or 99 if the character is not a hex digit.
	static func parseHexDigit(fromAscii index: UInt8) -> UInt8 {
		let x = invalidHexCharacter
		
		let table: ContiguousArray<UInt8> = [
			x, x, x, x, x, x, x, x, x, x, // first 30 ASCII characters
			x, x, x, x, x, x, x, x, x, x, //
			x, x, x, x, x, x, x, x, x, x, //
			x, x, x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			0, 1, 2, 3, 4, 5, 6, 7, 8, 9, // numbers 0-9 
			x, x, x, x, x, x, x,   // 7 invalid chars from ':' to '@'
			10, 11, 12, 13, 14, 15,       // uppercase A-F
			x, x, x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x, x, x, // 20 invalid chars G-Z
			x, x, x, x, x, x,       // 6 invalid chars from '[' to '`'
			10, 11, 12, 13, 14, 15,       // lowercase a-f
			// Remaining possible values, grouped 8 per line after the first line, to reach all 256 possible values
			x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
			x, x, x, x, x, x, x, x,
		]
		assert(table.count == (UInt8.min ... UInt8.max).count,
			"The table wasn't big enough to support all possible UInt8 values! table.count was: \(table.count)")
		return table[Int(index)]
	}
}

Joe_Groff · April 30, 2020, 4:28pm

Moving your lookup table array from a local variable to a global/static let should get the object initialization overhead out of your function and into a one-time initializer, at least.

AlexanderM · April 30, 2020, 5:01pm

When I tried out my code in godbolt, it got outline (opposite of inlined, is that an established phrase?) when using -O

scanon · April 30, 2020, 5:03pm

Yes! (For the curious, Jessica has given a few nice talks on the subject, e.g. https://www.youtube.com/watch?v=yorld-WSOeU).

AlexanderM · April 30, 2020, 10:18pm

Oh and here's a potentionally neater way to build the table:

enum AsciiUtils {
	static let invalidHexCharacter: UInt8 = 99
	
	/// Returns the ASCII value corresponding to the low nibble of `number`, encoded as hex.
	static func getHexDigit(for number: UInt8) -> UInt8 {
		let table: StaticString = "0123456789ABCDEF"
		return table.withUTF8Buffer { table in
			table[Int(number & 0x0F)]
		}
	}

	private static let hexitAsciiLookupTable: ContiguousArray<UInt8> = {
		let x = invalidHexCharacter
		
		let array: [UInt8] = [
			Array(repeating: x, count: 48),
			[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], // digits 0-9 
			Array(repeating: x, count: 7),  // 7 invalid chars from ':' to '@'
			[10, 11, 12, 13, 14, 15,],      // uppercase A-F
			Array(repeating: x, count: 26), // 26 invalid chars from 'G' to 'Z', '[' to '`'
			[10, 11, 12, 13, 14, 15],       // lowercase a-f
			Array(repeating: x, count: 153)
		].flatMap { $0 }
		
		assert(array.count == (UInt8.min ... UInt8.max).count,
			"The table wasn't big enough to support all possible UInt8 values! Its size was: \(array.count)")
			
		return ContiguousArray(array)
	}()
	
	/// Returns the numerical value of a hex digit, or 99 if the character is not a hex digit.
	static func parseHexDigit(fromAscii index: UInt8) -> UInt8 {
		return AsciiUtils.hexitAsciiLookupTable[Int(index)]
	}
}

eskimo · May 1, 2020, 8:18am

The generally-accepted fastest way to do this is using a lookup table,
like this:

I’m leary of this technique. While it was definitely valuable on the 6502, modern CPUs tend to be fast at computation and slow at I/O [1].

Have you thought about doing this using SIMD? There will be more instructions in your loop but each iteration will process more results.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

[1] As someone who focuses on Core OS issues — files, networking, and so on — it was a revelation when, back in the day, I worked with an AltiVec specialist who used the term I/O to refer to memory (-:

Jens · May 1, 2020, 8:59am

Slightly faster parseHexDigit (the same approach for getHexDigit didn't make any difference):

  /// Returns the numerical value of a hex digit, or 99 if the character is not a hex digit.
  static func parseHexDigit(fromAscii index: UInt8) -> UInt8 {
    enum S {
      static let x: UInt64 = 0x6363636363636363
      static var lut: (
        UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64,
        UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64,
        UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64,
        UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64, UInt64
        ) = (x, x, x, x, x, x, 0x0706050403020100, 0x6363636363630908,
             0x630f0e0d0c0b0a63, x, x, x, 0x630f0e0d0c0b0a63, x, x, x,
             x, x, x, x, x, x, x,x, x, x, x, x, x, x, x, x)
    }
    let i = Int(truncatingIfNeeded: index)
    return withUnsafeBytes(of: &S.lut) { $0[i] }
  }

BTW, I wonder why the "new" let-taking version of `withUnsafeBytes` (SE-0205) is *much* slower than the "old" var-requiring one used above ... probably a bug/opportunity of improvement?

That is, the above method becomes much slower if we declare lut as static let ( instead of static var ) and use withUnsafeBytes(of: S.lut) { $[i] } ( without the & ) from SE-0205.

Profiling shows that almost all of this extra overhead is caused by _platform_bzero$VARIANTS$Haswelland _platform_memset$VARIANTS$Haswell:

cc @Joe_Groff, @Ben_Cohen: Is this (expand details above) a known issue of SE-0205?
(I guess there's no actual reason why the method for lets should be slower than the old one for vars.)
Filed SR-12716.

Karl · May 1, 2020, 12:07pm

Oh, that's much better! I guess I just saw the retain/release calls and assumed it was due to using an Array, so I immediately switched to a tuple instead. That's also why I added a 99 as UInt8 element at the end (the idea was that if I didn't explicitly ask for an Array, the compiler might be able to see that I only want to subscript in to the array-literal and might be free to make something more optimal than a true Swift Array - just a wild punt).

Is there a reason the optimiser can't automatically turn non-trivial constant locals in to globals?

That's certainly true. I was trying to go for a sort of middle-ground: having the lookup table only handle the interesting part to reduce code size and cache pressure, and bounds-check to the table region as a compromise. It's possible that by using both approaches I actually end up with the benefits of neither.

SIMD is a really great idea. This is actually supposed to be part of a parser for IPv6 addresses, so I think that would suit SIMD really well. I never thought of using that - thanks a lot!

spevans · May 20, 2020, 4:51pm

If you want to eliminate memory access at the expense of a few extra instructions, the following may be faster for conversion to ASCII. You can store half the table (every other character) in a UInt64

func hexToChar(_ num: UInt8) -> UInt8 {
    let hexChars: UInt64 = 0x4543413836343230 // "ECA86420"
    let odd = num & 1
    let idx = Int(num & 0xe) << 2
    let ch = (hexChars >> idx) 
    return UInt8(truncatingIfNeeded: ch) + odd
}

Which compiles to:

output.hexToChar(Swift.UInt8) -> Swift.UInt8:
        push    rbp
        mov     rbp, rsp
        lea     ecx, [4*rdi]
        mov     edx, edi
        and     dl, 1
        and     cl, 56
        movabs  rax, 4990904521740005936
        shr     rax, cl
        add     al, dl
        pop     rbp
        ret

And for parsing the following only uses 1 branch and a conditional move

public func parseHex(_ char: UInt8) -> UInt8 {
    if char >= 0x30 && char < 0x3A { return char - 0x30 }
    let upperCase = char | 0x20
    if upperCase >= 0x61 && upperCase <= 0x66 { return upperCase - 0x61 + 0x0A }
    return 99
}

compiles to

output.parseHex(Swift.UInt8) -> Swift.UInt8:
        push    rbp
        mov     rbp, rsp
        lea     eax, [rdi - 48]
        cmp     al, 10
        jb      .LBB1_2
        or      dil, 32
        lea     eax, [rdi - 97]
        add     dil, -87
        cmp     al, 5
        movzx   ecx, dil
        mov     eax, 99
        cmovbe  eax, ecx
.LBB1_2:
        pop     rbp
        ret

Obviously both require benchmarking to know if they are actually faster than a table lookup.

Karl · March 6, 2021, 1:22pm

Circling back around to this for posterity and the benefit of future searchers.

When comparing lookup tables to bit-shifted magic numbers, it seems that ARM processors respond much more kindly to the latter. I've been trying to optimise a percent-encoding algorithm, which needs to determine if a particular ASCII character should be escaped. I initially went with a bitshift, like this:

//                 FEDCBA98_76543210_FEDCBA98_76543210_FEDCBA98_76543210_FEDCBA98_76543210
let lo: UInt64 = 0b01010000_00000000_00000000_00000101_11111111_11111111_11111111_11111111
let hi: UInt64 = 0b10000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
if character.codePoint < 64 {
  return lo & (1 &<< character.codePoint) != 0
} else {
  return hi & (1 &<< (character.codePoint &- 64)) != 0
}

But recently I changed it to a lookup table. The version with the table benchmarked about ~10% faster on x86-64 (which is significant), but ARM performance (tested on a Raspberry Pi 4 8GB) plummeted by as much as 90%. I double-checked the results, and they hold - so for now I'm including both the table and the bitshifting method and running the appropriate one for the architecture.

FWIW, this version of percent encoding is about an order of magnitude faster than Foundation's .addingPercentEncoding method on ARM in my benchmarks (200ms vs 20ms), which I believe uses a lookup table. I don't know if the findings from the Raspberry Pi would transfer to other ARM platforms, such as Apple Silicon, but I think it could be worth exploring to see if it could improve Foundation's performance on ARM. It's fascinating that we have new avenues to consider for optimisation as personal computers become more architecturally diverse.

If this result holds more generally, I would if compilers could be taught to transform a lookup table + mask in to a magic number + shift for ARM?

scanon · March 6, 2021, 3:33pm

This is going to be extremely sensitive to surrounding context and uArch; it's not a general property of either architecture. For example, on any recent Intel core, you'll be able to make either the lookup table or the bitshift method faster by changing the instruction mix of the rest of the program. If the rest of the program can saturate the load ports (two loads / cycle) then using shifts will win; if the rest of the program can saturate the ALUs (three-four operations / cycle) while leaving load slots available (harder, but still possible), then the lookup table will win. The situation is pretty similar on ARM, except that there's greater diversity in execution resources available across designs.

The actual optimal implementation of this particular algorithm in most cases is going to be a SIMD lookup table in-register, because you can do this determination for 16 or more characters at once. In pseudocode, that would look like:

v0 <- load 16 characters
index <- v0 >> 3
v1 <- table lookup with index
needsEscape <-- (v1 >> (v0 & 7)) & 1

This lets you do 16 characters at once, with one load, two shifts, a table lookup (single instruction), and two ands.