Fast binary parsing

JetForMe · July 29, 2020, 7:10am

I'm trying to write a simple binary stream reader for Data that will let me pick off basic FixedWidthInteger values. Unfortunately, the resulting code makes a LOT of calls that I think should be avoidable. This is the basic code:

struct
BinaryReader
{
	init(data inData: Data)
	{
		self.data = inData
	}
	
	mutating
	func
	get<T>()
		-> T where T : FixedWidthInteger
	{
		let size = MemoryLayout<T>.size
		let v: T = self.data.subdata(in: self.idx..<self.idx + size).withUnsafeBytes { $0.load(as: T.self) }
		self.idx += size
		if self.bigEndian
		{
			return T(bigEndian: v)
		}
		else
		{
			return T(littleEndian: v)
		}
	}
	
	mutating
	func
	seek(by inDelta: Int)
	{
		self.idx += inDelta
		precondition(self.idx >= 0 && self.idx < self.data.count, "seek(by: \(inDelta)) out of bounds")
	}
	
	let data			:	Data
	var idx							=	0
	var bigEndian					=	true
}

I had hoped that at least the init(bigEndian:)/init(littleEndian:) calls would optimize away, but they don’t. Looking at assembly using xcrun -sdk macosx swiftc [-O] -emit-assembly BinaryReader.swift, I get this:

Without -O:

...
	testb	$1, 24(%rdx)
	jne	LBB7_10
	jmp	LBB7_11
LBB7_10:
	movq	-160(%rbp), %rdi
	movq	-184(%rbp), %rsi
	movq	-112(%rbp), %rdx
	movq	-136(%rbp), %rax
	callq	*16(%rax)
	movq	-88(%rbp), %rcx
	movq	%rax, -288(%rbp)
	movq	%rcx, %rax
	movq	-160(%rbp), %rdi
	movq	-112(%rbp), %r13
	movq	-112(%rbp), %rsi
	movq	-104(%rbp), %rdx
	callq	_$ss17FixedWidthIntegerP9bigEndianxx_tcfCTj
	movq	-184(%rbp), %rdi
	movq	-112(%rbp), %rsi
	movq	-136(%rbp), %rax
	callq	*8(%rax)
	jmp	LBB7_12
LBB7_11:
	movq	-160(%rbp), %rdi
	movq	-184(%rbp), %rsi
	movq	-112(%rbp), %rdx
	movq	-136(%rbp), %rax
	callq	*16(%rax)
	movq	-88(%rbp), %rcx
	movq	%rax, -296(%rbp)
	movq	%rcx, %rax
	movq	-160(%rbp), %rdi
	movq	-112(%rbp), %r13
	movq	-112(%rbp), %rsi
	movq	-104(%rbp), %rdx
	callq	_$ss17FixedWidthIntegerP12littleEndianxx_tcfCTj
	movq	-184(%rbp), %rdi
	movq	-112(%rbp), %rsi
	movq	-136(%rbp), %rax
	callq	*8(%rax)
LBB7_12:
	leaq	-24(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%rbp
	retq
...

And with -O:

...
	movq	-88(%rbp), %rax
	jne	LBB4_5
	movq	-48(%rbp), %rdi
	movq	-56(%rbp), %rsi
	movq	%rsi, %r13
	movq	%r15, %rdx
	callq	_$ss17FixedWidthIntegerP9bigEndianxx_tcfCTj
	jmp	LBB4_6
LBB4_5:
	movq	-48(%rbp), %rdi
	movq	-56(%rbp), %rsi
	movq	%rsi, %r13
	movq	%r15, %rdx
	callq	_$ss17FixedWidthIntegerP12littleEndianxx_tcfCTj
LBB4_6:
	leaq	-40(%rbp), %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
...

There are more calls for the withUnsafeBytes and load calls. It seems all of that stuff should be able to be optimized away. Is there a different way to implement this that would result in better performance? I’ve got to parse a lot of big files.

lukasa · July 29, 2020, 11:57am

Based on the demangled names here, it seems that this is the unspecialised generic version of the code. That code cannot elide the call to init(bigEndian:) because for arbitrary conformances to FixedWidthInteger that could do anything: it's impossible to know.

I recommend creating a test function (in the same Swift module) that instantiates this with a concrete type and actually tries to use it. That will cause the compiler to specialise the code and you can see what it looks like for an actual integer type.

Lantua · July 29, 2020, 6:20pm

On an unrelated note:

You don't need to make copy of subdata, $0.load only copy data at the beginning. The caveat is that you don't go out-of-bound, but (I believe) that's the same with subdata anyway.

JetForMe · July 29, 2020, 6:20pm

Ah, I guess I thought the compiler created specializations automatically. Is there a way to emit assembly with symbol names demangled? Is there a way to demangle symbols on the command line, maybe?

FWIW, here's the full disassembly (with -O) of the class.

JetForMe · July 29, 2020, 6:28pm

I’m sorry, I don’t quite understand this, could you elaborate? Thanks!

JetForMe · July 29, 2020, 6:29pm

Also, I'm not sure how to do this. Do you mean write a different method altogether like getUInt32()? Or do you mean to use @_specialize or write something like get<T where T : UInt32>()? I'm not very well versed in Swift generics (if that wasn't obvious!).

Lantua · July 29, 2020, 6:36pm

Right now, you're making a copy of the first T.size bytes by calling subdata(in:), you then load the first T.size bytes into v.

You can just load the first T.size bytes into v. The compiler might be good enough to optimize it out (I haven't checked the disasm), but it's also easier to read:

let v = self.data.withUnsafeBytes { $0.load(as: T.self) }

Say, BinaryReader is in Lib module and you're calling get from Main module, either move BinaryReader.get to Main module, or make this specialised function call in the Lib module:

extension BinaryReader {
  getUInt32() -> UInt32 {
    get()
  }
}

just to see if the specialization works.

One other things, after fixing the problem with the specialisation, if bigEndian is a type-wide constant, you could try to make it a generic parameter:

protocol Endian { }
enum LittleEndian: Endian { }
enum BigEndian: Endian { }

struct BinaryReader<E: Endian> {
  func get() {
    ...
    if E.self == LittleEndian.self {
      ...
    } else {
      ...
    }
  }
}

From my experience so far, the compiler will readily remove the branching on specialization, but it is quite costly (compared to Bool) when not specialized. Further testing for ones need is advised.

JetForMe · July 29, 2020, 6:46pm

Lantua:

Right now, you're making a copy of the first T.size bytes by calling subdata(in:) , you then load the first T.size bytes into v .

You can just load the first T.size bytes into v . The compiler might be good enough to optimize it out (I haven't checked the disasm), but it's also easier to read:
let v = self.data.withUnsafeBytes { $0.load(as: T.self) }

Ah, I don't think I can do that, since I'm working on a subrange of the Data. I tried this just now and I got a misaligned load exception:

let v: T = self.data[self.idx..<self.idx + size].withUnsafeBytes { $0.load(as: T.self) }

Fatal error: load from misaligned raw pointer: file /AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/swiftlang/swiftlang-1103.8.25.8/swift/stdlib/public/core/UnsafeRawPointer.swift, line 354

I was trying to read 4 bytes (UInt32) at byte 14 of the Data.

The endianness is (in my application, reading TIFF files) doesn’t change, but it’s unknown until the first couple of bytes are read. I could examine those bytes first and then instantiate the BinaryReader, which presumably would avoid the test in each get()?

Lantua · July 29, 2020, 6:49pm

Ah, so that's why you need copy. Fair point (load doesn't allow misaligned read).

You could, the compiler only needs to statically know BinaryReader.E at each line. It could double the code size though. You probably need to test it to see if it works better or worse. I'm just saying it's an option.

JetForMe · July 29, 2020, 6:50pm

Here’s the disassembly of a version of the file with two specializations:

	mutating
	func
	getUInt16()
		-> UInt16
	{
		get()
	}
	mutating
	func
	getUInt23()
		-> UInt32
	{
		let size = MemoryLayout<UInt32>.stride
		let v: UInt32 = self.data[self.idx..<self.idx + size].withUnsafeBytes { $0.load(as: UInt32.self) }
		self.idx += size
		if self.bigEndian
		{
			return UInt32(bigEndian: v)
		}
		else
		{
			return UInt32(littleEndian: v)
		}
	}

JetForMe · July 29, 2020, 6:53pm

Is there no way to do a misaligned read? I could read the bytes and assemble them into the int myself, I suppose. It would certainly be better than copying the data. Frustrating, though.

Lantua · July 29, 2020, 6:58pm

(Explicit) misalign load is most definitely missing. You need to DIY in the meantime. Your current implementation should be fine though since (IIRC) newly created Data will be 16byte align.

Karl · July 29, 2020, 6:58pm

It’s on the list.

[SR-10273] Add an UnsafeRaw[Buffer]Pointer API for loading and storing unaligned/packed data · Issue #52673 · apple/swift · GitHub - Add an UnsafeRaw[Buffer]Pointer API for loading and storing unaligned/packed data

JetForMe · July 29, 2020, 7:00pm

Yeah, I'd just like to avoid the overhead of the copy. The TIFF format is a series of 2- and 4-byte values (and other types like strings), so I do this a lot. The bulk of the data will be 2-byte pixel values (really heights in a height map), hopefully they're aligned in the file!

Lantua · July 29, 2020, 7:15pm

Line 49: get<UInt16>
Line 285: generic get
Line: getUInt23

Looks like the inits get elided in get<UInt16> and getUInt23 just fine.

It seems the caller and get are supposed to be in a different module, annotating get with @inlinable (requiring you to annotate data, idx, bigEndian with @usableFromInline) should help.

JetForMe · July 29, 2020, 7:28pm

So, I re-wrote it like this:

	mutating
	func
	get()
		-> UInt16
	{
		let v: UInt16
		if self.bigEndian
		{
			v = UInt16(self.data[self.idx]) << 8
							| UInt16(self.data[self.idx + 1])
		}
		else
		{
			v = UInt16(self.data[self.idx]) << 0
							| UInt16(self.data[self.idx + 1]) << 8
		}
		self.idx += 2
		return v
	}

and I still get an insane number of calls:

$ xcrun -sdk macosx swiftc -O -emit-assembly BinaryReader.swift
<snip>
	.private_extern	_$s12BinaryReaderAAV3gets6UInt16VyF
	.globl	_$s12BinaryReaderAAV3gets6UInt16VyF
	.p2align	4, 0x90
_$s12BinaryReaderAAV3gets6UInt16VyF:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rbx
	.cfi_offset %rbx, -48
	.cfi_offset %r12, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	movb	24(%r13), %r12b
	movq	(%r13), %rbx
	movq	8(%r13), %r14
	movq	16(%r13), %r15
	movq	%rbx, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%r15, %rdi
	movq	%rbx, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %r15d
	movq	%rbx, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%r15b, %ecx
	movq	(%r13), %r15
	movq	8(%r13), %r14
	movq	16(%r13), %rbx
	incq	%rbx
	seto	%al
	cmpb	$1, %r12b
	jne	LBB7_3
	testb	%al, %al
	jne	LBB7_9
	movzwl	%cx, %r12d
	shll	$8, %r12d
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%rbx, %rdi
	movq	%r15, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %ebx
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%bl, %eax
	orl	%r12d, %eax
	jmp	LBB7_5
LBB7_3:
	testb	%al, %al
	jne	LBB7_7
	movl	%ecx, %r12d
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%rbx, %rdi
	movq	%r15, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %ebx
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%bl, %ecx
	shll	$8, %ecx
	movzwl	%r12w, %eax
	orl	%ecx, %eax
LBB7_5:
	movq	16(%r13), %rcx
	addq	$2, %rcx
	jo	LBB7_8
	movq	%rcx, 16(%r13)
	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
LBB7_8:
	## InlineAsm Start
	## InlineAsm End
	ud2
LBB7_9:
	## InlineAsm Start
	## InlineAsm End
	ud2
LBB7_7:
	## InlineAsm Start
	## InlineAsm End
	ud2
	.cfi_endproc
<snip>

Digging into this a little bit, I see

$ xcrun swift-demangle s10Foundation4DataV15_RepresentationOWOy
$s10Foundation4DataV15_RepresentationOWOy ---> outlined copy of Foundation.Data._Representation

That it says “outlined copy” gives me pause. Am I not calling swiftc with the right optimization options? In any case, I feel like this should boil down to a few lines of assembly (some moves, shifts, and ors, with a branch).

JetForMe · July 29, 2020, 7:36pm

Ideally this code would be in a separate module, but it’s currently in the same one as the caller.

Lantua · July 29, 2020, 8:12pm

While I'm at it, might as well,

struct BinaryReader {
  private(set) var data: Data
  var bigEndian = true

  init(data: Data) { self.data = data }

  mutating func get<T>() -> T where T : FixedWidthInteger {
    var v = T()
    withUnsafeMutableBytes(of: &v) { out in
      data.prefix(MemoryLayout<T>.size).withUnsafeBytes { data in
        out.copyMemory(from: data)
      }
    }

    data.removeFirst(MemoryLayout<T>.stride)
    return bigEndian ? T(bigEndian: v) : T(littleEndian: v)
  }

  mutating func seek(by length: Int) {
    precondition(data.count >= length, "seek(by: \(length)) out of bounds")
    data.removeFirst(length)
  }
}

JetForMe · July 30, 2020, 1:32am

This discards read data, does it not? You wouldn't be able to seek back to the beginning, for example.

Lantua · July 30, 2020, 11:16am

Hmm, since get seems to generate a pretty decent binary (from my 30sec skimming), might as well stick to it:

struct BinaryReader {
  let base: Data
  private(set) var data: Data

  ...

  mutating func seek(by offset: Int) {
    guard let index = base.index(data.startIndex, offsetBy: offset, limitedBy: offset > 0 ? base.endIndex : base.startIndex) else {
      preconditionFailure("seek(by: \(offset)) out of bounds")
    }
        
    data = base.suffix(from: index)
  }
}