Fast binary parsing

I’m sorry, I don’t quite understand this, could you elaborate? Thanks!

Also, I'm not sure how to do this. Do you mean write a different method altogether like getUInt32()? Or do you mean to use @_specialize or write something like get<T where T : UInt32>()? I'm not very well versed in Swift generics (if that wasn't obvious!).

Right now, you're making a copy of the first T.size bytes by calling subdata(in:), you then load the first T.size bytes into v.

You can just load the first T.size bytes into v. The compiler might be good enough to optimize it out (I haven't checked the disasm), but it's also easier to read:

let v = self.data.withUnsafeBytes { $0.load(as: T.self) }

Say, BinaryReader is in Lib module and you're calling get from Main module, either move BinaryReader.get to Main module, or make this specialised function call in the Lib module:

extension BinaryReader {
  getUInt32() -> UInt32 {
    get()
  }
}

just to see if the specialization works.


One other things, after fixing the problem with the specialisation, if bigEndian is a type-wide constant, you could try to make it a generic parameter:

protocol Endian { }
enum LittleEndian: Endian { }
enum BigEndian: Endian { }

struct BinaryReader<E: Endian> {
  func get() {
    ...
    if E.self == LittleEndian.self {
      ...
    } else {
      ...
    }
  }
}

From my experience so far, the compiler will readily remove the branching on specialization, but it is quite costly (compared to Bool) when not specialized. Further testing for ones need is advised.

Ah, I don't think I can do that, since I'm working on a subrange of the Data. I tried this just now and I got a misaligned load exception:

let v: T = self.data[self.idx..<self.idx + size].withUnsafeBytes { $0.load(as: T.self) }
Fatal error: load from misaligned raw pointer: file /AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/swiftlang/swiftlang-1103.8.25.8/swift/stdlib/public/core/UnsafeRawPointer.swift, line 354

I was trying to read 4 bytes (UInt32) at byte 14 of the Data.

The endianness is (in my application, reading TIFF files) doesn’t change, but it’s unknown until the first couple of bytes are read. I could examine those bytes first and then instantiate the BinaryReader, which presumably would avoid the test in each get()?

Ah, so that's why you need copy. Fair point (load doesn't allow misaligned read).

You could, the compiler only needs to statically know BinaryReader.E at each line. It could double the code size though. You probably need to test it to see if it works better or worse. I'm just saying it's an option.

1 Like

Here’s the disassembly of a version of the file with two specializations:

	mutating
	func
	getUInt16()
		-> UInt16
	{
		get()
	}
	mutating
	func
	getUInt23()
		-> UInt32
	{
		let size = MemoryLayout<UInt32>.stride
		let v: UInt32 = self.data[self.idx..<self.idx + size].withUnsafeBytes { $0.load(as: UInt32.self) }
		self.idx += size
		if self.bigEndian
		{
			return UInt32(bigEndian: v)
		}
		else
		{
			return UInt32(littleEndian: v)
		}
	}

Is there no way to do a misaligned read? I could read the bytes and assemble them into the int myself, I suppose. It would certainly be better than copying the data. Frustrating, though.

(Explicit) misalign load is most definitely missing. You need to DIY in the meantime. Your current implementation should be fine though since (IIRC) newly created Data will be 16byte align.

1 Like

It’s on the list.

https://bugs.swift.org/browse/SR-10273 - Add an UnsafeRaw[Buffer]Pointer API for loading and storing unaligned/packed data

2 Likes

Yeah, I'd just like to avoid the overhead of the copy. The TIFF format is a series of 2- and 4-byte values (and other types like strings), so I do this a lot. The bulk of the data will be 2-byte pixel values (really heights in a height map), hopefully they're aligned in the file!

1 Like
  • Line 49: get<UInt16>
  • Line 285: generic get
  • Line: getUInt23

Looks like the inits get elided in get<UInt16> and getUInt23 just fine.


It seems the caller and get are supposed to be in a different module, annotating get with @inlinable (requiring you to annotate data, idx, bigEndian with @usableFromInline) should help.

So, I re-wrote it like this:

	mutating
	func
	get()
		-> UInt16
	{
		let v: UInt16
		if self.bigEndian
		{
			v = UInt16(self.data[self.idx]) << 8
							| UInt16(self.data[self.idx + 1])
		}
		else
		{
			v = UInt16(self.data[self.idx]) << 0
							| UInt16(self.data[self.idx + 1]) << 8
		}
		self.idx += 2
		return v
	}

and I still get an insane number of calls:

$ xcrun -sdk macosx swiftc -O -emit-assembly BinaryReader.swift
<snip>
	.private_extern	_$s12BinaryReaderAAV3gets6UInt16VyF
	.globl	_$s12BinaryReaderAAV3gets6UInt16VyF
	.p2align	4, 0x90
_$s12BinaryReaderAAV3gets6UInt16VyF:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r12
	pushq	%rbx
	.cfi_offset %rbx, -48
	.cfi_offset %r12, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	movb	24(%r13), %r12b
	movq	(%r13), %rbx
	movq	8(%r13), %r14
	movq	16(%r13), %r15
	movq	%rbx, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%r15, %rdi
	movq	%rbx, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %r15d
	movq	%rbx, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%r15b, %ecx
	movq	(%r13), %r15
	movq	8(%r13), %r14
	movq	16(%r13), %rbx
	incq	%rbx
	seto	%al
	cmpb	$1, %r12b
	jne	LBB7_3
	testb	%al, %al
	jne	LBB7_9
	movzwl	%cx, %r12d
	shll	$8, %r12d
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%rbx, %rdi
	movq	%r15, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %ebx
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%bl, %eax
	orl	%r12d, %eax
	jmp	LBB7_5
LBB7_3:
	testb	%al, %al
	jne	LBB7_7
	movl	%ecx, %r12d
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOy
	movq	%rbx, %rdi
	movq	%r15, %rsi
	movq	%r14, %rdx
	callq	_$s10Foundation4DataVys5UInt8VSicig
	movl	%eax, %ebx
	movq	%r15, %rdi
	movq	%r14, %rsi
	callq	_$s10Foundation4DataV15_RepresentationOWOe
	movzbl	%bl, %ecx
	shll	$8, %ecx
	movzwl	%r12w, %eax
	orl	%ecx, %eax
LBB7_5:
	movq	16(%r13), %rcx
	addq	$2, %rcx
	jo	LBB7_8
	movq	%rcx, 16(%r13)
	popq	%rbx
	popq	%r12
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
LBB7_8:
	## InlineAsm Start
	## InlineAsm End
	ud2
LBB7_9:
	## InlineAsm Start
	## InlineAsm End
	ud2
LBB7_7:
	## InlineAsm Start
	## InlineAsm End
	ud2
	.cfi_endproc
<snip>

Digging into this a little bit, I see

$ xcrun swift-demangle s10Foundation4DataV15_RepresentationOWOy
$s10Foundation4DataV15_RepresentationOWOy ---> outlined copy of Foundation.Data._Representation

That it says “outlined copy” gives me pause. Am I not calling swiftc with the right optimization options? In any case, I feel like this should boil down to a few lines of assembly (some moves, shifts, and ors, with a branch).

Ideally this code would be in a separate module, but it’s currently in the same one as the caller.

While I'm at it, might as well, :stuck_out_tongue:

struct BinaryReader {
  private(set) var data: Data
  var bigEndian = true

  init(data: Data) { self.data = data }

  mutating func get<T>() -> T where T : FixedWidthInteger {
    var v = T()
    withUnsafeMutableBytes(of: &v) { out in
      data.prefix(MemoryLayout<T>.size).withUnsafeBytes { data in
        out.copyMemory(from: data)
      }
    }

    data.removeFirst(MemoryLayout<T>.stride)
    return bigEndian ? T(bigEndian: v) : T(littleEndian: v)
  }

  mutating func seek(by length: Int) {
    precondition(data.count >= length, "seek(by: \(length)) out of bounds")
    data.removeFirst(length)
  }
}
2 Likes

This discards read data, does it not? You wouldn't be able to seek back to the beginning, for example.

Hmm, since get seems to generate a pretty decent binary (from my 30sec skimming), might as well stick to it:

struct BinaryReader {
  let base: Data
  private(set) var data: Data

  ...

  mutating func seek(by offset: Int) {
    guard let index = base.index(data.startIndex, offsetBy: offset, limitedBy: offset > 0 ? base.endIndex : base.startIndex) else {
      preconditionFailure("seek(by: \(offset)) out of bounds")
    }
        
    data = base.suffix(from: index)
  }
}

Outlined copy is a good example of the optimiser actually working. The "outliner" is a piece of the compiler that spots repeated code invocation that can lead to smaller binary size if, instead of being inlined everywhere, it can be moved into a function that is called. These smaller binaries are often a perf win, not least because the actual cost of calling an x86 function itself is very low, especially when the call is direct. Function calls themselves are not a huge performance cost, except where they provide optimisation boundaries.

I understand the feeling, but that's not quite an accurate representation. There are a few reasons why.

First, checks. You have a few checked operations here: self.idx += 2 inserts a branch for overflow checking, and self.idx + 1 (repeated twice) also introduces such a branch. These branches are again extremely cheap, as they will never be taken in real code, but they increase the complexity of the generated code.

The thing is the subscript work, which looks like this:

	callq	outlined copy of Foundation.Data._Representation
	movq	%r15, %rdi
	movq	%rbx, %rsi
	movq	%r14, %rdx
	callq	Foundation.Data.subscript.getter : (Swift.Int) -> Swift.UInt8
	movl	%eax, %r15d
	movq	%rbx, %rdi
	movq	%r14, %rsi
	callq	outlined consume of Foundation.Data._Representation

The dance here is that we take a copy of the backing representation of Data, then call the subscript to get a UInt8, and then consume the copy. The reason this has to be done is twofold. Firstly, Data's backing storage is reference counted, and if the compiler cannot prove this operation does not need to CoW it needs to appropriately manage the refcounting (that's what the copy/consume operations will be doing).

The other one is the Data subscript. While this subscript is inlinable, I suspect it exceeds the inlining threshold and doesn't get inlined. This is because Data is a very complex data type with a wide range of possible representations, so the performance benefit of inlining the code would almost certainly be outweighed by the code size and complexity increases.

You can see this by considering this compiler explorer project that replaces your use of Data with Array<UInt8>, a vastly less complex data type. Here we get the following generated assembly:

output.Buffer.get() -> Swift.UInt16:
        push    rbp
        mov     rbp, rsp
        mov     rax, qword ptr [r13]
        mov     rcx, qword ptr [r13 + 8]
        mov     rdx, qword ptr [rax + 16]
        cmp     byte ptr [r13 + 16], 1
        jne     .LBB14_4
        cmp     rcx, rdx
        jae     .LBB14_10
        lea     rsi, [rcx + 1]
        cmp     rsi, rdx
        jae     .LBB14_11
        movzx   eax, word ptr [rax + rcx + 32]
        rol     ax, 8
        jmp     .LBB14_7
.LBB14_4:
        cmp     rcx, rdx
        jae     .LBB14_8
        lea     rsi, [rcx + 1]
        cmp     rsi, rdx
        jae     .LBB14_9
        movzx   eax, word ptr [rax + rcx + 32]
.LBB14_7:
        add     rcx, 2
        mov     qword ptr [r13 + 8], rcx
        pop     rbp
        ret
.LBB14_10:
        ud2
.LBB14_11:
        ud2
.LBB14_8:
        ud2
.LBB14_9:
        ud2

This is much closer to where you wanted to end up. We have a few extra ud2 instructions due to the subscripts getting inlined, as they need bounds checking now, but otherwise you get exactly what you'd expect to see: bounds checks, followed by loads and (in the big-endian case) a rol.

6 Likes

This is great!

It'd be cool if I could avoid the bounds check, too (from an optimization exercise PoV), since I can ensure the bounds are valid at a higher level. There's no way to write a non-bounds-checked subscript is there? Something like []&, in the spirit of the overflow operators, or some such.

You can drop down to UnsafeBufferPointer if it's important to you to remove bounds checks (it checks in debug builds, but not in optimised ones). If you do that and use wrapping arithmetic (you'd have to ensure idx is never equal to Int.max), I think you get the assembly you expected.

3 Likes

Some thoughts:

• You might get more speed using the SWIFT_DISABLE_SAFETY_CHECKS flag.
• If you want this to work with different size integers you might look into pre-specializing it with @_specialize, which is documented a bit on the web.
• You might want to try -Osize and see if that’s better than -O.

And, as usual, the hill I will die on:

• I always tell people to try NOT to optimize code they haven’t used in sutu. It’s usually a huge waste of your time, since you don’t actually know how much time this function will take relative to the rest of your program until you actually have the program written. You could spend days worrying about this and find out it’s 0.0001% of the total time to read a file, or that files read in 0.00001 seconds even ‘unoptimized.’ I know it can be a fun game to see how small your code can become (Woz was famous for that, and I learned it from him) but nowadays that’s the opposite of ‘shipping.’ But, if you’re just doing this for fun, go for it!

-Wil

Terms of Service

Privacy Policy

Cookie Policy