Data.popFirst(), removeFirst() adjust indices

I'm finding a behavior I wouldn't expect that's undocumented, but surely it's commonly known (by people smarter than me).

I'm parsing some incoming data. I accumulate it in a Data(), and wanted to discard bytes off the front using popFirst() or removeFirst(). But I find if I do that, then try to access the new first byte with data[0], I get EXC_BAD_INSTRUCTION. It seems the new first byte lives at data[1], and there is no data[0].

Neither the header file nor the Xcode docs describe this behavior. And I can't even find docs for Data.dropFirst(), which also compiles (I mean, I can't find it because Xcode's code completion is so broken).

In any case, how is this behavior useful? I can work around it by creating new Data objects based on the subrange I want, but it just feels clunky.

1 Like

Collection types are not required to use integer indices, and when they do, the first element is not required to have index 0 (and when it does, the second element is not required to have index 1). Unless a type documents additional guarantees, it is not correct to make that assumption. You can get the first element of an instance of any collection type using first.

Wow, I did some googling before posting and couldn't find anything (Google really fails sometimes with commonly overloaded terms).

I still question this. It's great that it's mentioned in some far-off place in the docs that you can't assume the first index is 0, I guess, but it goes against decades of programming convention. So, at the very least, I'd argue that the docs should be littered with notes to this effect.

But it also seems to make code less concise. I can't write [0], I have to write some really hard to discover [<first>] (I still couldn't figure out how to write that for Data), then make all my references relative to that.

And none of those references explains why it was done this way.

A fundamental requirement of the Collection protocol is that slices share indices with the parent collection. It is incredibly useful, because it means that for any valid index, slice[index] == collection[index], and many generic collection algorithms are much more efficient because of it.

When you want to reset the indices for some reason, just create a fresh Data instance from the Data.SubSequence:

let slice = myData.dropFrist(20)
let reset = Data(slice)
assert(reset.startIndex == 0) // ✓

But it is usually much better to just write your code in terms of a generic Collection as much as possible. That way the compiler prevents you from making mistakes and by definition also steers your design in a more performant direction. For example, write extension Collection or extension Collection where Element == UInt8 instead of extension Data. A bonus is that you get the same methods for free on Data, Data.SubSequence, [UInt8], [UInt8].SubSequence and even String.UTF8View and String.UTF8View.SubSequence.

The term for what Data does is that it is “self-slicing”. that is, Data.SliceType == Data. As noted by @SDGGiesbrecht, it is required that slice indices are shared across the slice and the sliced entry.

This is basically fine. It’s only really a problem when you have both an integer Index type and a self-slicing type. If you break either of those criteria, the problem goes away: if your Index isn’t an integer then there is no temptation to write [0], and if your type isn’t self-slicing then you hopefully spotted that you got a new type out of the slicing operation and dove into its rules.

Data having this behaviour is ultimately just an awkwardness that managed to make its way into the language that is very hard to fix now, as Data is frozen.

In your case, collection[0] is always able to be spelled collection.first.

(FWIW, NIO’s “so you’re building a collection” tribal knowledge says never to use Integer index types because they only cause confusion: we use opaque index types pervasively.)

4 Likes

In the spirit of helping me understand this better, how should I have written this serial data parser? I reference the first few indices of my data throughout, and frequently pop off the first byte (using Data(self.data[1...]) in order to try the parse again if it fails. Then I wait for more data to come in and try again, until I have a complete packet.

func
serialPort(_ inPort: ORSSerialPort, didReceive inData: Data)
{
    if inData.count == 0 { return }                 //   Probably unnecessary
    
    if qVerboseLogging
    {
        debugLog("didReceive")
    }
    
    //  Note: This is not the most efficient parser, as it
    //  repeats work done when new data comes in, but
    //  it’s a bit easier to code.
    
    //  Accumulate the new data…
    
    self.data += inData
    if qVerboseLogging
    {
        self.data.hexCharDump(grouping: 16).forEach { debugLog($0) }
    }
    while self.data.count > 0
    {
        if self.parseState == .idle
        {
            if let start = FrameStart(rawValue: self.data[0])
            {
                //  Is it one of the single-byte frames?
                
                let singles: [FrameStart] = [.ack, .nak, .syn, .can]
                if singles.contains(start)
                {
                    received(start)
                    self.data = Data(self.data[1...])
                    continue
                }
                else if start == .sof
                {
                    //  It's the start of a packet…
                    
                    self.parseState = .parsing
                    
                    self.data = Data(self.data[1...])
                    continue
                }
            }
            else
            {
                debugLog("Error parsing packets: unexpected byte in idle state (\(inData[0]))")
                self.data = Data(self.data[1...])
                //  (TODO: send NAK?)
                continue
            }
        }
        else                                            //  We're in the middle of parsing a packet
        {
            if self.data.count < 4                                                  //  Not enough bytes to form a full packet, just wait for more
            {
                return
            }
            if self.data[1] != 0x00 && self.data[1] != 0x01                         //  Not a request or response packet, so something is broken
            {                                                                       //  Discard the first byte and try again (TODO: send NAK?)
                debugLog("not resp or req (\(self.data[1]))")
                self.data = Data(self.data[1...])
                continue
            }
            let length = self.data[0]
            if self.data.count < length + 1                                         //  Incomplete packet, wait for more data
            {
                return
            }
            
            //  We’ve got enough data, validate the checksum…
            
            let checksum = self.data[0 ..< length].reduce(0xff, ^)
            if checksum != self.data[Data.Index(length)]                            //  Bad checksum, discard the first byte and try again (TODO: send NAK?)
            {
                self.data = Data(self.data[1...])
                continue
            }
            
            //  We have a valid packet! Peel it off and process it…
            
            let packet = self.data[0 ..< length + 1]
            self.data = Data(self.data[(length + 1)...])
            received(packet: packet)
            
            self.parseState = .idle
            self.data = Data()
        }
    }   //  while true
//      self.data.hexCharDump().forEach { debugLog($0) }
}

Firstly... what? You extracted length = data[0] which should be UInt8, then compare it against data.count which should be Int. Either you have custom subscript or custom data type. That alone is enough to void a lot of the discussion in this thread. Not to mention that having APIs that looks like Collection but doesn't exactly behave like one pose a danger in and of itself. Nonetheless, we'd still suggest that you work with index instead of offset. The < operator did throw me off (see replies below)

Anyhow, here's my take:

func serialPort(_ inPort: ORSSerialPort, didReceive inData: Data) {
  guard !inData.isEmpty else {
    return // Probably unnecessary
  }

  //  Note: This is not the most efficient parser, as it
  //  repeats work done when new data comes in, but
  //  it’s a bit easier to code.

  //  Accumulate the new data…
  data += inData

  if qVerboseLogging {
    debugLog("didReceive")
    data.hexCharDump(grouping: 16).forEach { debugLog($0) }
  }

  while !data.isEmpty {
    if parseState == .idle {
      let first = data.removeFirst()

      guard let start = FrameStart(rawValue: first) else {
        debugLog("Error parsing packets: unexpected byte in idle state (\(inData[0]))")
        //  (TODO: send NAK?)
        continue
      }

      //  Is it one of the single-byte frames?
      switch start {
      case .ack, .nak, .syn, .can:
        received(start)
      case .sof:
        //  It's the start of a packet…
        parseState = .parsing
      default: break
      }
    } else {  //  We're in the middle of parsing a packet
      guard data.count >= 4 else {
        //  Not enough bytes to form a full packet, just wait for more
        return
      }

      let length = Int(data.first! + 1)
      let flag = data.dropFirst().first!

      guard flag == 0x00 || flat == 0x01 else {
        //  Not a request or response packet, so something is broken
        //  Discard the first byte and try again (TODO: send NAK?)
        debugLog("not resp or req (\(self.data[1]))")
        data.removeFirst()
        continue        
      }

      guard data.count >= length else {
        //  Incomplete packet, wait for more data
        return
      }

      let endIndex = data.index(data.startIndex, offsetBy: length)

      //  We’ve got enough data, validate the checksum...
      let packet = data[..<endIndex]
      let checksum = packet.reduce(0xff, ^)
      guard checksum == 0 else {
        //  Bad checksum, discard the first byte and try again (TODO: send NAK?)
        data.removeFirst()
        continue
      }

      //  We have a valid packet! Peel it off and process it…
      data = data[endIndex...]
      received(packet: packet)

      parseState = .idle
      data = Data()
    }
  }
  // self.data.hexCharDump().forEach { debugLog($0) }
}

One could come up with different ways of course, but the important thing is that:

  • Uses index api to get index,
  • Treat Index like opaque type, never assume begin, end, interval, unless it's guaranteed by the collection itself, like Array,
  • Invalidate any index post collection mutation.
3 Likes

Awesome, thank you! This is way better, and obvious now that I see it. And no, I don't have any custom subscripts, and there's no warning or other problem from the compiler.

< is supported across any pair of BinaryInteger types, so it probably is just comparing UInt8 to Int.

(Admittedly, I didn’t know it was possible until just now either.)

how should I have written this serial data parser?

I generally implement this stuff by adding my own ‘read bytes’ abstraction on top of Data. Something like this:

extension Data {

    fileprivate mutating func readByte() -> UInt8? {
        guard let first = self.first else { return nil }
        self.removeFirst()
        return first
    }

    fileprivate mutating func readBytes(count: Int) -> Data? {
        guard self.count >= count else { return nil }
        defer { self.removeFirst(count) }
        return self.prefix(count)
    }
}

My parser code then works in terms of those primitives, and I find that makes it much easier to read.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

1 Like

Interesting, never noticed that before.

I almost suggest him this too. Then notice that there’s non-destructive read/seek in the code.

Then notice that there’s non-destructive read/seek in the code.

Can you point me at that specifically? I had a look through the code and I don’t see anything I’d classify as a seek.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

Not quite seek, but he reads length & flags in the else case, that is sometimes not removed.