Decoding UTF8 tagged with number of code points

lukeh · January 13, 2024, 11:21pm

I'm dealing with a protocol that, for reasons unbeknownst to me, chooses a variable length string encoding consisting of the number of Unicode code points followed by the UTF-8 encoding.

Can anyone think of a less verbose way of implementing this? data below is the decoding buffer.

func decode(_ type: String.Type) throws -> String {
    struct InterospectableIterator<T: Collection>: IteratorProtocol {
        typealias Element = T.Element
        var iterator: T.Iterator
        var position = 0

        init(_ elements: T) { iterator = elements.makeIterator() }

        mutating func next() -> Element? {
            position += 1
            return iterator.next()
        }
    }

    let scalarCount = try Int(decodeInteger(UInt16.self))
    guard scalarCount > 0 else { return String() }
    var iterator = InterospectableIterator(data)
    var scalars = [Unicode.Scalar]()
    var utf8Decoder = UTF8()

    for _ in 0..<scalarCount {
        switch utf8Decoder.decode(&iterator) {
        case let .scalarValue(v):
            scalars.append(v)
        case .emptyInput:
            throw Ocp1Error.pduTooShort
        case .error:
            throw Ocp1Error.stringNotDecodable([UInt8](data))
        }
    }

    data.removeFirst(iterator.position - 1)

    return String(String.UnicodeScalarView(scalars))
}

michelf · January 15, 2024, 8:09pm

UTF-8 is an interesting data format. Each unicode scalar in UTF-8 is a byte sequence with a very easy to recognize first byte. All you need is to count the number of those bytes. Once you reach the right count, you skip the last bytes and you're done.

Here is a loop I made (very untested, but should probably work):

var currentScalarCount = 0
for byteIndex in data.indices {
	if  (data[byteIndex] & 0b1000_0000) == 0 // one-byte sequence
     || (data[byteIndex] & 0b1100_0000) == 0b1100_0000 // start of sequence
    {
		// found a start of a code point
		currentScalarCount += 1
		if currentScalarCount == expectedScalarCount {
			// reached the start of last sequence
			if (data[byteIndex] & 0b000_0000) = 0) {
				endIndex = byteIndex + 1
			}
			if (data[byteIndex] & 0b1110_0000) = 0b1100_0000) {
				endIndex = byteIndex + 2
			}
			if (data[byteIndex] & 0b1111_0000) = 0b1110_0000) {
				endIndex = byteIndex + 3
			}
			if (data[byteIndex] & 0b1111_1000) = 0b1111_0000) {
				endIndex = byteIndex + 4
			}
			// reached invalid UTF-8 start of sequence character
		}
	}
}

You could try doing more validation, but String is going to revalidate the UTF-8 anyway so it's unnecessary in this case.

Edit: Mixed start and end bytes patterns. Corrected, but the earlier version of this code is wrong.

benrimmington · January 16, 2024, 1:04am

Your code is using a deprecated UnicodeCodec.decode(_:) method. The replacement Unicode.Encoding and Unicode.Parser APIs from SE-0163 allow you to parse and decode separately. For example:

  // Count the number of valid UTF-8 code units.
  var utf8Count = 0
  var utf8Parser = Unicode.UTF8.ForwardParser()
  for _ in 0..<scalarCount {
    switch utf8Parser.parseScalar(from: &iterator) {
    case .valid(let utf8Buffer):
      utf8Count += utf8Buffer.count
    case .emptyInput:
      throw Ocp1Error.pduTooShort
    case .error:
      throw Ocp1Error.stringNotDecodable([UInt8](data))
    }
  }

  // Decode and remove the code units.
  let utf8Prefix = data.prefix(utf8Count)
  data.removeFirst(utf8Count)
  return String(unsafeUninitializedCapacity: utf8Count) {
    _ = $0.initialize(fromContentsOf: utf8Prefix)
    return utf8Count
  }

Both the deprecated and replacement APIs might advance your iterator too far, because the UTF-8 parser has a buffering mode for non-ASCII. So your existing data.removeFirst(…) call might remove too many elements.

This topic seems relevant to the Unicode Processing APIs pitch by @Michael_Ilseman.

lukeh · January 16, 2024, 1:10am

Thank you so much for the detailed reply and solution, much appreciated (and merged).