Decoding UTF8 tagged with number of code points

benrimmington · January 16, 2024, 1:04am

Your code is using a deprecated UnicodeCodec.decode(_:) method. The replacement Unicode.Encoding and Unicode.Parser APIs from SE-0163 allow you to parse and decode separately. For example:

  // Count the number of valid UTF-8 code units.
  var utf8Count = 0
  var utf8Parser = Unicode.UTF8.ForwardParser()
  for _ in 0..<scalarCount {
    switch utf8Parser.parseScalar(from: &iterator) {
    case .valid(let utf8Buffer):
      utf8Count += utf8Buffer.count
    case .emptyInput:
      throw Ocp1Error.pduTooShort
    case .error:
      throw Ocp1Error.stringNotDecodable([UInt8](data))
    }
  }

  // Decode and remove the code units.
  let utf8Prefix = data.prefix(utf8Count)
  data.removeFirst(utf8Count)
  return String(unsafeUninitializedCapacity: utf8Count) {
    _ = $0.initialize(fromContentsOf: utf8Prefix)
    return utf8Count
  }

Both the deprecated and replacement APIs might advance your iterator too far, because the UTF-8 parser has a buffering mode for non-ASCII. So your existing data.removeFirst(…) call might remove too many elements.

This topic seems relevant to the Unicode Processing APIs pitch by @Michael_Ilseman.