The more direct translation of Unicode's maximal subpart of an
ill-formed subsequence would be something like:
(codeUnitOffset: Int, length: Int)
Where length could be 1, 2, or 3. Range<Int> is a more Swift-canonical representation and I'm open to either formulation.
If we stick with Range<Int>, then I agree it should be codeUnitOffsets or byteOffsets. After a little bit more discussion, I'll update the proposal and ask the LSG to pick a name if we don't settle on one ourselves.
What about func skip(by nScalars: Int) and func skip(by nCharacters: Int)?
I had picked n as that's used elsewhere in the stdlib. In those uses, there's only one Element type in context, but here's the also the concept of code unit offsets.
Hmm, the analogy with Codable is a bit unfortunate, but it is an error in how content is encoded rather than in the encoding processes. It could also be named UTF8.InvalidEncoding or similar. Thoughts?
Those, like Codable's are a bit more operational in nature. The API are single-byte-taking and have an emptyInput case instead of reporting a truncated scalar at the known end of the content.
As to the discussion with @Karl and @scanon , handling unfinished input is better served by more dedicated API than catching this error and diagnosing if it was due to insufficient buffering.
Not a bad idea to precondition on these, the equivalent (codeUnitOffset: Int, length: Int) would have similar checks and we could have an unchecked equivalent to Range(uncheckedBounds:).
I guess a question is whether we want the raw value to be public. I thought it might be helpful for tooling, such as logging errors, etc., which is a beneficiary of error classification and diagnosis.
If we want it public, then I agree with making it failable, as that's supported by RawRepresentable.
Hmm, it seems like the proposed classification is providing a little more granularity by answering "what kind of invalid start byte" without needing the context of the rest of the input. The only case where Unicode maximal subpart of an ill-formed sequence (we need an acronym or something...) cares about context is truncated scalars, which is their special case (e.g. error length >=1).