Int32 initialiser for Unicode.Scalar

Some UNIX functions such as getchar() return an Int32, but Unicode.Scalar only accepts UInt32 init, would it make sense to add a failable init that accepts Int32 values? Or should it be the programmer job to to the Int32 -> UInt32 conversion themselves? It seemed that the code was more verbose than it could be when you must check this instead of having the nice optionality of a failable init, what's your opinion about this?

Thank you

I don't think Unicode.Scalar(getchar()) would be useful, because it doesn't handle UTF-8 multi-byte sequences or EOF (end-of-file) conditions.

What would be a better alternative to convert something from getchar to a Swift type?
I thought EOF would have been handled by returning nil from the init?
About the UTF-8 multi-byte sequences what would be an example where there would be an issue? Are combining characters an example of a multi-byte sequence? I thought that a scalar had the same "granularity" as the characters returned by getchar?

getchar returns UTF-8 code units, not scalars or characters, so they're only equivalent if you're dealing with 7-bit ASCII.

For example, if you enter a, then getchar() returns 0x61 like you'd expect. That's fine.

But if you enter á, then you have to call getchar() twice to get the full scalar. á is represented in UTF-8 by two code units 0xC3 0xA1, so the first call to getchar() returns 0xC3 and the second returns 0xA1. So just calling it once and passing that to the UnicodeScalar initializer will give you the wrong result.

To use getchar() properly with Swift, depending on what your use case is, you'll need to do something like collect code units into an array and use one of the String methods that can decode a UTF-8 string, or possibly something a little more on-demand like the Unicode.UTF8 codec.

4 Likes

Thank you! :slight_smile: I hadn't understood that something like á would be a single Unicode.Scalar!

I should mention that "á" may have not been the best example because, depending on Unicode normalization, the character could be represented as one or more scalars:

  • U+00E1 LATIN SMALL LETTER A WITH ACUTE, or
  • U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT

But in either case, there's a scalar present that would be represented by more than one UTF-8 code unit so the general idea about needing to call getchar() and decode until you have a full scalar still holds.

3 Likes