Add Latin1 to Unicode codecs

taylorswift · June 28, 2020, 8:59pm

The Latin1 encoding seems to be missing from the list of text encodings in the standard library. i need this encoding to support PNG text chunks as defined in the format standard. i thought about extending Swift.Unicode to vend this codec so that text chunks could use an API spelled like

.parse(textChunk, as: Unicode.Latin1.self)

but this would mean that my extension would conflict with any other framework that also vends its own support for Latin1, which given how common a text encoding it is, seems like a real possibility. that tells me that this codec really belongs in the standard library, not in third-party extensions. i could namespace the text encoding to a framework-specific namespace like PNG.Latin1, but that just wouldn’t make any sense given that ASCII lives in Swift.Unicode.

as for the argument that goes “well if we support Latin1 then we have to support every text encoding that exists”, Swift already supports ASCII in the standard library, so there is precedent and justification for privileging the more common text encodings. And there is no reason why additional regional encodings can’t be added later.

benrimmington · June 30, 2020, 12:38am

This was accepted (but not implemented) as part of the SE-0163 proposed solution:

The standard library currently lacks a Latin1 codec, so a enum Latin1: Unicode.Encoding type will be added.

taylorswift · June 30, 2020, 1:04am

that proposal is 3 years old, is there a reason why it hasn’t been implemented?

benrimmington · July 9, 2020, 7:50am

https://github.com/apple/swift/pull/32782

lorentey · July 29, 2020, 2:32am

Also missing are Latin-2 to -16, EBCDIC-1 to -1337 and a million other more-or-less popular encodings. I would argue all of these belong to a nice text transcoding package rather than the Standard Library. Singling out ISO-8859-1 as the single legacy 8-bit encoding that has built-in support feels arbitrary to me.

Instead of selectively adding specific obsolete encodings to the stdlib, I think it would make sense to rather implement these in a standalone package that isn't constrained by the rigidly generic approach of SE-0163.

We've had three years to practice using and implementing the Unicode.Encoding APIs -- in hindsight, they don't seem like the most practical way to implement transcoding. (Beyond the unusual design choice of having to spell out Unicode.UTF8.self, the protocol APIs prescribe implementation details that make it almost impossible to efficiently implement them in practice.)

I think the transcoding APIs introduced in SE-0163 need to be left as is, and we should rather work on their eventual replacement. Adding a dramatically slow Latin-1 transcoder only to deprecate it in a couple years' time seems like a bad idea to me.

jrose · July 29, 2020, 3:11am

Latin-1 is "the" single legacy 8-bit encoding because it's 1-for-1 compatible with Unicode codepoints and UTF-16, but that may not be enough to make it worth putting in the stdlib.

lorentey · July 29, 2020, 4:42am

Yeah, but the simplicity of its implementation doesn't seem a particularly convincing reason for adding it at this point.

The need to slap an @available attribute on the encoding type largely negates the convenience of having it in the stdlib -- I don't expect there are that many use cases that would benefit from a very slow Latin-1 transcoder with limited availability.

On the other hand, adding Latin-1 to the rarified list of stdlib-supported encodings could be easily misread as encouragement for using it.

benrimmington · July 29, 2020, 5:41am

The Swift PNG library could use String(decodingLatin1: textChunk) by adding the following API.

extension String {

  public init<Latin1>(
    decodingLatin1 latin1: Latin1
  ) where Latin1: RandomAccessCollection, Latin1.Element == UInt8 {

    // Compute the *exact* number of UTF-8 code units.
    let utf8Capacity = latin1.reduce(into: 0) { $0 += ($1 < 0x80 ? 1 : 2) }

    // Define a nested function for the SE-0263 and SE-0245 APIs.
    func utf8Initializer(_ utf8: UnsafeMutableBufferPointer<UInt8>) -> Int {
      var utf8Index = utf8.startIndex
      func utf8Append(_ utf8Element: UInt8) {
        utf8[utf8Index] = utf8Element
        utf8.formIndex(after: &utf8Index)
      }

      // Transcode from Latin-1 to UTF-8.
      for latin1Element in latin1 {
        switch latin1Element {
        case 0xC0 ... 0xFF:
          utf8Append(0xC3)
          utf8Append(latin1Element - 0x40)
        case 0x80 ... 0xBF:
          utf8Append(0xC2)
          utf8Append(latin1Element)
        default: // ASCII
          utf8Append(latin1Element)
        }
      }
      return utf8Capacity
    }

    // [SE-0263] Access the string's uninitialized storage buffer.
    #if compiler(>=5.3)
    if #available(macOS 11.0, iOS 14.0, tvOS 14.0, watchOS 7.0, *) {
      self.init(
        unsafeUninitializedCapacity: utf8Capacity,
        initializingUTF8With: utf8Initializer
      )
      return // Prevent fallthrough.
    }
    #endif

    // [SE-0245] Access a temporary array's uninitialized storage buffer.
    let utf8 = [UInt8].init(
      unsafeUninitializedCapacity: utf8Capacity,
      initializingWith: { $1 = utf8Initializer($0) }
    )
    self.init(decoding: utf8, as: Unicode.UTF8.self)
  }
}

michelf · July 29, 2020, 11:28am

That sounds complicated when you can do the same with a one-liner for each direction:

let latin1 = str.unicodeScalars.map { UInt8($0.value) }
let str = String(latin1.map { Character(UnicodeScalar(UInt32($0))!) })

I guess this could be taken as an argument against adding it to the standard library since it's somewhat trivial stuff. On the other hand it'd be hard to justify adding a dependency to a transcoding library for that.

Error handling could be better on the first line though. And I wonder how well the second line gets optimized.

jrose · July 29, 2020, 7:02pm

Slightly simpler: let str = String(String.UnicodeScalarView(latin1.map { UnicodeScalar($0) }))

taylorswift · July 29, 2020, 8:55pm

i think this discussion has drifted away from the original issue, which is that all the transcoding APIs in the standard library are spelled with the as: Unicode.{Codec}.self pattern, which means that related framework APIs ought to follow the same naming convention. But Latin1 is such a commonly used standard that I can’t vend Latin1 as an extension to Swift.Unicode without potentially conflicting with another framework the user has imported. Whereas if it’s a more esoteric text encoding i could reasonably vend it as an extension without worrying about namespace conflicts.

for what it’s worth, i don’t think it made sense to namespace any of the standard library text codecs under Swift.Unicode in the first place except for Swift.Unicode.UTF8, 16, 32, but i really doubt changing this would meet the threshold for a source-breaking change, so we have to work with what we have.

code doesn’t exist in a vacuum, no one is choosing a particular text encoding on a whim, people choose text encodings based on what the thing they’re interacting with expects the encoding to be in. If the PNG standard says that the text has to be encoded in Latin1, you can’t just turn around and say “well i think UTF8 is the Superior text encoding, so i’m just going to store all this text as UTF8.”

lorentey · July 30, 2020, 1:00am

I think most people just type things like String(decoding: foo, as: UTF8.self), because it works and it's less pointlessly verbose as Unicode.UTF8.self. I know I do!

I sometimes even use Foundation's String(bytes:encoding:), just because it has less awkward syntax.

Unicode.Latin1 makes little sense to me as a name -- Latin-1 is not a Unicode encoding. (Neither is ASCII.) But if for some reason you feel you need a Unicode.Latin1 type, you should feel fully empowered to define internal extensions on the Unicode enum in your libraries!

Agreed! Which is why nobody has said that in this thread. Why are you bringing this up?

lorentey · July 30, 2020, 3:13am

Purely out of curiosity, why do you need to encode Latin-1 strings into PNGs? Chunk types aren’t strings, keywords, palette and color profile names are encoded in a strict subset of Latin-1, and for regular text there is an iTXt chunk type that uses the One Correct Encoding. I don’t really see how adding a stdlib-provided Unicode.Encoding for Latin-1 would simplify any of this. Do you intend to use the new type solely for decoding tEXt chunks? How can this be worth changing the stdlib? I don’t get it.

taylorswift · July 30, 2020, 3:48am

this discussion is not about library implementation, it’s about the front-facing public API.

You said that adding Unicode.Latin1 to the standard library would encourage people to use Latin1 instead of UTF8, and i said that this has never been my thought process towards anything in the standard library.

I don’t, the library strongly encourages users to use UTF8, so the encoder only emits iTXt chunks. (because the library represents all text chunks with a single type PNG.Text which uses String as its backing storage.) However, the decoder still has to handle the Latin1-type text chunks, and the per-chunk parsing APIs are public because it was a design goal of the library to provide this feature. The library in fact uses a single parser for the tEXt and zTXt chunks because it’s pretty easy to distinguish one from the other from the chunk contents, but the iTXt one is still separate. Right now they are spelled like:

static 
func parse(_ data:[UInt8]) throws -> Self 

static 
func parse(latin1 data:[UInt8]) throws -> Self

but they really “should” look like

static 
func parse<Encoding>(_ data:[UInt8], as _:Encoding.Type) throws -> Self
    where Encoding:PNG.Text.Encoding

since that’s what the standard library looks like. (in fact come to think of it, it really should be an init(parsing:as:).)

It is a relatively small addition with a lot of utility given how common Latin1 is. also, the review for SE-0163 already decided that it was worth changing the stdlib for so the question is kind of moot at this point

lorentey · July 30, 2020, 5:32am

taylorswift:

I don’t, the library strongly encourages users to use UTF8, so the encoder only emits iTXt chunks. However, the decoder still has to handle the Latin1-type text chunks, and the per-chunk parsing APIs are public because it was a design goal of the library to provide this feature. The library in fact uses a single parser for the tEXt and zTXt chunks because it’s pretty easy to distinguish one from the other from the chunk contents, but the iTXt one is still separate. Right now they are spelled like:
static 
func parse(_ data:[UInt8]) throws -> Self 

static 
func parse(latin1 data:[UInt8]) throws -> Self

I see. So these take an array that holds the chunk data and return a parsed keyword/text chunk? That seems absolutely reasonable — but I wouldn’t want to go anywhere near Unicode.Encoding if I’d need to implement these functions. (Shouldn’t these take something more flexible than full arrays, though?)

taylorswift:

but they really “should” look like
static 
func parse<Encoding>(_ data:[UInt8], as _:Encoding.Type) throws -> Self
    where Encoding:PNG.Text.Encoding 
since that’s what the standard library looks like. (in fact come to think of it, it really should be an init(parsing:as:) .)

Personally, I’d strongly advise against this — SE-0163’s transcoding APIs are not by any means an API design success story, and I don’t think they should be emulated. For one thing, PNG’s list of supported encodings is extremely small and it’s never going to be extended: it uses a custom printable subset of Latin-1, the actual Latin-1 and UTF-8 — that’s all.

Modeling these with a protocol pays an abstraction cost that is very unlikely to provide any benefits. These will never be justifiably used in a generic context, and people won’t ever need to implement their own PNG.Text.Encodings. So I’d just use labeled overloads like you already did above; or if I felt extremely ceremonial, I’d list them in an enum. (But even that seems overkill — it’s not like encoding ids need to be serialized or used in dictionary keys or passed around through multiple function layers!)

The actual conversions from/to these three encodings and String are trivial to implement — the most troublesome is the custom Latin-1 subset, and even that can be done with a one-liner. (The biggest difficulty would probably be in the implementation of well-considered error handling. In my experience, the stdlib’s high-level encoder APIs aren’t particularly helpful for that, either.)

—

As for Latin-1 in general, I do earnestly believe that it has absolutely no place alongside UTF-8/16/32 in the default namespace of every Swift program. It’s not at all in common use today as a text encoding, and it’s neither more or less important than all the other defunct encodings from the dark ages before Unicode.

At the same time, I also strongly believe that Swift obviously needs to be able to efficiently work with these legacy encodings. In my view, this means the ability to directly decode/encode such data to/from native String values. (E.g., there is probably little reason to implement direct conversion between, say, EBCDIC 254 to Latin-10.) There is no reason these need to be in the Stdlib — it would be far better to allow these to evolve in a package, at least for a while. We should not be in a hurry to add more half-baked transcoding APIs to the stdlib — that area is already extensively covered by Unicode.Encoding.

With the recent addition of String’s unsafe-uninitialized UTF-8 initializer, all puzzle pieces are in place to allow the creation of a well-designed and efficient String encoding package.

We should also not forget that Foundation also provides a battle-tested transcoding facility. It does support Latin-1.