High friction for String handling in Swift 4

I find string handling in Swift 4 a bit too weird. I don't want this to be a generic complaint so I will give two specific examples.

Note that none of these have anything to do with Index or Substring. Just the way unicode points are handled.

Concrete example #1

I want to create an attributed string with an icon attachment. To put an attachment we need a specific character whose unicode is given by NSAttachmentCharacter.

In general you don't have to use that because the API provides the following convenience constructor:

NSAttributedString(attachment: attachment)

But I want a bit more control, for example I want to add other attributes in there. For example:

    let iconAttrs: [NSAttributedStringKey : Any] = [
        .attachment: attachment,
        my_custom_key: my_custom_value
    ]

So I need to create the string manually, but I can't just do this:

NSAttributedString(string: " ", attributes: iconAttrs) // won't work!

The string has to be the attachment character! So how I put it in there?

I don't know, but the minimum thing I could manage to make work was this:

String(Character(Unicode.Scalar(NSAttachmentCharacter)!))

I don't know about you but this seems insane?

Concrete example #2

Given a Character, how do we get the unicode value?

This is what I managed to make:

func charCode(char a: Character) -> Int {
    return Int(a.unicodeScalars.first!.value)
}

Why is Character not mapped directly to a unicode rune?

Why do I care about this? Because I want to convert between "cases" of non-latin text (specifically, convert Japanese text from Hiragana<->Katakana back and forth).

I need to take the unicode point for a character, check if it's within a certain unicode block, and if so, convert it to the other block by adding the "offset" value

But somehow there's too much ceremony to do what should otherwise be very simple.

For example, once I have the unicode point that I want to offset, how do I add the offset to it?

This is what I managed to make:

func scalar(_ a: UnicodeScalar, add b: Int) -> UnicodeScalar {
    let newCode = Int(a.value) + b
    return UnicodeScalar(UInt32(newCode))!
}

This function should not even exist. I'm just adding two numbers together.

Why isn't the UnicodeScalar just a number?

Now, part of the boiler plate here is that I'm converting the UInt32 to Int, but that's because the thing I want to add is an offset between two unicode blocks, and it could be a negative number.

So to put all the things together, my solution looks like this:

let hiraganaMinusKatakana = charCode( "あ") - charCode("ア")
func katakanaToHiragana(_ c: UnicodeScalar) -> UnicodeScalar {
    return scalar(c, add: hiraganaMinusKatakana)
}
func hiraganaToKatakana(_ c: UnicodeScalar) -> UnicodeScalar {
    return scalar(c, add: -hiraganaMinusKatakana)
}

func normalize_to_hiragana(_ input: String) -> String {
    var out = "".unicodeScalars
    for c in input.unicodeScalars {
        if KatakanaBlock.contains(Character(c)) {
            out.append(katakanaToHiragana(c))
        } else {
            out.append(c)
        }
    }
    return String(out)
}

I'm probably doing some wasteful temporary allocations here somewhere.

The API is too weird, but it's not weird because it's trying to help you write performant code, it's just weird.
So when you do something like this, the most obvious solution is probably not the most performant solution.

Yeah, the framework team at Apple could probably provide a more Swift-friendly interface for NSAttachmentCharacter to make it easier to use correctly here. If you have a moment, you could file a bug at bugreport.apple.com to ask them to improve the interface. Your conversion chain can at least be reduced to String(Unicode.Scalar(NSAttachmentCharacter)!).

Swift's Character represents a composed grapheme; a "character" as a user perceives it does not always map to a single Unicode scalar in the general case. If you want to work at the level of Unicode scalars, you can use Unicode.Scalar directly in most of the same situations you can Character; for instance, Unicode.Scalar is itself usable with string literals. Instead of using your charCode function to extract a scalar from a composed Character, you can write ("あ" as Unicode.Scalar).value to get the integer value directly:

let firstHiragana: Unicode.Scalar = "あ"
let firstKatakana: Unicode.Scalar = "ア"

func katakanaToHiragana(_ c: UnicodeScalar) -> Unicode.Scalar {
  return Unicode.Scalar(c.value + firstHiragana.value - firstKatakana.value)!
}
4 Likes

Why do I care about this? Because I want to convert between “cases” of
non-latin text (specifically, convert Japanese text from
Hiragana<->Katakana back and forth).

While this doesn’t address your general concerns, you can achieve this specific goal using the .hiraganaToKatakana string transform. For example:

let hirigana = "\u{3041}"   // U+3041 HIRAGANA LETTER SMALL A
let katakana = hirigana.applyingTransform(.hiraganaToKatakana, reverse: false)!
let hirigana2 = katakana.applyingTransform(.hiraganaToKatakana, reverse: true)!
print(hirigana)             // -> U+3041 HIRAGANA LETTER SMALL A
print(katakana)             // -> U+30A1 KATAKANA LETTER SMALL A
print(hirigana2)            // -> U+3041 HIRAGANA LETTER SMALL A

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware
let myEmail = "eskimo" + "1" + "@apple.com"

6 Likes

It's perhaps not quite as readable, but this works:

// U+FFFC is the "OBJECT REPLACEMENT CHARACTER", represented
// in Cocoa as 'NSAttachmentCharacter'
NSAttributedString(string: "\u{fffc}", attributes: iconAttrs)