Best way to append a UnicodeScalar to a String

Is there better way:

let aaaa = "a"  // want to append a UnicodeScalar to this string
let addThisScalar = UnicodeScalar(0x33)!
let bbbb = String(String.UnicodeScalarView([aaaa.unicodeScalars.first!, addThisScalar]))
print(bbbb) // "a3"

If the source string is immutable and you'd like to create a new string from both elements, you can create a String directly out of the scalar and use String.appending(_:) to concatenate them:

let aaaa = "a"
let addThisScalar = UnicodeScalar(0x33)!
let bbbb = aaaa.appending(String(addThisScalar))

If the source string is mutable and you'd like to append the scalar in-place, you can also create a Character out of the scalar and call String.append(_:) directly:

var aaaa = "a"
let addThisScalar = UnicodeScalar(0x33)!
aaaa.append(Character(addThisScalar))

Whether either of these is "better" depends on your definition of "better", but they might bit a little easier to read and comprehend.


These work identically and as expected with things like combining characters:

let scalar = UnicodeScalar("\u{0301}")
var string = "Cafe"
print(string) // => Cafe

print(string.appending(String(scalar))) // => CafΓ©

string.append(Character(scalar))
print(string) // => CafΓ©
2 Likes

If the source string is mutable, you can do:

var x = ""
let scalar: Unicode.Scalar = "a"
x.unicodeScalars.append(scalar)
2 Likes

May I ask why you want to do this? Either you know the scalar in question is not combining (in which case it might as well be a Character or a String), or it might be combining (in which case appending it to a String doesn't seem like a good idea!).

The one use case I can think of is an input method (keyboard or character viewer).

When dealing with bidirectional and Arabic script text, I have ran into use cases where I needed to appending/prepending a string with direction or joiner code points.

I am dealing with emoji. The scalar sequence from Unicode.org : https://unicode.org/Public/emoji/14.0/emoji-test.txt

sometime are not the same with what the Apple emoji keyboard produce. For example:

1F21A ; fully-qualified # 🈚 E0.6 Japanese β€œfree of charge” button

U-1F21A is already "fully qualified", but from the Emoji keyboard, this emoji is U-1F21A U-FE0E (an extra qualifier is added)

I need to lookup emojis by either forms, they both are the same emoji even though the scalar sequences are different. So I add a "synonym" by appending the scalar U-FE0E.

I don't know why the Apple emoji keyboard add U-FE0E when it's not necessary because those scalar are "isEmojiPresentation" true.

Edit: U-FE0E should be U-FE0F

3 Likes

This is a good use case, thank you!

Note: As of Swift 5.6, this may unnecessarily trigger a CoW copy. The fix for this bug is currently on track for 5.7.

3 Likes

+1 to Jordan's comment β€” this is a reasonable use case.


This can be because U+1F21A 🈚 has emoji presentation by default, and U+FE0E VARIATION SELECTOR-15 is requesting the non-emoji presentation of the character: U+1F21A 🈚 + U+FE0E VARIATION SELECTOR-15 = 🈚︎. It is a bit surprising that entry directly from the emoji keyboard would request non-emoji presentation explicitly, but the specifics are dependent on where the text is being entered into, and some contextual detail.

(Unless U+FE0E is a typo and you meant U+FE0F which would be requesting explicit emoji presentation, which would be okay too β€” see below.)

Aside about qualification

To be a bit more precise about qualification, too: a "fully-qualified" emoji character is an emoji sequence made up of 1 or more qualified emoji characters. UTS#51 gives a definition of what that means:

An emoji character in a string that (a) has default emoji presentation or (b) is the first character in an emoji modifier sequence or (c) is not a default emoji presentation character, but is the first character in an emoji presentation sequence.

U+1F21A 🈚 is fully-qualified because it is a single qualified emoji character, and is qualified because it has default emoji presentation. In other words, it doesn't require a variant selector in order to be fully-qualified, because it already has emoji presentation.

This is opposed to a character like U+261D ☝ which does not have default emoji presentation, and requires an explicit variation selector (U+261D ☝ + U+FE0F VARIATION SELECTOR-16 = 261D FE0F☝️) or skin tone modifier (U+261D ☝ + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 🏾 = 261D 1F3FE ☝🏾) to become fully-qualified.

So, an emoji character might not require a variation selector, but variation selectors after an already-fully-qualified emoji without an emoji selector are fully valid, even if redundant. Producing an explicit variation selector can help keep text consistent if the default ever changes. In the case of the above, U+1F21A + U+FE0F VARIATION SELECTOR-16 would also be acceptable: although U+1F21A already has default emoji presentation, requesting a forced emoji presentation is fine, because it's just redundant. If the default presentation for this character ever changed, the currently-redundant variation selector would actually help keep the presentation consistent.

6 Likes

Yes, It should be U-FE0F

I found a odd case, from Unicode.org date file:

1F6F3 FE0F ; fully-qualified # πŸ›³οΈ E0.7 passenger ship
1F6F3 ; unqualified # πŸ›³ E0.7 passenger ship

the emoji keyboard produces U-1F6F3, but this scalar isEmojiPresentation is false

So I guess even when isEmojiPresentation is false, it can still default to emoji presentation.

It seems the emoji keyboard is not consistent: it should either always add U-FE0F or only add when needed.

1 Like

Yeah, without knowing more about the implementation of emoji in the keyboard, it's tough to know why this is the case β€” it could be a bug, or intentional for some reason. We can only guess.

Yes, this is always the case. UTS#51 Section 4 Presentation Style goes into a lot more detail on presentation, but to quote:

presentation style is never guaranteed

Unicode presentation is effectively a request to display characters in a certain way, but that request may be ignored, or it might be impossible to honor. For instance, although U+1F6F3 PASSENGER SHIP defaults to a text presentation, in order to display as such, some font in use on your system needs to be able to render the character as text. On my system running macOS Monterey 12.3.1, there is no font which has a textual glyph for this character... But Apple Color Emoji does have a glyph, so the system falls back to that representation:

(Note how Font Variation only shows the Apple Color Emoji font with a glyph.)

This is opposed to a character like U+2602 UMBRELLA β˜‚, which is also unqualified on its own without a variation selector, but for which my system does have various glyphs:

So even with and without explicit variation selectors, you are at the whim of the rendering system in use, and the fonts it has available at the time of rendering. These Unicode properties are effectively advisories on how to treat the text, but they may also be ignored depending on context, or even impossible to fulfill.

Edit: just for comparison's sake, on my Windows 10 machine, I do see a difference between the two representations for this character:

(I really scaled up the font size so if you open the image at native size you'll see a lot more detail on the textual ship!)

5 Likes

So PASSENGER SHIP can be text presentation, but it's shown as emoji due to system font not have this textual character.

For UMBRELLA, the keyboard produce U-2602 U-FE0F, so it's fully qualified for emoji presentation.

It seems PASSENGER SHIP is like UMBRELLA, so the keyboard should make PASSENGER SHIP as fully qualified U-1F6F3 U-FE0F to be emoji presentation. I think this is a bug in the keyboard. And I have to taking into account for this "bug" in my code to treat U-1F6F3 (and the like) alone as an emoji.

1 Like

Indeed.

It should, yes. Likely worth reporting as Feedback.

Aside

This is pure conjecture on my part, but I can also imagine that what could be happening here is that the keyboard process [either statically by assuming a known list of fonts, or dynamically by checking all loaded fonts] could be making the decision to not output U+FE0F based on the fact that there's no glyph for the textual presentation, so it doesn't bother outputting U+FE0F. As opposed to U+1F21A and U+2602, which do have other possible presentations. If this is the case, it's not a good assumption to make, and it's not really interoperable with other OSes. But it could also purely be a bug.

2 Likes

I filed a feedback: FB9987401

2 Likes