robnik
July 12, 2018, 3:45pm
1
It looks like String
's padding
method is broken. It's docs say it will pad to characters , not unicode scalars.
Here's a simple example. This string uses a combining character.
var str = "аа́а"
str.count ==> 3
str.unicodeScalars.count ==> 4
str.padding(toLength: 3, withPad: " ", startingAt: 0) ==> "aá"
Rob
1 Like
lancep
(Lance Parker)
July 12, 2018, 4:35pm
2
The padding
method is actually a method provided by NSString
, so you'll get this behavior in Objective C as well.
It also depends on how your string is formatted:
func printStuff(str: String) {
let nsstr = NSString(string: str)
print(str.count, nsstr.length, str.unicodeScalars.count)
print(str.padding(toLength: 3, withPad: " ", startingAt: 0))
}
let s1 = "a\u{e1}a"
let s2 = "aa\u{301}a"
printStuff(str: s1)
printStuff(str: s2)
Prints:
3 3 3
aáa
3 4 4
aá
jrose
(Jordan Rose)
July 12, 2018, 5:02pm
3
Can you file a documentation bug with Apple (https://bugreport.apple.com ) to refer to Unicode scalars rather than characters?
robnik
July 12, 2018, 5:14pm
4
Sure, I can do that. Working by character rather than code point seems more useful and more consistent with Swift strings, but I guess if it's an old Objective-C method, the behavior can't be changed now. Fortunately this is simple to implement.
jrose
(Jordan Rose)
July 12, 2018, 5:15pm
5
Oof, now I wonder if it really is Unicode scalars or if it's UTF-16 code units.
2 Likes
robnik
July 12, 2018, 5:27pm
6
I think it's UTF-16. I tried some emoji strings that have different numbers for unicodeScalars.count
and utf16.count
and then padding(...)
destroys the emojis.
2 Likes
Martin
(Martin R)
July 12, 2018, 5:28pm
7
It is UTF-16 code units (i.e. the unichar
s of NSString
):
let s = "🏁"
print(Array(s.unicodeScalars)) // ["\u{0001F3C1}"]
print(Array(s.utf16)) // [55356, 57281]
let p = s.padding(toLength: 3, withPad: " ", startingAt: 0)
print(Array(p.unicodeScalars)) // ["\u{0001F3C1}", " "]
print(Array(p.utf16)) // [55356, 57281, 32]
3 Likes
Good analysis! This looks like a good argument for treating Strings as extended grapheme clusters (most of the time).
berik
(Berik Visschers)
August 1, 2020, 1:12pm
9
str.padding(toLength: 3, withPad: " ", startingAt: 0)
Is still broken for unicode strings. Is there anything I can help with to improve this?