Opinions on the best way to improve String ergonomics by restricting allowed characters

It’s a really simple and surely incredibly common situation in Swift, but I realized that I don’t have any fully satisfying answers. I’m turning to the community for insight because I always learn something new and useful.

Let’s say I have a type that represents a social security number:

struct SSN {
    let rawValue: String
    init?(_ rawValue: String) { ... }
}

and I enforce that the rawValue is stripped of any hyphens. Now I want to provide a method that returns the standard textual format: 012-34-5678. It’s great that String is unicode-correct, but there are plenty of situations where one knows that those considerations are not necessary, and it seems like in those situations one should be able to turn to a simpler textual type that allows one to skip the pain of rawValue.index(rawValue.startIndex, offsetBy: 2).

What do you think is the best way to implement this formatting method?

let rawValue: [Character]? Something with String.UTF8View? Some other API that I so far only know from a distance?

I’m looking for something like this

extension SSN {
    var standardWrittenForm: String {
        rawValue[...2] + "-" + rawValue[3...4] + "-" + rawValue[5...]
    }
}

I'd go simpler: let rawValue: [9 of UTF8.CodeUnit] (aka [9 of UInt8]) would suffice.

2 Likes

I tried this out and was very surprised to see that an InlineArray with Hashable elements doesn’t conform to Hashable (or Equatable). Why doesn’t it conform?

It hasn't been done yet.

1 Like

In that case I’m wondering what the answer to my question would have been as of last year. Has Swift’s answer generally just been that one must accept the pain of String's API in all situations? I’ve been swimming in Swift non-stop for 10 years and I never ended up developing a huge resentment against String's indexing APIs, so I guess that’s a decent testament to that it really isn’t as necessary to index strings using integers as many might assume. But still, the need does come up from time to time and I feel like Swift should offer a solution.

What do you guys think about an ASCII type along these lines?

struct ASCII {
    private(set) var bytes: [UInt8]
    
    private init(uncheckedASCIIBytes: some Sequence<UInt8>) {
        self.bytes = Array(uncheckedASCIIBytes)
    }
    
    init?(asciiBytes: some Sequence<UInt8>) {
        self.bytes = []
        self.bytes.reserveCapacity(asciiBytes.underestimatedCount)
        for byte in asciiBytes {
            guard byte < 128 else { return nil }
            self.bytes.append(byte)
        }
    }
}

extension ASCII: ExpressibleByStringLiteral {
    init(stringLiteral value: String) {
        self.init(asciiBytes: value.utf8)!
    }
}

extension ASCII {
    static func + (lhs: Self, rhs: Self) -> Self {
        .init(uncheckedASCIIBytes: lhs.bytes + rhs.bytes)
    }
    
    static func + (lhs: inout Self, rhs: Self) {
        lhs.bytes += rhs.bytes
    }
}

FWIW, I rolled my own ASCIIString (and slice) type as well. My use case is known ASCII-only string processing, and I wanted to have convenient API tailored for that use case.

I think the existing Sequence APIs do an adequate job for slicing strings based on integer offsets. Like this:

extension SSN {
    var standardWrittenForm: String {
        rawValue.prefix(3) + "-" + rawValue.dropFirst(3).prefix(2) + "-" + rawValue.dropFirst(5)
        // Or use rawValue.suffix(4) for the last component
    }
}

Yes, this is a bit longer than your subscript-based code and probably a little less efficient because the middle component must create an intermediate Substring, but it's not that bad IMO. And if efficiency is extremely important, any integer-based subscripting API on String would also be a problem.

4 Likes

There'll be a fast O(1) path taken for a string known to be ascii when doing an integer based index access like string[string.index(string.startIndex, offsetBy: n)], no? The latter could be easily wrapped into some convenient string[n] subscript.

Note that this fits comfortably into inline string representation.

1 Like

There certainly are many fast paths in the String implementation for known-ASCII contents. Unfortunately, knowing that your string is ASCII is not quite good enough for O(1) integer indexing into Characters because the ASCII string "\r\n" (CR+LF = 2 ASCII chars) becomes a single grapheme cluster according to the Unicode rules:

let str = "\r\n"
str.count // 1
str.utf8.count // 2

I don't know how String handles this internally. Maybe @David_Smith can answer this?

But regardless of the String-internal optimizations, what I wanted to say was that any fast paths the implementation can take for the hypothetical str[3...4], it should also be able to take for str.dropFirst(3).prefix(2), so both variants should be equally fast (except for the creation of the intermediate Substring due to the chaining of two operations).

1 Like

Very good catch!

what I wanted to say was that any fast paths the implementation can take for the hypothetical str[3...4] , it should also be able to take for str.dropFirst(3).prefix(2) , so both variants should be equally fast

Yep, I agree. And those are what integer based subscript implementation could be based upon.

Another option to consider is Regex:

struct SSN {
  let rawValue = "123456789"

  var standardWrittenForm: String {
    let m = try! #/(...)(..)(....)/#.wholeMatch(in: rawValue)!
    return "\(m.1)-\(m.2)-\(m.3)"
  }
}

However I would probably forego String entirely and store the digits in a [Int8] or something, as others have already suggested.

The reason I was motivated to write my example ASCII type is because I started with the [UInt8] approach and quickly got to the natural question of “Okay, so what’s the most natural way to convert these code points back into my desired string 012-34-5678?” It seemed that the most straightforward answer was String(bytes: rawValue, encoding: .ascii)!, but this doesn’t feel nearly straightforward enough. My question was not really aimed at finding a more efficient approach (either time-wise or memory-wise), rather at finding one that is more natural/human. I was putting myself in the shoes of a novice Swift developer who just wants to insert some hyphens at fixed offsets. All of the approaches so far, including the String one, feel quite far from the ideal. @ole‘s suggestion to use .prefix(), .dropFirst() and .suffix() feels to me like the only one that actually moves the needle in a noticeable way towards having a more ergonomic way for novice-to-intermediate Swift developers to express this conceptually simple operation, but it’s still a little unintuitive. Is the reason that we don’t provide @tera’s suggested Int-based subscripts on String that it will be surprisingly non-performant and we don’t like allowing those kinds of pitfalls in the standard library? Or is it to avoid overloading the subscript that takes String.Index?

1 Like

I don't think this is the latter as we could have used a differently named subscript – it is the former. Imagine we did have an easy reachable int based subscripts on a string – people would use it without thinking and we'll have many accidentally quadratic algorithms in user apps. If provided in the standard library on a String at all this subscript must be some scary looking: string[atIndex_but_think_twice_before_using_it: n].

In some parallel universe where Swift actively uses StringProtocol as a currency type there would be an AsciiString type that could hold nothing but ascii characters (and not even "\r\n" sequences to allow for quick O(1) integer based indexing). There would be an obvious way to go from AsciiString to String, the reverse operation would be less trivial (and failable). The jury is out on whether we need it here.

BTW, is this recommendation still applicable or not?

From the docs:

Do not declare new conformances to StringProtocol. Only the String and Substring types in the standard library are valid conforming types.

Found this ~8 years old message that hints it might not.

Better alternatives IMO:

let bytes: [UInt8] = Array("Hello".utf8) // [0x48, 0x65, 0x6c, 0x6c, 0x6f]
let str1 = String(decoding: bytes, as: Unicode.ASCII.self) // "Hello"
// or
let str2 = String(validating: bytes, as: Unicode.ASCII.self) // "Hello" (Optional<String>)

The difference between init(decoding:as:) and init(validating:as:) is that the latter returns an Optional and nil if the input cannot be decoded.

Conversely, init(decoding:as:) will always return a valid non-Optional string. It will replace any non-decodable bytes with "�" (U+FFFD aka the Unicode replacement character).

1 Like

I’ve contemplated burning one of our remaining flag bits on String for “contains no CRLFs”, but they’re a precious and finite resource, and I haven’t seen this showing up in traces in the wild. I’m also mildly concerned about slowing down initialization to look for them.

2 Likes

Even for a non inline string? This bit could be stored in the internal reference object, without being exposed in the 16-byte String struct, no?

(edited, misread initially)

Flag bits are less scarce in that case, but we still don’t want to expand the allocation unnecessarily. Very small strings are the most common, so the most critical to fit tightly into malloc buckets.

I meant one of the reserved bits:

b48-58: Reserved for future usage.
...

  • This typically means that these bits can only be used as optional
    performance shortcuts, e.g. to signal the availability of a potential fast
    path. ...

Ah, no, obviously without lengthening the sting object... The smallest non inline string is already at least 64 bytes including overhead (on macOS).

Yeah, we’ve got 10 of those and have to make them last the next few decades :slight_smile:

Something to consider if it becomes apparent that it’s a significant problem in the wild. Not something to do speculatively imo.

2 Likes