Enum with `Substring` raw value?

this is really stupid:

extension Codelink
{
    enum Keyword:String
    {
        case  actor
        case `associatedtype`
        case `case`
        case `class`
        case `enum`
        case `func`
        case `import`
        case  macro
        case `protocol`
        case `static`
        case `struct`
        case `typealias`
        case `var`
    }
}
extension Codelink.Keyword
{
    init?(rawValue string:Substring)
    {
        switch string
        {
        case "actor":           self = .actor
        case "associatedtype":  self = .associatedtype
        case "case":            self = .case
        case "class":           self = .class
        case "enum":            self = .enum
        case "func":            self = .func
        case "import":          self = .import
        case "macro":           self = .macro
        case "protocol":        self = .protocol
        case "static":          self = .static
        case "struct":          self = .struct
        case "typealias":       self = .typealias
        case "var":             self = .var
        case _:                 return nil
        }
    }
}

one way to work around the issue is to make the RawValue type Substring and not String. but that messes with a lot of things that expect a RawRepresentable<String> conformance.

any better ways to avoid parroting the enum cases?

2 Likes
init?(substring: Substring) {
  self.init(rawValue: String(substring))
}

works if you can afford to copy the substring out into a new string.

1 Like

alas, that is exactly what i was trying to prevent, because the substring might be very long (all remaining parser input!), and copying the entire thing just to check if it is a keyword was what i was trying to avoid.

You can use any StringLiteralConvertible type as the enum raw value. It Just Works.

3 Likes

This I don't quite understand. If the substring is long (e.g. 1000 or even 30) it can't possibly match one of the above short keywords. And if the substring is short – creating a string from it is quick (it would take O(substring.length) time).

you wouldn’t know that a long substring doesn’t match any keywords until after you copy it to a String and pass it to init(rawValue:) and get nil as a result. you can’t really short-circuit that without knowing the length of the longest keyword beforehand.

right, like i said:

1 Like

Which you do know (or it is trivial pre-calculate), no?

Example
extension Codelink {
    enum Keyword: String, CaseIterable {
        case  actor
        // ...
    }
}

extension Codelink.Keyword {
    static var maxKeywordLen: Int = {
        Self.allCases.reduce(0) { maxLen, keyword in
            max(maxLen, keyword.rawValue.count)
        }
    }()
    
    init?(substring: Substring) {
        if substring.count > Self.maxKeywordLen {
            return nil
        }
        self.init(rawValue: String(substring))
    }
}

Note that substring.count could be slow to evaluate for a long substring as it has to decode grapheme clusters to count them. Count of UTF-8 units would be constant time though, assuming you aren't dealing with a bridged string (with UTF-16 storage) and don't have to care about Unicode normalization differences.

2 Likes

Good to know.

Is it possible to implement `var quickCount: Int?' on String / Substring that would only take O(1) time or fail with nil if it can't doing so? Could be useful and not just in this case.

For this case, I suppose one could make a constant time count(upTo:) that will stop counting after a fixed limit. This should be O(maxLength); constant time if maxLength is a constant:

func count(upTo maxLength: Int) -> Int? {
   for character in substring {
      count += 1
      if count > maxLength { return nil }
   }
   return count
}

There's probably a slightly more performant version possible by advancing an index instead of copying each character to a variable, since a character can theoretically be arbitrary length.

Edit: but can it really be constant time if a character can be of arbitrary length? :face_with_monocle: There's a hidden loop here where decoding each character of the substring can require reading an unlimited number combining of code points to form a grapheme cluster. I suppose the real question is whether you need it or not. You don't need it if you know all the words you want to match against the substring are ASCII (in which case you can work in term of UTF-8 code units).

2 Likes

*facepalm* I completely missed that. I think you're stuck, then. The compiler will synthesize a conformance to RawRepresentable; it won't make up arbitrary other initializers for you that happen to be similar. And the compiler-synthesized conformance isn't anything special anyway; it looks just like what you wrote explicitly. I think this is (now) a job for Macros.

4 Likes

Indeed.

Hopefully we could get UTF8 view out of a string / substring in O(1), or nil, if that's not possible (doable?), and then get UTF8.count, again in O(1).

The UTF-8 view is (I think) always lazily decoded, so if you count "up to" n where n is a constant, then it's unambiguously constant time to get that count. That said, since native Swift strings are internally UTF-8 (unless bridged) counting in UTF-8 units will incur less conversion overhead compared to the UTF-16 or UnicodeScalar view. This overhead is still constant time though since for each of these views there's an upper bound of four bytes to process per element.

This is an interesting angle. So even with some innocently looking:

enum E: String { case foo, bar }

The call to built-in E.init(rawValue: userInput) could take arbitrary time, which is not even O(userUnput.count) but potentially O("the length of a single character of user input, which (length) is potentially unbounded")... Unless we opt out of the compiler generated "init(rawValue:)" and use our own version similar to what we discussed above? Hmm, that doesn't sound right. Could the built-in implementation be amended instead to address this angle?

1 Like

Unless I'm missing something here, in order to hit a degenerate case here, you would have already needed to construct userInput from arbitrarily-long user input, meaning you've already spent arbitrarily long elsewhere, no? (You are validating user input, right? :wink:)

(I also can't think of an obvious use-case where user input would be passed in to an init(rawValue:), but maybe I'm not thinking hard enough.)

1 Like

Let's say I find my app spending too much time hashing strings because some rough player passes it ridiculously long strings. To combat it I do a validation: "if string.count < reasonable limit". But that, in turn could take long time, now in the count call itself...

That count has:

is not the whole story, as this simple app shows:

@discardableResult func eqTest(_ a: String, _ b: String, _ length: Int) -> Bool {
    let start = Date()
    let result = a == b
    let elapsed = Date().timeIntervalSince(start)
    precondition(b.count == 1)
    print("count: \(b.count), len: \(length), utf8 len: \(b.utf8.count), \(elapsed * 1000) msec")
    return result
}

func test() {
    let a = "some"
    var count = 1
    var b = "e"
    while count < Int.max / 100 {
        b += String.init(repeating: "\u{0300}", count: count)
        eqTest(a, b, count)
        count *= 10
    }
}
test()

output:

count: 1, len: 1, utf8 len: 3, 0.12695789337158203 msec
count: 1, len: 10, utf8 len: 23, 0.0019073486328125 msec
count: 1, len: 100, utf8 len: 223, 0.0050067901611328125 msec
count: 1, len: 1000, utf8 len: 2223, 0.027894973754882812 msec
count: 1, len: 10000, utf8 len: 22223, 0.24199485778808594 msec
count: 1, len: 100000, utf8 len: 222223, 2.5069713592529297 msec
count: 1, len: 1000000, utf8 len: 2222223, 25.845050811767578 msec
count: 1, len: 10000000, utf8 len: 22222223, 250.23198127746582 msec
count: 1, len: 100000000, utf8 len: 222222223, 2517.67897605896 msec

Obviously "some" could not be equal to "e" plus any number of "COMBINING GRAVE ACCENT" characters, but current EQ implementation wants to first fully determine that first character (spending an unbound amount of time and space whilst doing so) before it could finally bail out with "not equal".

Sorry for jumping between String.init(rawValue:), String.== and String.count - the first depends on the second and we are trying to use the third to optimise the first two.

Just about any simple parser like in the headline post of this topic.

2 Likes

Sorry, to be clearer about what I was hoping to express: string.count / string.utf8.count may already too late to be performing validation; in many scenarios, it's likely that you can (and if so, should) validate user input before you have a String to work with:

  • If you're reading user input from a file, there may be a reasonable size you can check a file against prior to reading it in from disk
  • If you're receiving user input from a stream/network, there may be a reasonable limit to the number of bytes you're willing to accept before blocking more
  • If you're receiving user input from UI, there are likely controls you can place to limit the type of input the user can provide

Once you have a fully-formed String, you already have the entirety of input in memory, which may be something you need to guard against.

It's not common in parser design to pass the entirety of string input into a matcher like this; it's almost never an efficient operation. Instead, it would be much more common to match in the opposite direction: getting the String value of your fixed Keyword to check against your input string.

That would be my recommended solution to this problem.

(And that's avoiding discussion of parsing bytes vs. whole grapheme clusters at a time — you likely want to be doing the former, not the latter. String may simply not be the right tool for the job here.)

5 Likes

When count says:

By "length" it means code-unit length, not the number of characters. Obviously it must visit every code-unit to determine the length of the string, and indeed it scales linearly as your tests show.

It's similar to how, in String.reserveCapacity(Int), the integer parameter is a UTF8 code-unit count rather than character count, even though String's elements are characters.

It may be worth filing a documentation bug to make this clearer.

2 Likes

Got you :+1:

This would be most lacking among the mentioned three, but that would be the limitations of UI frameworks.

I don't quite get this, are you implying that keyword == inputString would have different timing to inputString == keyword?

You are right. Although I've seen many parsers that do work at string level. The likes of String(contentsOf: url) even pushing towards that avenue.

I think an interesting observation is that String.first, String.index(after:), String.Iterator.next, etc. are all O(n) in a degenerate case with n combining code units. You could theoretically have a one-megabyte string and need to run over the whole megabyte to figure out the first character. Unlikely in a real world scenario, but potentially problematic if someone wants to cause trouble with bad input.

If you're just iterating over the string, no problem: you're running over all the bytes anyway so it's still O(n). But if you're repetitively checking the first character, like for instance comparing it with all the other characters, now it becomes O(n^2).

Can you see it? Take this simple example:

for character in string1 {
   if character == string2.first {
      return true
   }
}

This is O(n^2) because string2.first may have to iterate the whole string to gather all the combining code points forming one character. This should be O(n) though:

let string2First = string2.first
for character in string1 {
   if character == string2First {
      return true
   }
}

I wonder if equality testing and substring matching in the standard library has this unfortunate quadratic behavior.

2 Likes