Enum with `Substring` raw value?

itaiferber · October 12, 2023, 2:11am

No — just that what you really want to be doing is the equivalent of inputString.startsWith(keyword) (except more like at the byte level), then trimming off the prefix. Taylor mentions that the input string is

which is therefore much more likely to contain more input than just the keyword. In context, it doesn't seem to make much sense (to me, at least) to try to match in the other direction.

tera · October 12, 2023, 2:18am

Right... Just note that there are tons of textbooks suggesting to do it this "classical" way:

do the lexical pass by converting file contents to a sequence of lexemes. e.g. by breaking them at the whitespace, etc.
convert lexemes further (to symbols, numbers, etc).

At point (2) where we are reaching for tools like Keyword.init(rawValue: string) to know if it is a keyword (likewise Number.init(rawValue: string) to know if it is a number, etc). As per what you've just said above - it is too late at this point as the user input was already converted to strings.

itaiferber · October 12, 2023, 2:21am

It would be interesting to explore some stdlib API for performing some "user input → validated String" parsing that would allow you to set some reasonable constraints on what could be considered valid data; e.g., if the input contains, say, more than 30 combining characters, yeah, that seems kind of suspicious.

I've other languages and libraries discuss the idea of separating UserInput and Text into two separate types where UserInput is a black box until validation turns it into Text — but I haven't seen an implementation of this in practice, likely because the surface area is just so large. Still, some interesting concepts to think about in this space.

itaiferber · October 12, 2023, 2:25am

At point (2), it should not be possible for the string you pass to init(rawValue:) to be arbitrarily long, because you validated your lexemes for reasonable input in step (1), right?

(But again, if we're talking about real-world parsers here: you typically want to be working at a much lower level here anyway. It's faster and cheaper to compare byte sequences, and depending on your domain, you may need to avoid things like string normalization in order to preserve user input exactly. Again, if you're writing some serious parsing code and this is a concern for you, String is unlikely to be the correct tool.)

And to add to this a bit: yes, this sounds exhausting, and it's why writing truly secure code is so difficult. If you really need to worry about adversarial input, you likely need to validate, re-validate, and validate some more before you get to the interesting parts of the code; and there are plenty of libraries and APIs you won't be able to use because they're not written with these constraints in mind.

tera · October 12, 2023, 2:47am

With that in mind the following stated complexities look questionable:

    /// - Complexity: O(1)
    @inlinable public var last: Character? { get }

    /// - Complexity: O(1) on average, over many calls to `append(_:)` on the
    ///   same collection.
    @inlinable public mutating func append(_ newElement: Character)

    /// - Complexity: O(1)
    @inlinable public mutating func popLast() -> Character?

    /// - Complexity: O(1)
    @discardableResult
    @inlinable public mutating func removeLast() -> Character

(I picked up a few examples from the header, there are more).

taylorswift · October 12, 2023, 2:54am

i feel this thread has diverged from the original issue being discussed, which is that the compiler’s init(rawValue:) synthesis can only generate an initializer that takes the type RawValue and no others.

to me, @jrose has the right idea:

i wish i had more time to learn macros, as they were not available on linux for some time and i am a bit behind the curve in adopting them. but i would much rather have a macro that generates a second Substring-taking initializer than apply a lot of length heuristics onto substring lexemes.

tera · October 12, 2023, 3:22am

It did so "organically". As you rejected @bbrk24's solution above on the grounds that String(substring) could take a while because the substring in question could be arbitrarily large, we noted that quite equally a mere instantiation of an enumeration out of a string (or a substring) that's arbitrary long - could take a while on its own, regardless of whether the code to do that instantiation is written manually or generated by the compiler or made with a macro, etc.

sveinhal · October 12, 2023, 12:04pm

I haven't had the same need as you, but a similar one, in the sense that I've needed to construct a RawRepresentable value from something that was't the exact raw value.

In my use case, I was able to use CaseIterable and iterate through the values. Would that be possible for you?

E.g.

enum Keyword: String, CaseIterable {
    case  actor
    case `associatedtype`
    case `case`
    case `class`
    case `enum`
    case `func`
    case `import`
    case  macro
    case `protocol`
    case `static`
    case `struct`
    case `typealias`
    case `var`
}

extension Keyword {
    init?(rawValue string: Substring) {
        guard 
            let value = Self.allCases.first(where: { $0.rawValue == string })
            else { return nil }
        self = value
    }
}

bbrk24 · October 13, 2023, 6:56pm

Is there a string-substring equality operator? I would have written $0.rawValue[...] to convert it to a substring, but if both work I suppose it's neither here nor there.

tera · October 13, 2023, 10:33pm

All of these should be identical:

substring == string
string == substring
Substring(string) == substring

if they are not it's a bug, IMHO.

taylorswift · October 13, 2023, 10:34pm

yes there is: StringProtocol.==(_:_:)

you can find it on the inherited features list for String. i wish the site supported HTML anchors for the individual symbols in the members list.