For some reason, I always end up parsing things. Currently, the APIs are not well suited for this. Ideally, at some future point, we'd have something like Perl 6 grammars (I would love this).
Today, I write most of my methods as mutating func on Substring. I think the stdlib could benefit from having more of these. Things like func removePrefix(_ str: String) -> Bool, etc. Basically, a lot of stuff similar to Foundation's Scanner.
I'm wondering if, as a first step towards better string processing, we could make a list of these kinds of methods that mutate substrings. They might also be defined on protocols instead (so that we can also easily parse things like Data). I imagine that, even if we someday arrive at having embedded grammars, these methods would still be useful.
What do y'all think of this? Which kind of methods do you keep writing?
Thank you! I love parser combinators, but Iâm not sure if theyâre the way forward for Swiftâs standard library. Iâm proposing something thatâs way more modest (and less controversial, I think).
I think Substring is the right place for this. I don't think there's any value to an additional type unless you want additional stored properties, use-specified transform functions, etc. At least in my experience of writing parsers, those are usually highly use-specific. Tailored solutions* can easily be built on top of a Substring with consume.
(* puns involving "tailored" are my only unintended ones)
One advantage of a scanner-like type is that it could be generic over Character. Even when working with strings, this can be an advantage (e.g. when parsing binary formats you can often work with UTF-16 code points). Instead, these methods could also be defined on a protocol which allows efficient peek and popFirst.
I wonder what's the best way to go about this. We could look at existing parsing libraries and see what we can take. We could write a bunch of sample programs that should be implemented using these APIs and see what's useful and what isn't. I mostly have a experience with parser combinators, it would be good to have a broader view on this.
Is this also a question of the Substring API size? Itâs quite big as it is, and I imagine there are usability downsides to it. That would also favour the scanner approach.
I'm not arguing strongly against a new type, I just want to understand what benefit a new type introduces.
To accommodate peek, we have to make a generic Collection consumer instead of a generic Sequence consumer. If it's a new type, what does it hold? Well, it probably just stores a range over the wrapped Collection to represent its current position. That's basically what a SubSequence is, and literally what Slice and Substring are.
If we decide to make this interface into a protocol called, e.g. Collectionvore, I think we'd want to conform Substring and probably Slice to it.
Yes. Probabky any random access collection could be used as long as we store the index (and can increment in O(1). Removing the first element is then the same as incrementing the index...
Are activities like this a (Sub)String thing? Or a general Collection thing? I've wanted to parse sources that weren't character-based/strings in the past.
It would be nice if this could work for parsing at least Strings and binary formats, so I think a generic type for parsing would be appropriate. It would also provide a reasonable type for users to possibly extend with their own parsing methods, as opposed to extending something like Substring for that.
May I ask to what end your parser code mutates substrings? When I parse things, I usually start with a tokenizer and from there the actual text is not very relevant anymore... And my tokenizers never change the text they are interpreting
Btw., I've made the experience that a formal grammar is nice to describe and verify code but not a very good starting point for parsing it. Parsing is best done in a hierarchical fashion, where after the tokenizer the general structure is interpreted from brackets, with indentation as an additional hint in case the brackets are off. This way, local errors in the code remain local and the general structure can be interpreted, which is extremely helpful for writing out sensible error messages - a formal grammar will very often misinterpret the location of the source of a problem, especially when brackets are off...
You're not mutating the underlying storage, you're mutating the Substring's range. Something like mySubstr.eat(Int.self) would advance mySubstr's start to beyond the first parsed Int.
Your tokenizer could be built on top of these facilities.
If parsing was a Collection and/or Sequence thing, instead of a (Sub)String thing, then we just reuse the parsing routines for each level.
I was thinking for the parsing routines being for any Collection/Sequence with an Equatable/Comparable/Hashable Element, but with higher level parsing, we would get objects with accounting information (error codes, line numbers, etc.) we wouldn't necessarily want to consider during comparisons, so routines that take general comparison closures are needed (maybe with Equatable overloads).
In thinking a bit more about it, I think we get all the benefits of a scanner-like type if we implement this on Slice (as Michael noted earlier). It also stores the end index, but thatâs not a big problem, I think.
A SubSequence doesnât necessarily have to be a Slice; any type with the same Sequence/Collection/etc. rank as the enclosing type is allowed. Thatâs why parsing should be more general.