String parsing

Chris_Eidhof · February 22, 2018, 6:32pm

For some reason, I always end up parsing things. Currently, the APIs are not well suited for this. Ideally, at some future point, we'd have something like Perl 6 grammars (I would love this).

Today, I write most of my methods as mutating func on Substring. I think the stdlib could benefit from having more of these. Things like func removePrefix(_ str: String) -> Bool, etc. Basically, a lot of stuff similar to Foundation's Scanner.

I'm wondering if, as a first step towards better string processing, we could make a list of these kinds of methods that mutate substrings. They might also be defined on protocols instead (so that we can also easily parse things like Data). I imagine that, even if we someday arrive at having embedded grammars, these methods would still be useful.

What do y'all think of this? Which kind of methods do you keep writing?

zoul · February 22, 2018, 6:53pm

My tangential idea is that I would love to see a simple parser combinator library being developed in your Swift Talks!

Chris_Eidhof · February 22, 2018, 9:29pm

Thank you! I love parser combinators, but I’m not sure if they’re the way forward for Swift’s standard library. I’m proposing something that’s way more modest (and less controversial, I think).

Michael_Ilseman · February 22, 2018, 10:03pm

I think there is a clear need for mutating consume-from-the-front methods on Substring for use in lexing/scanning.

Total strawman (preferring amusing names):

extension Substring {
  mutating func eat<T: SubstringEdible>(_ ofType = T.Self) -> T? { ... }
  mutating func eat<T: SubstringEdible>(oneOf: Set<T>) -> T? { ... }
  mutating func eat<T: SubstringEdible>(manyOf: Set<T>) -> [T]? { ... }
  ...
  mutating func eat(`while`: (Character) -> Bool) -> Substring { ... }
  mutating func eat(until: (Character) -> Bool) -> Substring { ... }
  mutating func eat(morsels numCharacters: Int = 1) -> Substring { ... }

  func peek<T: SubstringPeekable>(_ ofType = T.Self) -> T?
  ...
  func peek(ahead: Int = 1) -> Character { ... }
  func peek(until: (Character) -> Bool) -> Substring { ... }
}

extension Numeric/StringProtocol/... : SubstringEdible/SubstringPeekable { ... }

edit: Added more straw

nnnnnnnn · February 22, 2018, 10:06pm

Do you think String is the right place for this, or would you use a Scanner-like type if it weren't so crazy awkward in Swift?

Michael_Ilseman · February 22, 2018, 10:12pm

I think Substring is the right place for this. I don't think there's any value to an additional type unless you want additional stored properties, use-specified transform functions, etc. At least in my experience of writing parsers, those are usually highly use-specific. Tailored solutions* can easily be built on top of a Substring with consume.

(* puns involving "tailored" are my only unintended ones)

Chris_Eidhof · February 23, 2018, 7:20am

One advantage of a scanner-like type is that it could be generic over Character. Even when working with strings, this can be an advantage (e.g. when parsing binary formats you can often work with UTF-16 code points). Instead, these methods could also be defined on a protocol which allows efficient peek and popFirst.

I wonder what's the best way to go about this. We could look at existing parsing libraries and see what we can take. We could write a bunch of sample programs that should be implemented using these APIs and see what's useful and what isn't. I mostly have a experience with parser combinators, it would be good to have a broader view on this.

zoul · February 23, 2018, 7:52am

Is this also a question of the Substring API size? It’s quite big as it is, and I imagine there are usability downsides to it. That would also favour the scanner approach.

Michael_Ilseman · February 23, 2018, 5:01pm

I'm not arguing strongly against a new type, I just want to understand what benefit a new type introduces.

To accommodate peek, we have to make a generic Collection consumer instead of a generic Sequence consumer. If it's a new type, what does it hold? Well, it probably just stores a range over the wrapped Collection to represent its current position. That's basically what a SubSequence is, and literally what Slice and Substring are.

If we decide to make this interface into a protocol called, e.g. Collectionvore, I think we'd want to conform Substring and probably Slice to it.

Chris_Eidhof · February 23, 2018, 5:44pm

Yes. Probabky any random access collection could be used as long as we store the index (and can increment in O(1). Removing the first element is then the same as incrementing the index...

CTMacUser · February 23, 2018, 11:09pm

Are activities like this a (Sub)String thing? Or a general Collection thing? I've wanted to parse sources that weren't character-based/strings in the past.

ahti · February 24, 2018, 8:15am

It would be nice if this could work for parsing at least Strings and binary formats, so I think a generic type for parsing would be appropriate. It would also provide a reasonable type for users to possibly extend with their own parsing methods, as opposed to extending something like Substring for that.

sbohmann · February 24, 2018, 4:08pm

May I ask to what end your parser code mutates substrings? When I parse things, I usually start with a tokenizer and from there the actual text is not very relevant anymore... And my tokenizers never change the text they are interpreting

sbohmann · February 24, 2018, 4:12pm

Btw., I've made the experience that a formal grammar is nice to describe and verify code but not a very good starting point for parsing it. Parsing is best done in a hierarchical fashion, where after the tokenizer the general structure is interpreted from brackets, with indentation as an additional hint in case the brackets are off. This way, local errors in the code remain local and the general structure can be interpreted, which is extremely helpful for writing out sensible error messages - a formal grammar will very often misinterpret the location of the source of a problem, especially when brackets are off...

Michael_Ilseman · February 24, 2018, 4:23pm

You're not mutating the underlying storage, you're mutating the Substring's range. Something like mySubstr.eat(Int.self) would advance mySubstr's start to beyond the first parsed Int.

Your tokenizer could be built on top of these facilities.

sbohmann · February 24, 2018, 4:25pm

Ah, that makes a LOT more sense

sbohmann · February 24, 2018, 4:30pm

I just looked at the last Tokenizer I've written and found the most terribly method imaginable

It works, though, so I will definitely not replace it with something smarter inside that project

Here's just that method:

github.com

sbohmann/CodeGeneration/blob/master/Tokenizer.java

	public List<Token<?>> run()
	throws IOException
	{
		try (Reader reader = IoTools.createReader(file))
		{
			List<Token<?>> tokens = new ArrayList<>();
			
			int line = 1;
			int col = 1;
			
			TokenBuffer tokenBuffer = new TokenBuffer(tokens);
			
			boolean insideComment = false;
			
			boolean insideStringLiteral = false;
			boolean insideNumberLiteral = false;
			boolean escaped = false;
			
			while (true)
			{

This file has been truncated. show original

CTMacUser · February 25, 2018, 10:52pm

If parsing was a Collection and/or Sequence thing, instead of a (Sub)String thing, then we just reuse the parsing routines for each level.

I was thinking for the parsing routines being for any Collection/Sequence with an Equatable/Comparable/Hashable Element, but with higher level parsing, we would get objects with accounting information (error codes, line numbers, etc.) we wouldn't necessarily want to consider during comparisons, so routines that take general comparison closures are needed (maybe with Equatable overloads).

Chris_Eidhof · February 26, 2018, 7:22am

In thinking a bit more about it, I think we get all the benefits of a scanner-like type if we implement this on Slice (as Michael noted earlier). It also stores the end index, but that’s not a big problem, I think.

CTMacUser · February 27, 2018, 2:49pm

A SubSequence doesn’t necessarily have to be a Slice; any type with the same Sequence/Collection/etc. rank as the enclosing type is allowed. That’s why parsing should be more general.