Declarative String Processing Overview

hooman · October 4, 2021, 7:37pm

This should be doable, but may need a bigger change in the parser. I personally don't like textual prefixes, but I can live with that as well. In that case, double quote would work as well, and we won't need to spend single quote for it.

Saklad5 · October 4, 2021, 7:38pm

I proposed that earlier, with slight modifications.

wowbagger · October 4, 2021, 7:39pm

(The end of) this line just made me realise that there is another potential conflict with using / for literal delimiter: multiline comments.

let x = foo(/*bar*/)

Is /*bar*/ a comment or a regex literal?

Saklad5 · October 4, 2021, 7:39pm

I think we can all agree using / should be off the table at this point.

Saklad5 · October 4, 2021, 7:44pm

Using '/regex/' just kicks the issue down the road, I think. We should rely on type information to identify what a literal is, not increasingly obscure delimiters.

We do not use 3.14d for a Double, nor [s 1, 2, 3 s] for a Set.

benlings · October 4, 2021, 8:14pm

My understanding is that the main reason for a specific literal is to allow the regex syntax to be checked at compile time, rather than an opaque string that can only be checked at runtime. I could envisage similar usage for other literals that are currently failable initialisers from strings (eg URL).

Saklad5 · October 4, 2021, 8:28pm

Exactly. That’s why I’m advocating for a special 'customizable' literal whose compile-time parsing is handled by the type being initialized.

patrickgoley · October 4, 2021, 8:33pm

It seems like we've veered off topic to compile time validation of regex and other string literals when the essence of this proposal is to enable string processing using constructs within the Swift language (structs, protocols, etc) who's usage would be naturally verified by the compiler to begin with. They also have the advantage of being more approachable and easier to read than the highly compacted and symbolic regex syntax.

Compile-time validation of string literals would need to be baked into the compiler for each use case (regex, URLs, etc) unless we want to enable full-on dependent typing which is a much more complicated feature than this proposal is suggesting.

Saklad5 · October 4, 2021, 8:34pm

I tried to put a pin in the whole literal issue for now, but the original poster explicitly said they wanted to discuss it here.

ben-cohen · October 4, 2021, 9:18pm

We very much can't all agree that – I for one disagree.

/ is the term of art delimiter for regular expressions. This, and the associated at-a-glance clarity it brings, strongly recommend it.

There is definitely a need for an alternate in cases where / is problematic like when dealing with paths etc, but these should be a fallback IMO. / would ideally be the standard convention.

There is also a need to figure out the parsing implications of this new meaning for /, and what tactical heuristics we might apply in Swift 5, and what potential consequences it might have in Swift 6 to make it fit seamlessly. But only after a full exploration of these, and deciding the problems it causes are too difficult to overcome in Swift, should we consider ruling out / in favor of something less apealing.

For example, the uncommon but possible use of postfix / could well be enough to limit use of / to Swift 6 mode only, and require fallback of something else like #r"" in Swift 5 (and Swift 6, when the literal makes / inconvenient). But it's not enough of a reason to rule it out.

Compile-time interpretation of code for validation is an awesome feature for building user-friendly libraries, and one I very much hope we get in the fullness of time. But it's a long way off, and should not hold up implementation of some regex handling directly in the compiler in the mean-time.

Saklad5 · October 4, 2021, 9:19pm

But we wouldn’t need to hold up regex handling. We’d just need to hold up compile-time checking of regex literals. Which already have unavoidable readability issues, so we shouldn’t be encouraging them anyway.

ben-cohen · October 4, 2021, 9:24pm

Familiarity is a strong recommendation but it's not the only one.

Simple regexes are wonderfully concise. A switch statement with regex cases when ripping through a text file, where the regexes are short and simple, maybe with some captures bound with let, is very clear and obvious in intent compared to building several expressions then using them.

I see them as having a similar rule to simple closure literals combined with map and filter. There comes a tipping point where actually you don't want map or reduce, you need a for loop and maybe to define a function, because your logic is more complex and the more verbose form is clearer. People will misuse regexes just like they misuse these high-order functions, using them to write obfuscated over-compressed code. But that shouldn't be used as a reason why we can't have nice things.

Saklad5 · October 4, 2021, 9:31pm

If we really need to implement bespoke compile-time regex checking (and I’m very doubtful about that), I think there’s a more appropriate precedent:

let regex = #regex("[0-9A-F]+)(?:\.\.([0-9A-F]+))?\s*;\s(\w+).*")
print(type(of: regex))
// Prints Regex<(Substring, Substring?, Substring)>

If this is added, I want it deprecated the instant it is no longer necessary.

Abawell · October 4, 2021, 9:51pm

We could use the same technics has the extended string delimiter
" ... " simple string
""" ... """ multiline string
#" ... "# string with extended delimiters
#""" ... """# multiline string with extended delimiters
/" ... "/ regex
/""" ... """/ multiline regex

Karl · October 4, 2021, 10:18pm

It isn't just for checking - I'm sure we'd want the compiler to interpret the regex and generate native code for it.

I don't think regex literals and URL literals are comparable. Regex literals define parsers (i.e. executable code), while URL literals define values which are interpreted by a single parser.

Also, WRT to other comments about the / delimiter, I definitely agree that we should support them, even if we have to deprecate ambiguous uses in custom operators. That's just the way regular expressions are ahem regularly expressed.

Saklad5 · October 4, 2021, 10:22pm

I think that is an extremely bad idea. It’d be better to use a source code generator if you really want that syntax.

And regular expressions (in the form of a literal) are not parsers: they are at most instructions, which are only useful when interpreted by a parser. That applies to most things, including URLs.

Karl · October 4, 2021, 10:29pm

I said they define parsers. Yes, they are a set of instructions - as much as any other source code is a set of instructions. As with other source code, we would want the compiler to understand and check them, but also to produce optimised native code from them.

URLs are just values. You can run the parser at compile-time, and the result is a static value. For regular expressions, the result of compiling them should be executable code.

Saklad5 · October 4, 2021, 10:46pm

But that doesn’t mean it should be Swift code. It should be parsed into Swift code, which is then optimized according to the normal rules. Like result builders, and like property wrappers. And it shouldn’t be regex-only.

What about NSPredicate, for example?

1-877-547-7272 · October 5, 2021, 8:51am

I agree that / is probably the best delimiter for regular expressions (or at least most regular expressions) — I support using / for regex literals as long as the behavior concerning prefix/postfix / is clear.

Perhaps, in Swift 6, / could be treated as a keyword. So if there were an ambiguous expression like let x = /5; let y = 5/, it would default to being interpreted as defining a regex but you could use ` delimiters to clarify that you want to use the / operator (i.e. let x = `/`5; let y = 5`/`). This would make the transition smoother for code that currently uses prefix/postfix / (such as code that relies on swift-case-paths).

A @evaluateAtCompileTime attribute could help with both the implementation of Pattern and compile-time literal processing (though its implementation would be pretty complex and should be discussed further).

@evaluateAtCompileTime func match(@PatternBuilder content: () -> Pattern) { ... }
@evaluateAtCompileTime static func init(stringLiteral: StaticString) { ... }

That being said, I think that discussion of this attribute (as well as further discussion of compile-time–checked string literals, for that matter) should occur on a separate forum topic.

Can you elaborate on why you don’t think we should have compiled regexes? One of the main goals of Swift is high-performance Unicode-correct string processing, so I think compiled regexes are a natural direction for Swift.

Saklad5 · October 5, 2021, 12:30pm

As I said before, I think compile-time literal parsing should be generalized beyond regex specifically. Swift generally shies away from prescribing formats (see Codable, distributed actors, etc.), and I think we should continue that trend.

@evaluateAtCompileTime would be way too general, I think. I’d rather we had ExpressibleByCustomLiteral, and an initializer with as many limitations as necessary to make compile-time running feasible.

Result builders and property wrappers also run at compile-time to a certain degree, and for very similar reasons.