Regex is very heavily geared towards textual data, as nearly every construct, concept, and option beyond the core operations only make sense for text. Even the core operations themselves are still pretty heavily geared for search within text and require a concept of position (i.e. Collection.Index).
A question regarding the DSL representation is whether the DSL should provide a radically different model than the literal. Since the DSL is producing Regex instances and can embed literals, it is very much a Regex DSL. Significant divergence would be uncanny.
The original overview from way back when discussed a Pattern<T>, implying more of a parser combinator system with recursively nested structural captures and history preservation. That's future work and difficult to incorporate into result builders (and the type system) as they are today. Trying to make the DSL be both a Regex DSL and a Pattern DSL leads into an uncanny valley that fails to serve the needs of either. Instead, a future Pattern can interoperate with Regex through the custom components interfaces (see CustomMatchingRegexComponent).
Regexes excel at finding needles in haystacks, but they are poor at recognizing the recursive structure of stacks of hay. FWIW, we're developing the fundamental capabilities that underlie both and treating regex as a particular presentation of those capabilities. PEG-like systems (including parser combinators) would have a different presentation style, ideally with significant improvements to Swift's DSL story (either better result builders or something to supersede them). This is outside the scope of these proposals, but I am always happy to discuss the topic.
While data processing can use the same underlying capabilities of the matching engine powering Regex as presented, it would benefit from a very different presentation than Regex. Most notably, local backtracking within a moving window over asynchronous sources. This is outside of the scope of these proposals.
I'll leave it up to the review manager to decide if we should spin off a thread for further discussions around non-String processing.
As for the semantics of processing Unicode scalars or bytes with Regex, here's some clarifications:
Models of string
The primary, or default, model would be String's model in which characters are extended grapheme clusters comparable under canonical equivalence. Thus . behaves the same as dropFirst().
Degenerate grapheme clusters is a nuanced topic, but a good rule is that if the original input did not contain any degenerates no "ordinary" operation on String would produce one. You'd have to drop down to a scalar or lower view or otherwise request sub-grapheme-cluster processing. For example, this means that while we may not require a grapheme break to be present between two adjacent scalars in a regex, we would require a grapheme break around any captured subpatterns (including the overall match), at least when in grapheme-semantic mode.
String also provides a UnicodeScalarView and regex can operate over that using scalar semantics, which is commonly referred to as "Unicode mode" in other engines. There, . corresponds to unicodeScalars.dropFirst() and comparison uses the raw value. This can be presented as API in a variety of ways, for example:
myString.unicodeScalars.firstMatch(of: regex)
myString.firstMatch(of: regex.scalarSemantic)
myString.firstMatch(of: regex, options: .scalarSemantic)
Which are just different ways to surface the semantic modes as API.
Another thing discussed would be a so-called byte-semantic mode, corresponding to String.UTF8View for validly encoded content and perhaps including other input sources permitting invalidly encoded content. There, ASCII literal values would be interpreted as their encoded values, non-ASCII literals as a series of their UTF-8 encoded values, and byte literals to match everything else.
All of this is currently being developed as part of the Unicode for String Processing proposal. Unfortunately I do not know the current status or details of that proposal, especially regarding byte-semantic mode. @nnnnnnnn, anything to clarify here?