Pattern matching in Unicode text is difficult, which is why we introduced the new Regex APIs. I believe they are the recommended way to perform this kind of processing. For example, we have CharacterClass
:
A character class can represent individual characters, a group of characters, the set of character that match some set of criteria, or a set algebraic combination of all of the above.
The regex builder can create character classes for you if you supply a closed range, and you can tell the resulting regex to match at the unicode scalar level:
import RegexBuilder
let isAllowedString = Regex {
Anchor.startOfSubject
OneOrMore {
ChoiceOf {
"a"..."z"
"="
"\u{007F}"..."\u{009F}"
}
}
Anchor.endOfSubject
}.matchingSemantics(.unicodeScalar)
func check(_ str: String) {
if str.wholeMatch(of: isAllowedString) != nil {
print(str, "allowed")
} else {
print(str, "not allowed")
}
}
check("a") // allowed
check("hell=o") // allowed
check("å") // not allowed
check("á") // not allowed
check("α") // not allowed
check("hEll=o") // not allowed
check("A") // not allowed
check("9") // not allowed
Alternatively, you can express your pattern using higher-level text characteristics as defined by Unicode. For example, if you want to allow the lowercase letter a
plus any combining characters, you can use the .lowercaseLetter
general category:
let isAllowedString = Regex {
Anchor.startOfSubject
OneOrMore {
ChoiceOf {
CharacterClass.generalCategory(.lowercaseLetter) // <-----
"="
"\u{007F}"..."\u{009F}"
}
}
Anchor.endOfSubject
}.matchingSemantics(.unicodeScalar)
check("a") // allowed
check("hell=o") // allowed
check("å") // allowed <---
check("á") // allowed <---
check("α") // allowed <---
check("hEll=o") // not allowed
check("A") // not allowed
check("9") // not allowed
This creates an interesting issue - let's check our old friend, é
, and whether both precomposed and decomposed forms are accepted:
check("\u{00E9}") // precomposed - allowed
check("e\u{0301}") // decomposed - not allowed (!)
They are not! Because we've applied scalar semantics to the entire pattern.
No matter, we can fix this - by composing scalar-level patterns with character-level patterns:
let isAllowedString = Regex {
Anchor.startOfSubject
OneOrMore {
ChoiceOf {
// Grapheme cluster semantics.
CharacterClass.generalCategory(.lowercaseLetter)
"="
// Additional character classes using scalar semantics.
Regex {
ChoiceOf {
"\u{007F}"..."\u{009F}"
}
}.matchingSemantics(.unicodeScalar)
}
}
Anchor.endOfSubject
}
check("\u{00E9}") // precomposed - allowed
check("e\u{0301}") // decomposed - allowed <---
In your particular example, U+007F-U+009F are control characters, so I'm pretty sure they never compose with anything, and matching them at scalar or grapheme cluster level doesn't matter. But what I'm trying to show is that the new Regex APIs offer some powerful tools for pattern matching in Unicode text, and that they compose so you can express even complex patterns.
SE-0363: Unicode for String Processing has more information about CharacterClass, including some of the nuances when expressing character classes using ranges.