When I use word boundaries in a regular expression pattern, words containing a quote are lost.
For exemple, with the following code in an Xcode playground.
var pattern = "\\b\\w+\\b"
var sentence = "Let's go!"
print("sentence: \(sentence)")
print("pattern: \"\(pattern)\"")
var range = NSRange(sentence.startIndex..., in: sentence)
var regex = try! NSRegularExpression(pattern: pattern, options: [.useUnicodeWordBoundaries])
matches = regex.matches(in: sentence, options: [], range: range)
matches.forEach({ match in
guard let subrange = Range(match.range(at: 0), in: sentence) else {
return
}
print(sentence[subrange])
})
pattern = "\\w+"
print("pattern: \"\(pattern)\"")
regex = try! NSRegularExpression(pattern: pattern, options: [.useUnicodeWordBoundaries])
matches = regex.matches(in: sentence, options: [], range: range)
matches.forEach({ match in
guard let subrange = Range(match.range(at: 0), in: sentence) else {
return
}
print(sentence[subrange])
})
With the pattern \b\w+\b the "Let's" part of the sentence is lost but not with the pattern \w+. That's not the case with the pattern \w+. I don't understand why "Let's" is lost with the first pattern. I have no such problem with python regex.
(macOS 11.6 20G16, Xcode 12.5.1 12E507 , Swift 5.4.2)
cukr
2
You opted-in to the option useUnicodeWordBoundaries which uses unicode rules for the word breaking. There's a rule saying you shouldn't break words when there's an ASCII quote between two letters UAX #29: Unicode Text Segmentation
If you want apostrophe to be treated as a word boundary, don't pass the useUnicodeWordBoundaries option.
If you want to apostrophe to be matched together with \w, use a different pattern
var pattern = "\\b[\\w']+\\b"
1 Like
Thank you for this information. I'll adapt my code.
Jon_Shier
(Jon Shier)
4
Unless you need the regex here, or are on non-Apple platforms, I'd suggest using NLTokenizer here instead of a regex. It should be faster, more accurate, supports multiple languages, and is easier to use.
import NaturalLanguage
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = yourString
tokenizer.enumerateTokens(in: yourString.startIndex..<yourString.endIndex) { range, attributes in
// Do something.
return true
}
1 Like
I don't need the regex. I want to play with regexes and swift.
