Some words are lost with regex.matches when pattern use word boundaries

smuroftfiws · September 20, 2021, 10:12am

When I use word boundaries in a regular expression pattern, words containing a quote are lost.

For exemple, with the following code in an Xcode playground.

var pattern = "\\b\\w+\\b"
var sentence = "Let's go!"
print("sentence: \(sentence)")
print("pattern: \"\(pattern)\"")
var range = NSRange(sentence.startIndex..., in: sentence)
var regex = try! NSRegularExpression(pattern: pattern, options: [.useUnicodeWordBoundaries])
matches = regex.matches(in: sentence, options: [], range: range)
matches.forEach({ match in
    guard let subrange = Range(match.range(at: 0), in: sentence) else {
        return
    }
    print(sentence[subrange])
})
pattern = "\\w+"
print("pattern: \"\(pattern)\"")
regex = try! NSRegularExpression(pattern: pattern, options: [.useUnicodeWordBoundaries])
matches = regex.matches(in: sentence, options: [], range: range)
matches.forEach({ match in
   guard let subrange = Range(match.range(at: 0), in: sentence) else {
       return
   }
   print(sentence[subrange])
})

With the pattern \b\w+\b the "Let's" part of the sentence is lost but not with the pattern \w+. That's not the case with the pattern \w+. I don't understand why "Let's" is lost with the first pattern. I have no such problem with python regex.

(macOS 11.6 20G16, Xcode 12.5.1 12E507 , Swift 5.4.2)

cukr · September 20, 2021, 11:57am

You opted-in to the option useUnicodeWordBoundaries which uses unicode rules for the word breaking. There's a rule saying you shouldn't break words when there's an ASCII quote between two letters UAX #29: Unicode Text Segmentation

If you want apostrophe to be treated as a word boundary, don't pass the useUnicodeWordBoundaries option.
If you want to apostrophe to be matched together with \w, use a different pattern
var pattern = "\\b[\\w']+\\b"

smuroftfiws · September 20, 2021, 12:12pm

Thank you for this information. I'll adapt my code.

Jon_Shier · September 20, 2021, 3:29pm

Unless you need the regex here, or are on non-Apple platforms, I'd suggest using NLTokenizer here instead of a regex. It should be faster, more accurate, supports multiple languages, and is easier to use.

import NaturalLanguage

let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = yourString
tokenizer.enumerateTokens(in: yourString.startIndex..<yourString.endIndex) { range, attributes in
  // Do something.
  return true
}

smuroftfiws · September 20, 2021, 5:35pm

I don't need the regex. I want to play with regexes and swift.