False match reported by NSRegularExpression.firstMatchInString()

Hi Philippe, other interested people:

Here is the detailed description of an apparent bug in NSRegularExpression along with a possible solution. Request you to comment on the solution, in the context of correctness and performance.

This is long - please bear with me! Alternatively, we could discuss this over a pull request if you concur.

--->

The following test case has been borrowed from some of those tests in TestNSRegularExpression that aren’t exercised (test_complexRegularExpression) as of today:

import Foundation

let searchStr = "This this is the theway."

let testRegex = try NSRegularExpression(pattern: "\b(th[a-z]+) \1\b", options: )

let fm = testRegex.firstMatchInString(searchStr, options: .WithTransparentBounds, range: NSMakeRange(0,20))

let str = NSString.init(string: searchStr)

if let match = fm {

print("Test failed")

print(str.substringWithRange(match.range))

} else {

print("Test passed")

}

The test fails on Linux - a false match is reported. The substring “the the” matches pattern “\b(th[a-z]+) \1\b” which is wrong because the second “the” does not occur on a word boundary. Note that we are using the option: WithTransparentBounds. This means the matcher can look beyond the search range, for word boundaries.

The question is why the word boundary metacharacter at the end of the patter isn’t being honoured. I studied the functions _CFRegularExpressionEnumerateMatchesInString() and prepareRegularExpression() from CFRegularExpression.c and these are my findings that provide an explanation for a possible reason:

  1. We first try to get the search text - a UniChar* - using CFStringGetCharactersPtr(). I guess this is done to try improve performance. We copy the entire search text and set regionStart and regionLimit to the search range.

  2. Alternatively, if CFStringGetCharactersPtr() fails, we try to fill the UniChar buffer using CFStringGetCharacters(). Here we try to reduce the size of the search text so that it matches the search range. We use an “enclosingRange” for this. For “transparentBounds” we use the entire text. For nonAnchoringBounds we just take the searchRange plus one character to the left (to match ^) plus one to the right (to match $) : enclosingRange = range;

if (transparentBounds) {

enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

if (enclosingRange.location > 0) {

    enclosingRange.location--;

    enclosingRange.length++;

}

if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

We then set regionStart and regionLimit to the search range.

  1. Further we set the search text using the ICU function uregex_setText() like this:

uregex_setText(regex, (const UChar *)stringBuffer, (int32_t)regionLimit, &errorCode);

Note that we use “regionLimit” for the “textLength”, which seems questionable. This truncates the search text down to a substring matching the specified search range. So, in the above case where the search text is : "This this is the theway." , using a search range of {0,20} the search text that we actually pass to ICU is “This this is the the” which matches “\b(th[a-z]+) \1\b”. Though this may not be a problem in most searches, the WithTransparentBounds option fails to take effect.

Proposed solution

···

————————

The straightforward solution is to simply set the length of the search text to the actual length in prepareRegularExpression() of the search string as follows:

**int32_t textLength = length; //proposed fix**

if (range.location + range.length <= INT_MAX) stringBuffer = (UniChar *)CFStringGetCharactersPtr(string);

if (stringBuffer) {

regionStart = (int64_t)range.location;

regionLimit = (int64_t)(range.location + range.length);

*offset = 0;

} else {

enclosingRange = range;

if (transparentBounds) {

enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

if (enclosingRange.location > 0) {

enclosingRange.location--;

enclosingRange.length++;

}

if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

regionStart = (int64_t)(range.location - enclosingRange.location);

regionLimit = (int64_t)((range.location + range.length) - enclosingRange.location);

*offset = enclosingRange.location;

if (enclosingRange.length <= STACK_BUFFER_SIZE) {

stringBuffer = stackBuffer;

if (enclosingRange.length > 0) {

CFStringGetCharacters(string, enclosingRange, stringBuffer);

}

} else {

stringBuffer = (UniChar *)malloc(sizeof(UniChar) * enclosingRange.length);

if (stringBuffer) {

CFStringGetCharacters(string, enclosingRange, stringBuffer);

*bufferToFree = stringBuffer;

}

}

textLength = enclosingRange.length; //proposed fix

}

if (stringBuffer) {

regex = checkOutRegularExpression(internal, checkout, checkedOutRegex);

uregex_setText(regex, (const UChar *)stringBuffer, textLength, &errorCode); //proposed fix

}

Thanks!

Pushkar N Kulkarni,

IBM Runtimes

"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke

Hi Pushkar,

Yes, let’s get this into a PR and we can finish reviewing it there. Thanks!

- Tony

···

On Mar 3, 2016, at 12:23 PM, Pushkar N Kulkarni via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

Hi Philippe, other interested people:

Here is the detailed description of an apparent bug in NSRegularExpression along with a possible solution. Request you to comment on the solution, in the context of correctness and performance.

This is long - please bear with me! Alternatively, we could discuss this over a pull request if you concur.

--->

The following test case has been borrowed from some of those tests in TestNSRegularExpression that aren’t exercised (test_complexRegularExpression) as of today:

import Foundation

let searchStr = "This this is the theway."

let testRegex = try NSRegularExpression(pattern: "\\b(th[a-z]+) \\1\\b", options: )

let fm = testRegex.firstMatchInString(searchStr, options: .WithTransparentBounds, range: NSMakeRange(0,20))

let str = NSString.init(string: searchStr)

if let match = fm {

    print("Test failed")

    print(str.substringWithRange(match.range))

} else {

    print("Test passed")

}

The test fails on Linux - a false match is reported. The substring “the the” matches pattern “\b(th[a-z]+) \1\b” which is wrong because the second “the” does not occur on a word boundary. Note that we are using the option: WithTransparentBounds. This means the matcher can look beyond the search range, for word boundaries.

The question is why the word boundary metacharacter at the end of the patter isn’t being honoured. I studied the functions _CFRegularExpressionEnumerateMatchesInString() and prepareRegularExpression() from CFRegularExpression.c and these are my findings that provide an explanation for a possible reason:

1. We first try to get the search text - a UniChar* - using CFStringGetCharactersPtr(). I guess this is done to try improve performance. We copy the entire search text and set regionStart and regionLimit to the search range.

2. Alternatively, if CFStringGetCharactersPtr() fails, we try to fill the UniChar buffer using CFStringGetCharacters(). Here we try to reduce the size of the search text so that it matches the search range. We use an “enclosingRange” for this. For “transparentBounds” we use the entire text. For nonAnchoringBounds we just take the searchRange plus one character to the left (to match ^) plus one to the right (to match $) : enclosingRange = range;

if (transparentBounds) {

    enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

    if (enclosingRange.location > 0) {

        enclosingRange.location--;

        enclosingRange.length++;

    }

    if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

We then set regionStart and regionLimit to the search range.

3. Further we set the search text using the ICU function uregex_setText() like this:

  uregex_setText(regex, (const UChar *)stringBuffer, (int32_t)regionLimit, &errorCode);

Note that we use “regionLimit” for the “textLength”, which seems questionable. This truncates the search text down to a substring matching the specified search range. So, in the above case where the search text is : "This this is the theway." , using a search range of {0,20} the search text that we actually pass to ICU is “This this is the the” which matches “\b(th[a-z]+) \1\b”. Though this may not be a problem in most searches, the WithTransparentBounds option fails to take effect.

Proposed solution

————————

The straightforward solution is to simply set the length of the search text to the actual length in prepareRegularExpression() of the search string as follows:

    int32_t textLength = length; //proposed fix

    …

    if (range.location + range.length <= INT_MAX) stringBuffer = (UniChar *)CFStringGetCharactersPtr(string);

    if (stringBuffer) {

        regionStart = (int64_t)range.location;

        regionLimit = (int64_t)(range.location + range.length);

        *offset = 0;

    } else {

        enclosingRange = range;

        if (transparentBounds) {

            enclosingRange = CFRangeMake(0, length);

        } else if (nonAnchoringBounds) {

            if (enclosingRange.location > 0) {

                enclosingRange.location--;

                enclosingRange.length++;

            }

            if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

        }

        …

        regionStart = (int64_t)(range.location - enclosingRange.location);

        regionLimit = (int64_t)((range.location + range.length) - enclosingRange.location);

        *offset = enclosingRange.location;

        if (enclosingRange.length <= STACK_BUFFER_SIZE) {

            stringBuffer = stackBuffer;

            if (enclosingRange.length > 0) {

                CFStringGetCharacters(string, enclosingRange, stringBuffer);

            }

        } else {

            stringBuffer = (UniChar *)malloc(sizeof(UniChar) * enclosingRange.length);

            if (stringBuffer) {

                CFStringGetCharacters(string, enclosingRange, stringBuffer);

                *bufferToFree = stringBuffer;

            }

        }

        textLength = enclosingRange.length; //proposed fix

    }

    if (stringBuffer) {

        regex = checkOutRegularExpression(internal, checkout, checkedOutRegex);

        uregex_setText(regex, (const UChar *)stringBuffer, textLength, &errorCode); //proposed fix

    }

Thanks!

Pushkar N Kulkarni,
IBM Runtimes

"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke

_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev