False match reported by NSRegularExpression.firstMatchInString()

So there is something else going on here that might be of interest: the objective-c version has two distinct build paths, one goes down a very similar path as the code snippet from CFRegularExpression and the other takes a “shorter” path to uregex_setUText so it might be that this code path has some dust. From what I can tell the major differential between those two code-paths from a portability standpoint is CFAttributedString support (which previously did not exist in linux builds but now does). I think that if you can make a provably correct unit test to show this range differential it would be interesting to test against the objective-c version as well to verify the behavior. In the mean time I can investigate on what other portions we might be missing. It is worth noting CFRegularExpression has never existed before swift-corelibs-foundation so there may very well be mistakes in my re-write.

For the record, the tests are nearly a direct transliteration of some of the objective-c tests we have; so those probably are “more correct”.

A few clarifications inline:

Hi Philippe, other interested people:

Here is the detailed description of an apparent bug in NSRegularExpression along with a possible solution. Request you to comment on the solution, in the context of correctness and performance.

This is long - please bear with me! Alternatively, we could discuss this over a pull request if you concur.

--->

The following test case has been borrowed from some of those tests in TestNSRegularExpression that aren’t exercised (test_complexRegularExpression) as of today:

import Foundation

let searchStr = "This this is the theway."

let testRegex = try NSRegularExpression(pattern: "\\b(th[a-z]+) \\1\\b", options: )

let fm = testRegex.firstMatchInString(searchStr, options: .WithTransparentBounds, range: NSMakeRange(0,20))

let str = NSString.init(string: searchStr)

if let match = fm {

    print("Test failed")

    print(str.substringWithRange(match.range))

} else {

    print("Test passed")

}

The test fails on Linux - a false match is reported. The substring “the the” matches pattern “\b(th[a-z]+) \1\b” which is wrong because the second “the” does not occur on a word boundary. Note that we are using the option: WithTransparentBounds. This means the matcher can look beyond the search range, for word boundaries.

The question is why the word boundary metacharacter at the end of the patter isn’t being honoured. I studied the functions _CFRegularExpressionEnumerateMatchesInString() and prepareRegularExpression() from CFRegularExpression.c and these are my findings that provide an explanation for a possible reason:

1. We first try to get the search text - a UniChar* - using CFStringGetCharactersPtr(). I guess this is done to try improve performance. We copy the entire search text and set regionStart and regionLimit to the search range.

It is always best practice to expect CFStringGetCharactersPtr to potentially fail. This is where either a) the encoding cannot support returning a direct pointer (e.g. tagged pointer strings) or if somehow the encoding cannot be represented as that storage type.

2. Alternatively, if CFStringGetCharactersPtr() fails, we try to fill the UniChar buffer using CFStringGetCharacters(). Here we try to reduce the size of the search text so that it matches the search range. We use an “enclosingRange” for this. For “transparentBounds” we use the entire text. For nonAnchoringBounds we just take the searchRange plus one character to the left (to match ^) plus one to the right (to match $) : enclosingRange = range;

if (transparentBounds) {

    enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

    if (enclosingRange.location > 0) {

        enclosingRange.location--;

        enclosingRange.length++;

    }

    if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

We then set regionStart and regionLimit to the search range.

3. Further we set the search text using the ICU function uregex_setText() like this:

  uregex_setText(regex, (const UChar *)stringBuffer, (int32_t)regionLimit, &errorCode);

Note that we use “regionLimit” for the “textLength”, which seems questionable. This truncates the search text down to a substring matching the specified search range. So, in the above case where the search text is : "This this is the theway." , using a search range of {0,20} the search text that we actually pass to ICU is “This this is the the” which matches “\b(th[a-z]+) \1\b”. Though this may not be a problem in most searches, the WithTransparentBounds option fails to take effect.

I have a feeling that this is potentially an artifact of not using uregex_setUText

Proposed solution

————————

The straightforward solution is to simply set the length of the search text to the actual length in prepareRegularExpression() of the search string as follows:

    int32_t textLength = length; //proposed fix

    …

    if (range.location + range.length <= INT_MAX) stringBuffer = (UniChar *)CFStringGetCharactersPtr(string);

    if (stringBuffer) {

        regionStart = (int64_t)range.location;

        regionLimit = (int64_t)(range.location + range.length);

        *offset = 0;

    } else {

        enclosingRange = range;

        if (transparentBounds) {

            enclosingRange = CFRangeMake(0, length);

        } else if (nonAnchoringBounds) {

            if (enclosingRange.location > 0) {

                enclosingRange.location--;

                enclosingRange.length++;

            }

            if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

        }

        …

        regionStart = (int64_t)(range.location - enclosingRange.location);

        regionLimit = (int64_t)((range.location + range.length) - enclosingRange.location);

        *offset = enclosingRange.location;

        if (enclosingRange.length <= STACK_BUFFER_SIZE) {

            stringBuffer = stackBuffer;

            if (enclosingRange.length > 0) {

                CFStringGetCharacters(string, enclosingRange, stringBuffer);

            }

        } else {

            stringBuffer = (UniChar *)malloc(sizeof(UniChar) * enclosingRange.length);

            if (stringBuffer) {

                CFStringGetCharacters(string, enclosingRange, stringBuffer);

                *bufferToFree = stringBuffer;

            }

        }

        textLength = enclosingRange.length; //proposed fix

    }

    if (stringBuffer) {

        regex = checkOutRegularExpression(internal, checkout, checkedOutRegex);

        uregex_setText(regex, (const UChar *)stringBuffer, textLength, &errorCode); //proposed fix

I might be worried that this has some other side effect that does not behave as the darwin version. Is there a way we can prove this by unit test that this hits all of the expected range extrema?

···

On Mar 3, 2016, at 12:23 PM, Pushkar N Kulkarni <pushkar.nk@in.ibm.com> wrote:

    }

Thanks!

Pushkar N Kulkarni,
IBM Runtimes

"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke

I have created a PR with the proposed fix - https://github.com/apple/swift-corelibs-foundation/pull/278

@Philippe - I have a few questions mentioned inline.

Regards,

Pushkar N Kulkarni,

IBM Runtimes

"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke

Thanks Philippe and Tony for your responses.

So there is something else going on here that might be of interest: the objective-c version has two distinct build paths, one goes down a very similar path as the code snippet from CFRegularExpression and the other takes a “shorter” path to uregex_setUText so it might be that this code path has some dust. From what I can tell the major differential between those two code-paths from a portability standpoint is CFAttributedString support (which previously did not exist in linux builds but now does). I think that if you can make a provably correct unit test to show this range differential it would be interesting to test against the objective-c version as well to verify the behavior. In the mean time I can investigate on what other portions we might be missing. It is worth noting CFRegularExpression has never existed before swift-corelibs-foundation so there may very well be mistakes in my re-write.

Could you please describe the two paths in greater detail ?

Could you please elaborate more on what is expected from the unit test? Are you talking about the range difference created by the two paths?

For the record, the tests are nearly a direct transliteration of some of the objective-c tests we have; so those probably are “more correct”.

A few clarifications inline:

Hi Philippe, other interested people:

Here is the detailed description of an apparent bug in NSRegularExpression along with a possible solution. Request you to comment on the solution, in the context of correctness and performance.

This is long - please bear with me! Alternatively, we could discuss this over a pull request if you concur.

--->

The following test case has been borrowed from some of those tests in TestNSRegularExpression that aren’t exercised (test_complexRegularExpression) as of today:

import Foundation

let searchStr = "This this is the theway."

let testRegex = try NSRegularExpression(pattern: "\b(th[a-z]+) \1\b", options: )

let fm = testRegex.firstMatchInString(searchStr, options: .WithTransparentBounds, range: NSMakeRange(0,20))

let str = NSString.init(string: searchStr)

if let match = fm {

print("Test failed")
print(str.substringWithRange(match.range))

} else {

print("Test passed")

}

The test fails on Linux - a false match is reported. The substring “the the” matches pattern “\b(th[a-z]+) \1\b” which is wrong because the second “the” does not occur on a word boundary. Note that we are using the option: WithTransparentBounds. This means the matcher can look beyond the search range, for word boundaries.

The question is why the word boundary metacharacter at the end of the patter isn’t being honoured. I studied the functions _CFRegularExpressionEnumerateMatchesInString() and prepareRegularExpression() from CFRegularExpression.c and these are my findings that provide an explanation for a possible reason:

  1. We first try to get the search text - a UniChar* - using CFStringGetCharactersPtr(). I guess this is done to try improve performance. We copy the entire search text and set regionStart and regionLimit to the search range.

It is always best practice to expect CFStringGetCharactersPtr to potentially fail. This is where either a) the encoding cannot support returning a direct pointer (e.g. tagged pointer strings) or if somehow the encoding cannot be represented as that storage type.

  1. Alternatively, if CFStringGetCharactersPtr() fails, we try to fill the UniChar buffer using CFStringGetCharacters(). Here we try to reduce the size of the search text so that it matches the search range. We use an “enclosingRange” for this. For “transparentBounds” we use the entire text. For nonAnchoringBounds we just take the searchRange plus one character to the left (to match ^) plus one to the right (to match $) : enclosingRange = range;

if (transparentBounds) {

enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

if (enclosingRange.location > 0) {
    enclosingRange.location--;
    enclosingRange.length++;
}
if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

We then set regionStart and regionLimit to the search range.

  1. Further we set the search text using the ICU function uregex_setText() like this:

uregex_setText(regex, (const UChar *)stringBuffer, (int32_t)regionLimit, &errorCode);

Note that we use “regionLimit” for the “textLength”, which seems questionable. This truncates the search text down to a substring matching the specified search range. So, in the above case where the search text is : "This this is the theway." , using a search range of {0,20} the search text that we actually pass to ICU is “This this is the the” which matches “\b(th[a-z]+) \1\b”. Though this may not be a problem in most searches, the WithTransparentBounds option fails to take effect.

I have a feeling that this is potentially an artifact of not using uregex_setUText

Proposed solution

————————

The straightforward solution is to simply set the length of the search text to the actual length in prepareRegularExpression() of the search string as follows:

**int32_t textLength = length; //proposed fix**

if (range.location + range.length <= INT_MAX) stringBuffer = (UniChar *)CFStringGetCharactersPtr(string);

if (stringBuffer) {

regionStart = (int64_t)range.location;

regionLimit = (int64_t)(range.location + range.length);

*offset = 0;

} else {

enclosingRange = range;

if (transparentBounds) {

enclosingRange = CFRangeMake(0, length);

} else if (nonAnchoringBounds) {

if (enclosingRange.location > 0) {

enclosingRange.location--;

enclosingRange.length++;

}

if (enclosingRange.location + enclosingRange.length < length) enclosingRange.length++;

}

regionStart = (int64_t)(range.location - enclosingRange.location);

regionLimit = (int64_t)((range.location + range.length) - enclosingRange.location);

*offset = enclosingRange.location;

if (enclosingRange.length <= STACK_BUFFER_SIZE) {

stringBuffer = stackBuffer;

if (enclosingRange.length > 0) {

CFStringGetCharacters(string, enclosingRange, stringBuffer);

}

} else {

stringBuffer = (UniChar *)malloc(sizeof(UniChar) * enclosingRange.length);

if (stringBuffer) {

CFStringGetCharacters(string, enclosingRange, stringBuffer);

*bufferToFree = stringBuffer;

}

}

textLength = enclosingRange.length; //proposed fix

}

if (stringBuffer) {

regex = checkOutRegularExpression(internal, checkout, checkedOutRegex);

uregex_setText(regex, (const UChar *)stringBuffer, textLength, &errorCode); //proposed fix

I might be worried that this has some other side effect that does not behave as the darwin version. Is there a way we can prove this by unit test that this hits all of the expected range extrema?

···

On Mar 3, 2016, at 12:23 PM, Pushkar N Kulkarni pushkar.nk@in.ibm.com wrote:

I have opened up 35 out of 37 tests in the test_complexRegularExpression() function in TestNSRegularExpression. These tests include most if the cases I could think of. Do you think these offer enough coverage? Or would you advise me to work on more?

}

Thanks!

Pushkar N Kulkarni,

IBM Runtimes

"Any sufficiently advanced technology is indistinguishable from magic." - Arthur Clarke

To: Pushkar N Kulkarni/India/IBM@IBMIN
From: Philippe Hausler
Sent by: phausler@apple.com
Date: 03/04/2016 02:10AM
Cc: swift-corelibs-dev@swift.org
Subject: Re: False match reported by NSRegularExpression.firstMatchInString()

-----phausler@apple.com wrote: -----