Pathological slow sorting due to "infection" from NSString?

I've been working on this for several hours, and I still don't believe it, but I'm pretty sure of what I've found. It's hard to post a complete example, but I'll distill it down to the relevant parts.

   // read in about 6K different file names, all of the form
   //     /some/long/prefix/<fileNameA>
   //     /some/long/prefix/<fileNameB> (etc)
   let pathNames: [String] = <READ IN A WHOLE BUNCH OF FILE NAMES>

   // dirName has the directory prefix of each of the read in files,
   // with a trailing slash.
   let dirName "/some/long/prefix/"

   let slowFileNames = pathNames.map { ($0 as NSString).lastPathComponent }
   let fastFileNames = pathNames.map { $0.replacingOccurrencesOf(dirName, "") }

   // both of the above now look like [<fileNameA>, <fileNameB>, ...]
   for i in 0..<slowFileNames.count {
      if slowFileNames[i] != fastFileNames[i] {
          fatalError("horrible death")
     }
   }

  // ok, we got here, so i'm not an idiot: the arrays are the same.
  // note: both of the arrays are of type [String].  slowFileNames is NOT [NSString].
  let x1 = fastFileNames.sorted()         // takes 0.02 seconds
  let x2 = slowFileNames.sorted()          // takes 0.82 seconds.  WTF??

The only difference in the arrays fastFileNames and slowFileNames is how they were built, but at any level I can measure, they are identical. However, one sorts 40 times faster than the other.

I can reproduce this all day long, on iOS or macOS with Xcode 11.3.1.

The clear implication is that at a functional level, the two arrays have the same content, but under the hood, the strings in slowFileNames have some awfullness that the strings in fastFileNames do not.

Why is this is happening? Can someone suggest how I could poke at the internals of the strings in the debugger or infer in some other way what the difference is?

This is frightening, that merely messing with the NSString API can hurt you this badly, since there are lots of spots we depend on it.

1 Like

Native String instances are UTF‐8 and fast. But a String instance can also have other memory representations under the hood. Bridged NSString instances are the most notorious (which are natively UTF‐16).

If you are going back and forth between the String and NSString, you want the conversion to be as cheap as possible, not the resulting memory representation. That is what String attempts to do by default. When you want the opposite trade off, then you can use makeContiguousUTF8() to force a more‐expensive, but one‐time conversion to a native, fast String.

However in your toy example, it may not help much, since you are only doing a single comparison anyway. Basically the difference you are observing is the cost of converting from UTF‐8 to UTF‐16 and back exactly once, including all the memory allocation that implies. [Edit: The scroll hadn’t shown me the last two lines with sorted().]

1 Like

makeContiguousUTF8() worked like a charm.
In fact, the sorting results were even 2x faster.

Moral: any place I make use of NSString in my code, I shall be certain to convert back to a fast String, before storing it. thanks so much!

btw If you’re curious how this works internally, the UTF-8 String Swift blog post is most enlightening.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

4 Likes