Optimal String conversion to/from UTF-8 w/o null-termination?

snej · November 3, 2017, 7:41pm

I’m working with a C API that represents strings as UTF-8 data tagged with a length but **without a trailing NUL byte**. In other words, its string type is basically a tuple {const char*, size_t}. I need to convert this representation to and from Swift 4 strings.

This needs to be efficient, as these calls will occur in some areas of my project that are known to be performance-critical. (Equivalent conversions in my Obj-C code have already shown up as hot-spots and been carefully optimized.)

For String-to-UTF-8, I’m using String.withCString():
  _ = str.withCString { bytes in c_function(bytes, strlen(bytes)) }
An alternative is
  let bytes = [UInt8](str.utf8)
  c_function(&bytes, bytes.count)
Any idea which of these is more optimal? The former has to call strlen, but I suspect the latter may incur more heap allocation.

For UTF-8-to-String I use this, where `stringPointer` is an UnsafeRawPointer and stringLen is an Int:
let data = Data(bytes: stringPointer, count: stringLen)
return String(data: data, encoding: String.Encoding.utf8)
I’m unhappy about this because it incurs both heap allocation and copying the string bytes. But Data doesn’t seem to have the “noCopy” options that NSData does. Any way to pass the bytes directly to String without an intermediate copy?

—Jens

PS: I’m aware this is an FAQ, but I’ve already put in time searching. Most of the hits are obsolete because the damn String API keeps changing, or else they assume NUL-terminated C strings; and the remainder don’t consider performance.

Ryan_Walklin · November 4, 2017, 12:10am

Why not just profile it? Set up a loop of 100,000 or so with each method and time it.

Ryan

···

On 4 Nov 2017, at 6:42 am, Jens Alfke via swift-users <swift-users@swift.org> wrote:

I’m working with a C API that represents strings as UTF-8 data tagged with a length but **without a trailing NUL byte**. In other words, its string type is basically a tuple {const char*, size_t}. I need to convert this representation to and from Swift 4 strings.

This needs to be efficient, as these calls will occur in some areas of my project that are known to be performance-critical. (Equivalent conversions in my Obj-C code have already shown up as hot-spots and been carefully optimized.)

For String-to-UTF-8, I’m using String.withCString():
   _ = str.withCString { bytes in c_function(bytes, strlen(bytes)) }
An alternative is
   let bytes = [UInt8](str.utf8)
   c_function(&bytes, bytes.count)
Any idea which of these is more optimal? The former has to call strlen, but I suspect the latter may incur more heap allocation.

For UTF-8-to-String I use this, where `stringPointer` is an UnsafeRawPointer and stringLen is an Int:
   let data = Data(bytes: stringPointer, count: stringLen)
   return String(data: data, encoding: String.Encoding.utf8)
I’m unhappy about this because it incurs both heap allocation and copying the string bytes. But Data doesn’t seem to have the “noCopy” options that NSData does. Any way to pass the bytes directly to String without an intermediate copy?

—Jens

PS: I’m aware this is an FAQ, but I’ve already put in time searching. Most of the hits are obsolete because the damn String API keeps changing, or else they assume NUL-terminated C strings; and the remainder don’t consider performance.
_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

eskimo · November 6, 2017, 2:58pm

You can do this with `UnsafeBufferPointer`. For example:

extension String {
    init?(bytes: UnsafePointer<UInt8>, count: Int) {
        let bp = UnsafeBufferPointer(start: bytes, count: count)
        self.init(bytes: bp, encoding: .utf8)
    }
}

I don’t know if this is faster.

There are lots of different ways to achieve your two goals and I wouldn’t even start optimising this without a realistic model of what your strings look like in practice, and a performance test based on that model.

Share and Enjoy

···

On 3 Nov 2017, at 19:42, Jens Alfke via swift-users <swift-users@swift.org> wrote:

Any way to pass the bytes directly to String without an intermediate copy?

--
Quinn "The Eskimo!" <http://www.apple.com/developer/>
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

taylorswift · November 6, 2017, 5:21pm

doesn’t the compiler like to optimize the loop out of the benchmarking
code? i’ve always had a hard time writing benchmarks in Swift

···

On Fri, Nov 3, 2017 at 7:10 PM, Ryan Walklin via swift-users < swift-users@swift.org> wrote:

Why not just profile it? Set up a loop of 100,000 or so with each method
and time it.

Ryan

> On 4 Nov 2017, at 6:42 am, Jens Alfke via swift-users < > swift-users@swift.org> wrote:
>
> I’m working with a C API that represents strings as UTF-8 data tagged
with a length but **without a trailing NUL byte**. In other words, its
string type is basically a tuple {const char*, size_t}. I need to convert
this representation to and from Swift 4 strings.
>
> This needs to be efficient, as these calls will occur in some areas of
my project that are known to be performance-critical. (Equivalent
conversions in my Obj-C code have already shown up as hot-spots and been
carefully optimized.)
>
> For String-to-UTF-8, I’m using String.withCString():
> _ = str.withCString { bytes in c_function(bytes, strlen(bytes)) }
> An alternative is
> let bytes = [UInt8](str.utf8)
> c_function(&bytes, bytes.count)
> Any idea which of these is more optimal? The former has to call strlen,
but I suspect the latter may incur more heap allocation.
>
> For UTF-8-to-String I use this, where `stringPointer` is an
UnsafeRawPointer and stringLen is an Int:
> let data = Data(bytes: stringPointer, count: stringLen)
> return String(data: data, encoding: String.Encoding.utf8)
> I’m unhappy about this because it incurs both heap allocation and
copying the string bytes. But Data doesn’t seem to have the “noCopy”
options that NSData does. Any way to pass the bytes directly to String
without an intermediate copy?
>
> —Jens
>
> PS: I’m aware this is an FAQ, but I’ve already put in time searching.
Most of the hits are obsolete because the damn String API keeps changing,
or else they assume NUL-terminated C strings; and the remainder don’t
consider performance.
> _______________________________________________
> swift-users mailing list
> swift-users@swift.org
> https://lists.swift.org/mailman/listinfo/swift-users

_______________________________________________
swift-users mailing list
swift-users@swift.org
https://lists.swift.org/mailman/listinfo/swift-users

eskimo · November 6, 2017, 5:40pm

Yep. I would resolve this by reading the input from a file. For example, in the data-to-String case, you could:

1. Read lots of individual chunks of data from the file

2. Run each chunk through the `String` initialiser

3. Count how many chunks work and how many fail, so the output depends on the input

4. Print that number, which prevents the compiler optimising the whole thing away

Share and Enjoy

···

On 6 Nov 2017, at 17:21, Kelvin Ma via swift-users <swift-users@swift.org> wrote:

doesn’t the compiler like to optimize the loop out of the benchmarking code?

--
Quinn "The Eskimo!" <http://www.apple.com/developer/>
Apple Developer Relations, Developer Technical Support, Core OS/Hardware