SE-0247: Contiguous Strings


(Douglas Gregor) #1

The review of SE-0247: Contiguous Strings begins now and runs through Monday, March 18th, 2019.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0247" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for contributing to Swift!

Doug Gregor
Review Manager


Call for Users (and Authors): Offset indexing pitch
(Andreas Jönsson) #2

Effort: Quick reading.

Great addition!

Only thing I can think of is the difference in naming of the View’s ... contiguousStorage ... and the String’s isContiguous, but that probably makes sense and I can’t think of a way to improve it.

+1


(Johannes Weiss) #3
  • What is your evaluation of the proposal?

+1

  • Is the problem being addressed significant enough to warrant a change to Swift?

yes

  • Does this proposal fit well with the feel and direction of Swift?

yes

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

n/a

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I've hit many of the described problems myself in the development phase of SwiftNIO 2 and I believe that these additions will really make writing performance sensitive String code much more convenient.


(Lily Ballard) #4

This looks like exactly what I need today. Huge +1 on the general idea. I haven't had a chance to review the specifics yet though.


(Lily Ballard) #5

After having read the proposal:

Overall +1. However I would recommend isContiguousUTF8 and makeContiguousUTF8() as the names. I recognize that the Swift ABI blesses UTF-8, but the API shouldn't assume the reader understands the ABI. This also makes the connection to withUTF8 more obvious.


(Brent Royal-Gordon) #6

I might actually come at this from the other direction and rename withUTF8(_:) to withContiguousUTF8(_:). My main worry with this proposal is that users might see withUTF8(_:) and assume that it, like -[NSString utf8String], returns a null-terminated C string. The additional word in the name may give users enough pause for them to discover withCString(_:) and use it instead if appropriate.

Adding the word "contiguous" to the with function's name will suffice to connect it to the related members; you only need to worry about it being UTF-8 when you're actually working with the pointer, so it makes sense to only mention it on that method.


(Lily Ballard) #7

I disagree that this is sufficient. Going by the naming alone, isContiguous doesn't say it's contiguous UTF-8, just that it's contiguous, so there's no reason to believe that isContiguous implies that withContiguousUTF8 will be O(1).

As for withUTF8 vs withContiguousUTF8, the fact that it's an UnsafeBufferPointer already implies contiguous, so withContiguousUTF8 is redundant. As for null-termination, I'm not concerned about that. String does not vend utf8String, it vends utf8CString instead, where the "CString" part implies the null-termination. And similarly it already has withCString which gives you a null-terminated UTF-8 string. Which is to say, the existing precedent in the String API is methods/properties that give null-terminated buffers have "CString" in the name.


(Michael Ilseman) #8

Do you have any thoughts on the public var contiguousUTF8: String { get } as an alternative to makeContiguousUTF8(), and non-mutating withUTF8?


(Lily Ballard) #9

I'm concerned that this approach would make it too easy to have poor performance where you end up copying the same string repeatedly to make it contiguous each time you do an operation that wants to work with the UTF-8. Having the "convert to contiguous UTF-8" be mutating makes it obvious to the author that if they copy the string into a temporary variable in order to call a mutating method, then this won't affect the original String and therefore doing this multiple times in a row will have performance implications. This guides the author into checking isContiguous to avoid this cost if UTF-8 is an optimization instead of a requirement and repeated operations is a concern.


(Michael Ilseman) #10

That was definitely my initial feeling. Trying to explore this a little bit:

// Mutating withUTF8
func parse(_ str: String) -> T {
  var copy = str
  return copy.withUTF8(foo)
}
// Non-mutating withUTF8
func parse(_ str: String) -> T {
  return str.withUTF8(foo)
}

Both of these trigger a separate allocation and copy if str is not contiguous, and the mutating version is a little more annoying.

But, this is more fluid for a library that wants to propagate contiguity back:

// Mutating withUTF8
func parse(_ str: inout String) -> T {
  str.withUTF8(foo)
}
// Nonmutating withUTF8
func parse(_ str: inout String) -> T {
  str = str.contiguousUTF8
  str.withUTF8(foo)
}

Simple example:

// Mutating
func doThing() {
  var str = foo()
  str.makeContiguous()
  // ...  
}
// Non-mutating
func doThing() {
  let str = foo().contiguousUTF8
  // ...
}

Non-mutating lets us use a let binding if there are no further modifications.

Repeated access:

// Mutating
func doManyThings() {
  var str = foo()
  str.makeContiguous()
  bar(str) // does withUTF8 and doesn't take argument inout
  baz(str) // does withUTF8 and doesn't take argument inout
}
// Non-mutating
func doManyThings() {
  let str = foo().contiguousUTF8
  bar(str) // does withUTF8 and doesn't take argument inout
  baz(str) // does withUTF8 and doesn't take argument inout
}

Both have a performance trap: the former forgets makeContiguous and the latter forgets contiguousUTF8. Bar/baz will only operate on copies anyways, so the caller should establish contiguity. What do you feel is more/less obvious/dangerous between these two formulations?

(The good news at least is that isContiguous[UTF8] is available for usage in pre/post condition checking, even if it cannot be reflected well in the type system.)


(Lily Ballard) #11

In your final example, my concern with the non-mutating approach is bar and baz look like

func bar(str: String) {
    str.withUTF8 { /* ... */ }
}
func baz(str: String) {
    str.withUTF8 { /* ... */ }
}

It's easy to forget here that withUTF8 is potentially doing a copy. Even worse would be if one of them called str.withUTF8 twice without using .contiguousUTF8 first.

The mutating approach makes it slightly more awkward to work with, as you have to copy your string into a mutable variable (assuming it's not inout), but this awkwardness is the reminder that this is potentially mutating the string and therefore has performance implications.

/// - Complexity: O(1) if `str` is already contiguous,
///   otherwise incurs a string copy.
func bar(str: String) {
    var str = str
    str.withUTF8 { /* ... */ }
}

This approach also automatically solves the issue of calling withUTF8 twice without considering performance.

func baz(str: String) {
    var str = str
    str.withUTF8 { /* ... */ }
    str.withUTF8 {
        // this second call is free because the first
        // call made the string contiguous
    }
}

(Jonas) #12

Generally positive, a few questions/doubts:

  1. Proposal says makeContiguous on non-contiguous will invalidate pre-existing indices. A major use-case of the API in this proposal is ensuring bridged Cocoa strings, that you need to do heavy-duty work on, are contiguous.
    Typically when you work with Cocoa APIs, you also have some NSRange, or index that indicate a UTF-16 offset, that you need to convert to a String.Index. My question now is, when making a String contiguous from bridged Cocoa string, will the String also in the same pass build the UTF-16 "breadcrumbs" mentioned here: String’s ABI and UTF-8 ?
    If not, I believe there should be an option to do so, or the following would seemingly be inefficient:
    string.makeContiguous()
    let range = Range(nsRange, in: string)
  1. Not sure if makeContigous operation on a Substring ever makes sense? If not already contiguous that would cause a copy. Isn't it better that such a copy is done explicitly?
    let string = String(substring)
    string.makeContigous()
  1. Wouldn't a (very) performance-minded user interested in getting direct access to the underlying code units usually want to:
  • Check if the encoding UTF-8 -> access UTF-8 buffer
  • Check if the encoding UTF-16 -> access UTF-16 buffer, and perhaps directly transcode to UTF-8 in a destination buffer

(Lily Ballard) #13

Using withUTF8 is not unreasonable on a Substring, and if we have withUTF8 then we need makeContiguous to allow you to avoid performance problems when calling withUTF8 multiple times.

Yes, makeContiguous on a non-contiguous Substring would cause a copy, but then again, it would cause a copy on a non-contiguous String as well.