Why does String.CharacterView have reserveCapacity(:)?


(^) #1

I鈥檓 wondering why the String.CharacterView structure has a
reserveCapacity(:slight_smile: member? And even more strangely, why String itself has
the same method?

It鈥檚 even weirder that String.UnicodeScalarView has this method, but it
reserves `n` `UInt8`s of storage, instead of `n` `UInt32`s of storage. Also
why String.UTF8View and String.UTF16View do not have this method, when it
would make more sense for them to have it than for String itself and
String.CharacterView to have it.


(Brent Royal-Gordon) #2

I鈥檓 wondering why the String.CharacterView structure has a reserveCapacity(:slight_smile: member?

Because it conforms to the RangeReplaceableCollection protocol, which requires `reserveCapacity(_:)`.

More broadly, because you can append characters to the collection, and so you might want to pre-size it to reduce the amount of reallocating you might need to do in the future.

And even more strangely, why String itself has the same method?

Because it has duplicates of those `CharacterView` methods which don't address individual characters. (In Swift 4, it will be merged with CharacterView.)

It鈥檚 even weirder that String.UnicodeScalarView has this method, but it reserves `n` `UInt8`s of storage, instead of `n` `UInt32`s of storage.

Because the views are simply different wrappers around a single underlying buffer type, which stores the string in 8-bit (if all characters are ASCII) or 16-bit (if some are non-ASCII). That means that `UnicodeScalarView` isn't backed by a UTF-32 buffer; it's backed by an ASCII or UTF-16 buffer, but it only generates and accepts indices corresponding to whole characters, not the second half of a surrogate pair.

Why not allocate a larger buffer anyway? Most strings use no extraplanar characters, and many strings use only ASCII characters. (Even when the user works in a non-ASCII language, strings representing code, file system paths, URLs, identifiers, localization keys, etc. are usually ASCII-only.) By reserving only `n` `UInt8`s, Swift avoids wasting memory, at the cost of sometimes having to reallocate and copy the buffer when a string contains relatively rare characters. I believe Swift doubles the buffer size on each allocation, so we're talking no more than one reallocation for a non-ASCII string and two for an extraplanar string. That's quite acceptable.

Also why String.UTF8View and String.UTF16View do not have this method, when it would make more sense for them to have it than for String itself and String.CharacterView to have it.

Because UTF8View and UTF16View are immutable. They don't conform to RangeReplaceableCollection and cannot be used to modify the string (since you could modify them to generate an invalid string).

路路路

On May 2, 2017, at 12:35 PM, Kelvin Ma via swift-users <swift-users@swift.org> wrote:

--
Brent Royal-Gordon
Architechies


(^) #3

Okay I understand most of that, but I still feel it鈥檚 misleading to put
`reserveCapacity()` on `CharacterView` and `UnicodeScalarView`.
`reserveCapacity()` should live in a type where its meaning matches up with
the meaning of the `.count` property, ideally the `UTF8View`. Otherwise it
should at least be removed from `CharacterView` and `UnicodeScalarView` and
only live in the parent `String` type.

路路路

On Thu, May 4, 2017 at 6:33 AM, Brent Royal-Gordon <brent@architechies.com> wrote:

On May 2, 2017, at 12:35 PM, Kelvin Ma via swift-users < > swift-users@swift.org> wrote:

I鈥檓 wondering why the String.CharacterView structure has a
reserveCapacity(:slight_smile: member?

Because it conforms to the RangeReplaceableCollection protocol, which
requires `reserveCapacity(_:)`.

More broadly, because you can append characters to the collection, and so
you might want to pre-size it to reduce the amount of reallocating you
might need to do in the future.

And even more strangely, why String itself has the same method?

Because it has duplicates of those `CharacterView` methods which don't
address individual characters. (In Swift 4, it will be merged with
CharacterView.)

It鈥檚 even weirder that String.UnicodeScalarView has this method, but it
reserves `n` `UInt8`s of storage, instead of `n` `UInt32`s of storage.

Because the views are simply different wrappers around a single underlying
buffer type, which stores the string in 8-bit (if all characters are ASCII)
or 16-bit (if some are non-ASCII). That means that `UnicodeScalarView`
isn't backed by a UTF-32 buffer; it's backed by an ASCII or UTF-16 buffer,
but it only generates and accepts indices corresponding to whole
characters, not the second half of a surrogate pair.

Why not allocate a larger buffer anyway? Most strings use no extraplanar
characters, and many strings use only ASCII characters. (Even when the user
works in a non-ASCII language, strings representing code, file system
paths, URLs, identifiers, localization keys, etc. are usually ASCII-only.)
By reserving only `n` `UInt8`s, Swift avoids wasting memory, at the cost of
sometimes having to reallocate and copy the buffer when a string contains
relatively rare characters. I believe Swift doubles the buffer size on each
allocation, so we're talking no more than one reallocation for a non-ASCII
string and two for an extraplanar string. That's quite acceptable.

Also why String.UTF8View and String.UTF16View do not have this method,
when it would make more sense for them to have it than for String itself
and String.CharacterView to have it.

Because UTF8View and UTF16View are immutable. They don't conform to
RangeReplaceableCollection and cannot be used to modify the string (since
you could modify them to generate an invalid string).

--
Brent Royal-Gordon
Architechies