We know that:
The cumbersome complexity of current Swift String handling
and programming is caused by the fact that Unicode characters
are stored and processed as streams/arrays with elements
of variable-width (1...4 bytes for each character) Unicode characters.
Because of that, direct subscripting of string elements e.g. str[2..<18]
is not possible.Therefore it was, and still is, not implemented in Swift,
much to the unpleasant surprise of many new Swift programmers
coming from many other PLs like me. They did miss plain direct subscripting
so much that the first thing ever they do before using Swift intensively is
implementing the following or similar dreadful code (at least for direct
subscripting), and bury it deep into a String extension, once written,
hopefully never to be seen again, like in this example:
extension String
{
subscript(i: Int) -> String
{
guard i >= 0 && i < characters.count else { return "" }
return String(self[index(startIndex, offsetBy: i)])
}
subscript(range: Range<Int>) -> String
{
let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound, limitedBy: endIndex) ?? endIndex))
}
subscript(range: ClosedRange<Int>) -> String
{
let lowerIndex = index(startIndex, offsetBy: max(0,range.lowerBound), limitedBy: endIndex) ?? endIndex
return substring(with: lowerIndex..<(index(lowerIndex, offsetBy: range.upperBound - range.lowerBound + 1, limitedBy: endIndex) ?? endIndex))
}
}
[splendid jolly good Earl Grey tea is now being served to help those flabbergasted to recover as quickly as possible.]
This rather indirect and clumsy way of working with string data is because
(with the exception of UTF-32 characters) Unicode characters come in
variable-width encoding (1 to 4 bytes for each char), which as we know
makes string handling for UTF-8, UTF-16 very complex and inefficient.
E.g. to isolate a substring it is necessary to sequentially
traverse the string instead of direct access.
However, that is not the case with UTF-32, because with UTF-32 encoding
each character has a fixed-width and always occupies exactly 4 bytes, 32 bit.
Ergo: the problem can be easily solved: The simple solution is to always
and without exception use UTF-32 encoding as Swift's internal
string format because it only contains fixed width Unicode characters.
Unicode strings with whatever UTF encoding as read into the program would
be automatically converted to 32-bit UTF32 format. Note that explicit conversion
e.g. back to UTF-8, can be specified or defaulted when writing Strings to a
storage medium or URL etc.
Possible but imho not recommended: The current String system could be pushed
down and kept alive (e.g. as Type StringUTF8?) as a secondary alternative to
accommodate those that need to process very large quantities of text in core.
What y'all think?
Kind regards
TedvG
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>