Pitch: Character and String properties

I see your motivation here and don't necessarily disagree with it, but let's not attempt to take on the task of redesigning character sets as part of this pitch. Foundation's CharacterSet has a number of known issues that really need to be resolved on their own. It's not a trivial task that we want to just include as an aside in another proposal.

Agreed, but if we are talking about introducing a new asciiClasses enum type that effectively fulfils the same role as CharacterSet but in a more narrow capacity, I wonder if makes sense to include that in this proposal or if it would be better to move it out into its own thing.

If the discussion starts moving toward an agreement that ASCII property queries are better suited for an API more like CharacterSet, IMO we should leave it out of this proposal entirely and move it into a new proposal (not necessarily immediately) that aims to overhaul CharacterSet. There's no harm in leaving it on the floor now if we think we have a chance at a better API down the road without taking a huge bite out of CharacterSet-related things too early.

An alternative approach that may be more in the spirit of the original proposal would just be to require that users make two calls:

if character.isDigit && character.isASCII {

}

Since in all cases the ASCII variant of a given character set should be a subset of the unicode variant.

The only problems are that it's neither particularly efficient, nor is it obvious to an English-native developer that isDigit might include characters outside of the ASCII range in the first place.

Hi Michael,

Could you go into a bit more detail about what the intended use-cases are for the words property? The unicode document you linked talks about searching for whole words and providing smart-selection behaviour, but I'm not clear whether a collection of word substrings is the right way to solve either of those problems. For example, whole-word search also includes things like "\Wthis is more than one word\W" which seems a bit awkward to do with a collection of substrings. And smart-selection seems like it wants to start from a particular index and walk both forwards and backwards to the nearest word boundary rather than have to break all the preceding text.

I'm in favor of the properties on Character; those seem really useful. Ideally I'd love to see them implemented in terms of my ContainmentSet proposal...

I'm kinda torn on the lines and words stuff... It doesn't seem to me like that belongs on String directly. It seems weird to want to take a url.absoluteString and ask it for its lines or words, or reading in the contents of a JSON file and asking it for all the words in the file.

Update regarding String.Words: I think we should split it out from this pitch into something more comprehensive for the future.

Thank you to @allevato for the excellent write up earlier in this thread. I think that's a good approach for the eventual solution.

--

I think that enumerateSubstrings is an API that could use some love, either on the stdlib or Foundation side. Such an overhaul should accommodate words, sentence, and paragraph breaking alongside choice concerning localization. Doing so is outside the scope of this pitch, but could be an interesting future direction.

Lines, however, have a specific and useful meaning in a programmer context:

Right, these are not use cases well accommodated from String.Words as pitched. From a programmer's view of strings, I can't think of much use. From a linguistic view of strings, this is currently insufficient.

Err, I don't understand this reasoning. Could you elaborate? You probably also wouldn't want to ask for a url.absoluteString's titlecased representation or what version of Unicode its leading Unicode.Scalar was introduced in.

I agree that String.Words is not as generally useful as I thought it might be. Lines is useful, because a newline is one of the most common terminators in computing.

You're right! It is weird to ask for those things. This is the philosophical problem with methods/properties, and the over-use of String as the base of API: you end up with lots of weird properties that kinda don't really fit. Another example of this is how both NSString and NSURL kinda sorta look like a Path if you squint, but aren't really either. As such, you get this API baggage on the types that sometimes is relevant but usually isn't.

This is all to say that I think it'd be better to make something like a LineSequence(_: StringProtocol) and WordSequence(_: StringProtocol) instead of making String.Lines and String.Words.

From what I see many issues/doubts come from having String deficiencies over localised text. But in my view it doesn't have to support that: I'm in favour of keeping String capable of being used in every logic computation, and have a separate type (LocalizedString?) that would be able to deal with the various complexities of user-facing text. Similarly to how we don't want Path-related utilities in String, while it's still text, we might want to have further distinctions for different use-cases.
This would still not stop people from using NSString-defined utilities for these other use-cases though...

1 Like

I'm not entirely clear what that means for this problem space. Is it doing the kind of additional analysis you've said you don't want to do to be universally accurate?

I generally agree. I ask about the relationship with Foundation primarily in terms of impedance mismatch of the two APIs. If Swift's will be more Unicode-correct but less locale-correct, it will be non-obvious when I should reach for Swift words/lines vs. Foundation words (f.ex.), and that's assuming I already know both exist.

I will note w.r.t. your self-doubt elsewhere in the thread that line boundaries and word boundaries both seem of notable importance as building blocks for higher-level text processing, so the rest is picking the color of this particular bikeshed.

I think I caused a misconception of what String.Lines is about. I regret conflating it with linguistic analysis functionality by also including String.Words in this pitch.

String.Lines is not about what Unicode thinks of as lines, but rather what a programmer processing machine output / input thinks of as lines. E.g., a more robust version of gets.

When Unicode talks about line-breaking, they're talking about places in the text suitable to insert soft line breaks, i.e. word-wrapping. Unicode also talks about sentence-breaking and word-breaking.

Swift's String does perform some non-localized textual segmentation/analysis with grapheme breaking. Graphemes give the closest universal approximation to what a character is, and hey, they handle emoji. However, they're not sufficient for implementing a truly robust internationalized text processing system. For example, "ch" is a single grapheme in Czech: when a Czech user advances the cursor it should skip over "ch". But at least they handle emoji!

When requesting word boundaries, sentence boundaries, or paragraph boundaries, you're much more likely to either be in a linguistic processing context, where I think we will need a more complete solution here. Or, you're in a UI context where you want to get the boundaries surrounding a cursor, which is likely outside the scope of the standard library.

Do you have other use cases of wanting a collection of the words or sentences that comprise a string? What kinds of higher level text processing do you do? I want to understand this area quite a bit more before wading in with casual additions to String.

edit: stray grammar

I would love to discuss philosophy with you one day, just not in this thread :-)

Yes, I am generally opposed to the proliferation of stringly-typed entities. This is a long road, but there's been progress on the ObjC -> Swift side as well as developer awareness. Eventually, I hope for more composition-oriented and newtype-like functionality in the future. Way out of scope for this thread :-)

So, your vote is for spelling this LineSequence<S: StringProtocol> instead of String.Lines and LineSequence.init(_:S) instead of StringProtocol.lines, correct?

1 Like

If I understand what you're saying, then yes. My vote is to not make this behavior intrinsic to String, but instead have it be a new type that can be created from String.

What do you mean by "intrinsic to String"? For the type, under both spellings they are a new type: mine is a nested type inside of String and yours is at global scope. For the construction of an instance of that type, both spellings are functions generic over an argument conforming to StringProtocol: mine is a computed property, yours is an initializer.

When I see String.Subtype or String.aProperty, I see these as implying that the functionality/behavior described by these things are intrinsic to all strings.

However as established above, there are definitely cases where asking strings for their .words or .lines doesn't make semantic sense. Because of these, I don't think that the notion of having "words" or "lines" is a fundamental part of Stringiness. Because of that I think that the solution to breaking text (where "text" implies String, but String does not imply "text") up in to words and lines should be a new type outside of the String "namespace".

1 Like

Just to be clear, do you agree that this is a choice of spelling and you're providing an argument for your choice of spelling? I.e., you're not interested in discussing different semantics or behavior.

I'm not trying to trivialize this by calling it "spelling". Spelling is important and there are tradeoffs to these decisions. I just want to make sure that when you talk of behavior, you're talking about justifications for user intuition about where an API "belongs" and what it should look like. This is definitely important; spelling is a specific choice of various tradeoffs, such as discoverability vs API bloat.

Yes, I'm only talking about the spelling.

And I recognize that "spelling" is a term of art :slight_smile:

1 Like

OK, I think I understand better. Just Lines makes more sense, then; that makes for a stronger proposal for now.

“Higher-level” was just a cute way for me to refer to future regex-alike things. Pay no mind.

I would actually be ok with including half-width numerals and numerals and all of those. Combining it with isASCII accomplishes the narrowing when needed.

My use case is a parser combinator library that is also used to parse chord symbols typed as plain text.

[I meant to weigh in on this topic sooner, but I just got back from vacation so I'm still catching up on stuff.]

I'm a bit concerned with the possibility of this simplifying a world of complexity which cannot be reasonably simplified.

One issue here is that we would be defining our own semantics on Characters, which may or may not come back to haunt us. Right now, it's possible for us to extrapolate from the Unicode spec, but what would we do if the Unicode spec changes to make its own decisions about these properties that is incompatible with ours?

More so, although I think that exposing properties on Unicode scalars directly is a good thing to do, I don't know if I agree with generalizing these. I'm uncomfortable with even just the first entry in the FAQ here:

Should "\u{0020}\u{0301}" (SPACE followed by COMBINING ACUTE ACCENT) really be considered whitespace? I think this really (but subtly) depends on the intent, which is lost with the simple name of isWhitespace. It certainly has whitespace, but it isn't invisible (which is an assumption that many could be prone to making). This could be taken either way, but however it's taken, someone will be assuming incorrectly about what this does.

Similarly, too, goes the discussion regarding casing. I think there's a danger here of being correct in a way that many would find unexpected. Consider some naïve code which attempts to loop through a string to find the first uppercased/lowercased letter (if any):

var firstUppercase: Character? = nil
var firstLowercase: Character? = nil
for char in str {
    if char.isUppercased {
        if firstUppercase == nil { firstUppercase = char }
    } else {
        if firstLowercase == nil { firstLowercase = char }
    }
}

Ignoring the inefficiency of the above, having characters which are considered both uppercased and lowercased would be surprising to many. I think many people assume upper- and lower-caseness to be mutually exclusive, so the above code (and variants of it) runs rampant in the wild. Given str = "Hello", (firstUppercase, firstLowercase) == ("H", "e"); given str = "ₕᵢ", (firstUppercase, firstLowercase) == ("ₕ", nil). This might seem like a really weird edge case, but this would actually be much more common; including digits in this definition leads to more confusion: str = "abc123" results in ("1", "a"). (For many, too, the concept of casedness is tied to "is this thing a letter or not?", which is a complex question in and of itself, but things which deviate from this definition feel wrong.)

Even with less naïve code in mind, this is an easy trap to fall into:

"123hi".first(where: { $0.isLowercased }) == "1" // => true
"H i".first(where: { $0.isLowercased }) == " " // => true
"H ́i".first(where: { $0.isLowercased }) == " ́" // => true

I think this is similar to the following concern:

I have to be honest, decimalDigits has tripped me up in the past too: there's a danger in using common names which have loaded history to mean something wider than what they might more commonly be thought of.


In all, I think what I'm trying to get at is that given simplifying properties on Character, I think many would unwittingly turn to simplifying complex concepts into what they already know: ASCII. There's a lot of discussion above about ASCII (which I have to admit I have not had the time to review, so apologies if I'm repeating past discussion here), and I think properties like this help turn the complexities of dealing with Unicode into too easily pretending that Characters are ASCII. The properties which hold for ASCII are clearly not true when generalized, but this trap is easy to fall into.

Yes, working with Unicode scalars directly would be less ergonomic, for sure. But the unfamiliar interface exposes users to the world of complexity that is Unicode; I think simplifying over that complexity here might be dangerous, unless we are able to express the complexity somehow (e.g. isUppercased being Bool?, to represent "yes", "no", "N/A").

7 Likes