It may also be worth exploring whether we can be a bit more selective with the Unicode data. I took a brief look at the data files, and here's what I found. These are just rough calculations and guesses, so take it with a grain of salt (perhaps @Alejandro could say more about which data is required for what):
Adding up the grapheme data, we have (621*4)+(165*2)+(166*8) bytes = 4142 bytes. I think that's all we need for String's collection conformance (?) - I don't think it needs to perform normalisation, or case-folding, or access scalar properties.
Using the same process to total the normalisation data (needed for things like string comparison), I get 27422 bytes. I wonder if there are alternative ways to pack this data which prioritise compactness over performance.
Then we get to scalar properties, which are just enormous - the header file is 2.3MB and is so massive GitHub doesn't even bother to render it. Just taking a look at some of the larger tables:
-
_swift_stdlib_scalar_binProps
is (4*4855) = 19420 bytes
-
_swift_stdlib_mappings_data_indices
is (4*2879) = 11516 bytes
-
_swift_stdlib_words
is 78151 bytes
-
_swift_stdlib_word_indices
is (4*12866) = 51464 bytes
-
_swift_stdlib_names
is 215884 bytes
-
_swift_stdlib_names_scalars
is (4*39040) = 156160 bytes
-
_swift_stdlib_names_scalar_sets
is (2*8704) = 17408 bytes
-
_swift_stdlib_ages
is (8*1659) = 13272 bytes
-
_swift_stdlib_generalCategory
is (8*3968) = 31744 bytes
- Total: 515019 bytes
(And there's more in other files - stuff like script information, word-breaking, and case data)
Firstly, it would be great if this stuff could just be dead-stripped automatically. I suspect the vast majority of applications don't care about scalar names, or their ages, word-breaking, or even the general category - but we know that DCE is a bit weak in Swift currently, so perhaps the compiler can't prove these things are never used.
If that's the case, it seems to me like we could still ship an excellent String experience just with grapheme and normalisation data (31564 bytes). I think this would still give us String's collection views, with proper count
and iteration behaviour as we're used to, as well as canonical equivalence for String comparison.
It's not totally minimal, but it would be a big improvement and it's enough that I expect most Swift applications and libraries would continue to work as normal.