Safe memory optimization: explicit stack allocation, static arrays, ascii

The default implementation of operator new calls malloc.

@Andrew_Trick, is it possible in pure Swift to vend subranges of a large UnsafeMutableRawBufferPointer as heterogeneously-bound buffers?

I didn't say malloc itself is bad; I meant hidden, impossible to control use of malloc is generally considered bad for real-time code.

I don’t disagree. I’m just saying that C++ defaults to hidden allocations that (on Darwin, at least) take the malloc() lock. You do have control over that, but it’s rather cumbersome to add the allocator argument to every std::vector, and the language doesn’t do anything to prevent you from forgetting.

That is, in fact, the only legit use cast of binding memory in pure Swift--writing your own allocator. You can allocate one large buffer then bind disjoint subranges to whatever type they will hold. You just need to pull out a pointer to the region of interest and call bindMemory with the capacity of the subrange. If the subrange is only temporary, and storage may hold some other type later, then you could use withMemoryRebound instead, which is just a wrapper around two calls to bindMemory.

5 Likes

It's important to clarify that C-style "stack allocation" (a concept that doesn't actually exist in the C language, but I think everyone knows what is intended based on the behavior of mainstream C implementations) is very much not a complete solution to the constrained-environments problem; there are lots of such environments where significant stack allocation is also not acceptable, and binding memory out of a provided work buffer is the typical approach even in C.

6 Likes

you probably do not want to use a promoted literal type of Character, you probably would want to use Unicode.Scalar which is the minimally-promoted literal type of ExpressibleByUnicodeScalarLiteral. (but this probably isn’t your performance bottleneck. update: it is, see below)

with code unit literals, you could also get literals restricted to the ASCII range at compile time.

2 Likes

update: rather than speculate, i godbolted it, and it turns out i was drastically wrong.

version 1 (Character.asciiValue)

extension CChar:ExpressibleByUnicodeScalarLiteral
{
    @inlinable
    public init(unicodeScalarLiteral:Character)
    {
        self.init(bitPattern: unicodeScalarLiteral.asciiValue!)
    }
}

public
func test() -> [CChar]
{
    [
        "a",
        "b",
        "c",
        "\0",
    ]
}

output.test() -> [Swift.Int8]:
        push    rbp
        push    r14
        push    rbx
        sub     rsp, 16
        lea     rdi, [rip + (demangling cache variable for type metadata for Swift._ContiguousArrayStorage<Swift.Int8>)]
        call    __swift_instantiateConcreteTypeFromMangledName
        mov     esi, 36
        mov     edx, 7
        mov     rdi, rax
        call    swift_allocObject@PLT
        mov     r14, rax
        movaps  xmm0, xmmword ptr [rip + .LCPI3_0]
        movups  xmmword ptr [rax + 16], xmm0
        mov     qword ptr [rsp + 8], 97
        lea     rdi, [rsp + 8]
        mov     esi, 1
        call    ($sSS18_uncheckedFromUTF8ySSSRys5UInt8VGFZ)@PLT
        mov     rbx, rdx
        mov     rdi, rax
        mov     rsi, rdx
        call    ($sSJ10asciiValues5UInt8VSgvg)@PLT
        mov     ebp, eax
        test    ebp, 256
        jne     .LBB3_5
        mov     rdi, rbx
        call    swift_bridgeObjectRelease@PLT
        mov     byte ptr [r14 + 32], bpl
        mov     qword ptr [rsp + 8], 98
        lea     rdi, [rsp + 8]
        mov     esi, 1
        call    ($sSS18_uncheckedFromUTF8ySSSRys5UInt8VGFZ)@PLT
        mov     rbx, rdx
        mov     rdi, rax
        mov     rsi, rdx
        call    ($sSJ10asciiValues5UInt8VSgvg)@PLT
        mov     ebp, eax
        test    ebp, 256
        jne     .LBB3_6
        mov     rdi, rbx
        call    swift_bridgeObjectRelease@PLT
        mov     byte ptr [r14 + 33], bpl
        mov     qword ptr [rsp + 8], 99
        lea     rdi, [rsp + 8]
        mov     esi, 1
        call    ($sSS18_uncheckedFromUTF8ySSSRys5UInt8VGFZ)@PLT
        mov     rbx, rdx
        mov     rdi, rax
        mov     rsi, rdx
        call    ($sSJ10asciiValues5UInt8VSgvg)@PLT
        mov     ebp, eax
        test    ebp, 256
        jne     .LBB3_7
        mov     rdi, rbx
        call    swift_bridgeObjectRelease@PLT
        mov     byte ptr [r14 + 34], bpl
        mov     qword ptr [rsp + 8], 0
        lea     rdi, [rsp + 8]
        mov     esi, 1
        call    ($sSS18_uncheckedFromUTF8ySSSRys5UInt8VGFZ)@PLT
        mov     rbx, rdx
        mov     rdi, rax
        mov     rsi, rdx
        call    ($sSJ10asciiValues5UInt8VSgvg)@PLT
        mov     ebp, eax
        test    ebp, 256
        jne     .LBB3_8
        mov     rdi, rbx
        call    swift_bridgeObjectRelease@PLT
        mov     byte ptr [r14 + 35], bpl
        mov     rax, r14
        add     rsp, 16
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB3_5:
        ud2
.LBB3_6:
        ud2
.LBB3_7:
        ud2
.LBB3_8:
        ud2

version 2 (numeric cast)

extension CChar:ExpressibleByUnicodeScalarLiteral
{
    @inlinable
    public init(unicodeScalarLiteral:Unicode.Scalar)
    {
        self.init(unicodeScalarLiteral.value)
    }
}

public
func test() -> [CChar]
{
    [
        "a",
        "b",
        "c",
        "\0",
    ]
}
output.test() -> [Swift.Int8]:
        push    rax
        lea     rdi, [rip + (demangling cache variable for type metadata for Swift._ContiguousArrayStorage<Swift.Int8>)]
        call    __swift_instantiateConcreteTypeFromMangledName
        lea     rsi, [rip + (outlined variable #0 of output.test() -> [Swift.Int8])+8]
        mov     rdi, rax
        pop     rax
        jmp     swift_initStaticObject@PLT

__swift_instantiateConcreteTypeFromMangledName:
        push    rbx
        mov     rax, qword ptr [rdi]
        test    rax, rax
        js      .LBB4_1
        pop     rbx
        ret
.LBB4_1:
        mov     rbx, rdi
        mov     rsi, rax
        sar     rsi, 32
        neg     rsi
        movsxd  rdi, eax
        add     rdi, rbx
        xor     edx, edx
        xor     ecx, ecx
        call    swift_getTypeByMangledNameInContext@PLT
        mov     qword ptr [rbx], rax
        pop     rbx
        ret

this was surprising to me, because i had assumed the compiler could at least infer the roundtrippable-ness of Unicode.Scalar -> Character -> Unicode.Scalar, but it appears i was mistaken.

1 Like

Counterexample:

1> "\r\n" as Character
$R0: Character = "\r\n"
  2> ("\r\n" as Character).asciiValue
$R1: UInt8? = 10
  3> UnicodeScalar(("\r\n" as Character).asciiValue!)
$R2: UnicodeScalar = U'\n'
1 Like

almost! but '\r\n' isn’t a unicode scalar, it’s a grapheme cluster of two unicode scalars. because @TeamPuzel is using ExpressibleByUnicodeScalarLiteral, his literal expression domain only contains unicode scalars, even though he is using a promoted literal type of Character.

but it really is easy for all this to go wrong, because if he had been using ExpressibleByExtendedGraphemeClusterLiteral this could indeed happen.

True. It happens that there are no canonical equivalents between codepoints in the ASCII range, but since there are equivalences between other individual codepoints (e.g. "\u{212b}" == "\u{00c5}") the optimizer would need to see through unwrapping the Optional<UInt8> that happens in the expression asciiValue!. Which, of course, it can’t do because the list of equivalent characters depends on the version of the Unicode data that’s on the system.

2 Likes

What is "the true C array"?

In C you can allocate things on stack: void foo() { int array[100]; ... }
heap: int* array = malloc(100 * sizeof(int))
or as a static / global variable: static int array[100]
plus there's alloca for stack allocations

(malloc / alloca are not part of C itself, but its standard library, should that matter).

There are size limitations on stack / global allocations, e.g. allocating 10K arrays on stack is asking for trouble unless you can be sure you've got large stack, and allocating 100MB global array is probably not the bestest idea either.

That's true and not only for memory allocation: things like mutex locks or Dispatch async calls are also not good for real-time applications. Until we have some form of @realtime attribute (discussed here for example) we must be very careful when using swift in realtime contexts (to the extent that some people/teams completely avoid it preferring good old C in those contexts).

What exactly are you trying to do, is it generating Markov chain text in realtime context? Or doing that "as fast as possible"? Note that in general case you may want to use different algorithms for the former and the latter, as for realtime contexts this is the maximum time complexity which is important while to do a thing as quick as possible you typically optimise average time complexity.

We don't know your task or constraints to advise better. For example if you want to prohibit all or most memory allocations (e.g. in realtime contexts) you may choose to avoid using swift Arrays, Dictionaries, Strings and implement your data structures (likely much simpler, specially tailored versions of Array/Dictionary/String) to achieve maximum average or max time performance or your algorithm. You may choose to invest in your custom allocator if your data structures are dynamic, or perhaps you can tolerate using fixed size structures (e.g. limiting the number of words associated with a particular key by some maximum limit of your choice). For Markov chain text instead of constructing a key by joining string components (even in the CChar array form), you may use an integer-based key made out as a combination of several word indices.

This is actually how a ton of embedded systems work.

No, I'm not trying to do anything specifically, I just had that algorithm written already so I used it to test the performance cost of using String vs that array extension. Writing 100000 random characters to an array (without the markov chain) takes 0.015. It seems like exactly the same performance as Rust :slightly_smiling_face:

the file:

#if canImport(Darwin)
import Darwin
#elseif canImport(Glibc)
import Glibc
#else
#error("Platform not supported")
#endif

@main
struct Main {
    static func main() {
        var buffer: [CChar] = [] ; buffer.reserveCapacity(101000)
        while buffer.count < 100000 {
            let random: Int8 = .random(in: 1...10)
            
            switch random {
                case  1: buffer.append(contentsOf: Array(stringLiteral: "test1"))
                case  2: buffer.append(contentsOf: Array(stringLiteral: "test2"))
                case  3: buffer.append(contentsOf: Array(stringLiteral: "test3"))
                case  4: buffer.append(contentsOf: Array(stringLiteral: "test4"))
                case  5: buffer.append(contentsOf: Array(stringLiteral: "test5"))
                case  6: buffer.append(contentsOf: Array(stringLiteral: "test6"))
                case  7: buffer.append(contentsOf: Array(stringLiteral: "test7"))
                case  8: buffer.append(contentsOf: Array(stringLiteral: "test8"))
                case  9: buffer.append(contentsOf: Array(stringLiteral: "test9"))
                case 10: buffer.append(contentsOf: Array(stringLiteral: "test0"))
                default: fatalError()
            }
            buffer.append(" ")
        }
        buffer.append(0)
        fputs(&buffer, stdin)
        fflush(stdin)
    }
}

There's also a second file with the Array extension mentioned before.

Random will eat significant portion of time in this loop. If you can use low quality random generator - it would be much faster.

well it seems like Swift doesn't like that:
'rand()' is unavailable in Swift: Use arc4random instead.
Why would this be a compilation error :smile:

Because the platform owner marked the function with __swift_unavailable in <stdlib.h>.

yeah, I meant why was this made an error rather than a warning

I shouldn't have to call this through another C function just to get the compiler to do what I want it to.

This was done on purpose to make it harder to use - Swift decided to make the task of random number generation "safe by default" (not sure if it's cryptographically safe though).

random also doesn't have an ability to customise its seed (e.g. to replicate exact sequence of random numbers in order to find a bug).

These two features make me looking for alternatives every now and when I need either a better performance or a repeatable sequence – if so I use some simple (and totally insecure) algorithm:

uint32_t state = 777;

char myRand()
{
   state = state * 1664525 + 1013904223;
   return state >> 24;
}
1 Like

It is on officially supported platforms with the official distribution (specifically it binds arc4random on Apple platforms, getrandom, getentropy, or /dev/urandom on Linux-like systems, and BCryptGenRandom on Windows). We do not document it as such because we can't control what happens in a random fork that someone creates.

1 Like

So I did look into it, however there's an issue on macOS.
I can use malloc(), but not alloca(), despite it being imported as part of the Darwin module. It causes this error: Undefined symbols for architecture arm64: "_alloca"

Must be another case of Apple not wanting anyone using an api.