swift_tailRealloc

That said, if our implementation internally can take advantage of realloc for uniquely-referenced buffers with bitwise-takeable contents, that's great.

That is an API that would be great to use, but it isn't available on all platforms that we support. Also, as you point out in your next post, this will not be exposed to our users with this design.

I was just responding to the idea of adding it to the buffer types.

I think we could probably implement it with:

  1. A Builtin that shows if a value is bitwise movable.
  2. A runtime entry point for the "handle everything realloc".
  3. An implementation on ManagedBuffer itself that does this by hand.

That being said, I would rather represent this how we represent the rest of the tail allocated stuff, i.e.:

  1. A Builtin that lowers to a realloc_ref SIL instruction. This realloc_ref instruction would be able to be specialized.
  2. IRGen lowers the builtin.

We should be consistent unless there is a really good reason to not be.

2 Likes

I don't understand why a realoc_ref instruction is better than the
alternatives. It has very complicated semantics for an instruction and
inherits the badness of the realloc C API. We shouldn't contrive new
builtins and SIL instructions to accomodate a badly designed legacy C
API. The standard library can call realloc when it needs to. An
_isBitwiseTakable builtin makes perfect sense.

Just an FYI for the thread. @Andrew_Trick wanted a bit more discussion of the design. I invited him to talk about it in this thread. I am going to respond later.

Some more notes:

  1. I think Johannes is out on vacation until late July IIRC. This is going to be on hold until then.
  2. From talking with Andy over IM, he really wants a more in depth discussion of the various IR options with the way things look, etc.

That's right. It may turn out that realloc_ref is the best alternative. But it isn't obviously something we would want in SIL apart from supporting this particular internal API, and I think the barrier to adding a SIL instruction should be much higher than "it was convenient at the time". So what do the alternatives look like in terms of stdlib code, new runtime entry points, and SIL/LLVM IR?

I haven't tried to work out all the details myself yet, but certainly someone should do that before we start merging PRs.

Ok, let's revive this thread. Just to bring everybody on the same page: My eventual goal is to be able to use buffer re-allocation to be able to grow SwiftNIO's ByteBuffers without necessarily copying around bytes. And of course the stdlib's Array, Dictionary, Set, and other collections should and would also benefit from this functionality.

After @Michael_Gottesman initially started this thread, he and I had a session where we came up with a possible design. The design we came up with is the following:

  • for normal Swift packages (like SwiftNIO), this functionality is exposed through ManagedBuffer
  • the stdlib (which implements ManagedBuffer) will get a Builtin.reallocate (or similarly named) builtin which gets lowered to a
  • realloc_ref SIL instruction which gets lowered into IR
  • which does the 'is bitwise takable' check and then calls libc's realloc

I took a first stab at implementing the realloc_ref SIL instruction. This is a very early version of this PR and I'm also a beginner in Swift compiler land which is why most of this is me typing what I could pick off @Michael_Gottesman's brain :).

After opening the PR, @Andrew_Trick rightly pointed out that I didn't even document the new SIL instruction in SIL.rst which I really should have. Should we reach the conclusion that this is the right approach I promise to do so. But also as @Andrew_Trick suggested let's first take a step back and see if this is really the best approach.
The downside is that the realloc function has odd semantics in that it doesn't fail if it can't re-allocate the buffer. If it can't re-allocate it will 1) allocate a new buffer 2) copy the (fitting) bytes over 3) release the old buffer 4) return the address of the new buffer. The realloc_ref SIL instruction would replicate those semantics even though they're not great. That's a bit of a bummer but unfortunately I can't see a different solution because AFAIK the only portable function that gives us access to the re-allocation functionality is the realloc function so there doesn't really seem to be a way around it. In other words, I don't have an idea how to get realloc's functionality without also inheriting its odd semantics.

But we should discuss this here so we can find the best possible solution (even if it ends up being realloc_ref) and we also document why that is.

Background: Unlike a traditional compiler IR, such as LLVM, SIL is not just an instruction set where each instruction can be considered an independent set of semantics. SIL passes and SIL verification rely on the ability to identify accepted patterns of SIL code. In other words, SIL has a lot of structural properties that need to be maintained, whereas LLVM as very few.

(@Erik_Eckstein) Adding a realloc_ref instruction is undesirable because SIL passes would now need to check for a realloc_ref instruction wherever this new operation could pop up based on its semantics. For instance, what are all the operations that could potentially destroy a reference, all the operations that copy a heap object, all the operations that could allocate a new heap object? We also need to invent and verify new structural properties, like realloc_ref requires uniqueness and must be the last use of a reference.

Anytime you introduce new SIL patterns that potentially need to be checked in a lot of places, but in practice occur very rarely, you have a source of bugs.

Contrast this with making a runtime call, which all SIL passes and utilities must already understand. Every bit of code in the SIL compiler already needs to have robust handling of apply instructions. It's always the case that an apply can destroy, copy, or instantiate a new object.

In addition to being another source of bugs, I don't think SIL developers should need to think about the possibility of "realloc" and its semantics w.r.t their pass. It simply isn't a natural thing to have in the IR.

Generally, the standard library is the right place to work around platform issues so that the compiler can focus on the core language semantics.

Looking at the suggestions made earlier in this thread, I suggest something like the following. This is a rough idea, it's probably not exactly what we want...

@_transparent
public func _isBitwiseTakable<T>(_ type: T.Type) -> Bool {
  return Bool(Builtin.isBitwiseTakable(type))
}

extension ManagedBuffer {
  public final class func _realloc(
    buffer: inout ManagedBuffer,
    count: Int,
    newMinimumCapacity: Int,
    copyHeaderWith doCopy: (
      ManagedBuffer<Header, Element>) throws -> Header
  ) rethrows {
    if (_isBitwiseTakable(Header.self)
      && _isBitwiseTakable(Element.self)) {

      let sizeInBytes = _elementOffset
        +  newMinimumCapacity * MemoryLayout<Element>.stride

      // Either call libc directly or a runtime call?
      let p = realloc(Unmanaged.passUnretained(buffer).toOpaque(), sizeInBytes)!
      buffer = Unmanaged.fromOpaque(p).takeUnretainedValue()
      return
    }
    // Normal, non-bitwise move. I don't know how to "move" the
    // object header, so let's copy and destroy it.

    // Reimplement _create without the `makingHeaderWith` factory.
    let newBuffer = Builtin.allocWithTailElems_1(
         self, newMinimumCapacity._builtinWordValue, Element.self)
    
    let initHeaderVal = try doCopy(buffer)
    newBuffer.headerAddress.initialize(to: initHeaderVal)

    // Move over all the elements from buffer to newBuffer.
    buffer.withUnsafeMutablePointerToElements { oldPtr in
      newBuffer.withUnsafeMutablePointerToElements { newPtr in
        newPtr.moveInitialize(from: oldPtr, count: count)
      }
    }

    buffer = newBuffer // destroys the old header.
  }
}

If the goal is just to avoid copying bytes when growing the capacity, can you just use a data structure with guaranteed O(1) append instead of amortized O(1)? I know deques are widely considered to not be worth the extra per-operation overhead, but that might not be such a big deal for a byte-buffer since it's less likely to be repeatedly indexed instead of having aggregate operations done on it.

We also want to have it stored in memory contiguously. But to answer your question: We have this data structure already: To write out data, you can just enqueue multiple ByteBuffers and we'll write it with one writev system call. From a user's point of view this looks like this:

ctx.write(chunk1)
ctx.write(chunk2)
ctx.flush() // or combine the latter two in `ctx.writeAndFlush(chunk2)`

that will enqueue the ByteBuffers chunk1 and chunk2 into a (expanding) ring buffer and ctx.flush() will send them (using one writev system call).
So for the outbound side we're pretty good (in many cases you'd prefer a contiguous buffer still though). In the inbound side however, in very many cases the user will need a contiguously stored chunk in memory and that must grow. So realloc really makes a difference here and the only option that I'm aware of.

Okay, if you're sure you need contiguity on the input side, I guess I have to concede the need for reallocation. And you're sure that these buffers will not be reused enough to make the growth cost irrelevant?

My understanding is that realloc is great if the allocation has been page-aligned and the implementation is OS-integrated enough to just remap those pages into a new part of the address space, but that otherwise the optimized path simply does not kick in enough to make much of a difference.

@John_McCall Yes, as you say, realloc works best when we're dealing with buffers >= PAGE_SIZE. And we find there's many use-cases where our users do require contiguous buffers that may well be more than a page worth of data.

Just last week @lukasa found a performance issue where there was code in ByteBuffer that was accidentally causing a malloc + copy, after fixing it to use realloc he saw a great speedup in the TLS code. IIRC it was 20% of the runtime spent in memmove that just went away. The buffer was 16k to start with and with realloc could cheaply grow whereas before we needed to copy. I'm sure @lukasa would be happy to share the details if you were interested.

@normanmaurer also tells us that over in Netty-land using realloc was shown to in cases to improve performance quite a lot.

Hi Andy & @Erik_Eckstein,

That does sound like something that would work well for me. In fact @Michael_Gottesman and I started off with a similar idea but then gravitated towards the realloc_ref SIL instruction. I have to admit I now cannot remember why. @Michael_Gottesman do you remember more?

Looks good, we would need to also check isKnownUniquelyReferenced(&buffer) I think but otherwise that looks fairly similar to what we were thinking about.

Looks good, we would need to also check isKnownUniquelyReferenced(&buffer) I think but otherwise that looks fairly similar to what we were thinking about.

Yes, absolutely.

The implementation of Builtin.isBitwiseTakable should be similar to the existing Builtin.isPOD.

This second half of the implementation was truncated in your response. I just want to make sure you saw it...

    // Normal, non-bitwise move. I don't know how to "move" the
    // object header, so let's copy and destroy it.

    // Reimplement _create without the `makingHeaderWith` factory.
    let newBuffer = Builtin.allocWithTailElems_1(
         self, newMinimumCapacity._builtinWordValue, Element.self)
    
    let initHeaderVal = try doCopy(buffer)
    newBuffer.headerAddress.initialize(to: initHeaderVal)

    // Move over all the elements from buffer to newBuffer.
    buffer.withUnsafeMutablePointerToElements { oldPtr in
      newBuffer.withUnsafeMutablePointerToElements { newPtr in
        newPtr.moveInitialize(from: oldPtr, count: count)
      }
    }

    buffer = newBuffer // destroys the old header.
  }
1 Like

It was due to the symmetry with how other parts of the tail allocated stuff is implemented. That being said, since then via discussions internally, we agreed that there are stronger arguments against the realloc_ref instruction than for it. For me the strongest reason against it is that creating a SIL instruction should have a high bar and should only be created when there is some sort of semantic property that the optimizer needs to understand in a structured way. Otherwise, it makes sense to just use a function call, runtime impl, stdlib impl, or builtin impl.

1 Like

@Andrew_Trick thanks, saw it. Would we even want to implement that if the types aren't bitwise takable? Couldn't we argue that's programmer error and just trap?

Cool, that sounds good to me!

@Michael_Gottesman & @Andrew_Trick So it seems like we could go with something very similar to Andy's proposed code and the only things we need are

  • Swift builtin Builtin.isBitwiseTakable ((T.Type) -> Bool)
  • a SIL builtin operation "isbitwisetakable" similar so "ispod" ((T.Type) -> Bool) that lowers to emitLoadOfIsBitwiseTakable (which already exists) in IRGen

Would you think a direct libc call to realloc is fine like outlined in Andy's example?