Ironing out `ManagedBuffer` API wrinkles

lorentey · August 6, 2020, 10:50pm

ManagedBuffer has a public stored property called header. I'm complaining about this particular stored property.

At the time the freshly allocated object is passed to the factory closure, this stored property isn't initialized. The closure is expected to not to access it. From the docs:

During instance creation, in particular during ManagedBuffer.create ’s call to initialize, ManagedBuffer ’s header property is as-yet uninitialized, and therefore reading the header property during ManagedBuffer.create is undefined.

Vending invalid objects to the closure is quite bad enough already, but it gets worse: the closure is allowed to throw. If it does, then the header ivar stays uninitialized, but the custom deinit will still run. In most cases the custom deinit needs to access the header ivar to figure out what parts of the buffer need to be deinitialized. When it does so, things tend to blow up.

struct Foo: Error {}

class Bar {
  func greet() {
    print("Hello from Bar.greet")
  }
}

class Baz: ManagedBuffer<Bar, Int> {
  deinit {
    print("Hello from Baz.deinit")
    header.greet()
  }
}

_ = try? Baz.create(minimumCapacity: 100) { _ in throw Foo() }
// ⟹ Hello from Baz.deinit
// ⟹ 💥 Segmentation fault: 11

The only way to fix this I can see is to stop allowing the closure to throw, or to get rid of the closure altogether.

(The fact that custom subclasses must not define additional ivars is inconvenient, but I'm not complaining about that -- and sadly I don't expect we'd find a way to lift that restriction any time soon. I do think we need to explicitly verify this constraint at runtime, though, at least in debug builds.)

(I'm also not complaining about all the unsafe memory operations within any properly written use of ManagedBuffer. Those are a feature, not a bug.)

If we could find a way to make create an actual initializer, then that would be awesome! It would also resolve the naming problem that we have for its replacement. I'm hoping we would be able to do it without introducing a throwaway protocol for it, though.

My intention here isn't to ridicule the ManagedBuffer design -- I'm sorry if it comes though that way. I'm trying to point out practical problems with it in order to get a picture how people feel about maybe fixing them. ManagedBuffer is objectively not a joy to work with; I don't think we can fix that entirely without language changes, but at least we can fix some of the lower hanging API issues.

If we were to design this from scratch, then with the benefit of hindsight I'd probably consider fixing problems #4 and #5 in the new design. I don't think we're at that stage yet, but I do think we need to seriously consider getting rid of the capacity property. (I.e., deprecating it everywhere based on a language version, and not carrying it forward to new platforms.)

This is good information, but it's not something I can practically act on. The buffer cannot just be a dumb storage class with no knowledge about its concrete element type -- it must be subclassed to customize deinit. Most other operations can be moved out of the class (and in fact that's often how things naturally fall out during the implementation); but the class instance must know how to clean up after itself.

If we had language features such as a type system-level way of modeling methods that must only be called on a unique instance, then we'd probably want to forward to a class method more often, just to reap the benefits of static uniqueness checking. I expect this would put far more pressure on this bug.

Yes, yes, a thousand times yes! However, I suspect we cannot model heterogeneous buffers in a satisfying way yet. (E.g., variadic generics sound like something we would want to use.) It would be interesting to see how far we can get, though.

ManagedRawBuffer would be a way to allow people to manually partition their buffer storage, in a more palatable way than rebinding memory vended by ManagedBuffer. This would at least allow us to manually implement heterogeneous buffers without having to fight against the stdlib every step along the way.

But I think before we attempt to tackle these problems, we should spend some time to see how much we can fix in the APIs we already have. A strongly typed, flat ManagedBuffer is not a useless construct -- flat buffers have many use cases. Its API is obviously deeply, deeply flawed because of some language-level constraints that don't seem easily resolvable, even today. But it also has some minor API wrinkles that are very much resolvable!

This is ~universally done by retrieving the size during initialization and then storing the actual usable capacity in the Header. (I haven't seen any production code that uses ManagedBuffer.capacity instead of this. If someone is doing that, I'm pretty sure they haven't thought it through.)

Making use of any extra "free" capacity needs to be an opt-in thing, because it isn't always desirable. In fact, the desire to use it seems to be the exception rather than the rule.

In practice, this would lead to the capacity being stored twice, which I don't think would be reasonable.

John_McCall · August 6, 2020, 11:30pm

I see, so subclasses are expected to store the capacity in their header? Then I think you have two choices: (1) stop doing that in subclasses and instead rely on ManagedBuffer.capacity, which may not be feasible for ABI and performance reasons, or (2) provide the best-known bound to the subclass initializer somehow, which seems difficult.

lorentey · August 7, 2020, 1:41am

Yes, that is what people tend to do in practice.

Yep. This would be a difficult sell.

It’s in fact really easy, because the problem has nothing to do with initializers. Neither ManagedBuffer itself nor its subclasses are able to provide any usable initializers; they cannot fit into the normal class initialization rules, so they have to circumvent them entirely. (It would be nice to eventually fix this, but I’m not holding my breath for it — I’m sure there will always be a million more important things to work on. The mere fact that ManagedBuffer exists is already a small miracle!)

Managed buffer instances are created through ManagedBuffer.create. This function’s API is currently broken. Luckily, we can easily fix it! The fix is simple, and it does not need any language-level changes — I have already posted one approach in my initial post, and it does fully support retrieving the actually allocated capacity. All we need is to find a nice color for this shed, which is a purely library-level exercise.

Earlier, this topic started to go in a really interesting direction with questioning the fundamental idea behind create’s callback-based header initialization. I’d really like to explore this further.

lorentey · August 7, 2020, 2:29am

Agreed! For simplicity, let’s suppose we’d want to replace the current create with two factories for the with/without malloc_size cases. Do you think it’s important to keep the header factory closure? I did not initially consider getting rid of it entirely, but the more I think about it, the more attractive the idea becomes.

extension ManagedBuffer {
  class func create(
    exactCapacity: Int, 
    header: Header
  ) -> Self

  class func create(
    minimumCapacity: Int, 
    header: Header
  ) -> (object: Self, capacity: Int)
}

This would require people to prepare a dummy initial header for the second case, which seemed like too much of a chore to me. But is it really? Initializing class ivars with dummy defaults, then refining them after the super.init() call is exactly what we do in regular classes...

lorentey · August 7, 2020, 3:31am

Supposing (but not expecting) that we had the resources to spend some language/runtime effort on such a niche feature, I’d imagine an “ultimate” version of ManagedBuffer to look something like this:

class UnsafeManagedBuffer<Element> {
  init(minimumCapacity: Int) // special builtin

  func allocatedCapacity() -> Int? // or a separate fn elsewhere
  ...
}

class MyStorage<Element>: UnsafeManagedBuffer<Element> {
  var count: Int
  var capacity: Int

  init(minimumCapacity: Int) {
    // Note special initialization order
    super.init(minimumCapacity: minimumCapacity)
    count = 0
    capacity = allocatedCapacity() ?? minimumCapacity
  }
  ...
}

This would be a neat usability improvement, but it still wouldn’t make such classes suitable to be declared public.

Again, I’m not expecting we’d implement this anytime soon, if at all. In fact, I expect to live with Header and create indefinitely, and I’m fine with it!

(If for some reason anyone feels inspired to implement this: please don’t! Consider tackling a more important issue instead. I’m happy to recommend some!)

dabrahams · August 7, 2020, 4:28am

I guess my memory is failing me; I thought we couldn't actually have stored properties in these. Oh yeah, it “works” as long as we manage initialization and destruction manually, and in the case of ManagedBuffer (but not ManagedBufferPointer) the library can handle it for you.

At the time the freshly allocated object is passed to the factory closure, this stored property isn't initialized. The closure is expected to not to access it.

Yes, well, the original design didn't have that problem, but then [SR-1955] Consider renaming or eliminating `ManagedProtoBuffer`. · Issue #44564 · apple/swift · GitHub got addressed, with the thinking that some peril was better than API complexity.

Vending invalid objects to the closure is quite bad enough already, but it gets worse: the closure is allowed to throw. If it does, then the header ivar stays uninitialized, but the custom deinit will still run .

Oy vey. That's very sad; at least it wasn't my doing but still, it's an exception-safety mistake on my watch, which is a little embarrassing.

The only way to fix this I can see is to stop allowing the closure to throw, or to get rid of the closure altogether.

The right thing to do would have been to get the regular language mechanisms to initialize and destroy its stored property. If initialization of a normal stored property throws, the language must consider it already-destroyed for error handling to work at all. Short of that, we could have burned 7 bytes to store an Optional<Header> instead of Header directly and then expose a computed property, but I wouldn't be surprised if both such changes were considered impossible now, given the @_fixed_layout and @inlinable implementation.

(The fact that custom subclasses must not define additional ivars is inconvenient, but I'm not complaining about that -- and sadly I don't expect we'd find a way to lift that restriction any time soon. I do think we need to explicitly verify this constraint at runtime, though, at least in debug builds.)

We used to do that. Oh, weird; we only do it in ManagedBufferPointer.

If we could find a way to make create an actual initializer, then that would be awesome! It would also resolve the naming problem that we have for its replacement. I'm hoping we would be able to do it without introducing a throwaway protocol for it, though.

It's not a throwaway if it's a reusable component in the standard library. Of course I wish the language had a more direct way to support factory initializers… not that I actually care about supporting classes .

ManagedBuffer is objectively not a joy to work with

Oh, come now. I don't love working with it either, but I would never go so far as to claim that lack of joy is anything but subjective!

In fact, I do derive a tiny bit of joy from the fact ManagedBuffer[Pointer] is available to me at all, now that I don't get to use the internals upon which it is built.

I don't think we can fix that entirely without language changes, but at least we can fix some of the lower hanging API issues.

What language changes would you be looking for?

If we were to design this from scratch, then with the benefit of hindsight I'd probably consider fixing problems #4 and #5 in the new design. I don't think we're at that stage yet , but I do think we need to seriously consider getting rid of the capacity property. (I.e., deprecating it everywhere based on a language version, and not carrying it forward to new platforms.)

I'm beyond considering; to me, it's obvious that the whole component should be deprecated and replaced with an UnsafeManagedBuffer that doesn't have a capacity property. Short of that, definitely “get rid of” capacity as you suggest.

This is good information, but it's not something I can practically act on.

Sure it is; you can consider it when thinking about how much emphasis to put on making it possible to create beautiful subclasses of ManagedBuffer suitable for public consumption, and about whether APIs exposed on it are going to “pollute” anything important.

The buffer cannot just be a dumb storage class with no knowledge about its concrete element type -- it must be subclassed to customize deinit.

You don't have to handle it by subclassing; for example, it could take a generic parameter that describes deinit policy.

If we had language features such as a type system-level way of modeling methods that must only be called on a unique instance, then we'd probably want to forward to a class method more often, just to reap the benefits of static uniqueness checking. I expect this would put far more pressure on this bug.

Sorry, do you mean “a uniquely-referenced instance?” I can't follow you here.

dabrahams:

We should! These components predate the existence of Builtin.allocWithTailElems_N and those capabilities should be available to users. How many times have I said to myself, "that leads me to make an array of optionals and… then I guess I won't do that because of the 7 bytes of internal fragmentation in common cases. I wish I could make a collection that's semantically an array of optionals but that uses a bitmap to indicate which elements are alive… I suppose I could hack that on top of ManagedBuffer … but it would be too horrible for words. Sigh.”?

Yes, yes, a thousand times yes! However, I suspect we cannot model heterogeneous buffers in a satisfying way yet. (E.g., variadic generics sound like something we would want to use.) It would be interesting to see how far we can get, though.

I think we can get quite far, actually.

ManagedRawBuffer would be a way to allow people to manually partition their buffer storage, in a more palatable way than rebinding memory vended by ManagedBuffer . This would at least allow us to manually implement heterogeneous buffers without having to fight against the stdlib every step along the way.

That makes sense.

dabrahams · August 7, 2020, 4:43am

Simplicity is having just one API, but OK, I'll bite.

What people do with regular classes in the privacy of their own homes is their business, but is surely not a good example for our children to follow; I teach mine never to use classes in the first place .

More seriously, 2-phase initialization is wasteful, not always possible for an arbitrary type, and tends to lead people to weaken invariants (e.g. replacing Header with Optional<Header>). I don't think we should design standard library APIs around it.

FWIW-ly y'rs,
Dave

lorentey · August 7, 2020, 7:08am

This information is golden, thank you!

I have completely forgotten about SE-0127.

(I’m really not blaming these wrinkles on anyone. Actually I feel it’s my own personal fault for not pitching this before we etched these things in stone in 5.0 — if my list of problems came across as sarcastic, that’s because I’m furious with myself for sitting on this for years! But at the end of the day, ManagedBuffer has received attention roughly proportional to its overall weight, and it’s not like it doesn’t fulfill its purpose — it just has a few minor problems that I feel have finally reached a critical mass with @3405691582‘s OpenBSD effort. I’m hoping we can quickly fix these and move on to happier matters.)

If this was a bug that caused any actual problems in real life code, we could add a builtin to destroy a uniquely refd object without deinitializing it. But it seems easier to just deprecate & replace create, eliminating all problems caused by the half-initialized object.

Not being able to throw inside create’s closure (if we do end up keeping it) doesn’t feel like a hardship to me.

Adding this check to the existing create would be a potential bincompat problem, but we should definitely add it to any new implementations.

This feels like it deserves a separate discussion. I don’t personally mind that ManagedBuffer.create does not look like an initializer — because it doesn’t work like a real initializer. The unidiomatic API is alerting me that ManagedBuffer is a really weird class that isn’t following the normal rules.

On this, we are in full agreement. Hyperbole withdrawn!

Absolutely! I don’t know where this comes from — the idea of exposing low-level implementation details such as ManagedBuffer subclasses as public API is frankly absurd.

My concern is with the interfaces these classes vend to their subclasses and the structs that own them. And my primary beef is with confusables such as capacity vs header.capacity — which is why I recommend we eliminate that one. It seems we’re in agreement on that, too!

This is an interesting idea. It could even be done through a member requirement on the Header type. A final managed buffer class would certainly be an attractive concept — but is it a big enough improvement to be worth deprecating ManagedBuffer? I don’t think so, at least not on its own. But we should definitely investigate this approach for the heterogeneous (and/or raw) case.

Well, Swift’s regular initialization rules already fully embraced this approach, so I don’t really see a point in going against it in this particular little corner. Then again, I am 100% sure I will need to store inner pointers to select sub-buffers within the header, and it would be slightly inconvenient if I had to make them optional or populate them with temp data.

I don’t much care about writing twice to a couple of header fields — should I? Even if llvm doesn’t get rid if the dummy stores for me, the header will already be in the cache, so I expect the overhead (if any) would be entirely swamped by malloc noise and the initialization of actual buffer contents.

Ultimately, this is a minor detail. I guess it will come down to whether we find a reasonable way to eliminate the call-site ambiguity with the old & busted create variant.

compnerd · August 7, 2020, 10:56pm

This would be useful in other cases as well: embedded/bare-metal targets as well as for using a custom memory allocator (e.g. scudo).

lukasa · August 8, 2020, 4:45pm

Swift can already use Scudo: it’s available in the 5.3 branch and master as a sanitizer.

dabrahams · August 10, 2020, 2:35pm

I don't think I agree. How do you justify that claim?

lorentey · August 10, 2020, 8:59pm

We must be using two different languages. ¯\_(ツ)_/¯

All I'm saying is that until I've initialized all instance variables, I cannot access self (in order to, say, retrieve the address of its tail-allocated storage). If the initial value of one of my ivars depends on self, then I must use two-phase initialization. Note that I'm not arguing against this behavior; I'm merely pointing out that ManagedBuffer.create violates it.

I don't really consider this a big change -- malloc_size is a useful optimization, but it's hardly essential. I'm also not arguing for eliminating its use; I'm merely proposing to restrict it to allocation time, which covers all use cases within the stdlib (and every deliberate use I've seen outside of it). This is already how things work, except for ManagedBuffer, whose API accidentally assumes malloc_size is always available.

If we consider malloc_size critical enough to require it, then we ought to add a direct UnsafeMutablePointer method for it, so it becomes easily available for manual allocations, too.

dabrahams · August 13, 2020, 9:03pm

You can if you use lazy var. You could argue that that's a form of two-phase initialization, but the var never actually has any other initial value.

If the initial value of one of my ivars depends on self , then I must use two-phase initialization. Note that I'm not arguing against this behavior; I'm merely pointing out that ManagedBuffer.create violates it.

Maybe that's what you meant, but your words were that the language already “fully embraced” two-phase initialization. I don't agree, in the sense that Swift's normal rules enable me to write initializers that establish strong invariants and don't make stored properties accessible before they are initialized. A language that embraced two-phase initialization might, for example, zero-initialize all integer members for you and allow you to interact with them in that state before you ever got a chance to replace that zero with a new value.

ryansobol · May 8, 2022, 6:22am

Just stumbled upon this thread trying to (re)implement some core collection types for fun and learning.

Fwiw, I came to the forums in the first place searching for insights how to use ManagedBuffer. Franky, its docs, public API, and even its source code are unclear and confusing.

This thread has a ton of great ideas that seem to have been lost to time. I wonder, would it be possible to rejuvenate this work in time for Swift 6?