Pointer arithmetic with Builtin.RawPointer?

ignat · February 10, 2019, 5:00pm

I recently upgraded an old project from Swift 2 to Swift 4.2. It hooked into the dyld api to read the dynamic libraries that are loaded when an iOS app is launched.

I'll try to skip the details, but I have a pointer that I use as a cursor (of type UnsafeRawPointer) which moves around memory and reads various data using the base information from dladdr(). I used to be able to initialize this cursor like so:
if var lc_cursor = UnsafeRawPointer(bitPattern: mh.hashValue + header_size) { ... }
header_size is an Int with the value 32 and mh is an UnsafePointer<mach_header>

Now what used to happen is that the hashValue of any UnsafePointer gave you the address of the pointer, but now it gives you a literal hash value, meaning some large random number (and I cannot fathom why this change was made). After reading the Swift source, I discovered that UnsafePointer has a hidden property that is accessible called _rawValue.

Looking at this value in the lldb debugger, I found out that it is of type Builtin.RawPointer. Well ok, I thought at first, so I'll just use it since it can be used to init an UnsafeRawPointer, but it turns out that you can't do arithmetic on this type. There's no overload for + on type RawPointer and Int. So I can't just do UnsafeRawPointer(bitPattern: mh._rawValue + header_size) There's a whole bunch of caveats too, like you can't cast this RawPointer type like mh._rawValue as Int or make an Int like Int(mh._rawValue). Interestingly though, type(of: mh._rawValue) told me that it is of type Builtin.Int64. Therefore it's possible to cast this RawPointer to an Int64 like unsafeBitCast(mh._rawValue, to: Int64.self), but that just seems a little extra just to add two numbers together. I'm trying to keep my code as short as possible while staying within Swift.

I guess could just create the UnsafeRawPointer and then add the offset, but isn't that creating an extra pointer? Why can't I just get the address of the UnsafePointer as an Int, add the offset, then create the pointer? Why can't I perform simple arithmetic on this Builtin.RawPointer type?

If you want to take a look at my project, here's a link to the github page. Thanks.

ASwiftUser · February 10, 2019, 5:35pm

The description of a pointer is its address:

0x0FF5G880CDD32378

So Int.init(_, radix:) on ptr.description.dropFirst(2). Then you can use UnsafePointer.init(bitPattern:).

Karl · February 10, 2019, 5:52pm

You shouldn't be relying on implementation details like hash-values or parsing descriptions, and you definitely shouldn't be using any Builtin types or underscored members (implementation details of the standard library). As you've seen, relying on those things can make your code brittle.

The way to do this accurately in Swift is to keep your cursor as an UnsafeRawPointer. URP includes support for the addition operator, and you can use the load(as: T.self) and load(fromByteOffset: Int, as: T.self) methods to load a mach_header or Int32 or whatever else from it.

I'm not sure if it's safe to assume the pointer returned from dlopen is implicitly "bound" in the Swift sense.

ignat · February 10, 2019, 6:25pm

I thought as much. I guess if I want to optimize Swift code that calls c libraries, I might as well just use c. Thanks!

scanon · February 10, 2019, 6:52pm

Addressing the implicit question here: the change to hashing behavior throughout the standard library is SE-0206.

Like @Karl said, do not use Builtin types outside of the standard library. The Builtin module moves slowly, but is subject to essentially arbitrary change, and your code will break.

You can certainly do that, but it's not necessary. The load(fromByteOffset: as:) method that Karl references or the techniques described under "Raw Pointer Arithmetic" in the UnsafeRawPointer documentation allow you to do this in Swift. I do not understand your concern about "creating an extra pointer"; at the machine level these are completely equivalent, and at the source-language level you only need to "create" one pointer.

beccadax · February 10, 2019, 8:13pm

"Creating an extra pointer" is just copying a word of memory. UnsafeRawPointer is a fixed-layout struct, just like Int, so there's no reason not to use it—it won't add any overhead.

You probably could, but you'd suddenly need to worry about integer signedness, your arithmetic ending up with a zero, etc. Better to just leave it as a pointer type.

Basically, by bringing a Builtin type into your code, you've caught a glimpse of the world as the standard library sees it. Types and functions from the Builtin module are primitives that the standard library wraps in actually usable APIs like UnsafeRawPointer. Arithmetic operators are one of the things those wrappers add.

Underscored APIs like _rawValue are "no user-serviceable parts inside"; you shouldn't use them.

ignat · February 10, 2019, 9:28pm

Thank you everyone, I understand now. It was really just a visual preference, where I wanted let new_ptr = UnsafeRawPointer(base + offset) but let new_ptr = UnsafeRawPointer(base) + offset does what I need.

The only final thoughts I would have would be that load(as T.self) on an UnsafeRawPointer instance creates a new instance of that type, so if I understand correctly it won't use the same memory space where the UnsafeRawPointer points to. If I am sure that the memory is bound correctly already, it would be fine to use .assumingMemoryBound(as: T.Type) instead, right?

Oh and I have these two not very pretty lines:

let seg_cmd = UnsafeMutableRawPointer(mutating: lc).assumingMemoryBound(to: segment_command.self)
let segname = String(cString: &seg_cmd.pointee.segname.0)

Is there a better way to get the segname out of the pointer? It is defined in c as char segname[16]; but is imported into Swift as a tuplet of 16 Int8's. I don't know a String constructor that would take that tuplet directly.

scanon · February 10, 2019, 9:44pm

CC @Michael_Ilseman who may have a suggestion here.

Karl · February 10, 2019, 11:48pm

Yes, the documentation for load states that it will copy the memory to initialise a new, independent T.

As for raw/typed pointer, unfortunately we don't really have great high-level documentation about it. The official language guide doesn't mention the distinction at all, and I couldn't find any design documents in the compiler source repository. If you go back to the Swift 3 migration notes though, it does say this is okay:

In general, developer’s should not make layout assumptions. However, some “obvious” cases can be safely assumed, including homogeneous arrays and tuples, and structs with homogeneous stored properties. Imported C structs naturally follow the layout rules of the platform’s C ABI.

As you've seen though, the interface is more awkward once you need to add byte-offsets and read values of different types. This kind of parsing of binary layouts is largely why the raw APIs exist.

Also check out @Andrew_Trick's recent post on correct use of these APIs.

You can use withUnsafe(Mutable)Pointer(to: inout T) on the tuple, transform it in to a fixed-capacity buffer as shown here, then use String.init(cString:).

eskimo · February 11, 2019, 9:06am

… then use String.init(cString:).

Segment and section names are not guaranteed to be nul terminated, so you can’t use a C string initialisation. Rather, use an initialiser that takes a length, like String.init?(bytes:encoding:).

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

ignat · February 11, 2019, 1:31pm

I thought segment names are all null-terminated? 16 bytes is just an upper limit, I don't think any of them use up the full 16 bytes, but I can definitely agree that it should be possible. I don't know what will happen if all 16 bytes are not null, will the cString: initializer continue to try to read bytes?

I tried to make a String using String.init?(bytes:encoding:) but tuples are not iterable, they are not Sequences, so I cannot initialize it through this way, at least not in a straightforward way. Looking at answers on StackOverflow, people suggested using Mirror()'s children to iterate over a tuple, but I'm not the one that needs to iterate over it, I want the String initializer to do it, so again that's a no-go.

@Karl's suggestion is essentially same as what I'm doing already, just sending a pointer to the first element of the tuple to the initializer instead of the tuple itself, but it is a viable way to make the String, although again I don't know what will happen if all 16 bytes have some data. Does withMemoryRebound(to:capacity:body:) add an extra null byte after the end of the rebound memory? Because if not won't the String(cString:) initializer keep reading bytes?

Looking at other initializers of String, the only other reasonable one would be String(bytesNoCopy:length:encoding:freeWhenDone:), but it adds all the extra 0's as part of the string, creating something like "__TEXT\0\0\0\0\0\0\0\0\0\0". I found that NSString has an initializer of NSString(bytes:length:encoding), which shows correctly in the debugger (print object) but still falls apart when converted into a String and adds all the 0's back in again.

I guess I need an initializer that's something like String(cString:maxLength:), unless I'm completely misunderstanding how the cString: initializer and memory binding of tuples work. But then again, I don't think I'll ever run into this problem in this use case.

eskimo · February 11, 2019, 3:06pm

I thought segment names are all null-terminated?

Nope. If you want to test this, set Other Linker Flags (OTHER_LDFLAGS) to -sectcreate xxx yyy /some/file/path and tweak xxx and yyy. If you set them to 16 characters, things work and the load command has no trailing nul. If you set either to 17 characters or more, Xcode warns you about truncation.

As for converting this to a String, I’d use something like this.

extension segment_command {
    var segmentName: String {
        var buffer = [UInt8](repeating: 0, count: 17)
        var tmp = self.segname
        memcpy(&buffer, &tmp, 16)
        return String(cString: &buffer)
    }
}

Still, this worries me because there’s two implicit assumptions:

Stuff after the first nul is not significant.
The string is encoded as UTF-8.

If I were doing this in production code I’d limit it to debugging / logging purposes. For identifying a segment, I’d treat the 16 byte name as an opaque token.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

TellowKrinkle · February 11, 2019, 8:40pm

Isn't there Int.init(bitPattern:)?

jrose · February 11, 2019, 9:04pm

Int.init(bitPattern:) is indeed the correct way to convert an UnsafeRawPointer to an Int. However, as noted it's not the preferred way to offset a pointer by a certain number of bytes. I can't actually think of anything that will go wrong with it if you're starting with an integer value, but if you're starting with a pointer then it's better to structure all your operations in terms of pointers so that the compiler better understands your intent. (For instance, a future version of the compiler might be able to warn if you accidentally create a typed pointer with the wrong alignment.)

Michael_Ilseman · February 11, 2019, 9:37pm

String(decoding: myBytes, as: UTF8.self) is the canonical way to form a String from a collection of bytes representing UTF-8 content. So you just need to get a Collection that doesn't include trailing 0s, and myBytes.prefix { $0 != 0 } will give you that.

I haven't tested this, but looking at this line, it could look like:

let segname = String(decoding: UnsafeRawBufferPointer(start: seg_cmd, count: 16).prefix { $0 != 0 }, as: UTF8.self)

ASwiftUser · February 11, 2019, 10:00pm

I think you mean myBytes.suffix { $0 != 0 } (or you meant leading zeros)

Torust · February 11, 2019, 10:08pm

prefix(while: { $0 != 0 }) means the prefix until the first zero, so what was written was correct.

ASwiftUser · February 11, 2019, 11:27pm

Oops. I was thinking of dropFirst

scanon · February 12, 2019, 12:06am

Note that all of these prefix variants are slightly wrong anyway, because as clarified by @eskimo, segment names are technically neither nul-terminated nor UTF8.

Karl · February 12, 2019, 12:22am

It is defined to be ASCII, but not necessarily null-terminated.

segname

A C string specifying the name of the segment. The value of this field can be any sequence of ASCII characters, although segment names defined by Apple begin with two underscores and consist of capital letters (as in __TEXT and __DATA ). This field is fixed at 16 bytes in length.

Apple doesn't seem to publish the document any more, but it's mirrored here.