New literal for string from contents of file

anthonylatsis · March 11, 2018, 9:02pm

What I do not understand is why can't we make the compiler instantiate a Data object at compile time.
The alternative, if I understand, is to retrieve the contents of a file as an array of bytes or a pointer and then instantiate an object. What is the difference in terms of ultimately having an instantiated object at compile time?

So the proposal is to have a #dataLiteral that can later be converted to a suitable type if needed? Now that I think of it, it seems like a very good idea. Data would implement the variant with resource: String; [NS]String and [UInt8] would be easily derived with convenience initializers as Chris noted:

protocol ExpressibleByDataLiteral {
    
    associatedtype DataLiteralType
    
    init(resource: DatatLiteralType)
}

xwu · March 11, 2018, 9:22pm

We can, but it'd be a different kind of literal, one we're designing today. The full name of "#fileLiteral" is "file reference literal."

anthonylatsis:

Data would implement the variant with resource: String; [NS]String and [UInt8] would be easily derived with convenience initializers as Chris noted:

protocol ExpressibleByDataLiteral {
associatedtype DataLiteralType

init(resource: DatatLiteralType)
}

I think you're misunderstanding the meaning of *LiteralType types. Those refer to the default type of the literal, not the value that is passed to the literal initializer, which in this case would likely be UnsafeRawBufferPointer.

mklbtz · March 11, 2018, 10:27pm

The conversation so far leads me to think there are actually a few different components in this feature with progressive levels of convenience.

embed file contents as a buffer of bytes: let buffer = [0x1, 0x2, ... ] (à la @Chris_Lattner3’s comment)
embed file contents as an instance of Data explicitly (à la @xwu’s suggestions)
embed file contents, then initialize any types (including Data) that conform to a protocol providing init from buffer of bytes.

Component 1 is the real core of this pitch. From there it’s a question of how convenient / complex we want to make it. Building up to 2 would be acceptable to me, as it solves the needs in a fairly convenient and discoverable way. Building up to 3 sounds really interesting and flexible to me, but I could sympathize with an argument that it’s too complex for such a niche use case.

If we build up to 3, I think the syntax feature should not concern itself with with string encodings. String’s conformance to the ExpressibleByWhateverLiteral protocol can just assume utf-8. This is actually what Rust’s include_str macro does, according to their docs. I think it’s the most common use case, plus you can always work around it by embedding as Data first, then initializing String with whatever other encoding you need.

anthonylatsis · March 11, 2018, 10:44pm

I understand that. I am just following the naming of the other *LiteralType. Maybe that's what confused you. (IntegerLiteralExpressible, BooleanLiteralExpressible, ..). When I first discovered these protocols, the naming of the associatedtype confused me as well.

~~I still don't understand the difference between creating a NSData object from a path and creating the same object using a UnsafeRawMutablePointer~~ (*) . Except for passing a UnsafeRawMutablePointer sounds very inconvenient and , by the way, where do we get this pointer from in contrast with having a path to the file?

(*) Ah, is this all to mantain a reference to the buffer representing the file and ensure changes in the file affect the literal as well as the underlying object at compile time?

anthonylatsis · March 11, 2018, 10:56pm

Agree, moreover, the consideration of encodings will conflict with the protocol, which is unnecessary due to the ~~already mentioned~~ actual high permutability of the types in interest.

Chris_Lattner3 · March 12, 2018, 12:18am

Great summary. Given that this is something that will be used in a few places, I don't see a strong reason to over complicate and over sugar the proposal. Question to consider: how simple can we make this without losing its core essence?

-Chris

samdeane · March 12, 2018, 12:30am

@Chris_Lattner3 and @xwu managed to express what I was trying to say, better than I managed to say it :)

I thought that Data would have been a safer interface to the embedded data, especially if the compiler could construct the object completely statically to make it efficient, but I didn't consider the dependency that would involve - so I can see why UnsafeBufferPointer makes more sense.

When I mentioned encoding I probably should have picked a different word. I wasn't intending to imply string encodings, but just more generally some hint as to how the raw data is to be interpreted - be that a file extension, MIME type, whatever. I imagine that there will be many times when the raw format is known (or assumed) from the context, without needing this hint, but that there might also be situations where it's essential to be able to provide it?

One thing I am now wondering about is whether widespread (over-enthusiastic?) adoption of this feature would have too negative an impact on load times. I'm assuming that the raw data would go into something akin to the __TEXT section?

anthonylatsis · March 12, 2018, 12:36am

In terms of simplicity, I would say option 2 is a necessary convenience for option 1, while 3 is optional relative to 2. Considering this, the implementation is undesirable to be more simple than a #dataLiteral, which hence and IMO is the best option so far in convenience and generality.

@xwu @mklbtz What do you think?

mklbtz · March 12, 2018, 1:56am

Embedding a ton of data into an executable would certainly have some impact on its initial loading time and of course its size. To what degree… I couldn't say. It's a tradeoff and the main thing you'd be buying is the ability to not rely on the file system when you don't want to. In some cases you'd be able to ship an executable with no file dependencies, but I wouldn't go overboard with it. It would be interesting to benchmark this against loading the same data from a file on disk. See where the limits are.

mklbtz · March 12, 2018, 2:36am

I'm with you on that. Vending Data seems like the minimum level of ergonomics we'd want to provide. Even building a string, my original motivation, would be fairly easy.

let _ = String(data: #fileContentLiteral(...), encoding: .utf8)!

That said… all the other literals have a corresponding ExpressibleBy protocol. If we're going to the trouble to vend Data, it wouldn't be much more effort to add such a protocol for this and implement it for Data, String, etc.

What I'm not sure about is how much effort it would take on the compiler side to make this "macro" behave like a function with a generic return type. But since this is how #colorLiteral works, I can't imagine it'd be too much.

It's possible to make your own Color type and still use literals.

let _: MyColor = #colorLiteral(red: 1, green: 1, blue: 1, alpha: 1)

This is why I was pushing for the same kind of design.

let _: StringyString = #fileContentLiteral(...)

anthonylatsis · March 12, 2018, 3:12am

True. But, speaking about reasonable quantities, it's great if we don't have to spend that time at run time. Likely the asymptotics are linear, since we are talking about loading and allocating strings and arrays of bytes. But it looks like we must sacrifice space - twice as much - for our data for everything to be fast. This is a very important point to consider.

Unfortunately, the 'ExpressibleByColorLiteral' protocol is internal, but we of course could otherwise, like with the other open *literalExpressible protocols. Maybe ExpressibleByDataLiteral should be internal too to discourage usage with types other than the Standard ones.

mklbtz · March 24, 2018, 7:09pm

I'm working on the proposal for this. I'm looking for feedback on some details that haven't yet been discussed yet. Here's the work-in-progress

Regardless of the exact syntax, we want the parameter to be a file path, like this: #xxxx(contentsOf: "path/to/file.ext"). Here's an excerpt on the constraints on this parameter:

It is considered an buildtime error to provide an empty path or a path to a file that does not exist or cannot be read. Paths must be written using Unix conventions. When a relative path is written, the compiler will use the project directory as the root directory for the relative path. Here, "project directory" refers the directory containing the .xcodeproj or Package.swift.

I'm making a few assumptions about relative file paths here. Using the project directory as the root seemed natural to me, but I might be overlooking something. Anyone have thoughts on this?

Also, I realize it's totally possible to have an Xcode project for a SPM package that's in a different directory. I'm not sure how to rectify that. Do Xcode projects have a more explicit "project root"? Would it happen to match the path to Package.swift when you use swift package generate-xcodeproj?

Cheers

QuinceyMorris · March 24, 2018, 9:38pm

I think this is a great idea, but in its current form it looks to become a usability problem, in Xcode at least.

The problem is that, as soon as you enshrine even a relative path in code, something as simple as dragging a project item to a different place in the project hierarchy may easily break the code. A single, isolated instance of this is not hard to deal with, but in a large, source-controlled, multi-developer project, these kinds of errors build up to a constant nuisance.

A different problem arises when the file contains text. Traditionally, Xcode tolerates variations of encodings (UTF-8, UTF-16, and other things) and variations of line-endings in source code. Text files come from external sources with varying combinations of such variations, and by the time they're added to the project, the variations are opaque to text editing.

That says to me that such files need a "compilation" pass (in general). For pure binary data, the "compilation" is a pass-through. For text, the compilation is (say) a re-encoding to UTF-8 with standardized line ending. For other recognized types, if any, a different compilation might be needed.

Separately, it makes no sense to me that the contents of these files are literally compiled into other Swift source files. That make the compilation unit bigger and slower, and (by the syntax-coloring argument used as motivation) no one really wants to look at the file contents in situ in the source file.

Surely it would be better to break this entire proposal into two parts:

Compile the external file independently into an object file, with some mangled/namespaced public symbol, representing some kind of Sequence-compatible data (e.g. a [UInt8] if nothing better), that's linked into the final binary. Maybe this could use a double-barreled file extension convention such as .swiftdata.png for binary, or .swifttext.sql for readable text.
Change the #xxx literal syntax to refer to the public symbol, using the associated data in an initializer for Data, as currently proposed.

I realize this is not a trivial alternative, and — more significantly — cannot be done without changes to Xcode (at least to recognize the need to "compile" one of the special files). However, I can't think of any approach to this file-literal feature that can work reasonably without some Xcode support.

It would be unfortunate if this great feature was implemented in a way that caused Xcode users to shun it because it was too troublesome to use.

FWIW

anthonylatsis · March 24, 2018, 9:56pm

I have the impression you think the contents of a file will be visibly embedded. The idea is to have a ready-to-use object with the contents loaded into it. Excuse me if I misunderstood.

I don't see why this should be considered a strong argument. Couldn't you say the same about all images, textures, scenes, animations etc. you refer to with paths when developing games, for instance?

Apart from that, you have a point (some of it was already discussed though and people who participated are aware).

mklbtz · March 24, 2018, 10:02pm

Correct. Any kind of special IDE support (like displaying file contents in the editor) is out of scope and indeed would vary from editor to editor. This is more about compiler support for embedding data.

QuinceyMorris · March 24, 2018, 10:07pm

I wasn't sure whether there might be an intention that Xcode would display some representation of the contents (which it does for, say, a color literal in a playground).

If the contents aren't represented "embedded", I was trying to say, that supports the idea of separate compilation. (I was not calling for this kind of representation.)

No, because if you're retrieving resource files at run time, you typically use the Bundle-relative resource APIs, and are not directly concerned with literal path strings at all. (You can choose to use some subpaths within the bundle's resources directory, but it's something of a PITA, and typically not necessary.)

anthonylatsis · March 24, 2018, 10:20pm

So you mean dragging around a file in your project will likely keep it on the same bundle. I think we can mitigate the shortcoming of invalid paths simply avoiding absolute paths. For example "assets/some.js" would search for an assets folder. Or even use just file names.

QuinceyMorris · March 24, 2018, 10:37pm

It will (without the "likely"). What goes into the bundle is controlled by target membership, which is unrelated to the project item hierarchy. The project item hierarchy itself bears no necessary relationship to the file system hierarchy of the files represented by the items. The bundle directory hierarchy bears no necessary relationship to the project item or source file system hierarchies.

Even then, Xcode would have to be changed to tell the Swift compiler where to look, since the above variability means that the location of the .swift file is no guide.

anthonylatsis · March 24, 2018, 10:52pm

Thanks for pointing this out! Although it is obvious upon reading, I never actually though about it.

Xcode can help with indicating the right bundle, I take it? But yes, of course, for this to work nicely we will need some Xcode support. I presume that is your main point.

QuinceyMorris · March 24, 2018, 11:00pm

Yes!