A better data pretty-printer

dabrahams · March 29, 2020, 10:27pm

With this post, I hope to open a discussion of the design requirements for a library, similar to Python's pprint, that could eventually be incorporated into the standard library and inform the design of many parts of the Swift ecosystem.

Introduction

There are many contexts—from educational/research tools like Playgrounds and Colab Notebooks to industrial programming activities like debugging and logging, in which it's important to be able to easily visualize/understand Swift data structures. For consumption by actual humans, though, Swift's facilities for formatting data leave a lot to be desired. Take a trivial example:

(0..<10).map { Array($0..<10) + (0..<$0) }

If you print this expression, you'll see:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 0], [2, 3, 4, 5, 6, 7, 8, 9, 0, 1], [3, 4, 5, 6, 7, 8, 9, 0, 1, 2], [4, 5, 6, 7, 8, 9, 0, 1, 2, 3], [5, 6, 7, 8, 9, 0, 1, 2, 3, 4], [6, 7, 8, 9, 0, 1, 2, 3, 4, 5], [7, 8, 9, 0, 1, 2, 3, 4, 5, 6], [8, 9, 0, 1, 2, 3, 4, 5, 6, 7], [9, 0, 1, 2, 3, 4, 5, 6, 7, 8]]

Now, if you happen to be in a context like a terminal where you get line-wrapping, this might be enough to give you a sense of what's going on. If the data were any longer, though, it would be a disaster.

Evaluate the expression in the REPL, and you get 122 lines of even less useful output:

$R0: [[Int]] = 10 values {
  [0] = 10 values {
    [0] = 0
    [1] = 1
    [2] = 2
    [3] = 3
    [4] = 4
    [5] = 5
    [6] = 6
    [7] = 7
    [8] = 8
    [9] = 9
  }
  [1] = 10 values {
    [0] = 1
    [1] = 2
    [2] = 3
    ...

The representations we get from LLDB, in the GUI and in the output of p or po, are similarly frustrating (the GUI is actually the worst for visualization: it makes me click 11 triangles to reveal the data). I usually end up typing p print(x) in the debugger to get something I can actually digest. Playgrounds? Don't get me started ; the workaround is similar but far more necessary.

For contrast, now fire up ipython from the command line and evaluate the corresponding expression:

In [3]: [list(range(x, 10)) + list(range(0, x)) for x in range(0, 10)]
Out[3]:
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 0],
 [2, 3, 4, 5, 6, 7, 8, 9, 0, 1],
 [3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
 [4, 5, 6, 7, 8, 9, 0, 1, 2, 3],
 [5, 6, 7, 8, 9, 0, 1, 2, 3, 4],
 [6, 7, 8, 9, 0, 1, 2, 3, 4, 5],
 [7, 8, 9, 0, 1, 2, 3, 4, 5, 6],
 [8, 9, 0, 1, 2, 3, 4, 5, 6, 7],
 [9, 0, 1, 2, 3, 4, 5, 6, 7, 8]]

Brilliant! Not only can I take the whole data structure in at a glance, but I can see the relationships between adjacent rows.

Naturally, not every data structure is this simple to format usefully, but surely we can aspire to do much better than we do today.

The Pitch

To be clear, I'm not proposing to change anything in the tools, language, or standard library in the near term; there's far too much to be explored and proven out in a separate package first. I propose to design a library that serves the same purpose as pprint, which is what ipython uses to generate the result above. I believe, if we get it right, similar principles could eventually be applied to improve standard library print, debuggers, playgrounds, and the REPL (of course, if Apple wants to run with some of these ideas and begin improving tools earlier, I'm sure nobody will complain ).

As far as I know, there's no existing Swift library that serves the purpose. The point of this thread is to discuss the design of such a library, which features are essential, and what problems need to be solved. I'll kick it off with a few things to think about:

References: Ideally one never formats an instance twice. When an instance is referenced from multiple points in a data structure, how is it presented?
Abbreviation: in my use cases, data can be extremely large, and it's sometimes going important to see the “shape” of the data without all of the detail.
- How can we effectively abbreviate an array?
  - I might want to see the first few elements and then an ellipsis.
  - I'd still want to know the length; how do I print that?
- I think this approach generalizes to most types
- Are there data structures that need different treatment?
Columns: is it important to see data structures on separate lines organized into columns?

Thanks for your attention,
Dave

scanon · March 29, 2020, 10:42pm

S4TF has already tackled this to some extent in the specific context of ShapedArray's description( ... summarizing: true), which produces output very similar to python's pretty printer. @rxwei can probably offer some of his experience implementing that.

dabrahams · March 29, 2020, 10:46pm

Those are the facilities used by today's tools for rendering data as text, but the resulting rendering is often suboptimal. That's the problem I want to solve.

Solving it would probably involve using those same facilities; the more so the better. But if you're satisfied with the renderings you get today… I guess I'll register that as one vote for “it ain't broke.”

Thanks for your reply,
Dave

dabrahams · March 29, 2020, 10:48pm

Thanks, @scanon; I'd be really interested in what @rxwei has to say. Wouldn't it be funny to find part of the answer so close to home!?

owenv · March 29, 2020, 11:45pm

In case anyone's interested, the S4TF method is here.

Based on the above, I see two potential areas of improvement here:

Better context sensitivity for CustomDebugStringConvertible implementations. For example, allowing customization based on terminal/line width, compact/expanded forms, fitting descriptions into columns.
A replacement for String(reflecting:) with pprint-like behavior. This could leverage the new customization points above so that for the most part only primitive types need to worry about layout and sizing details. Unusual data structures that wanted fully custom representations could continue to provide them through protocol conformances.

Both kinds of improvements seem like they could be prototyped in a package fairly easily by building on top of the existing standard library facilities. Overall, I think this is definitely an area worth exploring, thanks for kicking off the discussion!

dan-zheng · March 29, 2020, 11:50pm

TF-419 has a short survey of n-d array printing approaches: NumPy, PyTorch, and TensorFlow.

ShapedArray pretty printing was done in https://github.com/apple/swift/pull/23837, adapting PyTorch's approach, which in turn was simplified from NumPy.

There are a few formatting options:

lineWidth: max line width for printing.
- linewidth from NumPy/PyTorch.
edgeElementCount: max number of elements to print before and after summarization via ....
- edgeitems from NumPy/PyTorch.
summarizing: if true, summarize description when element count exceeds twice edgeElementCount.
- Adapted from threshold from NumPy/PyTorch, but is more direct.

Here are some ShapedArray/Tensor pretty printing examples. They basically match NumPy.
Here's the example from the OP:

import TensorFlow // https://github.com/tensorflow/swift-apis
let scalars = Array(0..<10).flatMap { Array($0..<10) + (0..<$0) }
let array = ShapedArray<Int>(shape: [10, 10], scalars: scalars)
print(array)
// [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
//  [1, 2, 3, 4, 5, 6, 7, 8, 9, 0],
//  [2, 3, 4, 5, 6, 7, 8, 9, 0, 1],
//  [3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
//  [4, 5, 6, 7, 8, 9, 0, 1, 2, 3],
//  [5, 6, 7, 8, 9, 0, 1, 2, 3, 4],
//  [6, 7, 8, 9, 0, 1, 2, 3, 4, 5],
//  [7, 8, 9, 0, 1, 2, 3, 4, 5, 6],
//  [8, 9, 0, 1, 2, 3, 4, 5, 6, 7],
//  [9, 0, 1, 2, 3, 4, 5, 6, 7, 8]]

The approach scales okay for scalar types with single-line descriptions (example below), but poorly for scalar types with multi-line descriptions:

import TensorFlow
struct Foo {
  var x, y: Float
}
func randomFloat() -> Float {
  Float.random(in: 0..<10)
}
let scalars = Array(0..<30).map { _ in Foo(x: randomFloat(), y: randomFloat()) }
let array = ShapedArray(shape: [3, 10], scalars: scalars)
print(array)
// [[  Foo(x: 0.4480654, y: 8.283595),  Foo(x: 9.3662815, y: 6.9708605),
//     Foo(x: 9.307922, y: 3.0516386),    Foo(x: 8.633892, y: 8.495968),
//     Foo(x: 8.039076, y: 6.2687416),   Foo(x: 9.036159, y: 5.5231953),
//     Foo(x: 9.479092, y: 5.6267643),  Foo(x: 0.27412534, y: 5.213334),
//     Foo(x: 9.509889, y: 5.4769177),  Foo(x: 5.3902664, y: 7.6097817)],
//  [   Foo(x: 8.531519, y: 9.247487),  Foo(x: 4.4876156, y: 4.8444886),
//      Foo(x: 8.801109, y: 9.923952),    Foo(x: 1.137932, y: 8.767807),
//        Foo(x: 9.818808, y: 7.9524),  Foo(x: 0.8811009, y: 3.1550765),
//     Foo(x: 3.2264006, y: 9.470762),   Foo(x: 0.83419025, y: 8.05618),
//    Foo(x: 5.1218147, y: 4.5521345),   Foo(x: 6.997285, y: 0.6220269)],
//  [  Foo(x: 8.387425, y: 1.3053352),  Foo(x: 1.4895821, y: 7.3696184),
//      Foo(x: 2.824216, y: 0.857808),    Foo(x: 1.861074, y: 9.140683),
//   Foo(x: 0.027683973, y: 8.813936),   Foo(x: 5.523879, y: 5.2888365),
//      Foo(x: 1.207906, y: 8.352948), Foo(x: 0.018079877, y: 6.803925),
//     Foo(x: 2.4139428, y: 9.780199),   Foo(x: 1.4824015, y: 8.396431)]]

jrose · March 30, 2020, 12:09am

I wonder how far you could get with special-casing Mirror.DisplayStyle's collection and dictionary cases. You could use that to build up the shape of the data structure and then render the elements individually.

dabrahams · March 30, 2020, 2:18am

Sure; I think there are lots of things we could do with what’s there, including detecting and using Custom[Debug]StringConvertible conformances when their results are short enough. But I really hope this thread can be about domain requirements and API for users before we get too far into implementation.

Jon_Shier · March 30, 2020, 3:52am

What kind of domain and other API were you thinking? Personally, I think some of this could be handled at the String level with better or more formatting built in. For example, I often find myself wanting to print levels of indented text, such as when printing Alamofire Request descriptions. I would love to be able define indentation levels or columns for output without having to manually calculate them. In Alamofire right now, we print a response's HTTP headers something like this:

[Headers]:
Access-Control-Allow-Origin: *
Content-Length: 464
Content-Type: application/json
Date: Mon, 30 Mar 2020 03:46:07 GMT
Server: gunicorn/19.9.0
access-control-allow-credentials: true

I would really like it to look something like this:

[Headers]:
    Access-Control-Allow-Origin: *
    Content-Length: 464
    Content-Type: application/json
    Date: Mon, 30 Mar 2020 03:46:07 GMT
    Server: gunicorn/19.9.0
    access-control-allow-credentials: true

This is currently accomplished through simple String interpolation:

"""
[Headers]:
\(sortedHeaders)
"""

So something simple to control indentation would be very nice.

"""
[Headers]:
\(sortedHeaders, indentLevel: 1)
"""

Even better if you could customize the indent representation or perhaps have relative indentation.

benrimmington · March 30, 2020, 5:19am

You can option-click a disclosure triangle in any outline view to expand (or collapse) all of the item's children.

A similar feature is available in the Files changed tab on GitHub.

ktoso · March 30, 2020, 5:35am

Personally I'd love for an improved print experience. I work around this constantly though it's less painful since I use (keyboard / IDE) macros (real ones would be nice ) to type out tons of boilerplate to get the printout I want (or have to type those for loops every single time manually while debugging).

For more inspiration you may also want to skim Scala's version of this: http://www.lihaoyi.com/PPrint/
it handles not only collections but also case classes (think "structs") quite well.

Another issue to keep in mind while looking into a prettier print to please have it behave consistently invoking either description and debugDescription all the way through, and not like e.g. collections handle this in swift today, which is super surprising (description of a collection invokes debugDescription of elements): [SR-11001] Collections description ends up invoking debugDescription of elements · Issue #53391 · apple/swift · GitHub (perhaps worth revisiting that SR as well concurrently...?)

dabrahams · March 30, 2020, 2:06pm

Thanks, that sounds really interesting.

IMO, most people would find the results even more surprising if we did that. Please see my reply in the issue.

dabrahams · March 30, 2020, 4:49pm

That's still one keystroke and a click too many for my purposes, and then the result is still similar to what I get from p or po, with poor use of screen real-estate and lots of information I don't normally need. Note that GitHub starts by showing you the useful information. But I don't really want to spend too much time discussing the tools here; my point in bringing them is was merely to point out that they might eventually benefit from work done on this library.

ktoso · March 30, 2020, 5:09pm

You're welcome I think the width / height concepts are pretty good there, they translate to when an output gets \n-ed or truncated etc.

Thanks for the reply there -- that puts much into context. I'm convienced that changing those semantics is not a good idea. Though, Nate sums the conflicting use cases up very well there – what debugDescription is is more about where it's printed, and not with what intent. So we can't change names and semantics etc, but...

So pprint will have to follow the same semantics, using debugDescription of elements if inside a collection etc. That's good and I understand why we want this (though would love more docs on the protocol why/when it gets invoked).

At the same time, while introducing a pprint could it also serve the "interactively (println/log) debugging" use-cases (3 & 4). It seems to me it might easily address those if it had a way to use a Mirror to obtain all the values to print -- then description and debugDescription both contain things which normal users may want to see, and some pprint(mirroring_pickBetterName: value) would do what I was after all along -- print "all" the values pretty formatted (multiline etc), as I've now realized that using debugDescription for manually doing the pretty multi-line print leads to unexpected outputs (because collections).

Do you think this is something we could consider having as part such API?

dabrahams · March 30, 2020, 6:25pm

I don't think I understand what you're proposing. As I've said elsewhere in this thread, I take “use a Mirror” for granted as one of the bases for the implementation of this library. I don't think I know what you were after all along.

That said, it might be a good idea for the library to have a lower-level API that uses a visitor for formatting individual pieces pf the output, so you can customize how any given type is formatted without having to write all the wrapping and indenting bits.

ktoso · March 31, 2020, 2:45am

Looking back at it I guess it being implemented in terms of mirrors is kind of obvious in retrospect, though was not really spelled out. I think this has quite the potential to solve what "debug" printing in my opinion should have been (while the debugDescription IMHO remains somewhat weirdly named to what it's actually used for -- when embedded in collections).

Overall, looking forward to this - could help avoid ad-hoc implementing pretty formatted outputs way too many times for some types I own

Karl · March 31, 2020, 3:52pm

This is a really interesting topic and I'm glad we're looking in to it!

Data visualisation is important for far more than just ML; getting a better idea of what your data looks like can be critical to identifying optimisation opportunities or gaps in test coverage.

A question: are we talking about visualising/understanding data using Swift, or visualising/understanding the data structures themselves? I'm going to assume you mean the former.

When it comes to multi-dimensional data, like the examples in the OP, it's clear that we can do better -- at least up to 2 dimensions. If we want to get really ambitious and look beyond that, it's possible that a visualiser with more advanced graphics capabilities (like an Xcode Playground or embedded web-content in a notebook) could render visualisations of even higher-dimensional data. Is the idea that this would be limited to the terminal, or are we looking at adding support for richer visualisers as well?

Even if limited to the terminal and 1 or 2 dimensions, we could do better. I'd really like some way to add value colourisation, and basic statistics like min/max/mean/stddev, for instance.

EDIT: Here's an example of the kind of things you can do with Python's rich library:

dabrahams · April 5, 2020, 1:13am

I'm not sure I understand the distinction you're making

Data visualization in general is hugely important, but I'm limiting the scope of what I'm talking about here to textual representations. That said, even higher-dimensional data can be effectively represented in text—at least way better than we do today—in prehistoric times I used to program in APL and even that was able to print 3D and 4D arrays sensibly.

I think I don't want statistics injected into my output; we have a programming language so we should be able to just compute those things and print them if we want them. As for colorizing, maybe as a postprocessing step, but I think it needs to be simple to generate plain text.

EDIT: Here's an example of the kind of things you can do with Python's rich library :

Thanks; I'll look into this!

idrougge · April 5, 2020, 7:51pm

Can't much of this be handled by referring to CustomPlaygroundDisplayConvertible?