Formalizing a Numerical/ML Working Group

rex-remind · February 14, 2022, 5:24am

Would you mind elaborating on this a bit further? I've recently been learning tensor algebra and from what I've gathered, the power of tensors comes from the tensor product, and the tensor product enables one to reason more fluidly about entangled systems. Are you suggesting that we could generalize the tensor product over any type, or that the tensor product is narrowly focused, or am I way off?

Fwiw, I only have a very passive understanding of Tensorflow so maybe the context is different.

Troy_Harvey · February 16, 2022, 4:14pm

Hi Rex,

Sure no problem. Tensors are a type used extensively in linear algebra solutions. Tensor algebra is certainly useful and powerful for certain problem sets. There are also mature SIMD and matrix accelerators of different flavors that are particularly efficient at linear algebra and tensor shaped problems.

But a the same time it is but just one single domain in the space of math. There is a whole world of math, logic, and computational problems that either don't work in that context, take a large amount of effort to reshape, or are just poorly fit.

Your typical homogeneous neural nets found in deep learning are successfully simulated in large tensor arrays, which in turn are readily accelerated by a variety of SIMD methods on GPUs, SIMD units, and specialized TPU/NPUs. That is great, but still a limited set of problems, and lends itself to monolithic solutions.

Swift answers a bigger question: "what if all code could be differentiated no matter the domain?" What new patterns could be enabled? What new approaches could we take to post-deep learning AI?

For example, our group has been working on the principals of generalized autonomy. What if we could eliminate the specialized training in deep learning that leads to an unprincipled single purpose use (i.e. training a NN to drive a car, gets you driving... but that same learning can't be transformed into say flying a plane). Instead what if we could allow users to ad-hoc define a system for any use – at run-time – that could be controlled without special purpose network design, learning periods, and expert validation, etc. To do this we needed an existential definition language of "things". That required a highly typed differentiable environment to define "actors" (transports, stores, routers), quanta (the currency of object interchange), and behaviors (the physical definition of how things behave). These get assembled at runtime into heterogeneous compute graphs that represent the physics of real world systems whose knowledge is composable, interchangeable, and introspecable. This is a different paradigm that doesn't easily fit into a tensor algebra approach, and where dispatch is fundamentally a MIMD problem set, not SIMD.

That is of course one use case, but we could talk about dozens of examples of post deep learning differentiable computing in other domains.

Troy

rex-remind · February 16, 2022, 8:55pm

Thanks for taking the time to write this. I'm definitely feeling what you're expressing, having an ontology that better matches reality and using actual physics in nodes of a compute graph definitely seems like where engineering should go.

I hope you don't mind me challenging this a bit though, because I'm still not understanding how tensors don't fit into all of this and I'd really like to understand. What whole world of math, logic, and computational problems don't work in that context or need reshaping or poorly fit?

(Notable, a lot of speculation here though I try to make that obvious.)

I'll go back to what I was thinking earlier, tensor products help reasoning about entangled systems. The reason I go there is because in my view, composition is easy, combining systems which produce emergence and understanding that emergent character is hard. The only mathematical tool I've so far found to maybe understand emergence is quantum probability theory, which is built on the tensor product, because it expresses the ability to multiply dimensions of systems. E.g. if V has 2 dim and W has 3 dim then V⊗W has 2x3 = 6 dim, greater than the sum of its parts, literally a multiplication. This (I think) may be interpreted as if V has 2 different realities V_1 and V_2, then here we can view W in terms of V_1 or in terms of V_2 simultaneously (or swap W for V accordingly). So then when I think of "the physical definition of how things behave", with the tensor product I'm hoping I can think of all the possible states factual and counterfactual of these combinations of behaviors. Some states may be quantum probabilistic, but some have 0 or 1 probability which makes them deterministic. That doesn't eliminate using a tensor product in those cases, I think it may make the tensor product less useful when deterministic and I think that just gets you back to composition of physical equations. On that note, is there another tool to combine emergent systems? How do we construct an ontology that defines emergence without such a tool? Without such a tool can this ontology be complete?

Maybe when you say "that context" you're more specifically referring to SIMD. So my next set of questions is do tensors require SIMD or is that an implementation fault of deep learning? Is there something in the mathematical expression of tensors/tensor products that constrains it in such way, and if so what's the better mathematical model? (If I were to conjecture it may be some part of category theory, of which I only have basic familiarity with i.e. composition/functors/natural transformations.)

Lastly, what do you mean by "quanta (the currency of object interchange)"?

Thanks for all the info so far, this is exciting stuff.

jonprescott · February 17, 2022, 12:30am

SIMD (Single Instruction Multiple Data) is a processor architecture that executes a single instruction stream on multiple pipelines of data. It's been used in super-computers since the 60's and 70's for parallel computations. It is also the basis for the generation and manipulation of graphics and image that happen in multiple dimensions on current computes, workstations, phones, tablets. The SIMD types in Swift are an outgrowth of the Apple SIMD framework provided as part of macOS/iOS that ease the use of the graphics hardware in Macintoshes, iPhones, iPads, etc.. The graphics processors can also be used to general parallel processing purposes for FFTs, vector/matric analysis, linear algebra, tensor analysis, and the like.

rex-remind · February 17, 2022, 1:35am

Thanks for the succinct explanation, understood. Then, does a tensor, in the most abstract sense, require SIMD or is that a fault of the implementation?

Maybe to add a bit more clarity (and I'm maybe speaking more to @Troy_Harvey here). I'm very curious and very excited about what Troy is proposing, this sounds big and important and on the cusp of the future. However I'm also a bit concerned, because it's really important to get the terminology right and understood for a new standard before communities start to bifurcate (in this case tensor vs tensor-less). Maybe that bifurcation was unnecessary and wasted effort, or maybe one side is more in the right and therefore with shared understanding can recruit more people so everyone's working in the same direction (already there's bifurcation on this thread). Knowing that, I'm kind of poking my head in with my limited knowledge to see if we can all get on the same page. (As the goal is to get to a Working Group?)

Given that, afaiu as I've been learning this, a tensor is a geometric object, is coordinate space free, multi-linear, and adheres to the tensor product. What implies that a tensor must be processed under SIMD architecture? Could not different multi-linear parts of the object be processed by different instructions simultaneously? Could all types not be viewed as tensors under fmap or some equivalent and we just need some notion of a tensor product on types (in which case we're just talking about the same thing)? If there are restrictions to SIMD or my definition is off, is there a more abstract object that preserves the notion of entanglement/tensor product but removes the restrictions? (Mostly questions I've already been asking.)

jonprescott · February 17, 2022, 5:33am

You can perform tensor analysis, and other continuous simulation and modeling tasks, using conventional processors, SIMD, or, if the processor architecture provides support, MIMD (multiple instruction streams, multiple data streams, supported by newest super-computer architectures). What architecture is best to use is based on rank of the problems, performance required, access to compute resources, and other pragmatic considerations of mapping algorithms to computer architectures (which is what programming is all about). Again, all of these concepts are processor architectures implemented in silicon and form components of the working processors within a computer. The mapping of analysis algorithms (tensors, systems of non-linear equations, neural net computations, hydrodynamic fluid simulations, and other continuous functions) are up to the programmer. There have been efforts, for example, to model these types of systems using collections of virtual servers on AWS.

I think you are conflating the notion of a mathematical "object" (a tensor), and the numerical representation of that object used for computing on a processor architecture.

rex-remind · February 17, 2022, 7:24am

Thank you, to paraphrase I’m hearing “tensor’s numerical representation could use MIMD” which I think makes sense to me. You also mention systems of nonlinear equations, etc. are these the other pieces of computational problems that Troy was hinting at?

Notable that if I’m paraphrasing right this does contradict what I was understanding earlier given “this is a different paradigm that doesn’t fit easily into a tensor algebra approach, and where dispatch is fundamentally a MIMD problem set”. I’d think that either tensors do not fit with MIMD or they do. (Or maybe more appropriately, could not or could.)

I did sign up for updates on Quantum, maybe more will become obvious when it’s officially revealed.

jonprescott · February 17, 2022, 3:37pm

A tensor is represented in a computer as a block of memory holding the values of the tensor components. To perform a tensor calculation, for example, multiply two tensors, on a computer requires a program written to manipulate the affected blocks of memory. That program is implemented in the context of a particular processor architecture (with the understanding that a system of distributed compute elements such as AWS virtual machines constitute a processor architecture). That processor architecture can be a conventional Von Neumann processor architecture, a SIMD parallel compute system, a MIMD parallel compute system, or a specialized processor system tailored to the problem at hand. Computing with tensors can be mapped (and have been mapped) to all of those hardware architectures. Which one is used depends on your budget, resources available, and the requirements for performance you have to satisfy.

The mathematical concept of a tensor is a very convenient aid to calculation in physics (calculation in the general sense, not just in the context of a computer). Use of tensors in general relativity results in a compact representation of stress/strain relationships in space-time, which is what we observe as gravity. Tensors have been used in structural physics and engineering to relate resulting strains in 2D/3D structures resulting from stresses (i.e., forces) applied to the structure. And, as you have mentioned, it is a convenient method to represent the phenomenon of quantum entanglement. There are many other applications of tensor algebra and tensor calculus in physics and mathematics where transformation of one metric space to another is required.

I think Tony's point is that providing a differentiable "Swift" has many more applications than just using it for machine learning networks, and, even with ML, using tensors can be limiting. Tensors are simply another type that can be used to compute.

rex-remind · February 17, 2022, 5:00pm

Ok, I think the obvious just hit me. In programming lingo a Tensor type is expressed over 1 field just like a Vector type is. You may have a Tensor of Floats (likely most common) or Tensor of Strings but there's no such thing as a heterogeneous Tensor type. This isn't to say a heterogeneous Tensor type couldn't be added. I would think if you take Tensor<Float> ⊗ Tensor<String> you get the outer product Tensor of both types, but this is simply a fancy way to compose heterogeneous data, something we could already get from a hand crafted data structure. If you include Tensors of tuples and unions and squint hard enough maybe even every heterogeneous type could be thought of as an outer product of types. The point I think you've helped clarify and that I think Troy was making is that we usually already know what compositions we want for a heterogeneous data types in day-to-day software engineering, and making all types differentiable gives us the power to form composable differential equations over them in general. On that note, a Tensor product over heterogeneous types might still be useful to model entanglement for quantum probability, but similar to what you're saying that may just be one bit of the overall larger picture of "Differentiable Types".

Am I following this correctly? And thanks for taking the time to help me through this.

jonprescott · February 17, 2022, 10:13pm

Tensors and tensor products as mathematical entities are defined over the same field, regardless of where they are implemented (paper and pencil or a computer). Not quite sure what you mean by a heterogenous tensor. A field defines, over the space of the elements of the set, operations such as addition, multiplication, and identity members.

If you can define what it means to multiply two Strings, add two Strings, you could define a mathematical field based on Strings. Seems to be a lot of work to twist this into the tensor methodology, and it's not clear there is a lot of benefit. Same goes for defining a Float-String field. If you can assign a meaning to the product of a Float and a String (and the other properties of a field), you could define such a field. Again, dubious value.

As to your assertion that "heterogeneous tensors might be useful to model entanglement for quantum probability```", I'm not sure what you are trying to get at. What heterogeneous types are you thinking of that would make up the field that the requisite tensors would be composed of? Quantum entanglement phenomena are already modeled using various formulations of quantum mechanics, statistical mechanics and other physical theories.

scanon · February 17, 2022, 11:13pm

A mathematical tensor is a thing defined over a base field. But a "tensor" as colloquially used by ML folks is often more like a multidimensional data store, and sometimes doesn't take advantage of the field properties or the algebra on the tensor structure at all.

philipturner · February 18, 2022, 1:19am

As the person resurrecting S4TF, I’d like to give an deeper explanation into why static SIMD tensors are so prevalent in ML. Months ago, I tried a custom matrix multiplication kernel in Metal, which used sparse matrices. Instead of being fully SIMD, these have dynamic sizes that match natural neurons better. But, even with zero multiplications, Apple’s dense MMX algorithm outperformed me.

A homogeneous computer architecture also works better for hardware acceleration for machine learning. NVIDIA GPUs have 100 TFLOPS tensor cores that require precise memory alignment and massive SIMD. With the same 50 billion transistor budget, Apple’s M1 Max focuses instead on general-purpose processing. Subsequently, it only has 10 TFLOPS of ML processing power.

The point of this is that there are trade offs to the concepts of static-ness and dynamic-ness in computing. Sparse matrix multiplications can be seen as more dynamic and efficient, but the trade off is that they aren’t the static tensors suited to machine learning. In the work @Troy_Harvey is mentioning, he is able to utilize heterogeneity/dynamic-ness much more than in traditional machine learning. Despite being statically typed, Swift has opened up more opportunities for dynamic stuff in their field.

rex-remind · February 18, 2022, 2:44am

This might just be too much speculation on my end. Thanks again for the info.

porterchild · February 18, 2022, 6:57pm

I think Jonathan and Steve's distinction of the uses of the word "tensor" is important, and here I'm talking in the multidimensional data-store sense.

I'm still not understanding how tensors don't fit into all of this and I'd really like to understand. What whole world of math, logic, and computational problems don't work in that context or need reshaping or poorly fit?

The contention being made is not that tensors aren't useful, but that they're not fundamental enough to dictate goals, architecture, or mental models. It seems there's a nascent recognition of an over-focus in the last ~5 years on tensors, which probably occurred because tensors happened to lend themselves to doing deep learning on the pieces of silicon we happened to have access to. But it's just one way to represent and compute things. We don't want to miss the forest of possibilities for the tree of tensors.

Lastly, what do you mean by "quanta (the currency of object interchange)"?

This question is a good medium for an example.
Using Troy's framing to represent a neural network, the quanta would be activation signals passed from one layer to another. But the framing is general enough to represent much more than neural networks. For example, if your compute graph instead represented a physical system of pipes, boilers, radiators, etc., then the quanta might be water. Or if it represented a network of computers, the quanta might be data packets. Or if it's a bunch of roads, then the quanta are vehicles.
As you can imagine, as you get more flexible about the data and operations your compute graph is embodying, you will desire more flexibility in the programming tools you use to implement them. Tensors lock you in to a homogeneity that works great for neural networks, but quickly becomes restricting if you want to go further.

Is there something in the mathematical expression of tensors/tensor products that constrains it in such way, and if so what's the better mathematical model?

Like Troy said, you can massage a lot of problems into a tensor representation. But it might take effort. What if you could just choose your own representation, whatever is most natural for the problem at hand? And what if, despite having chosen a novel representation, you still had access to automatic differentiation, and could use gradient descent to optimize your model? That's the opportunity Swift is providing us. We don't have to stay in tensor land, everything is differentiable now. In a strongly typed way, with all the benefits a principled compiler offers. There's an opportunity for this group to chart a course that builds on these existing unique benefits. And tensors can play a part, but there's no need to make them a centerpiece

Troy_Harvey · February 18, 2022, 7:00pm

There are many types of heterogeneity. Operator heterogeneity, actor heterogeneity, type heterogeneity, and interface heterogeneity. In total, these would be hard to stuff into a tensor-first paradigm.

Take a typical neural net. The transfer functions, or actors, are usually the same homogeneous functions repeating over and over — for example ReLu. That is very limiting, and it also lacks any principled "knowledge" about the model being trained. The neuron interconnects are typically just floats. Again, that lacks any principled "knowledge" about what information is actually flowing from neuron to neuron. That then leads to typically NNs lacking any ability to introspect, which is why deep learning applications are 1-way solutions. But equally important it makes NNs monolithic, because there are no principled interfaces to compose bits and pieces of NNs together to make something larger. That one reason many applications for future AI need types — which is an way to say heterogeneity.

Similarly you need structured way to manage dispatch in heterogenous compute graphs. Keep in mind roughly 95% of Tensorflows code base is managing operator fusion transforms to improve dispatch. This is because it is built on low-level tensors, from which there is no strong principled or intuitable structure to coalesce more efficient graphs — a problem that is well understood to be intractable. This is an information theory problem, you can't always assemble high level kinds, from lower level types that don't have the required information — but a principled high level definition, can be dispatched at multiple levels, or broken down into low level graphs quite readily. By the time a problem has been shaped into an abstract NN, and trained, this information entropy has already occurred.

We need both a more flexible and more principled definition construct to solve for many of these problems. Quantum, for example, is one such definition language built around physical types. There are bound to be others in other domains.

rex-remind · February 18, 2022, 9:41pm

@Troy_Harvey and @porterchild thank you for taking the time to answer these questions, this has been very helpful.

I'm thinking my intuition was pointing in the same direction as you all, agree that neural nets from my limited experience seemed too limiting. I think I've been approaching that problem with, how can we make a more abstract data-structure/model over or even above tensors and tensor products (in the mathy and not strictly multi-dim array sense)? I think I'm mostly sold on what you're proposing though which sounds like starting with a less constrained ontology that could include any arbitrary differential type, which gives user's the freedom to not be constrained by any specific kind of data-structure. I'll certainly keep researching in the direction I have been out of curiosity. I'm also very interested in seeing what this definition language ultimately looks like and how it operates!

On that note, if we (as in the community) were to start a working group how could that operate? How could we organize around building a roadmap? Is this definition language planned to be a future piece of it?

Troy_Harvey · February 19, 2022, 8:02pm

It's a good question, and why I started this thread.

First, Swift has a unique set of attributes that aren't represented or perhaps aren't easily possible in other languages, and I think there is real opportunity to leverage and elevate that in the general ML community.

There are core building blocks and frameworks that are missing in Swift open source that either we have, others have built, or we can fill in the blanks together. Some of these pieces include introspection & metaprogramming tools, numerical methods, fixed arrays, distributed computing, dataframes, backend independent NN frameworks, constraint engines, generalized differentiable programming scaffolding. These topics are general, and helpful to many domains.

My 1st goal was to asses the community interest.

My 2nd goal was to navigate a more strategic approach to open sourcing our own frameworks than "throw them on GitHub and see what happens".

My 3rd goal was to see if there is enough high-level interest to build a working group roadmap for these building blocks. I'm a big believer in community anointed libraries, it builds so much more momentum in a community than random smatterings of GitHub libraries. Take Data Frames for example. There are probably 4 or 5 half baked and half built Swift Data Frame projects. In Python there is Pandas — everybody knows it and backs it.

My 4rd goal was to see if there was Swift.org interest in housing such a working group — or if should be assembled outside these walls. We have some bandwidth to support, but we are busy building our own tech, so it can't be an uphill climb.

While S4TF was a good start, it was competing with institutional inertia by the owner (and within their walls, still searching for a killer application). Going forward, it's important to avoid that single source dependency. We have a large set of internal initiatives and frameworks, but also feel that distributed community investment is valuable. We started with Quantum, our DSL for physical systems AI. Quantum is on the path to have dozens of corporate backers by end of year. We are going to house that in a non-profit so it has more independence and community openness. I think Quantum is one such killer application for differentiable Swift. Given that we are exploring the open source warehousing of some of our supporting frameworks we think have general purpose community value under the same independent body.

I also should note, while we are investing growing our team to work on Swift AI tooling, I am also open to supporting open source efforts if groups or individuals have proposals.

Paulo_Faria · February 20, 2022, 10:46am

I want to express my support for this working group. I don't have enough experience to make substantial contributions, yet, but I'm very interested in specializing in the subject. Count me in to help in any way I can on the formalization of the working group.

philipturner · February 20, 2022, 4:33pm

+1 on what @Troy_Harvey said. There are a lot of TensorFlow-specific operators in S4TF, which detract from the ability to extend it to other backends. One example is Lanczos image resampling, and another is the matrix factorization ops. I'm planning to remove anything that isn't directly related to machine learning. Although, I may add some linear algebra operators once I can draft custom kernels in MSL and OpenCL. Outside of S4TF, there would be no convenient way to run linear algebra on TPUs.

This decision partially depends on whether SwiftFusion will recognize my fork as the successor to Swift for TensorFlow and shift their dependency to it. There are examples of SVD and Cholesky decomposition in their code. Just out of curiosity, is there any big effort to make a standalone linear algebra library for Swift that mirrors NumPy or the linear algebra operators in XLA?

rex-remind · February 21, 2022, 2:27am

You can certainly count me in, however I'd likely rely on others insights into direction and priority given my limited experience in specifically ML. Though, I feel confident in learning what needs to be learned and contributing that knowledge into something material.

Curious what you had in mind?

Agreed wholeheartedly. Would it be worthwhile reaching out to the authors of those libraries directly and pointing them towards this forum?

I.e. core team involvement or something different? It is curious that I do not see a current core team member replying in this thread. Though nice to look back a year and see Chris Lattner's direct support.

Given the following:

The project lead makes senior appointments to leadership roles, with those leaders coming from the worldwide Swift community of contributors. ... Ted Kremenek is the appointed representative from Apple, and acts as the voice of the project lead. Source

and that such a decision would likely require appointment.

@tkremenek what may it require to form a new working group for ML under Swift.org?