Shout out to MathJax/ASCIIMath for being awesome.

If you don't care about any of the details, you can click here to jump to the code.

Motivation and Problem Statement

We're launching a ball with a great deal of force at a wall. We can describe our wall with four points in 3D space: `a = [x, y, z]`, `b = [x, y, z]`, and so on for `c` and `d`. Our ball travels along a straight line path called `bbL`. It's origin is `p_0` and it's moving towards `p_1`.

There is some redundancy. I picked four points for our wall because it makes intuitive sense, but we're going to be applying this method on only three of those four points, the triangle `Delta abc`. If you feel the compulsion to run this on a square, you can easily extend the approach to two triangles.

Let's begin with describing our plane. There are a few different formulations of the plane in 3D. For our purposes, the 'normal + offset' configuration is the easiest. We'll figure out the normal (think a straight line pointing out of the wall) from our three points.

A quick review of dot and cross

I'm going to assume that the idea of cross-product and dot product are familiar, but here's a very quick review in the interest of completeness.

`a = (x, y, z)`
`b = (x, y, z)`

`a o. b = a_x * b_x + a_y * b_y + a_z * b_z`

`a ox b = [[a_y*b_z - a_z*b_y], [a_x*b_z - a_z*b_x], [a_x*b_y - a_y*b_x]]`

Note that the dot product is a scalar and the cross product is a vector.

One other thing to realize: the dot product of orthogonal vectors is zero. The cross product of two vectors produces a vector that's orthogonal to both. If that's not clear, don't worry.

The Normal

Let's get back to the wall. We've got `a`, `b`, and `c` and we want to figure out the normal. If these three points make up an infinite plane, then the normal will jut out of it straight towards us. Recall (or look at the notes above) that the cross product of two vectors makes an orthogonal vector. We can convert our three points to two vectors by picking one to be the start. Let's say our two new vectors are `r = b-a` and `s = c-a`. That means our normal, `bbN` is just `r ox s`! And since we picked `a` as our origin, our formula for the plane is `(P - a) o. bbN = 0` for some point `P`. Put differently, if we have some point `P` and it's on the plane, when we plug it into that formula along with `a` and `bbN` we'll get zero.

Enter: The Line

We mentioned before that our line `bbL` has a start point of `p_0` and an end of `p_1`. This means if a point `P` is on the line, then there's some value `t` where `P = p_0 + t*(p_1 - p_0)`. Now comes the fun part. We want to figure out where this line intersects with our plane (if it does). To do that, we'll plug in the formula for a point on our line into the formula for a point on our plane.

`(P - a) o. bbN = 0` // Point on plane.
`(((p_0 + t*(p_1 - p_0)) - a) o. bbN = 0` // Replace `P` with the formula.
`(((p_0 + t*(p_1 - p_0)) o. bbN - a o. bbN = 0` // Distribute the dot.
`(((p_0 + t*(p_1 - p_0)) o. bbN = a o. bbN` // Add `a o. bbN` to both sides, effectively moving it to the right.
`p_0 o. bbN + t*(p_1 - p_0) o. bbN = a o. bbN` // Distribute again.
`t*(p_1 - p_0) o. bbN = a o. bbN - p_0 o. bbN` // Subtract p_0 o. bbN from both sides.
`t*(p_1 - p_0) o. bbN = (a - p_0) o. bbN` // Pull out the dot product.
`t = ((a - p_0) o. bbN) / ((p_1 - p_0) o. bbN)` // Divide by `(p_1 - p_0) o. bbN` on both sides.

Our final answer, then, is

`t = (bbN o. (a - p_0))/(bbN o. (p_1 - p_0))`

If the denominator is zero, there's no solution. This can happen if the plane and line segment are perpendicular. Otherwise, we can plug t back into our line equation to get some point on the plane!

Inside the Triangle

We have a point `P` that's on the plane and the line, but is it inside the triangle defined by `Delta``abc`? There's a fairly easy way to check for that. If you've got a triangle, as we have, then any point in that triangle can be described as some combination of `a + u*(b-a) + v*(c-a)`, where `u` and `v` are in the interval `[0,1]`. If `u` or `v` is less than zero, it means they're outside the triangle. If they're greater than one, it means they're outside the triangle. If their sum is greater than one, it means they're outside, too. So we just have to find some `u` and `v` for `P = a + u*(b-a) + v*(c-a)`.

Systems of Equations

It might not seem possible. We have two unknowns and only one equation. However, there's something we've overlooked. `P = a + u*(b-a) + v*(c-a)` actually has three equations. We've been using a shorthand for our points, but `u` and `v` are scalars. Really, we should be looking for a solution for this:

`P = a + u*(b-a) + v*(c-a)`

`P - a = u*(b-a) + v*(c-a)`

`[[b_x - a_x, c_x - a_x], [b_y - a_y, c_y - a_y], [b_z - a_z, c_z - a_z]] * [[u],[v]] = [[P_x - a_x], [P_y - a_y], [P_z - a_z]]`

BAM! Two unknowns, three equations. You might also recognize this to be of the form `bbAx=b`. You'd be correct. If there were three unknowns and three equations, we could have been fancy and used Cramer's Rule. It's not a hard thing to solve, however.

`bbbAx = b`
`bbbA^TbbbAx = bbbA^Tb` // Start by making `bbbA` a square matrix.
`(bbbA^TbbbA)^-1 bbbA^TbbbA x = (bbbA^TbbbA)^-1 bbbA^T b` // Since it's square, it probably has an inverse.
`bbbI x = (bbbA^TbbbA)^-1 bbbA^T b` // Cancel the inverse.
`x = (bbbA^TbbbA)^-1 bbbA^T b` // Simplify.

And now we've got x in terms that we know (or can calculate)!

Inverse of a Square Matrix

`(bbbA^TbbbA)^-1` looks like a mess, but it's not as bad as it seems. I'm going to multiply it out and simplify it again.

`bbbA^T bbbA = [[b_x - a_x, b_y - a_y, b_z - a_z],[c_x - a_x, c_y - a_y, c_z - a_z]] * [[b_x - a_x, c_x - a_x], [b_y - a_y, c_y - a_y], [b_z - a_z, c_z - a_z]] = ` an unholy mess.
To cut down on that, I'm going to let `b-a = r` and `c-a = s`. If we rewrite the above using that, we get something more manageable.

`bbbA^T bbbA = [[r_x, r_y, r_z],[s_x, s_y, s_z]] * [[r_x, s_x], [r_y, s_y], [r_z, s_z]]`

Since we're basically multiplying things component wise, we can reuse our code for dot product!

`bbbA^T bbbA = [[r o. r, r o. s], [r o. s, s o. s]]`

That's an easy to calculate 2x2 matrix. As an added bonus, there's a closed-form solution for the inverse of a 2D matrix. You can probably work it out yourself easily enough, but we've gone through a lot already, so here's the solution:

`if bbbA = [[a, b], [c, d]] => bbbA^-1 = 1/(ad-bc) [[d, -b], [-c, a]]`

So we calculate `r o. r`, `r o. s`, and `s o. s` and plug them into the inverse matrix. Then we multiply the inverse and `bbbA^Tb` et voila: we've got our values for `u` and `v`.

`bbbA^T bbbA = [[r o. r, r o. s], [r o. s, s o. s]]`

I'm running out of letters!

`alpha = r o. r`
`beta = r o. s`
`gamma = r o. s`
`delta = s o. s`

`(bbbA^T bbbA)^-1 = 1/(alpha * delta- beta * gamma) * [[delta, -(beta)], [-(gamma), alpha]]`

And in all its glory:

`(bbbA^T bbbA)^-1 bbbA^T b = 1/(alpha * delta - beta * gamma) * [[delta, -(beta)], [-(gamma), alpha]] * [[r_x, r_y, r_z],[s_x, s_y, s_z]] * [[P_x - a_x], [P_y - a_y], [P_z - a_z]] = [[u],[v]]`


Closing: Just Show Me The Code

The moment for which you've been waiting. Here's an EMScript6 implementation of the Point and Triangle objects.

A Gist is available on GitHub at or you can play with this interactively on JSFiddle:

Word2Vec is nothing shy of brilliant.  It's elegant, simple, and profoundly effective at translating very sparse textual data into a dense, semantically-sensible representation.  I've spoken its praises before[1], but it's not without issues.  The two most compelling shortcomings of Word2Vec are, to me, the memory consumption of the model and the inability to recognize novel words.  Let's discuss them in order and talk about solutions.

Word2Vec uses an enormous weight matrix of size (words) * (hidden variables) * sizeOf(float).  Google's pre-trained Word2Vec model is 3.5 gigabytes and is built from the Google News corpus of approximately 100 billion words.  That seems like plenty, but when you consider the average native speaker knows 20-50 thousand words, then consider the number of prefixes, suffixes, tenses, participles, and Germanic-ultra-concatenations that can be made, is it unsurprising that there will be terms which are discarded as noise?  Not to mention the difficulties of named-entity extraction.  It would be great if we could find a place in the latent space for all those special names like Dick and Harry which preserved not only the concept that they were names but all the connotations that are associated with them.

Relatedly (a word also not found in word2vec), the use of misspellings to deliberately illustrate a point is something Word2Vec is not well equipped to handle.  One can reduce the noise from very infrequently seen words by performing spelling correction, or one can preserve the typographical errors at the cost of another (conceptually unique) term.  It would be great to capture both the semantics and lexes of the language.

What are are our options?  Given that there's no upper-bounds to the length of a word in the English language, and given that spacing and punctuation can be (and are) used arbitrarily at times, we are perhaps best off using a character-based recurrent neural network to capture a sentence.  The benefit is a memory-constrained, reasonably quick network which doesn't require tokenization and can handle unique words.  The shortcoming here is training from this point requires a very sizable corpus and takes an extended time to train.  Not only that, the recurrent network's output isn't guaranteed to have a useful, continuous representation of the latent concept space.

We may also consider hybridizing Word2Vec and recurrent networks, passing each word through a recurrent network and using the output to predict an embedding.  This means we can adapt to deliberate misssspellings and writing flourishes while also obtaining a good latent representation.

I'm not sure which, if either, will come first.  Nonetheless, I'm hopeful that we'll eventually come to pass around something not too dissimilar to the Inception model, but for natural language processing.

In an earlier update, I wrote about implementing computational graphs to learn new programming languages.  I took the time recently to implement one for Kotlin, as I was having a hard time moving forward in Rust.  Here are some implementation decisions I made and how I feel about them looking back:

Implementing operators as standalone classes.

When I build Aij (the Java compute graph), Rust Compute Graph, and the first revision of CG4j, I had a graph class which contained 'add node'/'add operator' methods which appended lambdas to a big list of operations for forward and reverse nodes.  This meant that the graph class was nice and portable.  I could drop it into new projects and be on my way.  The downside comes from serialization.  When using Java's internal serializer, Lambdas can't get converted, so saving and restoring the graph automatically isn't possible.  Another downside to this is lambdas incur a small performance overhead, from what I see, even in Rust where we incur a heap allocation unnecessarily.  The solution: define an operator or node class and subclass it.

Returning Nodes or Integers?

Depending on whether you opt to use nodes or node indices, you might find yourself passing either a node index and a graph pointer to your constructors OR a node.  Passing around a graph is inconvenient, especially if you have to load one from disk.  It means you need to deal with the whole chicken-and-egg thing.  Passing around a node is easier by far, but it means you have to handle saving out the nodes which might exist on multiple paths and, more problematically...

Recursively computing the forward propagation calls.

One thing I did do correctly in the Rust implementation was calculating the nodes in order.  Each node had an ID/index, and when doing forward-prop I'd just iterate from left-to-right across the node list and calculate each one.  In the Kotlin implementation, I opted to memoize and recursively compute the nodes.  That shouldn't have been an issue, in theory, since I was caching results, but the function calls aren't free, and it means getting stack-traces like this:

 at com.josephcatrambone.cg4j.Tensor.add_i(Tensor.kt:256)
 at com.josephcatrambone.cg4j.AddNode.forwardOperation(Node.kt:87)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:33)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward$default(Graph.kt:19)
 at com.josephcatrambone.cg4j.RNN.predict(RNN.kt:129)
 at com.josephcatrambone.cg4j.RNN.predict$default(RNN.kt:97)
 at com.josephcatrambone.cg4j.MainKt.testRNN(Main.kt:107)

The single advantage to this approach is that unused code paths don't get called.  If you have nodes, A, B, ... X, Y, Z, and Z = A+B, then when you compute using the non-memoized version you're actually going to be computing the values for everything between A and Z, as opposed to just A, B, and Z for the recursive version.  Depending on how much of your graph is used at any one time, this can be a big saving.  Not sure who the clear winner is here.

UPDATE: This code is now available in both Java and Python!

I've been on an automatic differentiation kick ever since reading about dual numbers on Wikipedia.

I implemented a simple forward-mode autodiff system in Rust, thinking it would allow me to do ML faster. I failed to realize/read that forward differentiation, while simpler, requires one forward pass to get the derivative of ALL outputs with respect to ONE input variable. Reverse-mode, in contrast, gives you the derivative of all inputs with respect to one output.

That is to say, if I had f(x, y, z) = [a, b, c], forward mode would give me da/dx, db/dx, dc/dx in a single pass. Reverse mdoe would give me da/dx, da/dy, da/dz in a single pass.

Forward mode is really easy. I have a repo with code changes here:

Reverse mode took me a while to figure out, mostly because I was confused about how adjoints worked. I'm still confused, but I'm now so accustomed to the strangeness that I'm not noticing it. Here's some simple, single-variable reverse-mode autodiff. It's about 100 lines of Python:

> let doit x y = x(x(y))
> let arnold x = x ++ "aauuugh"
> putStrLn $ arnold "Get to da choppa"
Get to da choppaaauuugh
> putStrLn ( doit arnold "Nauuugh" )