# Cross-Tweet : Terminus

Working on Terminus again. The 1st bot turned out creepy, but I think I'm going to keep it. #gamedev #machinelearning #screenshotsaturday pic.twitter.com/rVsRC8wAuE

— Joseph Catrambone (@JCatrambone) July 17, 2017

# Terminus Live Stream Development

Recorded a dev session.

# Assembly and CPU Design

I revived a prototype platformer I started in college called Terminus. For me, it was mostly an excuse to fiddle with chatbots as a method of driving the plot forward, but I found an extra interesting twist: an in-game assembly language that allows the user to solve programming problems.

Here's a general overview of the CPU:

- 256 bytes of memory.
- 1 byte instructions with up to three operators.
- No registers, but every instruction allows dereferencing and writing to an arbitrary memory output.

Here's some sample code which puts calculates if a value on _BUS0 is prime or not, writing 1 for primes, 2 for composites, and 0 for not yet finished.

;; Prime directive. ;; Assorted one-byte primes are coming in on BUS0. ;; Write 1 to BUS1 if they're prime numbers or 2 if they're composite. ;; Keep at 0 while processing. ;; 0 and 1 are not prime. JMP _START ; Set aside some protected memory at the start of the program. BUS0: 0x00 BUS1: 0x00 TARGET: 0x00 DIVISOR: 0x00 TEMP: 0x00 ;; Begin. START: ; Save the values from _BUS0 just in case it changes. ; Init. MOV *BUS0 TARGET MOV 2 DIVISOR MOV 0 BUS1 ; Main loop. LOOP: MOD *TARGET *DIVISOR TEMP JE *TEMP 0 DIVISOR_FOUND ; Otherwise this doesn't divide evenly ADD 1 *DIVISOR DIVISOR ; Increase our number by 1. JE *DIVISOR *TARGET NO_DIVISOR_FOUND ; If we're at the target, no divisor. JE 0 0 LOOP DIVISOR_FOUND: MOV 1 BUS1 JE 0 0 START NO_DIVISOR_FOUND: MOV 2 BUS1 JE 0 0 START

It's a nice change of pace from ML work to do some CPU design. The choice to leave off a register actually turned out to simplify the code, too. Here are some screenshots:

# Implementation Notes on Calculus for Computational Graphs

In an earlier update, I wrote about implementing computational graphs to learn new programming languages. I took the time recently to implement one for Kotlin, as I was having a hard time moving forward in Rust. Here are some implementation decisions I made and how I feel about them looking back:

### Implementing operators as standalone classes.

When I build Aij (the Java compute graph), Rust Compute Graph, and the first revision of CG4j, I had a graph class which contained 'add node'/'add operator' methods which appended lambdas to a big list of operations for forward and reverse nodes. This meant that the graph class was nice and portable. I could drop it into new projects and be on my way. The downside comes from serialization. When using Java's internal serializer, Lambdas can't get converted, so saving and restoring the graph automatically isn't possible. Another downside to this is lambdas incur a small performance overhead, from what I see, even in Rust where we incur a heap allocation unnecessarily. The solution: define an operator or node class and subclass it.

### Returning Nodes or Integers?

Depending on whether you opt to use nodes or node indices, you might find yourself passing either a node index and a graph pointer to your constructors OR a node. Passing around a graph is inconvenient, especially if you have to load one from disk. It means you need to deal with the whole chicken-and-egg thing. Passing around a node is easier by far, but it means you have to handle saving out the nodes which might exist on multiple paths and, more problematically...

### Recursively computing the forward propagation calls.

One thing I did do correctly in the Rust implementation was calculating the nodes in order. Each node had an ID/index, and when doing forward-prop I'd just iterate from left-to-right across the node list and calculate each one. In the Kotlin implementation, I opted to memoize and recursively compute the nodes. That shouldn't have been an issue, in theory, since I was caching results, but the function calls aren't free, and it means getting stack-traces like this:

at com.josephcatrambone.cg4j.Tensor.add_i(Tensor.kt:256) at com.josephcatrambone.cg4j.AddNode.forwardOperation(Node.kt:87) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:33) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30) at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30) at com.josephcatrambone.cg4j.Graph.forward$default(Graph.kt:19) at com.josephcatrambone.cg4j.RNN.predict(RNN.kt:129) at com.josephcatrambone.cg4j.RNN.predict$default(RNN.kt:97) at com.josephcatrambone.cg4j.MainKt.testRNN(Main.kt:107)

The single advantage to this approach is that unused code paths don't get called. If you have nodes, A, B, ... X, Y, Z, and Z = A+B, then when you compute using the non-memoized version you're actually going to be computing the values for everything between A and Z, as opposed to just A, B, and Z for the recursive version. Depending on how much of your graph is used at any one time, this can be a big saving. Not sure who the clear winner is here.