I did my masters in machine learning, so I’m a little touchy on the subject. It always stands out to me when someone says, ‘big data punishes poor people’ because it sounds like “polynomials are anti-semetic” or “bolt cutters are racist”.

Machine learning is a tool like any other, and it can be used for nefarious purposes. I don’t think it’s an unreasonable assertion that things like search-bubbling actually contribute negatively to echo-chamber effects, as they result in people seeing only data that reinforces their viewpoints (as a side effect of being more relevant). To cast the blanket statement like this, however, I think is a catchy but unnecessarily negative act.

I hope the book doesn’t overlook the positive contributions that data mining has made, like discovering genetic markers for diseases, finding new antibiotics, finding treatments for cancers, decreasing water consumption in agriculture, tracking diminishing animal populations, or even more mundane things like providing automatic subtitles to videos for the hearing impaired.

The most interesting question I have to raise is this: is it _more_ humane to remove the biases of a human? Humans are REALLY good at seeing patterns. We’re so good at seeing patterns that we see them where there are none — we see Jesus in toast, we see faces in the sky, we see people as part of a group. That last one is racist, and while we can’t alter our perceptions we can be made aware of them and do everything we can to try and work around our ‘feelings’. Machines are getting good at recognizing patterns too, now. They even beat us in a lot of cases. If we train a model with racist data, though, it will generate racist predictions. Can we efficiently sanitize data to be sure that it’s fair to everyone involved? Is it inevitable that people will abuse statistics to further their own ends? Equally curious: if data suggests a 99% chance that someone will default on a loan, should we chide the operator of the tool for using it? What if they’re trying to protect their own best interests? I don’t know if there’s a winner there.

There’s a lot of answers I don’t have and, ironically, an inability to predict the future, but I do have an emotional response to the article: it’s unpleasant and bothersome. I can’t say it’s wrong, but I can say it’s an incomplete picture and that it furthers the author’s agenda: making a boogeyman of an emerging technology. I don’t like that.

tl;dr: This is a nuanced topic and I’m dubious that the author can reasonably cover it, fearing instead that it devolves into fear-monger.

Compressed representation is a really good idea. In a nutshell, the idea is this: “There are lots of words. Lots of words mean the same thing. Lots of times the same word means different things. It would be nice if we only had one word per concept. Let’s make a neural network figure that out.” Voilá!

My machine learning toolbox Aij is open source and available on Github. I’ll be using the NeuralNetwork class and the BackpropTrainer to walk you through a rudimentary implementation. What we want in the end is to have a word vector map into a denser semantic vector. If that doesn’t make sense, fear not. We will explain shortly. If you already understand semantic vectors and just want to see the implementation details, skip down to the sign.

This document does assume that you have some understanding of programming and a remote familiarity with machine learning.

When you’re searching on the internet, be it Google, Bing, or DuckDuckGo, the search engine needs a way of mapping from each word in your query to a result. The naïve approach would be to take the search query, split it on the word boundary, and go iterating through each word of each document, counting the number of times the words overlap. This was, for a short while, the approach that some search providers used. (Recall, if you can, the era before Google.)

There are some problems with this method: first, you get LOTs of spurious results when a term happens over and over again in a document. Second, long documents get an unfair advantage over shorter ones, even though the shorter ones might be more relevant. Third, performance isn’t great since you need full-text in memory. I could go on.

You can see some improvements by going from checking for each word into a matrix representation. If you have instead table with a page name and a big vector of all the words in a document, you speed up your search tremendously because you can do a simple dot product between your search vector and your document vector.

Perhaps we’re getting ahead of ourselves. Let’s start by defining a ‘dictionary’. A dictionary is a big map which goes from a word to a number. Here’s my dictionary:

the -> 0
cat -> 1
dog -> 2
my -> 3
ran -> 4

I can use a dictionary to produce a word vector or sentence vector. A vector is an array of length ‘# of words’. If I’ve seen a billion different words, 20,000 of which are unique, my vector is going to have 20000 elements. In the case above, I’ve only seen five words.

The sentence vector for “My cat ran.” would be this:

"My cat ran" -> [the, cat, dog, my, ran] -> [0, 1, 0, 1, 1]

Note that each word in the sentence becomes a ‘1’ in the sentence vector. The position of the ‘1’ is determined by the dictionary.

Let’s really quickly review what ‘dot product’ is. Mathematically, a dot product is the sum of the products of the respective elements.


Let’s look at the dot product of two vectors, <2, 3, 0> and <1, 4, 5>.

“The respective elements…”

2, 1 and 3, 4 and 0, 5.

“The product of…”

2*1 = 2, 3*4 = 12, 0*5 = 0.

“The sum of…”

2+12+0 = 14.

So the dot product takes each corresponding pair of items in each vector, multiplies them, and adds the result.

If we want to say how similar our document, “My cat ran.” and the page “My dog ran.” are, we can look at the dot product of these two vectors!

[0, 1, 0, 1, 1] (dot) [0, 0, 1, 1, 1] = sum(0*0, 1*0, 0*1, 1*1, 1*1) = 2

So, “My cat ran” and “My dog ran” have a match of two. Compare this to the similarity of “My cat ran” and “The dog ran”.

[0, 1, 0, 1, 1] (dot) [1, 0, 1, 0, 1] = sum(0*1, 1*0, 0*1, 1*0, 1*1) = 1

A similarity of just 1!

Dot product is fast to compute and can be done in parallel for a lot of pages at the same time. Still, there are some problems. You’ll note that the vectors are _really_ sparse. That is, most of the 20k rows are zeros. That’s not a problem in and of itself, but it also means when words change form (like ‘ran’ to ‘run’), we might miss an entry for them. It would be handy if we had only ONE word for each CONCEPT. So instead of one entry for ‘feline’, ‘cat’, ‘gato’, ‘neko’. It would be some ‘high-dimensional’ concept in a ‘feature space’. Our objective is to find that representation.

Skip here if you only care about implementation.

We want a ‘compressed’ representation of the words, and we don’t want to hand-craft that representation. What we can do is have our network make this representation. We can take a neural network and give it a bunch of input nodes and THE SAME NUMBER OF OUTPUT NODES. The number of hidden nodes will determine the size of our ‘semantic vectors’. We then train the network using our input vectors as both the examples and the labels! The network has to find a way to encode (using the hidden layers) a compressed representation that best holds the information from the input layers. Here’s what that might look like:

A Semantic Autoencoder

To train this network, all we have to do is make a bunch of sentence vectors and push them through a neural network! It’s that simple! (Okay, maybe not, but it’s a good short summary.) If we can train an autoencoder on a bunch of sentences, hopefully ‘semantically similar’ terms will have vectors which are closer to each other. As an example, dog and cat are probable semantically closer than dog and computer. Here’s the output from one of the unit tests in my library:

Words array: [i, have, a, cat, and, a, dog, i, own, a, computer, my, cat, chases, ...]
Sentences array: [i have a cat and a dog, i own a computer, my cat chases my dog, ...]
cat vec: [0.228247, 0.096414, 0.127684, 0.234154, 0.378221]
dog vec: [0.349040, 0.114843, 0.116345, 0.309349, 0.140704]
comp vec: [0.123077, 0.117904, 0.258514, 0.558843, -0.050910]
catDog dist: 0.07712784882514626
dogComp dist: 0.17024387054490916

Note that indeed, cat and dog are more similar based on their use in the example sentences than dog and cat. Next, I’ll run through the implementation start to finish.

Let us begin by defining our training data, deciding on a ‘window’ size, and picking a semantic vector size.

We could train on all the words in a sentence at the same time, which would make this implementation very close to Principal Component Analysis. Instead, we select a window size of five words. (Which means we grab bunches of five words as we read through a sentence.)

We also select a semantic vector size of five. A bigger semantic vector size gives up more nuance with words, but requires more training data to work properly. We don’t always want more nuance.

Lastly, I’m hard-coding the training data because I am lazy. has an awesome collection of articles from Wikipedia which would work well here, but I’m just going to have one big sentence separated by newlines.

I start by breaking the training data into words and sentences. The words are used first to build the dictionary.

final int WINDOW_SIZE = 5;
final int HIDDEN_LAYER_SIZE = 5;
String paragraph =
	"i have a cat and a dog\n" +
	"i own a computer\n" +
	"my cat chases my dog\n" +
	"i enjoy machine learning\n" +
	"my computer runs machine learning problems\n" +
	"my pets are a cat and a dog\n" +
	"this machine learning problem is expecting unique words\n" +
	"i use github to store my machine learning code\n" +
	"i hope this code will produce good results\n" +
	"this test problem is fun and boring at the same time\n" +
	"my cat and dog are fun\n" +
	"my cat and dog are named cat and dog\n" +
	"i am bad at names\n";

String[] words = paragraph.split("\\s");
String[] sentences = paragraph.split("\\n");

I want to be able to go from a word to its index in the dictionary and back from an index to a word. The index to word part can be done with just an array of strings. Easy peasy. The word to index I do with a Map.

// We have to be able to map from words to our semantic vector.
// Build a mapping without stemming or chunking.
List indexToWord = new ArrayList<>();
Map wordToIndex = new HashMap<>();

for(String w : words) {
	// Expensive unique lookup, but this is just a test.
	// If the word is in the list already, don't bother.
	int foundAt = -1;
	for(int i=0; i < indexToWord.size() && foundAt == -1; i++) {
		if(indexToWord.get(i).equals(w)) {
			foundAt = i;
	// Not in the list so far, so add it to the end and keep the index so we can find it again.
	if(foundAt == -1) {
		wordToIndex.put(w, indexToWord.size()-1);

Be aware that I'm only adding unique words to the mapping. Now that that's finished, we've got our dictionary to make sentence vectors. It's time to go through our sentences and produce some training examples!

// Make our training examples.
int exampleIndex = 0;
Matrix examples = new Matrix(sentences.length*8, indexToWord.size(), 0.0);
for(String sentence : sentences) {
	String[] sentenceWords = sentence.split("\\s");
	for(int window = 0; window < sentenceWords.length-WINDOW_SIZE; window++) {
		for(int j=0; j < WINDOW_SIZE; j++) {
			String w = sentenceWords[window+j];
			examples.set(exampleIndex, wordToIndex.get(w), 1.0);
		exampleIndex++; // Each window of three words is an example.

If I were to run a sliding window over this sentence, it might look like this:

[If I were to run] a sliding window over this sentence, it might look like this:

If [I were to run a] sliding window over this sentence, it might look like this:

If I [were to run a sliding] window over this sentence, it might look like this:

... and so on.

So our first sentence vector has 1's for the words 'if', 'i', 'were', 'to', and 'run'. Repeat until the end of the sentence.

Now that we have our examples, it's time to set up our network. Remember, we want it to have the same number of inputs as outputs, and we want the inputs to be the size of our dictionary.

NeuralNetwork nn = new NeuralNetwork(new int[]{indexToWord.size(), HIDDEN_LAYER_SIZE, indexToWord.size()}, new String[]{"linear", "logistic", "linear"});

My choice of 'linear', 'logistic', 'linear' is somewhat arbitrary. You could use 'tanh' for all of them or any combination. A single caveat: at least ONE layer must be nonlinear. (The product of any set of linear operations is linear. We need this to be nonlinear.)

Finally, we make our trainer, tweak some variables (like how long we want it to run) and let it train.

BackpropTrainer trainer = new BackpropTrainer();
trainer.momentum = 0.0;
trainer.learningRate = 0.02;
trainer.batchSize = 10;
trainer.maxIterations = 10000;
trainer.earlyStopError = 0.0;

trainer.train(nn, examples, examples, null);

Now go make some coffee or read a book, since this will be training for a while. When it's finished, we can use our network to get the semantic-space representation of any word or words it has seen:

Matrix input = new Matrix(1, indexToWord.size());

input.set(0, wordToIndex.get("cat"), 1.0);
Matrix catActivation = nn.forwardPropagate(input)[1];
input.elementMultiply_i(0); // Reset so we can reuse this.

Just like above, we're making a vector with the same number of entries as the dictionary and setting the index of our query word to '1'. The tricky part here is that nn.forwardPropagate(input)[1] bit. Remember in the diagram above how layer 0 is our raw word vector? And layer 2 is our reconstructed vector? That means layer 1 is our semantic vector!

The elementMultiply_i(0) just resets the input vector to zero.

We can do this for whatever word pairs we like to determine how different they are! I'm using squared-error here instead of abs() or L1 error, but the same idea applies. Cosine-distance is also immensely popular because it gives a nice, normalized vector. I'm excluding it here only for the sake of brevity. Go look it up. ;) Smaller distances means the words are closer in semantic space. Think of it like peanut-butter being close to jam in the grocery store. Things which are semantically similar are closer.

Matrix catDogDiff = catActivation.subtract(dogActivation);
double catDogDistance = catDogDiff.elementMultiply(catDogDiff).sum(); // Squared distance.
Matrix dogComputerDiff = dogActivation.subtract(computerActivation);
double dogComputerDistance = dogComputerDiff.elementMultiply(dogComputerDiff).sum();

System.out.println("cat vec: " + catActivation);
System.out.println("dog vec: " + dogActivation);
System.out.println("comp vec: " + computerActivation);
System.out.println("catDog dist: " + catDogDistance);
System.out.println("dogComp dist: " + dogComputerDistance);
assertTrue(catDogDistance < dogComputerDistance);

And that gives us the output we see at the start.

Semantic representations are an awesomely powerful concept which can be applied at a bunch of different levels of abstraction. We're using it now for words, but there's nothing to prevent it from being used for concepts or images or any of a large number of places. Give it a go and see what you can do with it!

Notes and Corrections

Distance and Similarity are opposite sides of the same coin. If the distance between a and b is zero, the similarity between a and b is large. I prefer distance as a metric to similarity because 0 means identical and anything bigger is more dissimilar. Things can only be so identical, but they can be infinitely different. Similarity doesn't have the same property. 0 can be 'not at all similar', but then 1 may or may not be quite similar. Similarity assumes that there's a _maximum_ amount of DISSIMILARITY. That doesn't sit well with me, but there are all kinds of ways to handle it mathematically.

All code for this is available here as part of the unit test for my library.

The article originally misrepresented this idea as 'word2vec'. Since it doesn't include the skip-gram modelling portion I am renaming it 'compressed representation'. A huge thank you to Reddit user Powlerbare for the correction!

You can read more on the following pages:

Feature learning with Word2Vec

Making Sense of Word2Vec

Or: “How I learned to stop worrying and generate planet names.”

A friend of mine is working on a procedural universe generator. Procedural stuff is something I happen to love, as you might ascertain from my older ramblings. In the interest of handing out some knowledge, I want to illustrate an application of different number system representations. After this is done, you’ll have an intuitive understanding of how to take any number (from 0 through 2^64-1) and get back a planet name like ‘Gregorio-X. Ignus system. Charion group. Darion supercluster.’

First, some background.

One numeric basis to which you’re already accustomed is base-10. Base ten has the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 in it. If you write a number in base ten, like 123, you know that a digit in 2’s place counts for 10x what a digit in 3’s place does. That’s perhaps a long way of saying “123 is 100 + 20 + 3”. We can rewrite 123 as 1*10^2 + 2*10^1 + 3*10^0. Anything to the power of zero is 1. 10^0 is 1. Our first 3*10^0 reduces to 3*1 = 3. Great. Now something easier. 10^1 = 10, so 2*10^1 is 2*10 = 20. The process repeats. What you should be seeing from this pattern is each digit position increasing the weight by a factor of 10 — each change in position means the exponent is bigger by one. This will become important when we look at other bases.


Binary is base-2. The digits in binary are simple: 0 and 1. Let’s look at the number 101. That’s 5 in base-10. Similar to how we can write 123 as 1*10^2 + 2*10^1 + 3*10^0, we can rewrite 101 as 1*2^2 + 0*2^1 + 1*2^0. 1*4 + 0*2 + 1*1 = 4 + 0 + 1 = 5.

With me so far? Hang tight. We’re almost there.


What would happen if we wanted a base with MORE than ten digits? Well, hexadecimal happens to be base 16. We have digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F! Note that we need more than just our regular ten digits. To account for all the numbers in hex, we have to include some letters. ‘A’ in this case is ’10’ and ‘B’ is ’11’. And so on up to ‘F’ = ’15’. This means ’10’ in hexadecimal is 16 in decimal. 1*16^1 + 0*16^0. C2A is C*16^2 + 2*16^ 1 + A*16^0. Substitute C with 12 and A with 10 to get 12*16^2 + 2*16^1 + 10*16^0. 12*256 + 2*16 + 10*1 = 3072 + 32 + 1 = 3114.

Arbitrary Bases

Notice how in hexadecimal we had to include other digits A-F. We don’t _have_ to use 0-9 as our basis digits, either. Let’s use base-10, but for digits 0-9 we will substitute ‘fluffy’, ‘fuzzy’, ‘poof’, ‘bunny’, ‘kitten’, ‘puppy’, ‘snuggly’, ‘happy’, ‘cuddly’, ‘puff’. The number 0189 then is ‘fluffyfuzzycuddlypuff’. We can also go backwards, ‘snugglypuppypoof’ is 652.

As a natural extension to this, we can pick bigger bases. In my code, I picked base-65 because that gave me around 8 digits to represent all the numbers 0 to 2^64. Here’s the Python code which converts from base-10 to base-65. My digits for the base are made up names.

Day 3: Mildly Interesting

I’m not sure how fast news travels around my social circles, so I figure it might be wise to leave a first-person report of today’s events. Yes, I hit something with my car on the way up. Yes, it was a mountain. Yes, I’m perfectly fine. I walked away from the accident without so much as a bruise. But I’m getting ahead of myself. Here’s everything in the right order for today:

I woke up to find the weather on Wyoming worse than the night before. While we’d been relegated to sub-45mph travels due to high winds carrying enough snow to block lane markers, today was met with almost total snow blindness. I got maybe 45 miles in the first two hours, following in a cautious but close improv motorcade. After we crested the last mountains outside mile marker 120, things cleared up. Snow disappeared gradually and the sun came out. Wind, too, was largely gone. I relaxed and returned to the posted speed limit (five/ten under, surprisingly. While I’m not one to obey the speed limit, I was doing just that entirely be coincidence.)

Traffic spread out, the speedier people flew on ahead, and the heavier trucks fell behind. I crested one last mountain which, as it turns out, was responsible for the subsiding wind. As I reached the top my front wheels started to slide. I angled my wheels to my lane and released the gas pedal. My vehicle performed then a maneuver I will refer to now as, “The Terrifying Lateral Shimmy.” It slid from the far right lane, still pointing forward, horizontally until the far left lane. As I was about to exit the far left lane I decided to press the brake and immediately regretted doing so. The car got back traction and headed in the direction I’d pointed it. Unfortunately, the direction I’d pointed it was much farther to the right than I anticipated. I corrected to the left and succeeded in swinging the back end more or less sideways, triggering a Fast and Furious: Tokyo Drift style maneuver, except Paul Walker was dead and there was no Vin Diesel.

I recall thinking, “Huh. I’ve never been in a car accident before,” just prior to hitting the center divide. The front passenger headline contacted the center divide at just under 70mph and rotated the vehicle across the passenger side doors. It did a tiny bounce and a totally sweet 180 before the rear tail light contacted the embankment.

I was fine. My indicator lights suggested I check the engine. I did not, in part because I don’t know how and in part because I smelled something like gasoline, except more carcinogenic. (I’d come to understand that this is the smell airbags make when they deploy.) I got out, perhaps in violation of safety procedures, and did a quick check to make sure I didn’t leave any debris which might hurt other drivers. There appeared not to be any. I got back in my car and phoned the police.

All cell conversations were a bit tricky, as this was scarcely in 1X signal territory. A passerby stopped. “That was awesome!” he said. “I know!” I responded. He offered a trip back into town, but I decided to try and deal with the highway patrol. After a dropped call, I phoned again. “Where are you?” “I have no idea. Possibly between Wyoming and Utah, west of Rawling on the 80W.” “What mile marker?” “Not sure.” “See if you can flag a vehicle and have them call it in.” “Okay. I’ll call you back.”

By the time I’d gotten out of the vehicle, someone had already stopped. “That was awesome!” “Yeah. It was pretty sweet.” “Are you hurt?” “Don’t think so. I’m kinda’ hungry.” “Can I help?” “Yes, please. Could you call 911 and tell them the mile marker?” “Sure.”

Before our conversation was wrapped up, another person stopped. “Are you okay?” “I’m okay, thanks. These people are helping me.”

I got back in my car and called my insurance company. “I think the police are sending a tow truck. I’ll call you back when they get here and do the report, but I figure I’d tell you. Seems like the sort of thing you’d want to know.” “Thanks for calling. Is anyone hurt?” “No. I’m a little hungry.” Then we went on.

After around half an hour, the officer arrived, tow truck in… tow. I shook their hands and introduced myself. The tow truck operator went to clearing the other bits of my vehicle from the center divide. He refused my help, but asked for my key. I gave it to him and he sat down in the car and started it. It didn’t explode. Astonishingly, the car even managed to pull a U-turn and end up on the right side of the road where he could clear it without issue.

I ran through the motions with the officer. He was tremendously friendly. Pretty patient, too. I asked if I could try his gun. He said no. I did get a picture inside the squad car though.

Grizzled angry Rowsdower truck driver gave me a ride into town. The conversation was lively, considering I’d almost died and he was the reincarnation of Rowsdower. I need to make it a point to figure out how to relate with career tow truck drivers. Anyway, we ended up in a mountain town with a population of around five.

I sat at the local diner and dug around for a way to get a rental car. Nearest open place was around 15 miles away. Gave them a call to see if anything was available. The guy, Bob, said they had some open cars. Coincidentally, he was also across the street from me and could give me a ride. Holy shit.

I rented a minivan, (it seemed an appropriate upgrade from my effeminate grocery car) and drove back to the town to a choir of horns. Fuck all if I’m doing above 50 after that shit.

I transferred the goods from the Yaris to the minivan and sustained my first injury from the accident — bumped my head on the rear view mirror while moving things.

Called the insurance company one more time and let them speak with Rowsdower for prosperity, allowing him to reiterate how awesomely totaled the car was, then jumped in the minivan and drove off like a madman on Xanax at well under the speed limit. Nine hours later, and I’m here in a hotel in Someplace, Utah, 8 hours from San Francisco.

Good night.

Word2Vec, a method for converting a word to a vector representation, is getting a lot of hype these days. Rightfully so, it’s wonderfully effective and, while not a huge departure from old LSI/LSH methods, is packaged beautifully and simply. For those of you with no concept of what that means, imagine converting a word into a numerical value in some higher dimensional space. Like, instead of ‘hello’, we might have a point at x=10, y=5, z=30. Now add a few more dimensions and you’ve got roughly the idea. Why do we care? Well, we can do mathematical operations on vectors. In fact, an interesting property of the way in which word2vec maps things is that it supports analogies very well. We can say, (dogs – dog) = (kings – ___) and word to vec will give us ‘king’. Another example might be (‘queen’ – ‘woman’) = (___ – ‘man’), and we’ll get “king”.

My biggest issue with word2vec is this: we are still using more or less whole words with some minimal parsing. This is limiting because there do exist words which are spelled wrong, or words which are entirely nonsensical but still need to be represented. If I ask, “What is the plural of ‘florb’?” you will probably respond, “florbs,” even if ‘florb’ is an entirely nonsensical word. Word2vec, because it has no way of vectorizing novel words after the training part, can’t fabricate or comprehend unfamiliar terms. This is what I’d like to correct.

How? Well, I know a bit about deep learning. It is my hammer, and all problems look like nails. The solution I propose is this: train a network on an input of four characters and try and compress it from 128^4 bits to n bits. Why 128^4 bits? Well, if we have four characters, the upper-bound for prefixes and suffixes, and we have 128 printable characters, (26 alpha + 26 alpha uppercase + 10 numeric + punctuation + blah blah) then that means a single 4-character chunk has 128^4 bits. That’s a lot. If we have any hope of understanding full words, we’ll want to get a good amount of variety while keeping our representation compact. Towards that end, we need to decide on exactly what size n we will need.

This is the purpose of today’s experiment: take one million sentences, break it into 4-character chunks, and see how many there are and what they look like. We could break these sentences in a number of ways, at word boundaries or across, in even chunks. Word boundaries might be ‘more correct’ in some sense, but that means things like “San Francisco” would be broken up and change our distribution. “P.F. Chang”, too, would be incorrectly split. So, in the interest of simplicity, we’ll just take chunks of four characters with a sliding window.

# Taking even sized samples.
fin = open("1M_sentences.txt", 'r');
chunks = dict();

# Convert the 1M sentences into a single, long, sentence, split by double spaces.
lines = fin.readlines();
big_sentence = "  ".join(lines);

# Iterate across the sentence in 4-character chunks, appending (and counting) the characters.
for i in range(len(big_sentence)-3):
    chunk = big_sentence[i:i+4];
    if chunk not in chunks:
        chunks[chunk] = 0;
    chunks[chunk] += 1;

There. Now chunks is a large dictionary indexed by blobs of four letters. Let’s look at some properties of this. Let’s get a histogram of chunk counts.

chunk_sizes = dict()
for chunk, count in chunks.iteritems():
    if count not in chunk_sizes:
        chunk_sizes[count] = 0;
    chunk_sizes[count] += 1

1:		############
2:		###########
3:		##########
4:		##########
5:		##########
6:		#########
7:		#########
8:		#########
9:		#########
10:		#########
11:		#########
12:		########
13:		########
14:		########
15:		########
16:		########
17:		########
18:		########
19:		########
20:		########

There are 2^12 chunks of 4-letters which appear once. This falls off linearly in our graph which is logarithmically in the actual numbers. (So there are half as many chunks which appear 2x and a quarter as many which appear 3x, and so on.) This parallels Zipf’s Law. 200 words make up 90% of sentences. We get a lot of information from the small, 1% of words which are unique to a document. There are a practically infinite number of words which appear once or twice, however. There’s plenty of room to compress this.

There are 1611720 unique chunks of four characters in our sample. The most frequent chunk appears 339365 times. Well, by the magic of binary and exponents, if we have a representation with log(1611720)+1 bits, we can represent all the chunks uniquely! It even has a bit of a margin to allow us to compose/cover unseen cases. How many bits do we need? 16. 16 bits to cover all those cases.

Let’s try and train a network to go from 128^4 to 16 bits.