Archive

Rant

Here I sit, overwhelmed at the insanity taking place in the political arena. I'm taking a moment to collect my thoughts for the same reason as anyone else that keeps a journal: so when we look back at the injustices and failures of the past we get some sense of their context. Maybe it will also remind me that we tried.

The Net Neutrality Repeal

The FCC, as chaired by Ajit Pai, has stated its intention to roll back the Title II protections afforded in 2015 under President Barack Obama. There are five members of the board, three Republicans and two Democrats. The Democrats have voiced their opposition to the changes. The three majority members favor of the repeal of the consumer protections and have given the bill the compelling title, "Restoring Internet Freedom Order." Their argument is that regulations are stifling innovation. Comcast and Verizon, in investor meetings, have both declared that Net Neutrality rules do not, in fact, hinder innovation. There have been millions of comments voiced by consumers who favor the added protections from Title II. There are some form letters. There have also been millions of automated bot comments in opposition. It seems reasonably likely that major ISPs are not expecting to get away with the fabricated comments in opposition, but hope to muddy the waters enough to make public feedback seem unusable.

It's looking like the repeal will go through, followed by a litany of confusing legal actions which will likely ALSO be muddied by telecom providers. (This can happen because only one appellate court can hear the petition and it's chosen more or less at random -- first come first serve. If a telecom files a petition against some part of the FCC order, the jurisdiction is entered into the lottery. This will allow them to change to a more favorable venue.)

Healthcare Repeal

The House and The Senate have both voted to try and dismantle key provisions of the Affordable Care Act. The ACA has insured a record number of people (in b4 smarmy individual mandate comment) and has started to restrain the growth of health care costs. It has been life saving for more than a few people and protected countless others from bankruptcy. Health care costs could be further reduced if states wouldn't refuse federal funds. (This has actually happened.) Additionally, since the president is required to basically sign checks reimbursing insurers for the high-risk pools, that adds uncertainty to the market and makes it harder for insurance providers to plan forward -- removing smaller providers and driving up costs for all participants.

Tax Reform

After a massive public outcry against the mass-repeal of healthcare, it looks like Republicans have doubled down on the, "Look, we're all about taxes," mantra. The new tax bill contains provisions to break one of the three legs of the ACA: the individual mandate. Without the individual mandate, there's no incentive for anyone to join the pool of insured until they need insurance. The addition of young, healthy, low-risk persons decreases the cost of providing care and drives down premiums. Without the individual mandate, people can refuse to acquire healthcare until they actually need it which, due to the rules on preexisting conditions, means they can't be refused service (a good thing, if coupled with the individual mandate). This makes Obamacare untenable and gives Republicans deniability. "Look, we always said it was doomed. It had nothing to do with us sabotaging it. It just fell apart due to nothing we did. All we did was pass this one unrelated tax bill and suddenly it exploded."

In Short Supply

I've been in fairly regular contact with my representatives at the House and Senate level. (Props to Jamario and Alex L. You guys rock.) It feels every day though that the best we can hope for is to throw lives at the battlefront while we retreat. Corporate profits continue to skyrocket. Dividends are paid to board members and shareholders instead of employees. The middle class' wages stagnate or shrink while the working poor's numbers grow. A fraction of a handful of a few self-serving people are setting up our country for failure to further their own personal gains and are manipulating countless thousands into believing it's for their own personal good. No hope for a better tomorrow.

Legislation should be driven by evidence and inspired by axioms.
Check your results and revisit your decisions. Did anything change? What? Why?
There's no shame in making mistakes -- learn from them and do better.
A good solution now is better than a perfect solution never.

Seek the greatest good for the greatest number of people.
What we're able to do is more important than what we're allowed to do.
Social safety nets make for a healthier society.

Science is fundamental to our economy and our future.
Learning makes for a better world -- it should be accessible to all people.

Employment is a right -- a person should be able to thrive on the fruits of their labor, not merely survive.
There's honor in trades. Craftsmanship should be celebrated.
A person's worth is not contingent upon their employment.

Be decent to everyone.
Be excellent to those who are excellent to you.
In all things, try and maintain a sense of humor.

I had the distinct honor of working with a number of talented individuals at the unofficial unconference.

We were ambitious, maybe too much so, but I comfort myself by saying that incremental progress is progress nonetheless, and that it's a worthwhile endeavor to report both our successes and our failures. The following is an account of what we tried and our reason for trying.

We began with the problem phrased as follows: "Given a high-level input description of a program, produce source code which meets the requirements." We didn't expect to solve this problem in a day, but figured that generating a dataset would be both feasible in the time frame and worthwhile. Collectively, we decided to produce tuples of problem descriptions, programs, and example outputs. The next question was one of output language: we wanted to generate examples in a language which was simple enough for a machine to learn and also usable by a human creating the programming problems. Python was the favorite language among the group, but had its limitations -- it was a big language with lots of minutia. Learning to generate Python would require apprehending list-comprehension and more subtlety syntactically than we felt was strictly necessary. Instead, we opted for a high-level Turing-complete computational graph representation of a language (basically, an AST). The graph could be "compiled" to a description or "compiled" to Python and then run, giving all the required outputs of the problem.

The next issue was generating programming problems of sufficient variety to be useful. Too few examples would basically guarantee overfitting, so the manual construction of programming examples was out. Too repetitive would mean the program would ignore the details of the English language and would pick up on the structure of the sentences. That seemed okay at the time -- we figured we could remove details without too much effort to make problems more 'programming-challenge-esque'. It became apparent quickly that selecting which details to omit to frame the problem was almost as big a challenge as the original.

Our graph approach was producing, "Given variables a, b, c, return a*c - b if c > a else c*b;" Not a particularly interesting problem, since it basically involves a direct translation from the description to machine code, and we wanted to avoid building, "a compiler, but with ML."

The remainder of the day was spent first trying to construct more elaborate program descriptions and more subtle, interesting problems. The latter was spent in an entirely alternative mode where we decided to try and use an autoencoder to learn the Python AST and an autoencoder to learn the structure of English, then bridge the two using our limited dataset scraped from Project Euler and from OpenAI's sample dataset.

I'm not yet sure what we'd do differently. It seemed we made the right choices given the available information most of the time -- the biggest oversight to me remains misjudging the quality of the English output graph. I have to wonder if that could be improved through clever graph optimization tricks. Perhaps that's a project for another hackathon.

On a personal note, I was overwhelmingly impressed by the depth of skills expressed by the persons there. As much as I don't enjoy being the least smart person in a room, that's a good state for a programming collaboration. I look forward to working with everyone again after I've had a chance to study some more. Maybe a lot more.

Thank you to everyone who participated. It has been a pleasure.

I did my masters in machine learning, so I'm a little touchy on the subject. It always stands out to me when someone says, 'big data punishes poor people' because it sounds like "polynomials are anti-semetic" or "bolt cutters are racist".

Machine learning is a tool like any other, and it can be used for nefarious purposes. I don't think it's an unreasonable assertion that things like search-bubbling actually contribute negatively to echo-chamber effects, as they result in people seeing only data that reinforces their viewpoints (as a side effect of being more relevant). To cast the blanket statement like this, however, I think is a catchy but unnecessarily negative act.

I hope the book doesn't overlook the positive contributions that data mining has made, like discovering genetic markers for diseases, finding new antibiotics, finding treatments for cancers, decreasing water consumption in agriculture, tracking diminishing animal populations, or even more mundane things like providing automatic subtitles to videos for the hearing impaired.

The most interesting question I have to raise is this: is it _more_ humane to remove the biases of a human? Humans are REALLY good at seeing patterns. We're so good at seeing patterns that we see them where there are none -- we see Jesus in toast, we see faces in the sky, we see people as part of a group. That last one is racist, and while we can't alter our perceptions we can be made aware of them and do everything we can to try and work around our 'feelings'. Machines are getting good at recognizing patterns too, now. They even beat us in a lot of cases. If we train a model with racist data, though, it will generate racist predictions. Can we efficiently sanitize data to be sure that it's fair to everyone involved? Is it inevitable that people will abuse statistics to further their own ends? Equally curious: if data suggests a 99% chance that someone will default on a loan, should we chide the operator of the tool for using it? What if they're trying to protect their own best interests? I don't know if there's a winner there.

There's a lot of answers I don't have and, ironically, an inability to predict the future, but I do have an emotional response to the article: it's unpleasant and bothersome. I can't say it's wrong, but I can say it's an incomplete picture and that it furthers the author's agenda: making a boogeyman of an emerging technology. I don't like that.

tl;dr: This is a nuanced topic and I'm dubious that the author can reasonably cover it, fearing instead that it devolves into fear-monger.

Compressed representation is a really good idea. In a nutshell, the idea is this: "There are lots of words. Lots of words mean the same thing. Lots of times the same word means different things. It would be nice if we only had one word per concept. Let's make a neural network figure that out." Voilá!

My machine learning toolbox Aij is open source and available on Github. I'll be using the NeuralNetwork class and the BackpropTrainer to walk you through a rudimentary implementation. What we want in the end is to have a word vector map into a denser semantic vector. If that doesn't make sense, fear not. We will explain shortly. If you already understand semantic vectors and just want to see the implementation details, skip down to the sign.

This document does assume that you have some understanding of programming and a remote familiarity with machine learning.

When you're searching on the internet, be it Google, Bing, or DuckDuckGo, the search engine needs a way of mapping from each word in your query to a result. The naïve approach would be to take the search query, split it on the word boundary, and go iterating through each word of each document, counting the number of times the words overlap. This was, for a short while, the approach that some search providers used. (Recall, if you can, the era before Google.)

There are some problems with this method: first, you get LOTs of spurious results when a term happens over and over again in a document. Second, long documents get an unfair advantage over shorter ones, even though the shorter ones might be more relevant. Third, performance isn't great since you need full-text in memory. I could go on.

You can see some improvements by going from checking for each word into a matrix representation. If you have instead table with a page name and a big vector of all the words in a document, you speed up your search tremendously because you can do a simple dot product between your search vector and your document vector.

Perhaps we're getting ahead of ourselves. Let's start by defining a 'dictionary'. A dictionary is a big map which goes from a word to a number. Here's my dictionary:

the -> 0
cat -> 1
dog -> 2
my -> 3
ran -> 4

I can use a dictionary to produce a word vector or sentence vector. A vector is an array of length '# of words'. If I've seen a billion different words, 20,000 of which are unique, my vector is going to have 20000 elements. In the case above, I've only seen five words.

The sentence vector for "My cat ran." would be this:

"My cat ran" -> [the, cat, dog, my, ran] -> [0, 1, 0, 1, 1]

Note that each word in the sentence becomes a '1' in the sentence vector. The position of the '1' is determined by the dictionary.

Let's really quickly review what 'dot product' is. Mathematically, a dot product is the sum of the products of the respective elements.

"What?"

Let's look at the dot product of two vectors, <2, 3, 0> and <1, 4, 5>.

"The respective elements..."

2, 1 and 3, 4 and 0, 5.

"The product of..."

2*1 = 2, 3*4 = 12, 0*5 = 0.

"The sum of..."

2+12+0 = 14.

So the dot product takes each corresponding pair of items in each vector, multiplies them, and adds the result.

If we want to say how similar our document, "My cat ran." and the page "My dog ran." are, we can look at the dot product of these two vectors!

[0, 1, 0, 1, 1] (dot) [0, 0, 1, 1, 1] = sum(0*0, 1*0, 0*1, 1*1, 1*1) = 2

So, "My cat ran" and "My dog ran" have a match of two. Compare this to the similarity of "My cat ran" and "The dog ran".

[0, 1, 0, 1, 1] (dot) [1, 0, 1, 0, 1] = sum(0*1, 1*0, 0*1, 1*0, 1*1) = 1

A similarity of just 1!

Dot product is fast to compute and can be done in parallel for a lot of pages at the same time. Still, there are some problems. You'll note that the vectors are _really_ sparse. That is, most of the 20k rows are zeros. That's not a problem in and of itself, but it also means when words change form (like 'ran' to 'run'), we might miss an entry for them. It would be handy if we had only ONE word for each CONCEPT. So instead of one entry for 'feline', 'cat', 'gato', 'neko'. It would be some 'high-dimensional' concept in a 'feature space'. Our objective is to find that representation.

Skip here if you only care about implementation.

We want a 'compressed' representation of the words, and we don't want to hand-craft that representation. What we can do is have our network make this representation. We can take a neural network and give it a bunch of input nodes and THE SAME NUMBER OF OUTPUT NODES. The number of hidden nodes will determine the size of our 'semantic vectors'. We then train the network using our input vectors as both the examples and the labels! The network has to find a way to encode (using the hidden layers) a compressed representation that best holds the information from the input layers. Here's what that might look like:

A Semantic Autoencoder

To train this network, all we have to do is make a bunch of sentence vectors and push them through a neural network! It's that simple! (Okay, maybe not, but it's a good short summary.) If we can train an autoencoder on a bunch of sentences, hopefully 'semantically similar' terms will have vectors which are closer to each other. As an example, dog and cat are probable semantically closer than dog and computer. Here's the output from one of the unit tests in my library:

Words array: [i, have, a, cat, and, a, dog, i, own, a, computer, my, cat, chases, ...]
Sentences array: [i have a cat and a dog, i own a computer, my cat chases my dog, ...]
cat vec: [0.228247, 0.096414, 0.127684, 0.234154, 0.378221]
dog vec: [0.349040, 0.114843, 0.116345, 0.309349, 0.140704]
comp vec: [0.123077, 0.117904, 0.258514, 0.558843, -0.050910]
catDog dist: 0.07712784882514626
dogComp dist: 0.17024387054490916

Note that indeed, cat and dog are more similar based on their use in the example sentences than dog and cat. Next, I'll run through the implementation start to finish.

Let us begin by defining our training data, deciding on a 'window' size, and picking a semantic vector size.

We could train on all the words in a sentence at the same time, which would make this implementation very close to Principal Component Analysis. Instead, we select a window size of five words. (Which means we grab bunches of five words as we read through a sentence.)

We also select a semantic vector size of five. A bigger semantic vector size gives up more nuance with words, but requires more training data to work properly. We don't always want more nuance.

Lastly, I'm hard-coding the training data because I am lazy. Lateral.io has an awesome collection of articles from Wikipedia which would work well here, but I'm just going to have one big sentence separated by newlines.

I start by breaking the training data into words and sentences. The words are used first to build the dictionary.


final int WINDOW_SIZE = 5;
final int HIDDEN_LAYER_SIZE = 5;
String paragraph =
	"i have a cat and a dog\n" +
	"i own a computer\n" +
	"my cat chases my dog\n" +
	"i enjoy machine learning\n" +
	"my computer runs machine learning problems\n" +
	"my pets are a cat and a dog\n" +
	"this machine learning problem is expecting unique words\n" +
	"i use github to store my machine learning code\n" +
	"i hope this code will produce good results\n" +
	"this test problem is fun and boring at the same time\n" +
	"my cat and dog are fun\n" +
	"my cat and dog are named cat and dog\n" +
	"i am bad at names\n";

String[] words = paragraph.split("\\s");
String[] sentences = paragraph.split("\\n");

I want to be able to go from a word to its index in the dictionary and back from an index to a word. The index to word part can be done with just an array of strings. Easy peasy. The word to index I do with a Map.


// We have to be able to map from words to our semantic vector.
// Build a mapping without stemming or chunking.
List indexToWord = new ArrayList<>();
Map wordToIndex = new HashMap<>();

for(String w : words) {
	// Expensive unique lookup, but this is just a test.
	// If the word is in the list already, don't bother.
	int foundAt = -1;
	for(int i=0; i < indexToWord.size() && foundAt == -1; i++) {
		if(indexToWord.get(i).equals(w)) {
			foundAt = i;
		}
	}
	// Not in the list so far, so add it to the end and keep the index so we can find it again.
	if(foundAt == -1) {
		indexToWord.add(w);
		wordToIndex.put(w, indexToWord.size()-1);
	}
}

Be aware that I'm only adding unique words to the mapping. Now that that's finished, we've got our dictionary to make sentence vectors. It's time to go through our sentences and produce some training examples!


// Make our training examples.
int exampleIndex = 0;
Matrix examples = new Matrix(sentences.length*8, indexToWord.size(), 0.0);
for(String sentence : sentences) {
	String[] sentenceWords = sentence.split("\\s");
	for(int window = 0; window < sentenceWords.length-WINDOW_SIZE; window++) {
		for(int j=0; j < WINDOW_SIZE; j++) {
			String w = sentenceWords[window+j];
			examples.set(exampleIndex, wordToIndex.get(w), 1.0);
		}
		exampleIndex++; // Each window of three words is an example.
	}
}

If I were to run a sliding window over this sentence, it might look like this:

[If I were to run] a sliding window over this sentence, it might look like this:

If [I were to run a] sliding window over this sentence, it might look like this:

If I [were to run a sliding] window over this sentence, it might look like this:

... and so on.

So our first sentence vector has 1's for the words 'if', 'i', 'were', 'to', and 'run'. Repeat until the end of the sentence.

Now that we have our examples, it's time to set up our network. Remember, we want it to have the same number of inputs as outputs, and we want the inputs to be the size of our dictionary.


NeuralNetwork nn = new NeuralNetwork(new int[]{indexToWord.size(), HIDDEN_LAYER_SIZE, indexToWord.size()}, new String[]{"linear", "logistic", "linear"});

My choice of 'linear', 'logistic', 'linear' is somewhat arbitrary. You could use 'tanh' for all of them or any combination. A single caveat: at least ONE layer must be nonlinear. (The product of any set of linear operations is linear. We need this to be nonlinear.)

Finally, we make our trainer, tweak some variables (like how long we want it to run) and let it train.


BackpropTrainer trainer = new BackpropTrainer();
trainer.momentum = 0.0;
trainer.learningRate = 0.02;
trainer.batchSize = 10;
trainer.maxIterations = 10000;
trainer.earlyStopError = 0.0;

trainer.train(nn, examples, examples, null);

Now go make some coffee or read a book, since this will be training for a while. When it's finished, we can use our network to get the semantic-space representation of any word or words it has seen:


Matrix input = new Matrix(1, indexToWord.size());

input.set(0, wordToIndex.get("cat"), 1.0);
Matrix catActivation = nn.forwardPropagate(input)[1];
input.elementMultiply_i(0); // Reset so we can reuse this.

Just like above, we're making a vector with the same number of entries as the dictionary and setting the index of our query word to '1'. The tricky part here is that nn.forwardPropagate(input)[1] bit. Remember in the diagram above how layer 0 is our raw word vector? And layer 2 is our reconstructed vector? That means layer 1 is our semantic vector!

The elementMultiply_i(0) just resets the input vector to zero.

We can do this for whatever word pairs we like to determine how different they are! I'm using squared-error here instead of abs() or L1 error, but the same idea applies. Cosine-distance is also immensely popular because it gives a nice, normalized vector. I'm excluding it here only for the sake of brevity. Go look it up. ;) Smaller distances means the words are closer in semantic space. Think of it like peanut-butter being close to jam in the grocery store. Things which are semantically similar are closer.


Matrix catDogDiff = catActivation.subtract(dogActivation);
double catDogDistance = catDogDiff.elementMultiply(catDogDiff).sum(); // Squared distance.
Matrix dogComputerDiff = dogActivation.subtract(computerActivation);
double dogComputerDistance = dogComputerDiff.elementMultiply(dogComputerDiff).sum();

System.out.println("cat vec: " + catActivation);
System.out.println("dog vec: " + dogActivation);
System.out.println("comp vec: " + computerActivation);
System.out.println("catDog dist: " + catDogDistance);
System.out.println("dogComp dist: " + dogComputerDistance);
assertTrue(catDogDistance < dogComputerDistance);

And that gives us the output we see at the start.

Semantic representations are an awesomely powerful concept which can be applied at a bunch of different levels of abstraction. We're using it now for words, but there's nothing to prevent it from being used for concepts or images or any of a large number of places. Give it a go and see what you can do with it!

Notes and Corrections

Distance and Similarity are opposite sides of the same coin. If the distance between a and b is zero, the similarity between a and b is large. I prefer distance as a metric to similarity because 0 means identical and anything bigger is more dissimilar. Things can only be so identical, but they can be infinitely different. Similarity doesn't have the same property. 0 can be 'not at all similar', but then 1 may or may not be quite similar. Similarity assumes that there's a _maximum_ amount of DISSIMILARITY. That doesn't sit well with me, but there are all kinds of ways to handle it mathematically.

All code for this is available here as part of the unit test for my library.

The article originally misrepresented this idea as 'word2vec'. Since it doesn't include the skip-gram modelling portion I am renaming it 'compressed representation'. A huge thank you to Reddit user Powlerbare for the correction!

You can read more on the following pages:

Feature learning with Word2Vec

Making Sense of Word2Vec