Monthly Archives: November 2017

Word2Vec is nothing shy of brilliant.  It’s elegant, simple, and profoundly effective at translating very sparse textual data into a dense, semantically-sensible representation.  I’ve spoken its praises before[1], but it’s not without issues.  The two most compelling shortcomings of Word2Vec are, to me, the memory consumption of the model and the inability to recognize novel words.  Let’s discuss them in order and talk about solutions.

Word2Vec uses an enormous weight matrix of size (words) * (hidden variables) * sizeOf(float).  Google’s pre-trained Word2Vec model is 3.5 gigabytes and is built from the Google News corpus of approximately 100 billion words.  That seems like plenty, but when you consider the average native speaker knows 20-50 thousand words, then consider the number of prefixes, suffixes, tenses, participles, and Germanic-ultra-concatenations that can be made, is it unsurprising that there will be terms which are discarded as noise?  Not to mention the difficulties of named-entity extraction.  It would be great if we could find a place in the latent space for all those special names like Dick and Harry which preserved not only the concept that they were names but all the connotations that are associated with them.

Relatedly (a word also not found in word2vec), the use of misspellings to deliberately illustrate a point is something Word2Vec is not well equipped to handle.  One can reduce the noise from very infrequently seen words by performing spelling correction, or one can preserve the typographical errors at the cost of another (conceptually unique) term.  It would be great to capture both the semantics and lexes of the language.

What are are our options?  Given that there’s no upper-bounds to the length of a word in the English language, and given that spacing and punctuation can be (and are) used arbitrarily at times, we are perhaps best off using a character-based recurrent neural network to capture a sentence.  The benefit is a memory-constrained, reasonably quick network which doesn’t require tokenization and can handle unique words.  The shortcoming here is training from this point requires a very sizable corpus and takes an extended time to train.  Not only that, the recurrent network’s output isn’t guaranteed to have a useful, continuous representation of the latent concept space.

We may also consider hybridizing Word2Vec and recurrent networks, passing each word through a recurrent network and using the output to predict an embedding.  This means we can adapt to deliberate misssspellings and writing flourishes while also obtaining a good latent representation.

I’m not sure which, if either, will come first.  Nonetheless, I’m hopeful that we’ll eventually come to pass around something not too dissimilar to the Inception model, but for natural language processing.