Archive

Rant

Or: “How I learned to stop worrying and generate planet names.”

A friend of mine is working on a procedural universe generator. Procedural stuff is something I happen to love, as you might ascertain from my older ramblings. In the interest of handing out some knowledge, I want to illustrate an application of different number system representations. After this is done, you’ll have an intuitive understanding of how to take any number (from 0 through 2^64-1) and get back a planet name like ‘Gregorio-X. Ignus system. Charion group. Darion supercluster.’

First, some background.

One numeric basis to which you’re already accustomed is base-10. Base ten has the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 in it. If you write a number in base ten, like 123, you know that a digit in 2’s place counts for 10x what a digit in 3’s place does. That’s perhaps a long way of saying “123 is 100 + 20 + 3”. We can rewrite 123 as 1*10^2 + 2*10^1 + 3*10^0. Anything to the power of zero is 1. 10^0 is 1. Our first 3*10^0 reduces to 3*1 = 3. Great. Now something easier. 10^1 = 10, so 2*10^1 is 2*10 = 20. The process repeats. What you should be seeing from this pattern is each digit position increasing the weight by a factor of 10 — each change in position means the exponent is bigger by one. This will become important when we look at other bases.

Binary

Binary is base-2. The digits in binary are simple: 0 and 1. Let’s look at the number 101. That’s 5 in base-10. Similar to how we can write 123 as 1*10^2 + 2*10^1 + 3*10^0, we can rewrite 101 as 1*2^2 + 0*2^1 + 1*2^0. 1*4 + 0*2 + 1*1 = 4 + 0 + 1 = 5.

With me so far? Hang tight. We’re almost there.

Hexadecimal

What would happen if we wanted a base with MORE than ten digits? Well, hexadecimal happens to be base 16. We have digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F! Note that we need more than just our regular ten digits. To account for all the numbers in hex, we have to include some letters. ‘A’ in this case is ’10’ and ‘B’ is ’11’. And so on up to ‘F’ = ’15’. This means ’10’ in hexadecimal is 16 in decimal. 1*16^1 + 0*16^0. C2A is C*16^2 + 2*16^ 1 + A*16^0. Substitute C with 12 and A with 10 to get 12*16^2 + 2*16^1 + 10*16^0. 12*256 + 2*16 + 10*1 = 3072 + 32 + 1 = 3114.

Arbitrary Bases

Notice how in hexadecimal we had to include other digits A-F. We don’t _have_ to use 0-9 as our basis digits, either. Let’s use base-10, but for digits 0-9 we will substitute ‘fluffy’, ‘fuzzy’, ‘poof’, ‘bunny’, ‘kitten’, ‘puppy’, ‘snuggly’, ‘happy’, ‘cuddly’, ‘puff’. The number 0189 then is ‘fluffyfuzzycuddlypuff’. We can also go backwards, ‘snugglypuppypoof’ is 652.

As a natural extension to this, we can pick bigger bases. In my code, I picked base-65 because that gave me around 8 digits to represent all the numbers 0 to 2^64. Here’s the Python code which converts from base-10 to base-65. My digits for the base are made up names.

Day 3: Mildly Interesting

I’m not sure how fast news travels around my social circles, so I figure it might be wise to leave a first-person report of today’s events. Yes, I hit something with my car on the way up. Yes, it was a mountain. Yes, I’m perfectly fine. I walked away from the accident without so much as a bruise. But I’m getting ahead of myself. Here’s everything in the right order for today:

I woke up to find the weather on Wyoming worse than the night before. While we’d been relegated to sub-45mph travels due to high winds carrying enough snow to block lane markers, today was met with almost total snow blindness. I got maybe 45 miles in the first two hours, following in a cautious but close improv motorcade. After we crested the last mountains outside mile marker 120, things cleared up. Snow disappeared gradually and the sun came out. Wind, too, was largely gone. I relaxed and returned to the posted speed limit (five/ten under, surprisingly. While I’m not one to obey the speed limit, I was doing just that entirely be coincidence.)

Traffic spread out, the speedier people flew on ahead, and the heavier trucks fell behind. I crested one last mountain which, as it turns out, was responsible for the subsiding wind. As I reached the top my front wheels started to slide. I angled my wheels to my lane and released the gas pedal. My vehicle performed then a maneuver I will refer to now as, “The Terrifying Lateral Shimmy.” It slid from the far right lane, still pointing forward, horizontally until the far left lane. As I was about to exit the far left lane I decided to press the brake and immediately regretted doing so. The car got back traction and headed in the direction I’d pointed it. Unfortunately, the direction I’d pointed it was much farther to the right than I anticipated. I corrected to the left and succeeded in swinging the back end more or less sideways, triggering a Fast and Furious: Tokyo Drift style maneuver, except Paul Walker was dead and there was no Vin Diesel.

I recall thinking, “Huh. I’ve never been in a car accident before,” just prior to hitting the center divide. The front passenger headline contacted the center divide at just under 70mph and rotated the vehicle across the passenger side doors. It did a tiny bounce and a totally sweet 180 before the rear tail light contacted the embankment.

I was fine. My indicator lights suggested I check the engine. I did not, in part because I don’t know how and in part because I smelled something like gasoline, except more carcinogenic. (I’d come to understand that this is the smell airbags make when they deploy.) I got out, perhaps in violation of safety procedures, and did a quick check to make sure I didn’t leave any debris which might hurt other drivers. There appeared not to be any. I got back in my car and phoned the police.

All cell conversations were a bit tricky, as this was scarcely in 1X signal territory. A passerby stopped. “That was awesome!” he said. “I know!” I responded. He offered a trip back into town, but I decided to try and deal with the highway patrol. After a dropped call, I phoned again. “Where are you?” “I have no idea. Possibly between Wyoming and Utah, west of Rawling on the 80W.” “What mile marker?” “Not sure.” “See if you can flag a vehicle and have them call it in.” “Okay. I’ll call you back.”

By the time I’d gotten out of the vehicle, someone had already stopped. “That was awesome!” “Yeah. It was pretty sweet.” “Are you hurt?” “Don’t think so. I’m kinda’ hungry.” “Can I help?” “Yes, please. Could you call 911 and tell them the mile marker?” “Sure.”

Before our conversation was wrapped up, another person stopped. “Are you okay?” “I’m okay, thanks. These people are helping me.”

I got back in my car and called my insurance company. “I think the police are sending a tow truck. I’ll call you back when they get here and do the report, but I figure I’d tell you. Seems like the sort of thing you’d want to know.” “Thanks for calling. Is anyone hurt?” “No. I’m a little hungry.” Then we went on.

After around half an hour, the officer arrived, tow truck in… tow. I shook their hands and introduced myself. The tow truck operator went to clearing the other bits of my vehicle from the center divide. He refused my help, but asked for my key. I gave it to him and he sat down in the car and started it. It didn’t explode. Astonishingly, the car even managed to pull a U-turn and end up on the right side of the road where he could clear it without issue.

I ran through the motions with the officer. He was tremendously friendly. Pretty patient, too. I asked if I could try his gun. He said no. I did get a picture inside the squad car though.

Grizzled angry Rowsdower truck driver gave me a ride into town. The conversation was lively, considering I’d almost died and he was the reincarnation of Rowsdower. I need to make it a point to figure out how to relate with career tow truck drivers. Anyway, we ended up in a mountain town with a population of around five.

I sat at the local diner and dug around for a way to get a rental car. Nearest open place was around 15 miles away. Gave them a call to see if anything was available. The guy, Bob, said they had some open cars. Coincidentally, he was also across the street from me and could give me a ride. Holy shit.

I rented a minivan, (it seemed an appropriate upgrade from my effeminate grocery car) and drove back to the town to a choir of horns. Fuck all if I’m doing above 50 after that shit.

I transferred the goods from the Yaris to the minivan and sustained my first injury from the accident — bumped my head on the rear view mirror while moving things.

Called the insurance company one more time and let them speak with Rowsdower for prosperity, allowing him to reiterate how awesomely totaled the car was, then jumped in the minivan and drove off like a madman on Xanax at well under the speed limit. Nine hours later, and I’m here in a hotel in Someplace, Utah, 8 hours from San Francisco.

Good night.

Word2Vec, a method for converting a word to a vector representation, is getting a lot of hype these days. Rightfully so, it’s wonderfully effective and, while not a huge departure from old LSI/LSH methods, is packaged beautifully and simply. For those of you with no concept of what that means, imagine converting a word into a numerical value in some higher dimensional space. Like, instead of ‘hello’, we might have a point at x=10, y=5, z=30. Now add a few more dimensions and you’ve got roughly the idea. Why do we care? Well, we can do mathematical operations on vectors. In fact, an interesting property of the way in which word2vec maps things is that it supports analogies very well. We can say, (dogs – dog) = (kings – ___) and word to vec will give us ‘king’. Another example might be (‘queen’ – ‘woman’) = (___ – ‘man’), and we’ll get “king”.

My biggest issue with word2vec is this: we are still using more or less whole words with some minimal parsing. This is limiting because there do exist words which are spelled wrong, or words which are entirely nonsensical but still need to be represented. If I ask, “What is the plural of ‘florb’?” you will probably respond, “florbs,” even if ‘florb’ is an entirely nonsensical word. Word2vec, because it has no way of vectorizing novel words after the training part, can’t fabricate or comprehend unfamiliar terms. This is what I’d like to correct.

How? Well, I know a bit about deep learning. It is my hammer, and all problems look like nails. The solution I propose is this: train a network on an input of four characters and try and compress it from 128^4 bits to n bits. Why 128^4 bits? Well, if we have four characters, the upper-bound for prefixes and suffixes, and we have 128 printable characters, (26 alpha + 26 alpha uppercase + 10 numeric + punctuation + blah blah) then that means a single 4-character chunk has 128^4 bits. That’s a lot. If we have any hope of understanding full words, we’ll want to get a good amount of variety while keeping our representation compact. Towards that end, we need to decide on exactly what size n we will need.

This is the purpose of today’s experiment: take one million sentences, break it into 4-character chunks, and see how many there are and what they look like. We could break these sentences in a number of ways, at word boundaries or across, in even chunks. Word boundaries might be ‘more correct’ in some sense, but that means things like “San Francisco” would be broken up and change our distribution. “P.F. Chang”, too, would be incorrectly split. So, in the interest of simplicity, we’ll just take chunks of four characters with a sliding window.


# Taking even sized samples.
fin = open("1M_sentences.txt", 'r');
chunks = dict();

# Convert the 1M sentences into a single, long, sentence, split by double spaces.
lines = fin.readlines();
big_sentence = "  ".join(lines);

# Iterate across the sentence in 4-character chunks, appending (and counting) the characters.
for i in range(len(big_sentence)-3):
    chunk = big_sentence[i:i+4];
    if chunk not in chunks:
        chunks[chunk] = 0;
    chunks[chunk] += 1;

There. Now chunks is a large dictionary indexed by blobs of four letters. Let’s look at some properties of this. Let’s get a histogram of chunk counts.


chunk_sizes = dict()
for chunk, count in chunks.iteritems():
    if count not in chunk_sizes:
        chunk_sizes[count] = 0;
    chunk_sizes[count] += 1

1:		############
2:		###########
3:		##########
4:		##########
5:		##########
6:		#########
7:		#########
8:		#########
9:		#########
10:		#########
11:		#########
12:		########
13:		########
14:		########
15:		########
16:		########
17:		########
18:		########
19:		########
20:		########
...
1611720:		

There are 2^12 chunks of 4-letters which appear once. This falls off linearly in our graph which is logarithmically in the actual numbers. (So there are half as many chunks which appear 2x and a quarter as many which appear 3x, and so on.) This parallels Zipf’s Law. 200 words make up 90% of sentences. We get a lot of information from the small, 1% of words which are unique to a document. There are a practically infinite number of words which appear once or twice, however. There’s plenty of room to compress this.

There are 1611720 unique chunks of four characters in our sample. The most frequent chunk appears 339365 times. Well, by the magic of binary and exponents, if we have a representation with log(1611720)+1 bits, we can represent all the chunks uniquely! It even has a bit of a margin to allow us to compose/cover unseen cases. How many bits do we need? 16. 16 bits to cover all those cases.

Let’s try and train a network to go from 128^4 to 16 bits.