Archive

Monthly Archives: December 2019

Shuffling datasets is a fairly standard fare operation in data science. Without it, we risk training a model on a moving target which gradually shifts over time. For most interesting datasets, we can’t fit all of the data in memory, but we’d still like to be able to access it at random. Sure, you can generate a random number and seek on disk to that location, but that doesn’t guarantee that the sought position will be one with a valid line. Jumping into the middle of a JSON blob and seeking to the end is bound to yield a bad time. Here is a brief solution:

class LineSeekableFile:
    def __init__(self, seekable):
        self.fin = seekable
        self.line_count = 0
        self.line_map = list() # Map from line index -> file position.
        self.line_map.append(0)
        while seekable.readline():
            self.line_map.append(seekable.tell())
            self.line_count += 1
    
    def __getitem__(self, index):
        # NOTE: This assumes that you're not reading the file sequentially.  For that, just use 'for line in file'.
        self.fin.seek(self.line_map[index])
        return self.fin.readline()

    def __len__():
        return self.line_count

It is available here as a Python Gist: