You will not find me singing praises for Spark. It is not my favorite system. Spark feels like taking a good idea and building a religion upon it. Here is my experience building and running a Spark application.Chapter 0: DenialOur story begins with a system that should have been functional. We have around a terabyte of text data to mill through with about 5k documents being added every week. Each document is just shy of one megabyte of plaintext. My inclination so far is to continue using the same setup we have used before: Gensim with some distributed workers. The process takes a few days for us to mill through everything and in the end we get a model that's reasonably effective. That multi-day turnaround is getting to be a problem, and storage of the text corpus after we've processed it is becoming a headache. Mostly, this difficulty is my own fault. I don't yet have the comfort or the confidence to ask for additional storage resources at my position. Instead, I have the clever solution: why not read the text incrementally from the slave database?Chapter 1: AngerReading from the database is taking weeks and the connection gives out regularly enough that the entire solution is unworkable. It might be necessary for us to use Spark after all. Of course, we'll still be reading from the database, but at least we've got some fault-tolerance. "Why is the database so slow?" becomes a daily mantra.Chapter 2: Bargaining and Special PleadingIt's time to bring in some outside council. Latent Dirichlet Allocation is a not uncommon topic modelling approachS3A Filesystem Errors, Fat-JAR building problems, Scala SBT Jar merge problems, AWSServiceException,IntelliJ crashed by Scala project.Cluster setup + running + Worker out of memory./usr/local/spark/sbin/start-slave.sh -> /usr/local/spark/sbin/spark-config.sh -> /usr/local/spark/bin/load-spark-env.sh -> /usr/local/spark/conf/spark-env.shOnly problem: spark-env.sh doesn't work. I know this. I've tried it. The worker box shows as having more memory, but the individual workers are still stuck at 1G. What I need to do is update default-env.sh.
Spark 2.x Setup.