Memo‎ > ‎

Run Apache Spark with Docker

posted May 5, 2015, 7:32 PM by Teng-Yok Lee   [ updated May 9, 2015, 8:34 AM ]
I want to learn spark, but I don't have a cluster, so I uses Docker to simulate one to practice. This memo mainly re-organizes multiple tutorials online:

Prepaere a virtual machine

This step is needed since I am using Windows. At the beginning I used the virtual machine came with boot2docker, but the root (/) was stored in RAM, and thus I lost all configuration changes (e.g. BASH) after rebooted the guest OS. Thus I decide to install a Ubuntu on VirtualBox instead. The tutorial in the link below has clear illustration

http://www.wikihow.com/Install-Ubuntu-on-VirtualBox

NOTE

  1. Create a disk with at least 20GB. At the beginning I only prepared 8GB, which quickly ran out of space.
  2. Also, don't use fixed size because it cannot be resized if needed (REF).
  3. Assign enough CPUs. I used 4 cores.
  4. Once the guest Ubuntu is installed, install openssh-server (REF) in order to log in to the Ubuntu via putty.

Install Docker


Docker's official site has a clear instructions. Once log in to the Ubuntu, run the following 2 commands:

$ wget -qO- https://get.docker.com/ | sh

$ sudo docker run hello-world


Run spark


Nevertheless, the git repo URL is different. Now the git command should be:

$ git clone -b blogpost https://github.com/amplab/docker-scripts.git

Then launch the docker containers for spark:
$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

NOTE
  1. These script cannot work for newer version of Spark. First, it only supports up to 1.0.0. Second, its script for spark 1.0.0 will not work. The docker command will keep waiting for the master, which is impossible since the master cannot launch spark.
  2. This command will launch a shell for scala. To run pyspark, just type command exit to terminate this shell (or it will take all CPUs for its own workers)

Run PySpark

The following is based on the instructions by Aris. The previous step should have output to indicate the master information. For instance:

***********************************************************************

start shell via:            sudo /home/leeten/projects/docker-scripts/deploy/start_shell.sh -i amplab/spark-shell:0.8.0 -n 5b37cadb558db3380eef69adfd9bcc533dd98a604f529447c81331533dfa951b

visit Spark WebUI at:       http://172.17.0.4:8080/

visit Hadoop Namenode at:   http://172.17.0.4:50070

ssh into master via:        ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

/data mapped:

kill master via:           sudo docker kill 7fff30fe8ef3b766504844e0f5eace95a10c66d1e72327fbdd5604b5b8536a16

***********************************************************************

Now you can log in to the master after change the permission of the id_rsa file:

$ chmode 400 /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa
$ ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

On the docker container, launch pyspark:

$ /opt/spark-0.8.0/pyspark


Example: Estimate PI



The following python code can estimate PI. It is based on the code segment in Spark Examples, and the source code can be found on git. However the current version fails at the statement to create SparkContext with Spark 0.8.0. Thus I revise it as follows:

# REF: https://spark.apache.org/examples.html
# Complete (But not runnable code): https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

import sys
from random import random
from operator import add

def sample(p):
    x, y = random(), random();
    return 1 if x*x + y*y < 1 else 0;

NUM_SAMPLES = 100;
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b);
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES);


Troubleshooting

What if docker keeps waiting for the master?

You can log in to the master manually. As mentioned in the previous section, the following command print the master's IP.

$ ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c


Then you can directly log in to the host.

$ ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

Once logged in, check whether spark is running, or manually launch spark to see whether it is working. In my case, the master failed to launch spark so docker is waiting.

Error: WARN ClusterScheduler: Initial job has not accepted any resource

If spark or pyspakr shows the following message, it means that no worker is available (REF):


WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

You can check the spark status on the WebUI, as shown above. In this example, it is http://172.17.0.4:8080/.

References

http://www.wikihow.com/Install-Ubuntu-on-VirtualBox
https://docs.docker.com/installation/ubuntulinux/
https://amplab.cs.berkeley.edu/author/schumach/
http://www.rankfocus.com/run-berkeley-sparks-pyspark-using-docker-couple-minutes/

Comments