I want to learn spark, but I don't have a cluster, so I uses Docker to simulate one to practice. This memo mainly re-organizes multiple tutorials online: Prepaere a virtual machineThis step is needed since I am using Windows. At the beginning I used the virtual machine came with boot2docker, but the root (/) was stored in RAM, and thus I lost all configuration changes (e.g. BASH) after rebooted the guest OS. Thus I decide to install a Ubuntu on VirtualBox instead. The tutorial in the link below has clear illustrationhttp://www.wikihow.com/Install-Ubuntu-on-VirtualBox NOTE
Install DockerDocker's official site has a clear instructions. Once log in to the Ubuntu, run the following 2 commands:
I mainly follow the instructions by Andre Schumacher: https://amplab.cs.berkeley.edu/got-a-minute-spin-up-a-spark-cluster-on-your-laptop-with-docker/ Nevertheless, the git repo URL is different. Now the git command should be: $ git clone -b blogpost https://github.com/amplab/docker-scripts.git Then launch the docker containers for spark: $ sudo ./ docker -scripts/deploy/deploy.sh
- i amplab /spark:0.8.0 -c NOTE
Run PySparkThe following is based on the instructions by Aris. The previous step should have output to indicate the master information. For instance:
*********************************************************************** Now you can log in to the master after change the permission of the id_rsa file: $ chmode 400 /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa $ ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4 On the docker container, launch pyspark: $ /opt/spark-0.8.0/pyspark Example: Estimate PIThe following python code can estimate PI. It is based on the code segment in Spark Examples, and the source code can be found on git. However the current version fails at the statement to create SparkContext with Spark 0.8.0. Thus I revise it as follows: # REF: https://spark.apache.org/examples.html # Complete (But not runnable code): https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py import sys from random import random from operator import add def sample(p): x, y = random(), random(); return 1 if x*x + y*y < 1 else 0; NUM_SAMPLES = 100; count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b); print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES); What if docker keeps waiting for the master?You can log in to the master manually. As mentioned in the previous section, the following command print the master's IP.$ ./ docker -scripts/deploy/deploy.sh
- i amplab /spark:0.8.0 -c Then you can directly log in to the host. $ ssh -i
/home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa
-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
root@172.17.0.4 Once logged in, check whether spark is running, or manually launch spark to see whether it is working. In my case, the master failed to launch spark so docker is waiting. Error: WARN ClusterScheduler: Initial job has not accepted any resourceIf spark or pyspakr shows the following message, it means that no worker is available (REF): WARN ClusterScheduler : Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered and have sufficient memory You can check the spark status on the WebUI, as shown above. In this example, it is http://172.17.0.4:8080/. Referenceshttp://www.wikihow.com/Install-Ubuntu-on-VirtualBoxhttps://docs.docker.com/installation/ubuntulinux/ https://amplab.cs.berkeley.edu/author/schumach/ http://www.rankfocus.com/run-berkeley-sparks-pyspark-using-docker-couple-minutes/ |
Memo-migrated >