Spark - Regenstrief Developer University
This repo is meant to work with a VirtualBox instance installed with Spark for Developer University at Regenstrief. This gives you some basic tools to run and test Spark.
If you want to setup on your own, or don't have internal access, here's what I did. Some of these are derived from Zecevic, P., & Bonaci, M. (2016). Spark in Action. Manning Pubns., so buy their book!
- Setup a new VirtualBox image with Ubuntu. This link might be helpful if you're new to VirtualBox.
- Give yourself some decent memory, if you can spare it. I gave mine about 8GB.
- Create user named 'spark' (Your home directory (~) will be /home/spark), create a password
- Install java, I used openjdk-8
- From a terminal,
sudo apt-get install openjdk-8-jdk - Install the latest Spark
- Pick the latest Spark release
- Pick the latest pre-built Hadoop version
- Pick mirror and download
- Untar the Spark download
tar xvf spark-[version-etc-etc].tgz- Make a Spark home
mkdir -p ~/bin/spark-home- Move Spark to its home
mv spark-[version-etc-etc] ~/bin/spark-home- Create a symbolic link for spark (this is if you want to run multiple versions of Spark, then you can just update the link to whatever version)
cd ~/binln -a spark-home/spark-[version-etc-etc] spark- Install git and whatever other stuff you might want (e.g. your favorite text editor)
sudo apt-get install gitsudo apt-get install vim- Create a Spark Shell desktop icon, this should open straight up to the Spark Shell
- Open up your favorite text editor in the ~/Desktop directory
- Create new file called spark-shell.desktop
- Put the following in the file (the icon is optional)
[Desktop Entry]
Name=Spark Shell
Exec=/home/spark/bin/spark/bin/spark-shell
Terminal=true
Type=Application
Icon=/usr/share/app-install/icons/spark_logo.png
- You might also have to give it permissions 1. Right Click icon, Properties > Permissions 2. Check 'Allow executing file as program'
- Update log4j, so it doesn't spit out a ton of messages
cd ~/bin/sparkvim conf/log4j.properties(or use whatever editor your like)- Put the following in the file
log4j.rootCategory=INFO, console, file
# console config (restrict only to ERROR and FATAL)
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.threshold=ERROR
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# file config
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.File=logs/info.log
log4j.appender.file.MaxFileSize=5MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
- Check out this repo, under your home directory, e.g. '/home/spark/spark-dev-u'
At this point, you should be ready to start executing some Spark commands. See exercises, for some examples.
- Download the latest tarball (I stuck with 2.7.x)
- Untar it,
tar xvf Python-2.7.[version-etc-etc].tgz - Install what you need to compile and configure Python
1.
sudo apt-get install build-essential checkinstall2.sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev - Go the directory where you untarred Python,
cd Python-2.7.[version-etc-etc]/ - Execute configure
1.
.configure - Make and check install
1.
sudo make install2.sudo checkinstall - Execute
pythonwhich should open a Python shell - Create a PySpark desktop icon, this should open straight up to PySpark
1. Open up your favorite text editor in the ~/Desktop directory
2. Create new file called pyspark.desktop
3. Put the following in the file (the icon is optional)
[Desktop Entry] Name=PySpark Exec=/home/spark/bin/spark/bin/pyspark Terminal=true Type=Application Icon=/usr/share/app-install/icons/spark_logo.png - Make sure icon is runnable
- Install R
sudo apt-get install r-base - Create an icon
1. Open up your favorite text editor in the ~/Desktop directory
2. Create new file called spark-r.desktop
3. Put the following in the file (the icon is optional)
[Desktop Entry] Name=SparkR Exec=/home/spark/bin/spark/bin/sparkR Terminal=true Type=Application Icon=/usr/share/app-install/icons/spark_logo.png - Make sure the icon is runnable