This repository demonstrates Apache Spark Cluster setup using Docker. The main objective of this working is to setup a base layer of docker container with parameterized base-script. To understand the working structure, build.sh is the main bash script file.
To leverages the solution build is the main folder, which contain respective different folders.
├── README.md
├── docker
│ ├── build.sh
│ ├── .docker-compose.yml
│ ├── .env.spark
│ ├── book_data
│ │ └── pride-and-prejudice.txt
│ ├── spark_apps
│ │ └── word_non_null.py
│ ├── notebooks
│ │ ├── sample2.txt
│ │ └── Untitle.ipynb
│ └── build
│ ├── juypter-lab
│ │ └── Dockerfile
│ ├── os-base
│ │ ├── Dockerfile
│ │ └── entrypoint.sh
│ ├── spark
│ │ ├──conf
│ │ │ └── spark-default.conf
│ │ ├──requirements
│ │ │ └── requirements.txt
│ │ ├── Dockerfile
│ │ └── entrypoint.sh
│ └── spark-base
│ └── Dockerfile
- Batch Script: Automates the Docker Image setup.
- Jupyter Lab: To get hand-on on the Apache spark, we setup Pyspark with juypter lab.
- Core Apache Spark Image: Get different docker images like os-base, spark-base and many more.
- Docker
- Docker Compose
- Windows WSL (When using Windows OS)
Navigate to the /docker directory and initialize docker-compose.
cd ../docker
./build.shAfter running the above batch script, you will get different docker images.
Now using docker compose you can spin multiple docker containers. In this case we spin only 1 machine spark worker, you can spin multiple spark-worker just by change the value.
docker-compose up --scale spark-worker=1
This project demonstrates a scalable and efficient solution for managing & apache spark cluster setup.

