Welcome to the BigData Analytics repository! This repository is designed to provide you with hands-on experience and practical instructions for working with various Big Data tools and technologies. This setup will create Docker containers with the following frameworks: HDFS, HBase, Hive, Spark, Jupyter, Hue, MongoDB, Kafka, MySQL, and Zookeeper:
To create and use the environment, we will use Git and Docker:
- Install Docker Desktop on Windows Docker Desktop or Docker on Mac
- Install Git
- Make sure WSL is installed properly and docker is in running state.
Note: This step should be performed only once. After creating the environment, use docker-compose to start the containers as shown in the topic STARTING THE ENVIRONMENT.
Note: Create a directory called docker.
-
Suggestion for Windows:
- Create the docker directory at the root of your drive. Open PowerShell Example: C:\docker
-
Suggestion for Mac:
- Create the directory in the user's home. Terminal Example: /home/user/docker
git clone https://github.com/usmanakhtar/BigDataCourse.gitAfter cloning the project
cd BigDataCourseTo start all the services using the docker
docker-compose up -d To stop all services
docker-compose down Further changes in the Docker-compose.yml file
After installation you need to change the image from fjardim/namenode_sqoop to usmanakhtar17/bigdatacourse:namenode. Secondly you need to create a /Workspace folder you need to the update the path of your folder location.
namenode:
#image: fjardim/namenode_sqoop
#build: .
image: usmanakhtar17/bigdatacourse:namenode
container_name: namenode
hostname: namenode
volumes:
- ./data/hdfs/namenode:/hadoop/dfs/name
- /c/Users/uakhtar/Docker-images/BigDataCourse/bigdata_docker/Workspace:/Workspace First step is to go to the namenode container from terminal.
docker exec -it namenode bash Make sure to disable the namenode from safemode operation
hadoop dfsadmin -safemode leaveFirst creata folder inside the HDFS
hadoop dfs -mkdir /DataNow put the file inside the HDFS
hadoop dfs -put /Workspace/Example.txt /DataTo check the file exists inside the HDFS
hadoop dfs -ls /DataNow we will run the first MapReduce Job. You need to change directory from root@namenode:
cd /opt/hadoop-2.7.4/share/hadoop/mapreduceNow we will run the mapreduce job of wordcount.
hadoop jar hadoop-mapreduce-examples-2.7.4.jar wordcount /Data/example.txt /Output1In this tutorial, we will describe how to write a simple MapReduce program for Hadoop in the Python programming language. But i have edited the code based on our Hadoop Environment. Here are the updated files and place these file to Workspace folder.
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
for word in line.split():
print(f"{word}\t1")#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word == word:
print(f"{current_word}\t{current_count}")Test the mapper and reducer function from namenode terminal make sure you run the following command from /Workspace folder.
echo "foo foo quux labs foo bar quux" | ./mapper.pySort and run the reducer function
echo "foo foo quux labs foo bar quux" | ./mapper.py | sort | ./reducer.pyTo run the MapReduce jobs in Python, we need to use the Hadoop Streaming that will run the MapReduce jobs. Run the following code:
hadoop jar /opt/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar -files /Workspace/mapper.py,/Workspace/reducer.py -mapper mapper.py -reducer reducer.py -input /Data -output /Output3to check the output of the MapReduce Job
hdfs dfs -ls /Output3
hdfs dfs -cat /Output3/part-00000