Skip to content

usmanakhtar/BigDataCourse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

120 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIG DATA Analytics

Welcome to the BigData Analytics repository! This repository is designed to provide you with hands-on experience and practical instructions for working with various Big Data tools and technologies. This setup will create Docker containers with the following frameworks: HDFS, HBase, Hive, Spark, Jupyter, Hue, MongoDB, Kafka, MySQL, and Zookeeper:

REQUIRED SOFTWARE

To create and use the environment, we will use Git and Docker:

  • Install Docker Desktop on Windows Docker Desktop or Docker on Mac
  • Install Git
  • Make sure WSL is installed properly and docker is in running state.

SETUP

Note: This step should be performed only once. After creating the environment, use docker-compose to start the containers as shown in the topic STARTING THE ENVIRONMENT.

Creating the Docker Directory:

Note: Create a directory called docker.

  • Suggestion for Windows:

    • Create the docker directory at the root of your drive. Open PowerShell Example: C:\docker
  • Suggestion for Mac:

    • Create the directory in the user's home. Terminal Example: /home/user/docker

Open the Powershell, clone the project from GitHub:

git clone https://github.com/usmanakhtar/BigDataCourse.git

Docker

After cloning the project

cd BigDataCourse

To start all the services using the docker

docker-compose up -d 

To stop all services

docker-compose down 

Further changes in the Docker-compose.yml file After installation you need to change the image from fjardim/namenode_sqoop to usmanakhtar17/bigdatacourse:namenode. Secondly you need to create a /Workspace folder you need to the update the path of your folder location.

  namenode:
    #image: fjardim/namenode_sqoop
    #build: .
    image: usmanakhtar17/bigdatacourse:namenode
    container_name: namenode
    hostname: namenode
    volumes:
      - ./data/hdfs/namenode:/hadoop/dfs/name
      - /c/Users/uakhtar/Docker-images/BigDataCourse/bigdata_docker/Workspace:/Workspace 

Hadoop Setup & (MapReduce Streaming)

First step is to go to the namenode container from terminal.

docker exec -it namenode bash 

Make sure to disable the namenode from safemode operation

hadoop dfsadmin -safemode leave

1. Hadoop Map/Reduce Jobs with Docker

First creata folder inside the HDFS

hadoop dfs -mkdir /Data

Now put the file inside the HDFS

hadoop dfs -put /Workspace/Example.txt /Data

To check the file exists inside the HDFS

hadoop dfs -ls  /Data

Now we will run the first MapReduce Job. You need to change directory from root@namenode:

cd /opt/hadoop-2.7.4/share/hadoop/mapreduce

Now we will run the mapreduce job of wordcount.

hadoop jar hadoop-mapreduce-examples-2.7.4.jar wordcount /Data/example.txt /Output1

2. Hadoop Map/Reduce Jobs with Python

In this tutorial, we will describe how to write a simple MapReduce program for Hadoop in the Python programming language. But i have edited the code based on our Hadoop Environment. Here are the updated files and place these file to Workspace folder.

mapper.py

#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()
    for word in line.split():
        print(f"{word}\t1")

reducer.py

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
    word, count = line.strip().split('\t')
    count = int(count)
    if word == current_word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word == word:
    print(f"{current_word}\t{current_count}")

Test the mapper and reducer function from namenode terminal make sure you run the following command from /Workspace folder.

 echo "foo foo quux labs foo bar quux" | ./mapper.py

Sort and run the reducer function

echo "foo foo quux labs foo bar quux" | ./mapper.py | sort | ./reducer.py

To run the MapReduce jobs in Python, we need to use the Hadoop Streaming that will run the MapReduce jobs. Run the following code:

hadoop jar /opt/hadoop-2.7.4/share/hadoop/tools/lib/hadoop-streaming-2.7.4.jar -files /Workspace/mapper.py,/Workspace/reducer.py -mapper mapper.py -reducer reducer.py -input /Data -output /Output3

to check the output of the MapReduce Job

hdfs dfs -ls /Output3
hdfs dfs -cat /Output3/part-00000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors