SPARK

How I Set Up Spark 4 Locally with Anaconda and VS Code (and Fixed Common Errors)

June 5, 2025June 5, 2025 Satish Kumar UpparaLeave a comment

With the release of Apache Spark 4, I decided to dive deeper into what’s new and how it could enhance distributed data processing. As someone already familiar with Spark, I was especially curious to try out its new features but before exploring them, I needed to get Spark 4 running on my local setup.

This post walks through how I set up Spark 4 using Anaconda, got it working in both Jupyter Notebook and VS Code, and how I resolved issues related to Java compatibility, environment configuration and common stumbling block when switching Spark versions.

If you’re upgrading or just starting with Spark 4, this guide might save you hours of head-scratching.

Prerequisites

You should have Anaconda installed (Download from here) and during installation choose “Add to PATH” .
Basic knowledge of using the terminal or Anaconda Prompt.

Step 1: Create a New Conda Environment

It’s a good practice to isolate your Spark setup in a separate environment.

conda create -n spark4_env python=3.10 -y
conda activate spark4_env

Step 2: Install Java (for Anaconda Users)

Apache Spark requires Java. To avoid system-level conflicts, I installed OpenJDK inside the conda environment:

conda install -c conda-forge openjdk=17 -y

This step is crucial because PySpark versions ≥ 3.5 are compiled using Java 17.

Step 3: Set `JAVA_HOME` to Match Conda Environment

To avoid Java errors, I explicitly set JAVA_HOME inside my Python script:

import os
os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"]

This ensures PySpark uses the correct Java installation from your conda environment.

Step 4: Test Spark in Jupyter Notebook

I wanted to run PySpark in a Jupyter notebook, so I installed the Jupyter kernel:

pip install ipykernel
python -m ipykernel install --user --name=spark4_env --display-name "Python (spark4_env)"

Then launched Jupyter:

jupyter notebook

And chose the Python (spark4_env) kernel to test:

import os
os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"]

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JupyterSpark").getOrCreate()
print(f'Spark Version = {spark.version}')

df = spark.range(10)
df.show()

Spark Version = 4.0.0
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

Make It Permanent (Optional)

To avoid setting it every time:

Open Anaconda Prompt
Run:

conda activate spark4_env
echo set JAVA_HOME=%CONDA_PREFIX% >> activate.bat

This appends the variable to activate script of conda env so it gets set every time.

The Gotcha: VS Code Error — Wrong JAVA_HOME

Everything was smooth in Jupyter, but I hit a wall in VS Code:

java.lang.UnsupportedClassVersionError:
... compiled by a more recent version of the Java Runtime (class file version 61.0)

Turns out, VS Code was picking up the wrong environment, and CONDA_PREFIX was pointing to a different path than spark4_env.

Fixing VS Code Setup

Here’s what I did:

Selected the correct interpreter:

Ctrl + Shift + P → “Python: Select Interpreter” → chose spark_env

Made sure the terminal was using the correct conda environment:

conda activate spark4_env

After this, my Spark script ran perfectly in both VS Code and Jupyter!

Final Thoughts

Setting up PySpark on a local machine using Anaconda is totally doable, but small mismatches between Java and Spark versions can cause big headaches.

Here are my key takeaways:

Always match Spark version <> Java version.
Use os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"] to stay environment-aware.
In VS Code, double-check the interpreter and terminal environment.
Prefer installing Java via conda (openjdk) to avoid messing with your system Java.

Bonus: Spark Version Compatibility Table

Spark 3.2.x and 3.3.x required Java 8–11
Spark 3.4.x+ required Java 11–17
Spark 3.5.x+ required Java 17

Let me know in the comments if you’ve hit similar issues or want help setting this up on your machine.

Troubleshoot

How to run a makefile in Windows?

February 28, 2021 Satish Kumar UpparaLeave a comment

Here is the quick way to run a Makefile or ‘make’ command on windows:

Download git and install (https://git-scm.com/downloads)
Download zip file (https://sourceforge.net/projects/ezwinports/files/make-4.3-without-guile-w32-bin.zip/download ) without guile from (Parent websites : https://sourceforge.net/projects/ezwinports/files/).
Unzip the downloaded zip file (make-4.3-without-guile-w32-bin.zip)
Copy the contents inside C:\Program Files\Git\mingw64 folder (DO NOT OVERWRITE OR REPLACE any files).
Restart GIT Bash.
‘make’ command will strat working.

I hope this will be helpful. If you want to share your thoughts, please feel free to comment below to let me know.

Happy Learning!!!

SPARK

Setup PySpark environment on Anaconda

April 2, 2018July 25, 2021 Satish Kumar UpparaLeave a comment

Install Anaconda

https://www.anaconda.com/products/individual

Download Spark and extract the .tgz file

https://spark.apache.org/downloads.html

Setup Pyspark environment

Open Anaconda Command prompt and run below command to create environment:

conda create -n pyspark python==3.6.9

Activate pyspark environment by running below command:

conda activate pyspark

Install Jupyter module

pip install jupyter

Install ‘findspark’ module

pip install findspark

Open jupyter notebook

jupyter notebook

Start Programming in notebook

import os
os.environ['SPARK_HOME'] = r'C:\dev-tools\spark-3.0.0-bin-hadoop2.7'

import findspark
findspark.init()
findspark.find()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark").getOrCreate()

data = [('Teja', 7), ('Bhavishya',3)]
df = spark.createDataFrame(data, ['Name', 'Age'])

df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

df.show()

+---------+---+
|     Name|Age|
+---------+---+
|     Teja|  7|
|Bhavishya|  3|
+---------+---+

Git Hub : https://github.com/sateeshfrnd/SparkWithPython/blob/master/notebooks/01_PySpark-Setup-Anaconda.ipynb

Hope this blog has been helpful. If you want to share your thoughts/updates, email me at [email protected].

Enjoy Learning…

OOPS

Four Pillars of Object Oriented Programming

January 11, 2018March 14, 2021 Satish Kumar UpparaLeave a comment

The four pillars for Object Oriented Programming are Abstraction, Encapsulation, Inheritance, Polymorphism.

Abstraction :
- Abstraction is the process of showing only required features of an Object/Entitiy to the oustside world and hide the other irrelevant information.
Encapsulation :
- Encapsulation means wrapping up data and member functions/methods together into a single unit (i.e Class).
- Encapsulation achieve the concept of data hiding automatically by providing security to data by making the varaiable as private and expose the property to access the private data which would be public.
Inheritance :
- Inheritance means ability to create a new class from existing class i.e a machanism in which once object acquires the properties and behaviour of another object.
- It helps to reuse, enhance and customize the existing code that reduces the development time.
Polymorphism :
- Polymorphism means ‘many forms’.
- Polymorphism as the ability of a object to be responded in more than one form. ( single action in different ways)

Happy Learning !!!

SPARK

Spark Broadcast Variables

December 12, 2017December 12, 2017 Satish Kumar UpparaLeave a comment

In this post, we will discuss on below topics :

About Spark Broadcast variables ?
When to use Broadcast variables ?
Broadcast variable LifeCycle
How to use Broadcast variables.

Broadcast variables

Spark provides pretty simple concept Broadcast variables to broadcasting data across nodes in a cluster in efficient way. Broadcast variables is same as Distributed Cache in MapReduce paradigm. However there are couple of caveats that are important to know before using.

Broadcast variables

Should be able to fit in memory on machine, since variables cached on each machine rather than shipping a copy to task.
Immutable (read-only), cannot be changed later.
Copied to executors only once and used many times by tasks (rather than being copied every time a task is launched)., it helps to get your Spark application faster if you have a larger values to use in tasks or there are more tasks than executors.

When to use Broadcast variables

Broadcast variables are used to implement map-side join, i.e ‘static look up data’ It means when tasks across multiple stages needs the same data.

Example: I have two tables Employees with huge records and Departments with few records with below schema.
Employee : ID, NAME, DEP_ID
Departments : ID, NAME

To get the employee details with department name, need to perform a standard join. It works, but is very ineffective as the Departments data is sent over to executors while it could have been there already. If there were more task that need the Departments data, you can improve performance by minimizing the data transfer over the network for task execution using Broadcast variables.

Here our Department data is quite small, smart way to do is to ship the small data to each node in the cluster and then perform lookups against it to join them.

With and without Broadcast variables results the same, but using Broadcast variables wins performance-wise when there are more executors spawned to execute tasks that uses Department data.

Broadcast variables LifeCycle

The Broadcast feature in Spark uses SparkContext to create broadcast values and BroadcastManager and ContextCleaner to manage their lifecycle.

BroadcastManager BroadcastManager is a Spark service to manage broadcast variables in Spark. It is created for a Spark application when SparkContext is initialized and tracks the number of broadcast variables in a Spark application.
ContextCleaner ContextCleaner is a Spark service runs on driver and it is created and immediately started when SparkContext starts and is stopped when SparkContext is stopped. It is responsible for application-wide cleanup of shuffles, RDDs, broadcasts, accumulators and checkpointed RDDs that is aimed at reducing the memory requirements of long-running data-heavy Spark applications.

Using Broadcast variables

Using Scala API

 
// Prepare data
val departments = Seq((1, "Development"), (2, "Testing"), (3, "Reporting"))
val employee = Seq((1, "Satish", 1),(2, "Ramya", 2), (3, "Teja", 1 ), (4, "Bavishya",3), (5, "Kumar", 2))

val departments_rdd = sc.parallelize(departments)
val employee_rdd = sc.parallelize(employee)

// broadcast department data
val broadcast_dept = sc.broadcast(departments_rdd.collectAsMap())

// Now let's go ahead and join the two datasets
val empWithDept = employee_rdd.mapPartitions(
 {
 row =&amp;gt; row.map(emp =&amp;gt; (emp._1,emp._2,broadcast_dept.value.getOrElse(emp._3, -1)))
 },
 preservesPartitioning = true
)
empWithDept.collect()

Using Python API

 
# Prepare data
departments = ((1, "Development"), (2, "Testing"), (3, "Reporting"))
employee = ((1, "Satish", 1),(2, "Ramya", 2), (3, "Teja", 1 ), (4, "Bavishya",3), (5, "Kumar", 2))

departments_rdd = sc.parallelize(departments)
employee_rdd = sc.parallelize(employee)

# broadcast department data
broadcast_dept = sc.broadcast(departments_rdd.collectAsMap())

# Now let's go ahead and join the two datasets
rowFunc = lambda emp: (emp[0], emp[1], broadcast_dept.value.get(emp[2], -1))
def mapFunc(partition):
for row in partition:
yield rowFunc(row)

empWithDept = employee_rdd.mapPartitions(mapFunc, preservesPartitioning=True)
empWithDept.collect()

You may have noticed that preserve partitioning argument and that’s to prevent the shuffle of data.

I hope this post has clarified Broadcast varaibles and when to use and how use in Spark. If you want to share your thoughts, please feel free to comment below to let me know.

Happy Learning !!!

HIVE

Hive – Using Lateral View UDTF’s

September 26, 2017September 26, 2017 Satish Kumar UpparaLeave a comment

In this post, we will see why we need Lateral View UDTF and how to use.

As we know Hive supports complex datatypes (like array, map and struct) to store list of values for a row in a single columns and also be queried. This way you can reduce the number of rows in our table.

Now if you want to access or iterate the individual elements, we use Lateral view with Built-in Table-Generating Functions (UDTF) available in Hive. UDFT transforms a single row to multiple rows.

Lets start exploring how to use lateral view explode() function with example.

Creating table EMPLOYEE with the following columns :

emp_id – INT
emp_name – STRING
assets – ARRAY
expenses – MAP

 
CREATE TABLE bigdataplaybook.employee(
   emp_id SMALLINT,
   emp_name STRING,
   assets ARRAY&lt;STRING&gt;, 
   expenses MAP&lt;STRING,INT&gt;
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

Now loading sample data from file and by executing below command:

cat /home/bigdataplaybook/employee.txt
1|SATISH|Laptop,VOIP,Monitor|cab:500,toll:100,meal:500
2|RAMYA|Computer,Mobile|cab:500,meal:700
3|Teja|Computer,Mobile|meal:600

LOAD DATA INPATH '/home/bigdataplaybook/employee.txt' INTO TABLE bigdataplaybook.employee;

Now we have 3 entries in employee table as seen by below select query:

hive&gt; select * from employee;
OK
1 SATISH ["Laptop","VOIP","Monitor"] {"cab":500,"toll":100,"meal":500}
2 RAMYA ["Computer","Mobile"] {"cab":500,"meal":700}
3 Teja ["Computer","Mobile"] {"meal":600}
Time taken: 0.122 seconds, Fetched: 3 row(s)

Below query shows, how to use explode on the column of type array:

SELECT 
 e.emp_id,
 e.emp_name,
 asset_list.asset 
FROM bigdataplaybook.employee1 e 
LATERAL VIEW EXPLODE(e.assets) asset_list as asset;
OK
1 SATISH Laptop
1 SATISH VOIP
1 SATISH Monitor
2 RAMYA Computer
2 RAMYA Mobile
3 Teja Computer
3 Teja Mobile
Time taken: 0.211 seconds, Fetched: 7 row(s)

Below query shows, how to use explode on the column of type map:

SELECT 
 e.emp_id,
 e.emp_name,
 expenses_list.exp_des,
 expenses_list.exp_cost
FROM bigdataplaybook.employee1 e 
LATERAL VIEW EXPLODE(e.expenses) expenses_list as exp_des,exp_cost;
OK
1 SATISH cab 500
1 SATISH toll 100
1 SATISH meal 500
2 RAMYA cab 500
2 RAMYA meal 700
3 Teja meal 600
Time taken: 0.198 seconds, Fetched: 6 row(s)

Lateral View first applies the UDFT to each row of the base table and then joins resulting output rows to the input rows to form a virtual table with supplied table alias.

I hope this post has clarified how to use Lateral View UDTF’s in Hive. If you want to share your thoughts, please feel free to comment below to let me know.

Happy Learning !!!

SCALA

Scala Programming Language and Features

July 26, 2017July 26, 2017 Satish Kumar UpparaLeave a comment

Scala Programming Language

Scala is a multi-paradigm programming language that integrates the object oriented and functional language features and aimed to implement common programming patterns in a concise, elegant and type-safe way.

“Martin Odersky” and his team started developing Scala in 2001 and released in 2003. In recent years, Scala has been used to develop scalable concurrent applications.

Scala Features

Object Oriented – Scala is pure object oriented language where every value is treated as an object and every operation is a method-call. Behaviors of the objects are described by classes and traits. Multiple Inheritance is done by mixin-based composition mechanism and extending classes.

Functional – Scala is a functional programming language where every function is a value then every value is an object so each function is an object. Scala supports Nested and Higher order functions (pass function as parameter), currying and provisions for defining anonymous functions with lightweight syntax.

Statically Typed – Scala is a strongly typed language, it carry out with type system that enforces constraints and abstractions to be used in a safe and coherent manner. The type system supports includes generic classes, variance annotations, upper and lower bounds, explicitly typed self-references, views and polymorphic methods. A ‘local type inference’ mechanism ensures that there is no need to annotate the program with redundant type information. All these composed forms a powerful basis for safe reuse of programming abstractions.

Extensible – Scala provides a unique combination of language mechanisms that makes it easier to integrate new language constructs in the form of libraries. New statements can be facilitated without using meta-programming facilities such as macros.

Runs Java Code – Scala has the same compilation model like Java and runs on JVM. Scala allows access to thousands of existing high-quality libraries like API’s in Java SDK and OpenSource projects.

Scala Frameworks – Many frameworks are getting developed using Scala Programming Language. BigData Frameworks like Scalding, Spark, Kafka developed using Scala and some of the popular frameworks like Play, Akka, Bowler, Lift.

Addition features the differ from Java:

Scala has REPL (Read Evaluate Print and Loop) to do interactive analysis.
Everything is an object in Scala unlike in Java Scala does not have primitives.
All functions are objects to makes functions easy to reuse.
Supports Nested Functions, to define a function inside function.
Type Interface – Automatically detect the datatype even though user not specified.
Closures – Closure is a function, whose return value depends on the value of one or more variables declared outside this function.
Enables users to write their own language specifications and implement using Scala.
Concurrency Support for utilization multiple cpu/core to enabling faster execution of programs unlike traditional programming languages.

That’s all about the Scala and its features. In coming posts, I will post some more Scala concepts.

Please drop me a comment if you like my post or have any typo errors/issues/suggestions.

HIVE

Hive Date Functions

June 30, 2017April 2, 2018 Satish Kumar UpparaLeave a comment

When we create Hive table on top of raw data that received from different sources, it becomes necessary to convert some values into date type or converting into different date formats.

In this blog post, we will summarize the most commonly used Hive Date Functions with some of examples on the usage of the functions.

Hive Date Functions

Hive supports most of the date functions that are exists in relational database.

Get current Date of the current system

SELECT CURRENT_DATE();
2017-06-30

Get current date with timestamp of the current system

SELECT CURRENT_TIMESTAMP();
2017-06-30 08:55:43.324

Gets current time stamp using the default time zone in UNIX epoch.

SELECT UNIX_TIMESTAMP() 
1498838153

Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp.

 
SELECT UNIX_TIMESTAMP('2017-06-30 08:55:53'); 
1498838153

Converts the Unix epoch to a STRING that represents DATE and TIMESTAMP of the current system timezone in the format of “1970-01-01 00:00:00”

 
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()); 
2017-06-30 08:59:06

Converts the TIMESTAMP to specified format.

 
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(CURRENT_DATE()), 'yyyymmdd'); 
20170630

Cast Date type to different format

 
SELECT CAST(CURRENT_TIMESTAMP() AS STRING); 
2017-06-30 09:00:43.637

SELECT CAST(CURRENT_TIMESTAMP() AS BIGINT); 
1498838443

Get the date part of the TIMESTAMP in DATE format ‘YYYY-mm-dd’

 
SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())); 
2017-06-30

Get the YEAR/MONTH/DAY/HOUR/MINUTE/SECOND from the specified date

SELECT YEAR('2017-06-30 09:07:06') 
2017

SELECT MONTH('2017-06-30 09:07:06') 
6

SELECT DAY('2017-06-30 09:07:06') 
30

SELECT HOUR('2017-06-30 09:07:06') 
9

SELECT MINUTE('2017-06-30 09:07:06') 
7

SELECT SECOND('2017-06-30 09:07:06') 
6

Get the WEEKOFYEAR from the input date

 
SELECT WEEKOFYEAR('2017-06-30 09:07:06') 
26

Get day number of the week that takes input date as STRING with format (starting from Monday=1)

 
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP('20170630','yyyyMMdd'),'u') 
5

Get number of days between the two specified dates

 
SELECT DATEDIFF('2017-06-12', '2017-06-02') 
10

Add days to the specified date

 
SELECT DATE_ADD('2017-06-30', 1); 
2017-07-01

Subtract the number of days to the specified date

 
SELECT DATE_SUB('2017-07-01', 2); 
2017-06-29

Change date format to specified format

select from_unixtime(unix_timestamp('2012/06/12','yyyy/MM/dd'),'yyyy-MM-dd') from table1;
2012-06-12

select from_unixtime(unix_timestamp('JUN 30 2017 053019','MMM dd yyyy HHmmss'),'yyyy-MM-dd HH:mm:ss')
2017-06-30 05:30:19

select from_unixtime(unix_timestamp('JUN 11 2017 053019 GMT','MMM dd yyyy HHmmss zzzz'),'yyyy-MM-dd HH:mm:ss')
2017-06-10 22:30:19

Select Max and Min date of the Column of String dataType

SELECT
MIN(cast(to_date(from_unixtime(unix_timestamp(colName , 'dd/MM/yyyy'))) as date)) as MinDate,
MAX(cast(to_date(from_unixtime(unix_timestamp(colName , 'dd/MM/yyyy'))) as date)) as MaxDate
FROM tableName;

Retrieve data between specific dates when date is of type String

 
select * from tableName 
WHERE 
unix_timestamp(colName, 'yyyy-MM-dd') >= unix_timestamp('2017-02-01', 'yyyy-MM-dd') 
AND
unix_timestamp(colName, 'yyyy-MM-dd') <= unix_timestamp('2017-06-30', 'yyyy-MM-dd')

This blog is mostly notes for myself from what I explored and learned. Hope this blog has been helpful.

Please drop me a comment if you like my post or have any thoughts/suggestions.

Keep Learning… Enjoy Learning… Share Learning…

CCA-175 Preparation · SPARK · SPARK-SQL

Quick Reference to read and write in different file format in Spark

June 9, 2017July 26, 2017 Satish Kumar UpparaLeave a comment

Text File

Read

 rdd = sparkContext.textFile(SOURCE_PATH)

Write

 rdd.saveAsTextFile(TARGET_PATH)

Apply compression while writing

rdd.saveAsTextFile(TARGET_PATH,classOf[COMPRESSION_CODEC_CLASS])

Supported compression codecs :

org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.GZipCodec
org.apache.hadoop.io.compress.SnappyCodec

Parquet File

Read

 dataframe = sqlContext.read.format('parquet').load(SOURCE_PATH)

Write

 dataframe.write.format('parquet').save(TARGET_PATH)

Apply compression while writing

dataframe.write.option("compression","compression_codec") \
.save(TARGET_PATH)

Supported compression codecs : none, gzip, lzo, snappy (default), uncompressed

AVRO File

Read

dataframe = sqlContext.read.format('com.databricks.spark.avro') \
.load(SOURCE_PATH)

Write

dataframe.write.format('com.databricks.spark.avro')
.save(TARGET_PATH)

Apply compression while writing

sqlContext.setConf("spark.sql.avro.compression.codec","compression_codec")
dataframe.write.format('com.databricks.spark.avro').save(TARGET_PATH)

Supported compression codecs : uncompressed, snappy and deflate

Reference: https://github.com/databricks/spark-avro

ORC File

Read

dataframe = sqlContext.read.format('orc').load(SOURCE_PATH)

Write

dataframe.write.format('orc').save(TARGET_PATH)

Apply compression while writing

dataframe.write.format('orc').option("compression","compression_codec") \
.save(TARGET_PATH)

Supported compression codecs : uncompressed, lzo, snappy, zlib, none

JSON File

Read

dataframe = sqlContext.read.json(SOURCE_PATH)

Write

dataframe.write.json(TARGET_PATH)

Apply compression while writing

dataframe.toJSON().saveAsTextFile(TARGET_PATH,classOf[COMPRESSION_CODEC_CLASS])

Supported compression codecs : BZip2Codec, GZipCodec, SnappyCodec

CSV File

Read

dataframe = sqlContext.read.format('com.databricks.spark.csv') \
.option("header", "true") \
.option("inferSchema", "true") \
.load(SOURCE_PATH)

Write

dataframe.write.format('com.databricks.spark.csv') \
.option("header", "true") \
.save(TARGET_PATH)

Reference : https://github.com/databricks/spark-csv

XML File

Read

dataframe = sqlContext.read.format('com.databricks.spark.xml') \
                      .option("rowTag", "tag_name") \
                      .load(SOURCE_PATH)

Write

dataframe.write.format('com.databricks.spark.csv') \
.option(rowTag='tag_name', rootTag='tag_name') \
.save(TARGET_PATH)

Reference : https://github.com/databricks/spark-xml

https://github.com/sateeshfrnd/Spark-Learnings/blob/master/spark-python/WriteReadDifferentFileFormats.py

Hope this blog has been helpful. If you want to share your thoughts/updates, email me at [email protected].

Enjoy Learning…

CCA-175 Preparation

CCA 175 Preparation Plan

May 30, 2017May 30, 2017 Satish Kumar UpparaLeave a comment

Data Ingest – Transfer data between external systems and your cluster :

Topic	Frameworks to use
Import data from a MySQL database into HDFS using Sqoop	SQOOP
Export data to a MySQL database from HDFS using Sqoop	SQOOP
Change the delimiter and file format of data during import using Sqoop	SQOOP
Ingest real-time and near-real-time streaming data into HDFS	Spark Streaming
Process streaming data as it is loaded onto the cluster	Spark Transformations/Actions
Load data into and out of HDFS using the Hadoop File System commands	HDFS Commands

Transform, Stage, and Store – Load data from HDFS, perform ETL on the data and write it back to HDFS :

Topic	Frameworks to use
Load RDD data from HDFS for use in Spark applications	Spark RDD/DataFrame (Scala/Python)
Write the results from an RDD back into HDFS using Spark	Spark RDD/DataFrame (Scala/Python)
Read and write files in a variety of file formats	Spark RDD/DataFrame (Scala/Python)
Perform standard extract, transform, load (ETL) processes on data	Spark Transformations/Actions

Data Analysis – Load data from metastore(i.e, Hive) execute queries to generate expected reports :

Topic	Frameworks to use
Use metastore tables as an input source or an output sink for Spark applications	Hive, Spark DataFrame (Scala/Python)
Understand the fundamentals of querying datasets in Spark	Spark DataFrame API’s (Scala/Python)
Filter data using Spark	Spark DataFrame API’s (Scala/Python)
Write queries that calculate aggregate statistics	Spark DataFrame Aggregate API’s (Scala/Python)
Join disparate datasets using Spark	Spark DataFrame Join API’s (Scala/Python)
Produce ranked or sorted data	Spark DataFrame Ranking, Window API’s (Scala/Python)

Configuration – Different options used to while submitting Spark Application :

Topic	Frameworks to use
Supply command-line options to change your application configuration, such as increasing available memory	Spark Submit Command Options

	Hive – Using L… on Hive – Working with Comp…
	Apache Sqoop – Impor… on Create table and load data in…
	Apache Sqoop – Impor… on Create table and load data in…

Prerequisites

Step 1: Create a New Conda Environment

Step 2: Install Java (for Anaconda Users)

Step 3: Set JAVA_HOME to Match Conda Environment

Step 4: Test Spark in Jupyter Notebook

Make It Permanent (Optional)

The Gotcha: VS Code Error — Wrong JAVA_HOME

Fixing VS Code Setup

Final Thoughts

Bonus: Spark Version Compatibility Table

Start Programming in notebook

Broadcast variables

When to use Broadcast variables

Broadcast variables LifeCycle

Using Broadcast variables

Scala Programming Language

Scala Features

Hive Date Functions

Get current Date of the current system

Get current date with timestamp of the current system

Gets current time stamp using the default time zone in UNIX epoch.

Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp.

Converts the Unix epoch to a STRING that represents DATE and TIMESTAMP of the current system timezone in the format of “1970-01-01 00:00:00”

Converts the TIMESTAMP to specified format.

Cast Date type to different format

Get the date part of the TIMESTAMP in DATE format ‘YYYY-mm-dd’

Get the YEAR/MONTH/DAY/HOUR/MINUTE/SECOND from the specified date

Get the WEEKOFYEAR from the input date

Get day number of the week that takes input date as STRING with format (starting from Monday=1)

Get number of days between the two specified dates

Add days to the specified date

Subtract the number of days to the specified date

Change date format to specified format

Select Max and Min date of the Column of String dataType

Retrieve data between specific dates when date is of type String

Text File

Parquet File

AVRO File

ORC File

JSON File

CSV File

XML File

Data Ingest – Transfer data between external systems and your cluster :

Transform, Stage, and Store – Load data from HDFS, perform ETL on the data and write it back to HDFS :

Data Analysis – Load data from metastore(i.e, Hive) execute queries to generate expected reports :

Configuration – Different options used to while submitting Spark Application :

Step 3: Set `JAVA_HOME` to Match Conda Environment