With the release of Apache Spark 4, I decided to dive deeper into what’s new and how it could enhance distributed data processing. As someone already familiar with Spark, I was especially curious to try out its new features but before exploring them, I needed to get Spark 4 running on my local setup.
This post walks through how I set up Spark 4 using Anaconda, got it working in both Jupyter Notebook and VS Code, and how I resolved issues related to Java compatibility, environment configuration and common stumbling block when switching Spark versions.
If you’re upgrading or just starting with Spark 4, this guide might save you hours of head-scratching.

Prerequisites
- You should have Anaconda installed (Download from here) and during installation choose “Add to PATH” .
- Basic knowledge of using the terminal or Anaconda Prompt.
Step 1: Create a New Conda Environment
It’s a good practice to isolate your Spark setup in a separate environment.
conda create -n spark4_env python=3.10 -y
conda activate spark4_env
Step 2: Install Java (for Anaconda Users)
Apache Spark requires Java. To avoid system-level conflicts, I installed OpenJDK inside the conda environment:
conda install -c conda-forge openjdk=17 -y
This step is crucial because PySpark versions ≥ 3.5 are compiled using Java 17.
Step 3: Set JAVA_HOME to Match Conda Environment
To avoid Java errors, I explicitly set JAVA_HOME inside my Python script:
import os
os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"]
This ensures PySpark uses the correct Java installation from your conda environment.
Step 4: Test Spark in Jupyter Notebook
I wanted to run PySpark in a Jupyter notebook, so I installed the Jupyter kernel:
pip install ipykernel
python -m ipykernel install --user --name=spark4_env --display-name "Python (spark4_env)"
Then launched Jupyter:
jupyter notebook
And chose the Python (spark4_env) kernel to test:
import os
os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"]
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JupyterSpark").getOrCreate()
print(f'Spark Version = {spark.version}')
df = spark.range(10)
df.show()
Spark Version = 4.0.0
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Make It Permanent (Optional)
To avoid setting it every time:
- Open Anaconda Prompt
- Run:
conda activate spark4_env
echo set JAVA_HOME=%CONDA_PREFIX% >> activate.bat
This appends the variable to activate script of conda env so it gets set every time.
The Gotcha: VS Code Error — Wrong JAVA_HOME
Everything was smooth in Jupyter, but I hit a wall in VS Code:
java.lang.UnsupportedClassVersionError:
... compiled by a more recent version of the Java Runtime (class file version 61.0)
Turns out, VS Code was picking up the wrong environment, and CONDA_PREFIX was pointing to a different path than spark4_env.
Fixing VS Code Setup
Here’s what I did:
Selected the correct interpreter:
Ctrl + Shift + P→ “Python: Select Interpreter” → chosespark_env
Made sure the terminal was using the correct conda environment:
conda activate spark4_env
After this, my Spark script ran perfectly in both VS Code and Jupyter!
Final Thoughts
Setting up PySpark on a local machine using Anaconda is totally doable, but small mismatches between Java and Spark versions can cause big headaches.
Here are my key takeaways:
- Always match Spark version <> Java version.
- Use
os.environ["JAVA_HOME"] = os.environ["CONDA_PREFIX"]to stay environment-aware. - In VS Code, double-check the interpreter and terminal environment.
- Prefer installing Java via conda (
openjdk) to avoid messing with your system Java.
Bonus: Spark Version Compatibility Table
- Spark 3.2.x and 3.3.x required Java 8–11
- Spark 3.4.x+ required Java 11–17
- Spark 3.5.x+ required Java 17
Let me know in the comments if you’ve hit similar issues or want help setting this up on your machine.