Spark

Run your Spark applications on oleander-managed infrastructure or on your own registered Spark clusters. Upload scripts to oleander for the managed cluster, or keep jobs in your environment for registered clusters. Manage runs and capture lineage metadata for full observability of your data transformations.

Installation

Using Homebrew

Install the oleander CLI:

brew tap OleanderHQ/tap
brew install oleander-cli

Upgrade the oleander CLI:

brew update
brew upgrade oleander-cli

Configuration

Authenticate with your API key. Find it in your oleander settings.

oleander configure --api-key <YOUR_API_KEY>

Oleander Managed Spark

Upload, list, and delete jobs only on the oleander managed cluster.

List your Spark jobs

List your uploaded Spark scripts and their status:

oleander spark jobs list

This lists all Spark jobs available to run.

Upload your Spark script

Upload your Spark application to oleander. The script is stored and ready to run:

oleander spark jobs upload <your_script_path>

Example:

oleander spark jobs upload ./transformations/process_sales_data.py

Include Python dependencies

If your Spark script needs additional Python modules, package them in a ZIP and include them with --py-files:

oleander spark jobs upload <your_script_path> --py-files <module_archive_zip>

Example:

oleander spark jobs upload ./etl_pipeline.py --py-files ./dependencies.zip

Update an existing job

To replace an existing Spark job with a new version of your script, use the --overwrite flag:

oleander spark jobs upload <your_script_path> --overwrite

Use this to replace a job while iterating or fixing bugs.

Delete a Spark job

Delete a Spark job:

oleander spark jobs delete <script_name>

Example:

oleander spark jobs delete process_sales_data

Submit and execute a Spark job

Submit your uploaded Spark script to the oleander managed cluster. Use the exact uploaded file name without the path, such as process_sales_data.py. The --wait flag keeps the command running until the job finishes.

oleander spark jobs submit <script_name> --namespace <namespace> --name <run_name> --wait

Example:

oleander spark jobs submit process_sales_data.py --namespace finance --name process-sales-data --wait

Submit options

--namespace (required): Namespace for the job, a logical group such as a team or project.
--name (required): Job name. Runs with the same namespace and name are grouped under the same job.
--args: Spark job entrypoint arguments.
--sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
--jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
--runTags: Run-specific tags.
--executionIamPolicy: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
--driverMachineType: oleander Spark driver machine type.
--executorMachineType: oleander Spark executor machine type.
--executorNumbers: Number of executor instances.
--wait: Wait until the job finishes.

Registered EMR Serverless Spark

Register your EMR Serverless cluster and target it by name when submitting jobs. Include --cluster <name> and provide the S3 entrypoint (PySpark script or JAR).

Register an EMR Serverless cluster

oleander spark cluster register <name> \
  --type emr-serverless \
  --region <region> \
  --account-id <awsAccountId> \
  --controller-role-arn <controllerRoleArn> \
  --execution-role-arn <executionRoleArn> \
  --application-id <applicationId> \
  --log-bucket <logBucket>

Register options

--region: AWS region of the EMR Serverless application.
--account-id: AWS account ID of the EMR Serverless application.
--controller-role-arn: IAM role ARN oleander assumes to start job runs. Add this to the role’s trust policy so oleander can assume it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}

Add this permissions policy to the controller role so oleander can run the job:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:StartJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>"
        },
        {
            "Sid": "AllowGetJobRun",
            "Effect": "Allow",
            "Action": "emr-serverless:GetJobRun",
            "Resource": "arn:aws:emr-serverless:<REGION>:<ACCOUNT_ID>:/applications/<APPLICATION_ID>/jobruns/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadLogFromS3",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<LOG_BUCKET_NAME>/*"
        }
    ]
}

--execution-role-arn: IAM role ARN the job uses; the Spark application runs with this role’s permissions.
--application-id: EMR Serverless application ID.
--log-bucket: S3 bucket for job logs.

Submit a job to EMR Serverless

oleander spark jobs submit <entrypoint_s3_uri> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait

Example:

oleander spark jobs submit s3://my-bucket/jobs/process_sales_data.py --cluster my-emr --namespace finance --name process-sales-data --wait

Submit options

--cluster (required): Name of the registered cluster.
--namespace (required): Namespace for the job, a logical group such as a team or project.
--name (required): Job name. Runs with the same namespace and name are grouped under the same job.
--args: Spark job entrypoint arguments.
--sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
--jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
--runTags: Run-specific tags.
--executionIamPolicy: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
--pyFiles: Extra pyFiles for the PySpark job. Mutually exclusive with --mainClass.
--mainClass: Entrypoint main class for the Java/Scala Spark job. Mutually exclusive with --pyFiles.
--wait: Wait until the job finishes.

Registered Glue Spark

Register your Glue cluster and target it by name when submitting jobs. Include --cluster <name>. Submit uses the existing Glue job name in your environment.

Register a Glue cluster

oleander spark cluster register <name> \
  --type glue \
  --controller-role-arn <controllerRoleArn>

Register options

--controller-role-arn: IAM role ARN oleander assumes to start job runs. Add this to the role’s trust policy so oleander can assume it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::579897423473:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "oleander"
                }
            }
        }
    ]
}

Add this permissions policy to the controller role so oleander can run the job:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStartAndJobRun",
            "Effect": "Allow",
            "Action": [
                "glue:StartJobRun",
                "glue:GetJobRun"
            ],
            "Resource": "arn:aws:glue:<REGION>:<ACCOUNT_ID>:job/*"
        },
        {
            "Sid": "PassExecutionRole",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::<ACCOUNT_ID>:role/<JOB_EXECUTION_ROLE_NAME>"
        },
        {
            "Sid": "ReadGlueLogs",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogStreams"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output"
        },
        {
            "Sid": "CloudWatchLogsGetLogEvents",
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents"
            ],
            "Resource": "arn:aws:logs:<REGION>:<ACCOUNT_ID>:log-group:/aws-glue/jobs/output:log-stream:*"
        }
    ]
}

Submit a job to Glue

Use --cluster to select the registered cluster:

oleander spark jobs submit <job_name> --cluster <cluster_name> --namespace <namespace> --name <run_name> --wait

Example:

oleander spark jobs submit process-sales-data --cluster my-glue --namespace finance --name process-sales-data --wait

Submit options

--cluster (required): Name of the registered cluster.
--namespace (required): Namespace for the job, a logical group such as a team or project.
--name (required): Job name. Runs with the same namespace and name are grouped under the same job.
--args: Spark job entrypoint arguments.
--sparkConf: Spark configurations without --conf, for example spark.default.parallelism=8. Separate multiple configurations with whitespace.
--jobTags: Job-specific tags in key=value form. Separate multiple tags with whitespace.
--runTags: Run-specific tags.
--executionIamPolicy: IAM policy for job permissions. Final permissions are the intersection of the job execution role and this policy.
--workerType: Glue worker type.
--numberOfWorkers: Number of Glue workers.
--enableAutoScaling: Set to true for auto scaling, false otherwise.
--executionClass: Glue execution class. Either STANDARD or FLEX.
--timeoutMinutes: Glue job timeout in minutes.
--wait: Wait until the job finishes.

When your Spark job runs, oleander captures OpenLineage metadata for lineage and dependencies. View results and the lineage graph in your oleander dashboard.

Get Started

Integrations

Features

Installation

Using Homebrew

Configuration

Oleander Managed Spark

List your Spark jobs

Upload your Spark script

Include Python dependencies

Update an existing job

Delete a Spark job

Submit and execute a Spark job

Submit options

Registered EMR Serverless Spark

Register an EMR Serverless cluster

Register options

Submit a job to EMR Serverless

Submit options

Registered Glue Spark

Register a Glue cluster

Register options

Submit a job to Glue

Submit options

Get Started

Integrations

Features

​Installation

​Using Homebrew

​Configuration

​Oleander Managed Spark

​List your Spark jobs

​Upload your Spark script

​Include Python dependencies

​Update an existing job

​Delete a Spark job

​Submit and execute a Spark job

​Submit options

​Registered EMR Serverless Spark

​Register an EMR Serverless cluster

​Register options

​Submit a job to EMR Serverless

​Submit options

​Registered Glue Spark

​Register a Glue cluster

​Register options

​Submit a job to Glue

​Submit options

Installation

Using Homebrew

Configuration

Oleander Managed Spark

List your Spark jobs

Upload your Spark script

Include Python dependencies

Update an existing job

Delete a Spark job

Submit and execute a Spark job

Submit options

Registered EMR Serverless Spark

Register an EMR Serverless cluster

Register options

Submit a job to EMR Serverless

Submit options

Registered Glue Spark

Register a Glue cluster

Register options

Submit a job to Glue

Submit options