Spark & Pyspark

Basic & Advanced Application

Spark and Pyspark are available via the Jupyter + Spark Basic and Jupyter + Spark Advanced interactive applications on Open OnDemand for Armis2, Great Lakes, and Lighthouse.

The Basic application provides a starter Spark cluster of 16 cpu cores and 90 GB of RAM with a one day (24 hour) walltime limit designed for beginner Spark users. This is especially useful for python novices who have not spent time customizing their environment, as well as for newcomers to the Spark ecosystem.

The Advanced application provides a more comprehensive and configurable interface for advanced Spark applications. Here, a user can ask for more cores and memory than the default, but understanding that Spark has certain requirements for its software executors (namely 3 CPU cores and 15 GB or RAM per executor). Users can also load module files or source a setup file where shell variables can be defined. Note that modules must be compatible with the selected Spark and Python versions.

Using PySpark with --jars and --packages

In a Jupyter-based Spark environment, users do not invoke pyspark directly from the command line. Instead, Spark dependencies such as external JAR files or Maven packages must be supplied through the Spark configuration when creating the SparkSession.

Method 1: Configure the SparkSession (Recommended)

The preferred way to include external JARs or Spark packages is to specify them when creating the SparkSession in your notebook:


from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
        .appName("MySparkApp")
        .config("spark.jars", "/path/to/my-library.jar")
        .config(
            "spark.jars.packages",
            "org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262"
        )
        .getOrCreate()
)

Multiple JARs or packages should be provided as a comma-separated list. Spark will automatically download packages from Maven Central and distribute them across the cluster.

Method 2: Using Environment Variables (Advanced)

Advanced users may define Spark submit options via environment variables before launching the Jupyter session (for example, in a sourced setup script):


export PYSPARK_SUBMIT_ARGS="--packages org.apache.hadoop:hadoop-aws:3.3.4 \
--jars /path/to/my-library.jar pyspark-shell"

This approach applies the configuration globally to the session and is useful when all notebooks in a job require the same dependencies.

Note that any JARs referenced by file path must be accessible on all Spark nodes, typically via a shared filesystem such as /scratch or /home.

Important: Spark configuration options must be set before the SparkSession is created. If a SparkSession already exists, it must be stopped and recreated for changes to take effect.

Note that closing the browser tab of the Jupyter Server or Jupyter Notebook DOES NOT stop the Spark cluster that runs in the background. Your account will continue to accrue charges until you explicitly stop the job or the wall time expires. To stop the job, click the ‘Quit’ button in the upper right of the Jupyter Server web page.