The simplest and smallest spark cluster you can build - guillem/blog

If you want to get up and running with Spark with almost no time you can let the Spark context create a local cluster. But deploying a minimal setup is also interesting. It helps you understanding which are the most common issues of actual environments, and showcases the performance issues you may have.

Spark is a service, and you have to connect it with something else. In my case, I tend to connect it with Jupyter.

The fastest and simplest way I've found to deploy a Spark cluster is with Docker compose.

version: "3"

networks:
  jupyterhub:
    external: false

services:
  jupyterhub:
    restart: always
    build:
      context: .
      dockerfile: Dockerfile
    container_name: jupyterhub
    extra_hosts:
      - host.docker.internal:host-gateway
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:rw"
      - "./data:/data"
      - "./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py"
    networks:
      - jupyterhub
    ports:
      - "8000:8000"
    environment:
      DOCKER_NETWORK_NAME: lab_jupyterhub
    command: >
      jupyterhub -f /srv/jupyterhub/jupyterhub_config.py
  spark:
    image: docker.io/bitnami/spark:3.3
    extra_hosts:
      - host.docker.internal:host-gateway
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - jupyterhub
    ports:
      - '8080:8080'
  spark-worker:
    image: docker.io/bitnami/spark:3.3
    extra_hosts:
      - host.docker.internal:host-gateway
    deploy:
      replicas: 4
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - jupyterhub

Note that the only hard piece is the name of the network to connect the jupyter server and the hub, and how you must take into account the project's prefix.

I'm running jupyterlab with the Docker spawner, and I'd recommend you to do the same thing. This means that you have to build a container image for the jupyter server, and make sure that you use the same versions of Spark everywhere.

Multi-stage builds to the rescue!

FROM docker.io/bitnami/spark:3.3 as spark
FROM jupyter/all-spark-notebook
USER root
COPY --from=spark /opt/bitnami/spark /usr/local/spark
USER jovyan

You can also run minio in the same docker compose project, and connect it to Spark.

from pyspark import SparkConf
from pyspark.sql import SparkSession
import os

conf = (
    SparkConf()
    .setAppName("Spark minIO Test")
    .set("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
    .set("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
    .set("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
    .set("spark.hadoop.fs.s3a.path.style.access", True)
    .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
)

spark = (
    SparkSession
    .builder
    .master("spark://spark:7077")
    .config(conf=conf)
    .getOrCreate()
)

(
    spark
    .read
    .options(inferSchema=True, header=True)
    .csv("s3a://datawarehouse/boards/*/*.csv")
    .createOrReplaceTempView("boards")
)

(
    spark
    .sql("""
select count(*)
from boards
"""
         )
    .toPandas()
)

Spark API changes quite often. This snippet is recent at the time of writing this article.