If you want to get up and running with Spark with almost no time you can let the Spark context create a local cluster. But deploying a minimal setup is also interesting. It helps you understanding which are the most common issues of actual environments, and showcases the performance issues you may have.
Spark is a service, and you have to connect it with something else. In my case, I tend to connect it with Jupyter.
The fastest and simplest way I've found to deploy a Spark cluster is with Docker compose.
version: "3"
networks:
jupyterhub:
external: false
services:
jupyterhub:
restart: always
build:
context: .
dockerfile: Dockerfile
container_name: jupyterhub
extra_hosts:
- host.docker.internal:host-gateway
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:rw"
- "./data:/data"
- "./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py"
networks:
- jupyterhub
ports:
- "8000:8000"
environment:
DOCKER_NETWORK_NAME: lab_jupyterhub
command: >
jupyterhub -f /srv/jupyterhub/jupyterhub_config.py
spark:
image: docker.io/bitnami/spark:3.3
extra_hosts:
- host.docker.internal:host-gateway
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
networks:
- jupyterhub
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3.3
extra_hosts:
- host.docker.internal:host-gateway
deploy:
replicas: 4
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
networks:
- jupyterhub
Note that the only hard piece is the name of the network to connect the jupyter server and the hub, and how you must take into account the project's prefix.
I'm running jupyterlab with the Docker spawner, and I'd recommend you to do the same thing. This means that you have to build a container image for the jupyter server, and make sure that you use the same versions of Spark everywhere.
Multi-stage builds to the rescue!
FROM docker.io/bitnami/spark:3.3 as spark
FROM jupyter/all-spark-notebook
USER root
COPY --from=spark /opt/bitnami/spark /usr/local/spark
USER jovyan
You can also run minio in the same docker compose project, and connect it to Spark.
from pyspark import SparkConf
from pyspark.sql import SparkSession
import os
conf = (
SparkConf()
.setAppName("Spark minIO Test")
.set("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
.set("spark.hadoop.fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
.set("spark.hadoop.fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
.set("spark.hadoop.fs.s3a.path.style.access", True)
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
)
spark = (
SparkSession
.builder
.master("spark://spark:7077")
.config(conf=conf)
.getOrCreate()
)
(
spark
.read
.options(inferSchema=True, header=True)
.csv("s3a://datawarehouse/boards/*/*.csv")
.createOrReplaceTempView("boards")
)
(
spark
.sql("""
select count(*)
from boards
"""
)
.toPandas()
)
Spark API changes quite often. This snippet is recent at the time of writing this article.