Add 'The simplest and smallest spark cluster you can build'

2023-02-19 08:44:16 +01:00 · 2023-02-19 08:44:16 +01:00 · 4113cbb4ee
parent 0b6fa7831d
commit 4113cbb4ee
1 changed files with 80 additions and 0 deletions
--- a/The-simplest-and-smallest-spark-cluster-you-can-build.md
+++ b/The-simplest-and-smallest-spark-cluster-you-can-build.md
@ -0,0 +1,80 @@
 If you want to get up and running with Spark with almost no time you can let the Spark context create a local cluster. But deploying a minimal setup is also interesting. It helps you understanding which are the most common issues of actual environments, and showcases the performance issues you may have.
 Spark is a service, and you have to connect it with something else. In my case, I tend to connect it with Jupyter.
 The fastest and simplest way I've found to deploy a Spark cluster is with Docker compose.
 ```
 version: "3"
 networks:
  jupyterhub:
    external: false
 services:
  jupyterhub:
    restart: always
    build:
      context: .
      dockerfile: Dockerfile
    container_name: jupyterhub
    extra_hosts:
      - host.docker.internal:host-gateway
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:rw"
      - "./data:/data"
      - "./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py"
    networks:
      - jupyterhub
    ports:
      - "8000:8000"
    environment:
      DOCKER_NETWORK_NAME: lab_jupyterhub
    command: >
      jupyterhub -f /srv/jupyterhub/jupyterhub_config.py
  spark:
    image: docker.io/bitnami/spark:3.3
    extra_hosts:
      - host.docker.internal:host-gateway
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - jupyterhub
    ports:
      - '8080:8080'
  spark-worker:
    image: docker.io/bitnami/spark:3.3
    extra_hosts:
      - host.docker.internal:host-gateway
    deploy:
      replicas: 4
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - jupyterhub
  ```
 Note that the only hard piece is the name of the network to connect the jupyter server and the hub, and how you must take into account the project's prefix.
 I'm running jupyterlab with the Docker spawner, and I'd recommend you to do the same thing. This means that you have to build a container image for the jupyter server, and make sure that you use the same versions of Spark everywhere.
 Multi-stage builds to the rescue!
 ```
 FROM docker.io/bitnami/spark:3.3 as spark
 FROM jupyter/all-spark-notebook
 USER root
 COPY --from=spark /opt/bitnami/spark /usr/local/spark
 USER jovyan
 ```