Add 'The simplest and smallest spark cluster you can build'
parent
0b6fa7831d
commit
4113cbb4ee
80
The-simplest-and-smallest-spark-cluster-you-can-build.md
Normal file
80
The-simplest-and-smallest-spark-cluster-you-can-build.md
Normal file
|
@ -0,0 +1,80 @@
|
|||
If you want to get up and running with Spark with almost no time you can let the Spark context create a local cluster. But deploying a minimal setup is also interesting. It helps you understanding which are the most common issues of actual environments, and showcases the performance issues you may have.
|
||||
|
||||
Spark is a service, and you have to connect it with something else. In my case, I tend to connect it with Jupyter.
|
||||
|
||||
The fastest and simplest way I've found to deploy a Spark cluster is with Docker compose.
|
||||
|
||||
|
||||
```
|
||||
version: "3"
|
||||
|
||||
networks:
|
||||
jupyterhub:
|
||||
external: false
|
||||
|
||||
services:
|
||||
jupyterhub:
|
||||
restart: always
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
container_name: jupyterhub
|
||||
extra_hosts:
|
||||
- host.docker.internal:host-gateway
|
||||
volumes:
|
||||
- "/var/run/docker.sock:/var/run/docker.sock:rw"
|
||||
- "./data:/data"
|
||||
- "./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py"
|
||||
networks:
|
||||
- jupyterhub
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
DOCKER_NETWORK_NAME: lab_jupyterhub
|
||||
command: >
|
||||
jupyterhub -f /srv/jupyterhub/jupyterhub_config.py
|
||||
spark:
|
||||
image: docker.io/bitnami/spark:3.3
|
||||
extra_hosts:
|
||||
- host.docker.internal:host-gateway
|
||||
environment:
|
||||
- SPARK_MODE=master
|
||||
- SPARK_RPC_AUTHENTICATION_ENABLED=no
|
||||
- SPARK_RPC_ENCRYPTION_ENABLED=no
|
||||
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
|
||||
- SPARK_SSL_ENABLED=no
|
||||
networks:
|
||||
- jupyterhub
|
||||
ports:
|
||||
- '8080:8080'
|
||||
spark-worker:
|
||||
image: docker.io/bitnami/spark:3.3
|
||||
extra_hosts:
|
||||
- host.docker.internal:host-gateway
|
||||
deploy:
|
||||
replicas: 4
|
||||
environment:
|
||||
- SPARK_MODE=worker
|
||||
- SPARK_MASTER_URL=spark://spark:7077
|
||||
- SPARK_WORKER_MEMORY=1G
|
||||
- SPARK_WORKER_CORES=1
|
||||
- SPARK_RPC_AUTHENTICATION_ENABLED=no
|
||||
- SPARK_RPC_ENCRYPTION_ENABLED=no
|
||||
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
|
||||
- SPARK_SSL_ENABLED=no
|
||||
networks:
|
||||
- jupyterhub
|
||||
```
|
||||
Note that the only hard piece is the name of the network to connect the jupyter server and the hub, and how you must take into account the project's prefix.
|
||||
|
||||
I'm running jupyterlab with the Docker spawner, and I'd recommend you to do the same thing. This means that you have to build a container image for the jupyter server, and make sure that you use the same versions of Spark everywhere.
|
||||
|
||||
Multi-stage builds to the rescue!
|
||||
|
||||
```
|
||||
FROM docker.io/bitnami/spark:3.3 as spark
|
||||
FROM jupyter/all-spark-notebook
|
||||
USER root
|
||||
COPY --from=spark /opt/bitnami/spark /usr/local/spark
|
||||
USER jovyan
|
||||
```
|
Loading…
Reference in a new issue