我在 docker 中使用 Spark 来进行一些处理。我们有一个 Kafka 容器、Spark 主容器、两个 Spark 工作容器和一个 Python 容器来编排整个流程。我们通常docker-compose会提出一切:
version: '3.4'
volumes:
zookeeper-persistence:
kafka-store:
spark-store:
services:
zookeeper-server:
image: 'bitnami/zookeeper:3.6.1'
expose:
- '2181'
environment:
...
volumes:
- zookeeper-persistence:/bitnami/zookeeper
kafka-server:
image: 'bitnami/kafka:2.6.0'
expose:
- '29092'
- '9092'
environment:
...
volumes:
- kafka-store:/bitnami/kafka
depends_on:
- zookeeper-server
spark-master:
image: bitnami/spark:3.0.1
environment:
SPARK_MODE: 'master'
SPARK_MASTER_HOST: 'spark-master'
ports:
- '8080:8080'
expose:
- '7077'
depends_on:
- kafka-server
spark-worker1:
image: bitnami/spark:3.0.1
environment:
SPARK_MODE: 'worker'
SPARK_WORKER_MEMORY: '4G'
SPARK_WORKER_CORES: '2'
depends_on:
- spark-master
spark-worker2:
#same as spark-worker1
compute:
build: ./app
image: compute
environment:
KAFKA_HOST: kafka-server:29092
COMPUTE_TOPIC: DataFrames
PYSPARK_SUBMIT_ARGS: "--master spark://spark-master:7077 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell"
depends_on:
- spark-master
- kafka-server
volumes:
- spark-store:/app/checkpoints
数据通过另一个 Python 应用程序发送,计算容器响应更改。我们创建一个 ComputeDeployment 并调用 start 函数来启动 Spark 作业:
牛魔王的故事
相关分类