目录
1. 说明
2. 服务器规划
3. 步骤
3.1 要点
3.2 配置文件
3.2 访问Spark Master
4. 使用测试
5. 参考
1. 说明
- 以docker容器方式实现apache spark计算集群,能灵活的增减配置与worker数目。
 
2. 服务器规划
|   服务器 (1master, 3workers)  | ip | 开放端口 | 备注 | 
|---|---|---|---|
| center01.dev.sb | 172.16.20.20 | 8080,7077 |   硬件配置:32核64G 软件配置:ubuntu22.04 + 宝塔面板  | 
| host001.dev.sb | 172.16.20.60 | 8081,7077 | 8核16G | 
| host002.dev.sb | 172.16.20.61 | 8081,7077 | ... | 
| BEN-ZX-GZ-MH | 172.16.1.106 | 应用服务,发任务机器 | 
3. 步骤
3.1 要点
- worker节点的网络模式用host,不然spark ui页面中获取的路径会是容器ip,里面的链接变得不可访问
 - 测试前需保证任务发布机与Worker机的运行语言版本一致(如: 同是python10 / python12),否则会报错 "Python in worker has different version (3, 12) than that in driver 3.10"。
 - 确保发任务机器能被Worker节点访问,否则会出现诸如: 
"WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"
等莫名其妙的错误,观察工作机错误日志:
"Caused by: java.io.IOException: Failed to connect to BEN-ZX-GZ-MH/<unresolved>:10027"
由于访问不了发任务机器而导致的,目前采取的解决方法是在配置里写死映射IP 
3.2 配置文件
docker-compose.spark-master.yml
services:
  spark:
    image: docker.io/bitnami/spark:latest
    container_name: spark-master
    restart: always
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8080:8080'
      - '7077:7077' 
docker-compose.spark-worker.yml
services:
  spark-worker:
    image: docker.io/bitnami/spark:latest
    container_name: spark-worker
    restart: always
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8081:8081'
      - '7077:7077'
    extra_hosts:
      - "spark-master:172.16.20.20"
      - "BEN-ZX-GZ-MH:172.16.1.106"
    network_mode: host 
3.2 访问Spark Master
访问Spark Master,可见已有两台worker机可供驱使

4. 使用测试
t3.py
from pyspark.sql import SparkSession
def main():
    # Initialize SparkSession
    spark = (
        SparkSession.builder.appName("HelloSpark")  # type: ignore
        .master("spark://center01.dev.sb:7077")
        .config("spark.executor.memory", "512m")
        .config("spark.cores.max", "1")
        # .config("spark.driver.bindAddress", "center01.dev.sb")
        .getOrCreate()
    )
    # Create an RDD containing numbers from 1 to 10
    numbers_rdd = spark.sparkContext.parallelize(range(1, 11))
    # Count the elements in the RDD
    count = numbers_rdd.count()
    print(f"Count of numbers from 1 to 10 is: {count}")
    # Stop the SparkSession
    spark.stop()
if __name__ == "__main__":
    main()
 
运行监控 
 结果
5. 参考
- containers/bitnami/spark at main · bitnami/containers · GitHub



















