在DolphinScheduler中,我们还会涉及到经常使用到的一种任务类型,就是运行spark的任务,这篇文章我们介绍下使用DolphinScheduler配置下spark的对应任务。下面直接开始。
一、首先需要部署一个spark环境
这里首先我们需要一个spark的环境,所以这里spark的话,安装的话,可以参考这篇文章《spark安装配置篇(一)spark-3.4.0单机版安装教程》。
二、配置环境变量
这里我们需要在DolphinScheduler的配置文件中添加上spark的环境变量,然后把配置分发给每个节点,这里的配置文件内容如下:
# # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # JAVA_HOME, will use it to start DolphinScheduler server export JAVA_HOME=${JAVA_HOME:-/usr/local/jdk1.8.0_271} # Database related configuration, set database type, username and password export DATABASE=${DATABASE:-mysql} export SPRING_PROFILES_ACTIVE=${DATABASE} export SPRING_DATASOURCE_URL="jdbc:mysql://192.168.31.30:3306/scheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false" export SPRING_DATASOURCE_USERNAME="scheduler" export SPRING_DATASOURCE_PASSWORD="YmBDpz775TeWT6r2" # DolphinScheduler server related configuration export SPRING_CACHE_TYPE=${SPRING_CACHE_TYPE:-none} export SPRING_JACKSON_TIME_ZONE=${SPRING_JACKSON_TIME_ZONE:-UTC} export MASTER_FETCH_COMMAND_NUM=${MASTER_FETCH_COMMAND_NUM:-10} # Registry center configuration, determines the type and link of the registry center export REGISTRY_TYPE=${REGISTRY_TYPE:-zookeeper} export REGISTRY_ZOOKEEPER_CONNECT_STRING=${REGISTRY_ZOOKEEPER_CONNECT_STRING:-localhost:2181} # Tasks related configurations, need to change the configuration if you use the related tasks. export HADOOP_HOME=${HADOOP_HOME:-/home/pubserver/hadoop-3.3.5} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/home/pubserver/hadoop-3.3.5/etc/hadoop} export SPARK_HOME1=${SPARK_HOME1:-/home/pubserver/spark-3.4.0-bin-hadoop3} export SPARK_HOME2=${SPARK_HOME2:-/home/pubserver/spark-3.4.0-bin-hadoop3} export PYTHON_HOME=${PYTHON_HOME:-/usr/bin/python} export HIVE_HOME=${HIVE_HOME:-/home/pubserver/hive3.1.3} export FLINK_HOME=${FLINK_HOME:-/home/pubserver/flink-1.17.0} export DATAX_HOME=${DATAX_HOME:-/home/pubserver/datax} export SEATUNNEL_HOME=${SEATUNNEL_HOME:-/opt/soft/seatunnel} export CHUNJUN_HOME=${CHUNJUN_HOME:-/opt/soft/chunjun} export PATH=$HADOOP_HOME/bin:$SPARK_HOME1/bin:$SPARK_HOME2/bin:$PYTHON_HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:$FLINK_HOME/bin:$DATAX_HOME/bin:$SEATUNNEL_HOME/bin:$CHUNJUN_HOME/bin:$PATH
三、在dolphinscheduler上面配置环境变量
这里在执行的时候,我们还是需要添加一下环境变量,因此在DolphinScheduler dashboard上添加一个环境变量,这里我们还是沿用之前配置的hadoop环境变量:
详细的配置如下:
export JAVA_HOME=/usr/local/jdk1.8.0_271 export HADOOP_HOME=/home/pubserver/hadoop-3.3.5 export HBASE_HOME=/home/pubserver/hbase-2.5.3 export HIVE_HOME=/home/pubserver/hive3.1.3 export FLINK_HOME=/home/pubserver/flink-1.17.0 export SPARK_HOME=/home/pubserver/spark-3.4.0-bin-hadoop3 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${FLINK_HOME}/bin:${SPARK_HOME}/bin export HADOOP_CLASSPATH=/home/pubserver/hadoop-3.3.5/etc/hadoop:/home/pubserver/hadoop-3.3.5/share/hadoop/common/lib/*:/home/pubserver/hadoop-3.3.5/share/hadoop/common/*:/home/pubserver/hadoop-3.3.5/share/hadoop/hdfs:/home/pubserver/hadoop-3.3.5/share/hadoop/hdfs/lib/*:/home/pubserver/hadoop-3.3.5/share/hadoop/hdfs/*:/home/pubserver/hadoop-3.3.5/share/hadoop/mapreduce/*:/home/pubserver/hadoop-3.3.5/share/hadoop/yarn:/home/pubserver/hadoop-3.3.5/share/hadoop/yarn/lib/*:/home/pubserver/hadoop-3.3.5/share/hadoop/yarn/*
四、上传spark的example程序
这里我们还是使用spark的example的演示数据,这个example所在位置在:${spark-home}/examples/jars目录下有一个example程序:
这里我们把jar包上传到DolphinScheduler的资源中心上
五、创建项目
这里我们创建一个spark的测试项目
六、创建工作流
这里我们创建一下一个工作流,拖动的组建是spark组件:
然后这里的编辑信息如下:
1)节点名称
节点1
2)环境名称
hadoop环境变量
3)程序类型
java
4)spark版本
spark2
5)主函数class
org.apache.spark.examples.SparkPi
6)主程序包
spark-example-jar
7)部署方式
cluster
8)任务名称
pi
9)程序参数
12
详细的配置如下图:
然后我们保存一下:
七、上线运行
这里我们把工作流进行上线和运行
这里运行的时候,可以看到提交spark的job是提交到yarn上的,使用的命令是:
${SPARK_HOME2}/bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --driver-cores 1 --driver-memory 512M --num-executors 2 --executor-cores 2 --executor-memory 2G --name pi --queue default spark-jobs/spark-examples_2.12-3.4.0.jar 12
提交之后,我们就能从yarn上看到运行的任务了。
在DolphinScheduler中也能看到结果
还没有评论,来说两句吧...