在dolphinscheduler系列里面我们还有一个常用的没有介绍到的就是spark了,这篇文章我们介绍下spark的单机版本安装。下面直接开始。
一、下载spark最新版本
要想安装spark,我们首先需要去下载一个最新版本的spark,因此这里我们可以从spark的官网里面下载,下载地址是:spark官网下载。这里我们下载的是最新版的,版本是:spark-3.4.0-bin-hadoop3。下载完毕之后,上传到服务器并且解压。
二、配置服务器的ssh免密码登录
这里我们跳过,因为在本站介绍的比较多了。
三、配置环境变量
这里我们和前面的文章是一样的,需要编辑/etc/profile文件,把spark的环境变量配置进去,这里我们演示环境的完整内容如下:
export JAVA_HOME=/usr/local/jdk1.8.0_271 export HADOOP_HOME=/home/pubserver/hadoop-3.3.5 export HBASE_HOME=/home/pubserver/hbase-2.5.3 export HIVE_HOME=/home/pubserver/hive3.1.3 export FLINK_HOME=/home/pubserver/flink-1.17.0 export SPARK_HOME=/home/pubserver/spark-3.4.0-bin-hadoop3 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${FLINK_HOME}/bin:${SPARK_HOME}/bin
四、配置spark
这里我们就来配置下spark的信息,配置spark主要会涉及到两个文件,分别是:
spark-env.sh slaves
首先我们创建下这两个文件
cp -r spark-env.sh.template spark-env.sh cp -r workers.template slaves
此时这两个文件就创建了
首先我们编辑spark-env.sh这个文件,把如下的内容复制进去
#!/usr/bin/env bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # This file is sourced when running various Spark programs. # Copy it as spark-env.sh and edit that to configure Spark for your site. # Options read when launching programs locally with # ./bin/run-example or ./bin/spark-submit # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files #hadoop的环境变量 export HADOOP_CONF_DIR=/home/pubserver/hadoop-3.3.5/etc/hadoop # - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node # - SPARK_PUBLIC_DNS, to set the public dns name of the driver program # Options read by executors and drivers running inside the cluster # - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node # - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program # - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data # - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos # Options read in any mode # - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf) export SPARK_HOME=/home/pubserver/spark-3.4.0-bin-hadoop3 # - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1). # - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G) # - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G) # Options read in any cluster manager using HDFS # - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files # Options read in YARN client/cluster mode # - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN export JAVA_HOME=/usr/local/jdk1.8.0_271 export YARN_CONF_DIR=/home/pubserver/hadoop-3.3.5/etc/hadoop # Options for the daemons used in the standalone deploy mode # - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname export SPARK_MASTER_HOST=node1 export SPARK_MASTER_IP=node1 # - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master export SPARK_MASTER_WEBUI_PORT=8090 export SPARK_MASTER_PORT=7077 # - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y") # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) export SPARK_WORKER_MEMORY=8g export SPARK_WORKER_CORES=4 # - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y") # - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g). # - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y") # - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y") # - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y") # - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons # - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers # Options for launcher # - SPARK_LAUNCHER_OPTS, to set config properties and Java options for the launcher (e.g. "-Dx=y") export HADOOP_HOME=/home/pubserver/hadoop-3.3.5 export MASTER=spark://node1:7077 # Generic options for the daemons used in the standalone deploy mode # - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf) # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_LOG_MAX_FILES Max log files of Spark daemons can rotate to. Default is 5. # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) export SPARK_PID_DIR=/home/pubserver/spark-3.4.0-bin-hadoop3/pids # - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $USER) # - SPARK_NICENESS The scheduling priority for daemons. (Default: 0) # - SPARK_NO_DAEMONIZE Run the proposed command in the foreground. It will not output a PID file. # Options for native BLAS, like Intel MKL, OpenBLAS, and so on. # You might get better performance to enable these options if using native BLAS (see SPARK-21305). # - MKL_NUM_THREADS=1 Disable multi-threading of Intel MKL # - OPENBLAS_NUM_THREADS=1 Disable multi-threading of OpenBLAS # Options for beeline # - SPARK_BEELINE_OPTS, to set config properties only for the beeline cli (e.g. "-Dx=y") # - SPARK_BEELINE_MEMORY, Memory for beeline (e.g. 1000M, 2G) (Default: 1G) ##zookeeper+spark高可用 ,没有装zookeeper的下边可以不配置 export SPARK_DAEMON_JAVA_OPTS=" -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node1 -Dspark.deploy.zookeeper.dir=/spark"
在初始状态的文件里面没有任何配置,全部是使用#号给分割开的,这里我们我们把所有export的内容都粘贴到spark-env.sh这个文件中去。
然后再配置slave,这里我们是单机,所以master是node1这台服务器,slave也是node2这台服务器。所以我们编辑slaves这个文件,把node1添加进去
这里原本有一个localhost,我们把他给去掉下,添加上node1,如果有其他的服务,也添加进去。
以上我们的spark单机版本就配置完了。
五、启动spark
配置完毕之后,我们就来启动下spark
cd /home/pubserver/spark-3.4.0-bin-hadoop3/sbin/ ./start-all.sh
启动之后,会在服务器上出现两个进程,分别是:
Master Worker
使用jps命令即可看到
六、测试spark
这里的话我们启动后需要测试下spark,第一项测试,我们访问下8089 web ui的端口
可以看到访问没有任何问题。
我们再测试下spark自带的example程序,这里我们以计算pi来看,执行如下的命令
cd /home/pubserver/spark-3.4.0-bin-hadoop3/examples/jars spark-submit --master spark://node1:7077 --class org.apache.spark.examples.SparkPi spark-examples_2.12-3.4.0.jar 12
运行没有任何问题,运行后在web上可以看到对应的任务:
我们再测试下提交到yarn上运行,使用如下的命令:
cd /home/pubserver/spark-3.4.0-bin-hadoop3/examples/jars spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi spark-examples_2.12-3.4.0.jar 12
运行之后,我们可以在hadoop的yarn上看到这里的job任务
以上我们的spark就安装完毕了。
备注:
1、这里我们的安装方式是按照集群的模式来安装单机版本的,如果需要扩展集群,只需要配置ssh免密码登录和更改slaves文件接口。
2、在spark集群上提交的任务,执行之后会在spark的web ui上看到。
3、向yarn提交spark任务的话,任务执行情况在spark的web ui上是看不到的,只能在hadoop的yarn上查看到。
还没有评论,来说两句吧...