Spark安装小记
in 大数据 with 0 comment

Spark安装小记

in 大数据 with 0 comment

系统情况

使用的是阿里云香港轻应用主机,使用Vicer的DD脚本重置喂Debian9。服务器相关配置如下:

CPU model            : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Number of cores      : 1
CPU frequency        : 2500.042 MHz
Total size of Disk   : 24.5 GB (3.6 GB Used)
Total amount of Mem  : 996 MB (163 MB Used)
Total amount of Swap : 1021 MB (1 MB Used)
System uptime        : 5 days, 20 hour 40 min
Load average         : 0.35, 0.09, 0.03
OS                   : Debian GNU/Linux 9
Arch                 : x86_64 (64 Bit)
Kernel               : 4.9.0-8-amd64
ip                   : ***.***.***.***
ipaddr               : 中国 香港   阿里云
vm                   : kvm

安装Scala

apt-get update
apt-get install scala
root@debian:~# scalac -version
cat: /release: No such file or directory
Scala compiler version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
root@debian:~# scala -version
cat: /release: No such file or directory
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

这里吐槽一下竟然是-version而不是--version

安装Spark

当然依然可以食用apt-get的方式进行安装,但为了可以自定义版本,还是建议去官方下载。
戳我直达官网
选择对应的版本和类型直接下载即可。

wget http://mirror-hk.koddos.net/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz -C /usr/local/spark
export SPARK_HOME=/usr/local/spark/spark-2.4.0-bin-hadoop2.7
export PATH=${SPARK_HOME}/bin:$PATH

至此安装基本完成,当然Spark相关的环境变量只是设置在了当前shell,如果需要永久使用需要编辑 ~/.bashrc

root@debian:~# spark-shell
2018-12-04 01:32:01 WARN  Utils:66 - Your hostname, debian resolves to a loopback address: 127.0.1.1; using 172.17.56.142 instead (on interface ens3)
2018-12-04 01:32:01 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-12-04 01:32:01 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://172.17.56.142:4040
Spark context available as 'sc' (master = local[*], app id = local-1543905134960).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
root@debian:~# pyspark
Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-12-04 01:32:48 WARN  Utils:66 - Your hostname, debian resolves to a loopback address: 127.0.1.1; using 172.17.56.142 instead (on interface ens3)
2018-12-04 01:32:48 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-12-04 01:32:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.13 (default, Sep 26 2018 18:42:22)
SparkSession available as 'spark'.
>>>

配置

网络配置

如果打算开启公网访问,阿里云这里有一个很特别的配置,无论怎么开放防火墙和安全组,都无法访问到Spark的内容。此时应该:

vim /etc/hosts
内网IP localhost
内网IP debian

使用Jupyter进行链接

Jupyter是什么其实已经不必赘述,这里仅仅记录Jupyter的安装以及相关的配置。

pip3 install jupyter
jupyter notebook --generate-config --allow-config
# 之后会出现 Writing default config to: /root/.jupyter/jupyter_notebook_config.py
vim ~/.jupyter/jupyter_notebook_config.py

更改以下内容:

c.NotebookApp.ip='0.0.0.0'

即可开启公网访问,但与此同时也会有十分严重的安全问题,所以加个密码吧~

jupyter notebook password

Spark资料整理

官方文档
databricks

Responses