Working With Apache Spark, Python and PySpark
1. Environment
· Hadoop Version: 3.1.0
· Apache Kafka Version: 1.1.1
· Operating System: Ubuntu 16.04
· Java Version: Java 8
2. Prerequisites
Apache Spark requires Java. To ensure that Java is installed, first update the Operating System then try to install it:
sudo apt-get update
sudo apt-get –y upgrade
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
3. Installing Apache Spark
3.1. Download and install Spark
First, we need to create a directory for apache Spark.
sudo mkdir /opt/spark
Then, we need to download apache spark binaries package.
wget “http://www-eu.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz”
Next, we need to extract apache spark files into /opt/spark directory
sudo tar xzvf spark-2.3.1-bin-hadoop2.7.tgz --directory=/opt/spark --strip 1
3.2. Configure Apache Spark
When Spark launches jobs it transfers its jar files to HDFS so they're available to any machines working. These files are a large overhead on smaller jobs so I've packaged them up, copied them to HDFS and told Spark it doesn't need to copy them over any more.
jar cv0f ~/spark-libs.jar -C /opt/spark/jars/ .
hdfs dfs -mkdir /spark-libs
hdfs dfs -put ~/spark-libs.jar /spark-libs/
After copying the files we must tell Spark to ignore copying jar files from the spark defaults configuration file:
sudo gedit /opt/spark/conf/spark-defaults.conf
Add the following lines:
spark.master spark://localhost:7077
spark.yarn.preserve.staging.files true
spark.yarn.archive hdfs:///spark-libs/spark-libs.jar
In this article we will configure Apache Spark to run on a single node, so it will be only localhost:
sudo gedit /opt/spark/conf/slaves
Make sure that it contains only the value localhost
Before running the service we must open .bashrc file using gedit
sudo gedit ~/.bashrc
And add the following lines
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_MASTER_HOST=localhost
Now, we have to run Apache Spark services:
sudo /opt/spark/sbin/start-master.sh
sudo /opt/spark/sbin/start-slaves.sh
4. Installing Python
4.1. Getting latest Python release
Ubuntu 16.04 ships with both Python 3 and Python 2 pre-installed. To make sure that our versions are up-to-date, we must update and upgrade the system with apt-get (mentioned in the prerequisites section):
sudo apt-get update
sudo apt-get -y upgrade
We can check the version of Python 3 that is installed in the system by typing:
python3 –V
It must return the python release (example: Python 3.5.2)
4.2. Install Python utilities
To manage software packages for Python, we must install pip utility:
sudo apt-get install -y python3-pip
There are a few more packages and development tools to install to ensure that we have a robust set-up for our programming environment.
sudo apt-get install build-essential libssl-dev libffi-dev python-dev
4.3. Building the environment
We need to first install the venv module, which allow us to create virtual environments:
sudo apt-get install -y python3-venv
Next, we have to create a directory for our environment
mkdir testenv
Now we have to go to this directory and create the environment (all environment file will be created inside a directory that we called my_env):
cd testenv
python3 -m venv my_env
We finished we can check the environment files created using the ls my_env
To use this environment, you need to activate it:
source my_env/bin/activate
5. Working with PySpark
5.1. Configuration
First we need to open the .bashrc file
sudo gedit ~/.bashrc
And add the following lines:
export PYTHONPATH=/usr/lib/python3.5
export PYSPARK_SUBMIT_ARGS="--master local[*] pyspark-shell"
export PYSPARK_PYTHON=/usr/bin/python3.5
5.2. FindSpark library
If we have Apache Spark installed on the machine we don’t need to install the pyspark library into our development environment. We need to install the findspark library which is responsible of locating the pyspark library installed with apache Spark.
pip3 install findspark
In each python script file we must add the following lines:
import findspark
findspark.init()
5.3. PySpark example
5.3.1. Reading from HDFS
The following script is to read from a file stored in hdfs
import findspark
findspark.init()
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("example-pyspark-hdfs").getOrCreate()
df_load = sparkSession.read.csv('hdfs://localhost:9000/myfiles/myfilename')
df_load.show()
5.3.2. Reading from Apache Kafka consumer
We first must add the spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar library to our Apache spark jars directory /opt/spark/jars. We can download it from mvn repository:
The following codes read messages from a Kafka topic consumer and print them line by line:
import findspark
findspark.init()
from kafka import KafkaConsumer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
KAFKA_TOPIC = 'KafkaTopicName'
KAFKA_BROKERS = 'localhost:9092'
ZOOKEEPER = 'localhost:2181'
sc = SparkContext('local[*]','test')
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, ZOOKEEPER, 'spark-streaming', {KAFKA_TOPIC:1})
lines = kafkaStream.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination()
6. Bibliography
[1]
|
M. Litwintschik, "Hadoop 3 Single-Node Install Guide," 19 March 2018. [Online]. Available: http://tech.marksblogg.com/hadoop-3-single-node-install-guide.html. [Accessed 01 June 2018].
|
[2]
|
L. Tagiaferri, "How To Install Python 3 and Set Up a Local Programming Environment on Ubuntu 16.04," 20 December 2017. [Online]. Available: https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04. [Accessed 01 August 2018].
|
[3]
|
"Apache Spark Official Documentation," [Online]. Available: https://spark.apache.org/docs/latest/. [Accessed 05 August 2018].
|
[4]
|
"Stack Overflow Q&A" [Online]. Available: https://stackoverflow.com/. [Accessed 01 June 2018].
|
[5]
|
A. GUPTA, "Complete Guide on DataFrame Operations in PySpark," 23 October 2016. [Online]. Available: https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/. [Accessed 14 August 2018].
|