Posts

Showing posts from August, 2018

Working With Apache Spark, Python and PySpark

Image
1. Environment ·           Hadoop  Version: 3.1.0 ·           Apache Kafka Version: 1.1.1 ·           Operating System: Ubuntu 16.04 ·           Java Version: Java 8 2. Prerequisites Apache Spark requires Java. To ensure that Java is installed, first update the Operating System then try to install it: sudo  apt-get update sudo  apt-get –y upgrade sudo  add-apt-repository -y ppa:webupd8team/java sudo  apt-get install oracle-java8-installer  3. Installing Apache Spark 3.1. Download and install Spark First, we need to create a directory for apache Spark. sudo   mkdir  /opt/spark Then, we need to download apache spark binaries package. wget  “ http://www-eu.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz ” Next, we need to extract apache spark files into /opt/spark directory sudo  tar  xzvf  spark-2.3.1-bin-hadoop2.7.tgz --directory = /opt/spark  --strip 1 3.2. Configure Apache Spark When Spark launches jobs it transfers its