Working With Apache Spark, Python and PySpark
1. Environment · Hadoop Version: 3.1.0 · Apache Kafka Version: 1.1.1 · Operating System: Ubuntu 16.04 · Java Version: Java 8 2. Prerequisites Apache Spark requires Java. To ensure that Java is installed, first update the Operating System then try to install it: sudo apt-get update sudo apt-get –y upgrade sudo add-apt-repository -y ppa:webupd8team/java sudo apt-get install oracle-java8-installer 3. Installing Apache Spark 3.1. Download and install Spark First, we need to create a directory for apache Spark. sudo mkdir /opt/spark Then, we need to download apache spark binaries package. wget “ http://www-eu.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz ” Next, we need to extract apache spark files into /opt/spark directory sudo tar xzvf spark-2.3.1-bin-hadoop2.7.tgz --directory = /opt/spark --strip 1 3.2. Configure Apache Spark When Spark launches jobs it transfers its