
Showing posts from 2018

Working With Apache Spark, Python and PySpark

1. Environment ·           Hadoop  Version: 3.1.0 ·           Apache Kafka Version: 1.1.1 ·           Operating System: Ubuntu 16.04 ·           Java Version: Java 8 2. Prerequisites Apache Spark requires Java. To ensure that Java is installed, first update the Operating System then try to install it: sudo  apt-get update sudo  apt-get –y upgrade sudo  add-apt-repository -y ppa:webupd8team/java sudo  apt-get install oracle-java8-installer  3. Installing Apache Spark 3.1. Download and install Spark First, we need to create a directory for apache Spark. sudo   mkdir  /opt/spark Then, we need to download apache spark binaries package. wget  “ ” Next, we need to extract apache spark files into /opt/spark directory sudo  tar  xzvf  spark-2.3.1-bin-hadoop2.7.tgz --directory = /opt/spark  --strip 1 3.2. Configure Apache Spark When Spark launches jobs it transfers its

Apache Kafka and flume installation guide (import data from Kafka to HDFS)

This article contains a complete guide on how to install Apache Kafka, creating Kafka topics, publishing and subscribing Topic messages. In addition, it contains Apache Flume installation guide and how to import Kafka topic messages into HDFS using Apache Flume. 1. Environment ·           Hadoop  Version: 3.1.0 ·           Apache Kafka Version: 1.1.1 ·           Apache Flume Version: 1.8.0 ·           Operating System: Ubuntu 16.04 ·           Java Version: Java 8 2. Prerequisites 2.1. Install Java Apache Kafka requires Java. To ensure that Java is installed first update the Operating System then try to install it: sudo  apt-get update sudo  apt-get upgrade sudo  add-apt-repository -y ppa:webupd8team/java sudo  apt-get install oracle-java8-installer 2.2. Install Zookeeper Apache Kafka requires Zookeeper service to be installed, because it uses it to maintain its nodes heart beats, its configuration and to elect leaders. sudo  apt-get insta