In this post I will walk you through the Apache Zeppelin and explain how it works along with sample example with CPSE data(Central Public Sector Enterprises) provided by https://data.gov.in/. Apache Zeppelin is an Open Source python notebook that enables to do fast data analytics .It provides data processing analytical environment that works on the concept of Interpreters. It can be used as a Multi-purpose notebook that satisfies all the need of data analytics right from the data ingestion to Data visualization and Collaboration. It has many built-in interpreters like Scala, Python, Apache Spark ,SparkSQL, Hive, Markdown, Shell etc. It provides a single view of data across diverse set of data stores right from hive to streaming data. It can be used to perform exploratory data analytics.
Apache Zeppelin has different components:
- Client: Apache zeppelin is built using python notebook,so it becomes easy for novice use to learn it quickly
- Server: Zeppelin server is nothing but the webservice to which client communicates using Rest API which in turn sends processing task to different intrepreters
- Interpreter: Interpreters are nothing but the actual data processing engine.It can be Apache Spark,Hive,Cassandra etc.
Using Apache Zeppelin:
Following are the steps that needs to be performed for creating simple analytical reports:
- To start zeppelin run “zeppelin.sh” script with turns up the zeppelin server.
- The above step will starts up the zeppelin server and brings up the UI on port 8080.Following figure shows the landing page of zeppelin:
- Now let move ahead and create a new notebook by clicking the “Create new notebook” under notebook tab. This will bring up the new empty notebook as shown below:
- To start with we will use “md”(Markdown ) interpreter to create first heading for a paragraph using “md”(Markdown) interpreter as follows:
%md ###First Analytical Chart
- Now let’s crunch some data using Apache Spark interpreter and create a temporary table in to be used querying using spark sql. Following is the code for the above data.
- Once a temporary table is created then we can create the charts by just querying the data using spark sql. So in the next paragraph switch to the Spark SQL interpreter by typing command “%sql” and lets use some analytical query to get the details of the relation between enterprises paid capital and profit and Loss and also the net worth of a company
- Once you run this paragraph you will see the following:.