Wednesday, June 7, 2017

Posted by beni June 07, 2017

A Brief Intro of Apache Storm


Hi Friends,

This post is about Apache Storm and How to install Apache Storm 0.9.4 (latest version) on Windows 7 & Ubuntu 14.04.2 with an Hello World program.

Overview of Apache Storm
Apache Storm is a free and open source distributed system for real-time computation. Storm is implemented in Clojure & Storm APIs are written in Java by Nathan Marz & team at BackType. Storm is best for distributed processing where unbounded streams of  real time data computation requires. In contrast to batch systems like Hadoop, which often introduce delay of hours, Storm allows us to process online data.Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size
About Nathan Marz
Nathan Marz was the lead engineer on Twitters Publisher Analytics team. Previously Nathan was the lead engineer of BackType which was acquired by Twitter in July of 2011. On march 2013 he left Twitter to start-up his own company He is a major believer in the power of open source and has authored some significant open source projects, including Cascalog, ElephantDB, and Storm. He writes a blog at http://nathanmarz.com.

A Brief History of Apache Storm
In 2010 Apache Storm was a nothing than an idea of Nathan Marz, But now storm has been adopted by many of the worlds largest companies including Yahoo!, Twitter, Microsoft and many more.

Before Storm, BackType built an analytic product to help businesses understand their impact on social media both historically and real time using Queues and Workers approach. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing. Where most portion of code are for sending/receiving messages and serializing/deserializing messages and with small portion of actual business logic.

In December 2010, Nathan Marz realized the complexity of product and come up with idea of "Stream" as distributed abstraction with Top-level abstraction "Topology", a network of spouts and bolts. Spout produces brand new streams and Bolt takes in streams as input by subscribe to whatever streams they need to process and produces streams as output. The key insight is that spout and bolt are inherently parallel. Then he tested this abstraction against his test case and he wanted to validate large set of use cases, so he tweeted as
"Im working on new kind of stream processing system. If this sounds interesting to you, ping me. I want to learn your use cases"
Most of people responded with their use-cases, Then he Initially started designing Storm with RabbitMQ for Intermediate messages. Then without intermediate messages, Nathan Marz developed an algorithm based on random numbers and XORs that would only require 20bytes to track each spout tuple to solve massage processing guarantee. Unlike Hadoop Zombie processes which would not shut down at idle, Storm never has this Zombie process issue and Storm has been designed to be "Process Fault-Tolerant". Storm has bean open sourced on July 2011 and graduated to a top-level Apache project on September 17th, 2014.

Storm Application
A Storm application is designed as a topology in the shape of DAG(Directed ACyclic Graph) with Spouts and Bolts acting as graph vertices. Edges on the graph are data stream between nodes

Here are some typical �prevent� and �optimize� use cases for Storm.

�Prevent� Use Cases�Optimize� Use Cases
Financial   Services
  • Securities fraud
  • Operational risks & compliance violations
  • Order routing
  • Pricing
Telecom
  • Security breaches
  • Network outages
  • Bandwidth allocation
  • Customer service
Retail
  • Shrinkage
  • Stock outs
  • Offers
  • Pricing
Manufacturing
  • Preventative maintenance
  • Quality assurance
  • Supply chain optimization
  • Reduced plant downtime
Transportation
  • Driver monitoring
  • Predictive maintenance
  • Routes
  • Pricing
Web
  • Application failures
  • Operational issues
  • Personalized content
Charactrastics of Storm
  • Fast � benchmarked as processing one million 100 byte messages per second per node
  • Scalable � with parallel calculations that run across a cluster of machines
  • Fault-tolerant � when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable � Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate � standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate
Use cases of Storm

  • Processing Streams (No need of intermediate Queues)
  • Continous Computation
  • Distributed RPC

Spout
A Spout is a source of streams in a computation. Spout can read from queue like Kafka , can generate its own stream or read Twitter streaming.

Bolt
A bolt processes any number of input streams and produces any number of new output streams.
Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.

Topology
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.
A topology can have one spout and multiple bolt
A topology is arrangement of spout and bolt

Understanding Strom Architecture
Storm cluster has 3 sets of nodes
1 Nimbus Node
2.Zookeeper Nodes
3.Supervisor Nodes

Search