What is Flume?
Flume is a reliable distributed service for collection and aggregation of large amount of streaming data into HDFS. Most of the Bigdata analysts use Apache Flume to push data from different sources like Twitter, Facebook, & Linkedin.. into Hadoop, Strom, Solr, Kafka & Spark.
Why we are using Flume?
Most often Hadoop developer use this tool to get log data from social media sites. It’s developed by Cloudera for aggregating and moving very large amount of data. The primary use is gather log files from different sources and asynchronously persists in the Hadoop cluster.
What is Flume Agent?
A Flume agent is a JVM process that holds the Flume core components (Source, Channel, Sink) through which events flow from an external source like web-servers to destination like HDFS. Agent is heart of the Apache Flume.
What is Flume event?
A unit of data with set of string attributes called Flume event. The external source like web-server sends events to the source. Internally Flume has inbuilt functionality to understand the source format. For example Avro sends events from Avro sources to the Flume.
Each log file is consider as an event. Each event has header and value sectors, which has header information and appropriate value that assign to the particular header.
What are Flume Core components?
Source, Channels and Sink are core components in Apache Flume.
When Flume source receives event from external sources, it stores the event in one or multiple channels.
Flume channel is temporarily store & keeps the event until it’s consumed by the Flume sink. It acts as Flume repository.
Flume Sink removes the event from channel and put into an external repository like HDFS or Move to the next Flume agent.
Can Flume provides 100% reliability to the data flow?
Yes, it provide end-to-end reliability of the flow. By default Flume uses a transactional approach in the data flow. Sources and sinks encapsulated in a transactional repository provides by the channels. This channels responsible to pass reliably from end to end in the flow. So it provides 100% reliability to the data flow.
Can you explain about configuration files?
The agent configuration is stored in local configuration file.It comprises of each agent’s source, sink and channel information.
Each core component such as source, sink and channel has properties such as name, type and set of properties.
for example Avro source need hostname, port number to receive data from external client.
Memory channel should have maximum queue size in the form of capacity.
Sink should have File System URI, Path to create files, frequency of file rotation and more configurations.
What are the complicated steps in Flume configuration?
Flume can processing streaming data, so if started once, there is no stop/end to the process. asynchronously it can flows data from source to HDFS via Agent. First of all Agent should know individual components how they are connected to load data. So configuration is trigger to load streaming data. For example consumerKey, consumerSecret, accessToken and accessTokenSecret are key factors to download data from Twitter.
What are the important steps in the configuration?
Configuration file is the heart of the Apache Flume’s agent.
Every Source must have atleast one channel.
Every Sink must have only one channel.
Every Component must have a specific type.
Apache Flume support third-party plugins also?
Yes, Flume has 100% plugin-based architecture. It can load and ships data from external sources to external destinations which separately from Flume. So that most of the bigdata analysts use this tool for streaming data.
Can you explain Consolidation in Flume?
The beauty of Flume is Consolidation, it collect data from different sources even it’s different flume Agents. Flume source can collect all data flow from different sources and flows through channel and sink. Finally send this data to HDFS or target destination.
Can Flume can distributes data to multiple destinations?
Yes, it support multiplexing flow. The event flows from one sources to multiple channels and multiple destinations. It’s acheived by defining a flow multiplexer.
In above example, data flows and replicated to HDFS and another sink to destination and another destination is input to another agent.
Agent communicate with other Agents?
No, each agent runs independently. Flume can easily scale horizontally. As a result there is no single point of failure.
What are interceptors?
It’s one of the most frequently asked Flume interview question. Interceptors are used to filter the events between source and channel, channel and sink. These channels can filter un-necessary or targeted log files. Depends on requirements you can use n number of interceptors.
What are Channel selectors?
channel selectors control and separating the events and allocate to a particular channel. There are default/ replicated channel selectors. Replicated channel selectors can replicated the data in multiple/all channels.
Multiplexing channel selectors used to separate and aggregate the data based on the event’s header information. It means based on Sink’s destination, the event aggregate into the particular sink.
Leg example: One sink connected with Hadoop, another with S3 another with Hbase, at that time, Multiplexing channel selectors can separate the events and flow to the particular sink.
What is sink processors?
Sink processors is a mechanism by which you can create a fail-over task and load balancing.
Did you installed Flume? What are the major problems you have faced?
- First of all installed Java 1.7 or 1.6 version.
- Required sufficient memory and disk space. 4GB ram and depends on data flow required maximum disk space.
- Agent should has read/write permission, administrator take care of it.
- The main complicated issues in Flume is configure the source files such as type, channels, consumerKey, consumerSecret, accessToken, and accessTokenSecret are trigger. I faced many problems to collect those keys, but succeed eventually.More information: visit User guide