Introduction
We, at Goibibo, rely heavily on open source technologies to manage our systems. One such technology is Apache Flume (https://flume.apache.org/). Flume is a distributed system for collecting, aggregating and moving large amounts of data from different sources to a centralized location. Flume can be used for transferring log data, social media data (e.g. tweets) and pretty much any data that we have.
OUR USE CASE
Without going much into details about Flume, I will talk about how we use it at Goibibo and how it plays an integral part in our log aggregation, system health monitoring and critical alerting.
We have approximately 35 different application servers from which data is transferred to our 13 hadoop machines and finally to various locations. A high level diagram would be something like this:

Ingestion
We stream various forms of log data from our application servers. We aggregate the following logs using flume:
- Apache logs
- uwsgi Logs
- Celery Logs
- Custom Application Logs
Collection
As explained in the flow diagram above, we stream all these logs from our application servers to the data nodes of our hadoop cluster. To handle fail over, we use fail over sink in Flume and transfer data to multiple data nodes at a time with a set priority. If, in case, a particular data node fails, the other one automatically becomes active and starts collecting data from application servers. One thing to note here is that one application server pushes data to only one data node at any point of time. This data is collected pretty much real time across all the machines.
Storage
Once the data is collected at the data nodes, we finally route it to various places as per different use cases.
- HDFS: Every single log we stream through flume goes to hadoop and is stored there. We also ensure that the log is stored in a proper format using interceptors. Logs in HDFS are stored in this format: /provider=$provider/years=$years/months=$months/days=$days/hours=$hours/. There are two specific advantages of using this structure. The first one being that it helps us drill down into specific logs in cases of any issues. Second, it helps us in creating external tables in hive and exposing these logs using standard sql like queries.

- MongoDB: We route a subset of this data to MongoDB for near real time analysis. This data from flume goes to specific capped collections (http://docs.mongodb.org/manual/core/capped-collections/) which various developers subscribe to for their specific logs. We also use these mongo collections for alerting purposes where we send specific emails and text messages to respective stakeholders on the basis of logs.
- Cloudera Search: We also provide an interface for our developers to search the logs using free text queries. We index these logs in near real time using Cloudera Search (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html and make them available using Hue (http://gethue.com/). We store only last 10 days’ data in our collections and that is frequently used by our developers for debugging purposes. These logs are indexed and made available for search with a delay of 2 seconds from generation. Here are some snapshots:


Statistics
Here are some stats regarding the number of logs we transfer and index using flume. These numbers are not as great as what Google or Linkedin deals with, but provides a good insight into the scale we are managing with a mere 13 node hadoop cluster.
Total Logs Transferred in a day: ~150 million
Logs inserted into MongoDB in a second: ~500
Total Logs Indexed in a day: ~50 million
What next
In the next article I will share further details about our flume agents and how they are connected to each other. I will also share some workarounds which we had to do to make the systems work in a smooth and failure resistant way.