To maintain the high network availability needed to serve all LinkedIn applications, we need to monitor and analyse both network infrastructure and network usage patterns. At LinkedIn, we traditionally use SNMP to monitor traffic patterns passing through different layers of networks. The monitored data is visualized using inGraphs and anomalies are sent as alerts to network engineers using our in-house alerting system called AutoAlerts. Other protocols like Netconf, vendor specific APIs, etc. are also used to monitor the network.
However, most of these monitoring systems address the question “How many bytes of data are transferred across network?” but do not answer:
What kind of data is getting transferred?
Who (which service) is transferring the data?
In the past, lack of data has limited our ability to troubleshoot link-hogging issues, perform capacity planning, understand service usage patterns, detect anomalies in network, and do traffic flow analysis. Having only approximate data about the usage pattern of our network infrastructure limited us in effectively utilizing and scaling our network infrastructure to ever-growing needs. To solve this problem, we needed to devise a solution that could give us more insight into network traffic.
A brief explanation about flow information
Netflow/IPFIX flow and SFlow are specifications implemented by device vendors to provide flow information periodically. Flow information is a sample of traffic passing through network gear that contains detailed information about the type of traffic being transferred at the network layer. A subset of information that can be exported by a device in each flow include:
Source and Destination (IP)
Source and Destination (Port)
Source and Destination (ASN)
IP of the network gear that is exporting flow
Input and Output interface index of the network gear where the traffic is being monitored
Number of bytes transferred
This information is useful to get an insight into the traffic. Detailed specifications are available at Netflow/IPFIX and SFlow.
The InFlow application has been built at LinkedIn to precisely answer the who, what, when, where, and how of network traffic by processing flow information exported from a device. InFlow has the ability to integrate with internal and third party applications to enrich traffic information (“enriching” is the process of mapping IP to different possible values/dimensions). InFlow has a comprehensive and flexible reporting mechanism, that helps network owners in understanding:
Where traffic on the network is coming from and going to
Which interfaces and devices are transferring more bytes of data
Which peering links are effectively used
Top talkers of applications on the network
Traffic trends on the network over a period of time
Ability to view source and destination hosts/ports, contributing to traffic numbers Network traffic data is processed and stored in the Hadoop environment. In order to make intelligent data-driven decisions, we need a simple and intuitive view of the data that can be presented to the user. InFlow precisely does this to represent processed data. Users can drill down to see hourly trend or the aggregate raw data. Aggregate raw data shows raw samples collected from network gear.
Example of network traffic views provided by the InFlow dashboard
We had to address two major challenges before getting any useful information from the collected flows:
Data Quantity: the amount of sample data processed was huge and difficult to work with. While data varied based on traffic, on average we observed one million flows per minute
Flows were samples and not actuals. Flow traffic did not match the actual traffic that was graphed using SNMP
Flows have information about source and destination IP. Consuming them “as is” is challenging for engineers. It’s easier for users to view and consume them when the data is aggregated by each service than by individual network gear nodes.
At LinkedIn, data driven decisions are given the utmost importance. There are a plethora of tools/applications/platforms that can munch big data and produce useful information, like Kafka, Gobblin, Cubert, Samza, and Pinot. Another LinkedIn core value is Leverage, and InFlow leverages the LinkedIn data analytics ecosystem to process data collected from network gears.