Large Network Event Management: Collection Infrastructure (Part 1)

Integration for security devices remains an issue despite the use Security Information Event Management (SIEM) systems.  The overlap in accessing the data stream and managerial control of alerts makes awkward solutions.  Complicating the matter is network grow, which produces higher amounts of bandwidth and informational logs than existed when today’s log management solutions were designed. Even though tuning the events entering the system increases performance, the small capacity of SIEM datastores hinder response, reducing the value of the security investments. 

Missing in the ability to implement larger scale event management systems is a means of deploying security devices into an independent structure and having multiple central management systems.  

Independence means that there is separation between the products to allow them to integrate outside their product family.  This independence gives the ability to roadmap future devices while integrating legacy devices. 

Multiple central systems might seem strange.   SIEMs are designed as an all-in-one device. They are like a boom-box of integrated solutions.  With the increase in security related events that a network produces, SIEMs are a best fit for mid-sized companies.  Like a boom box, the owners need to choose between it, and upgrading to higher-end integrated components.  SIEM’s collection infrastructure hinders this migration to other management systems by controlling flow of events with less than optimal capacity.

To create a true collection infrastructure, a company requires an integration solution that has:

  • Simplicity
  • Security
  • Process Integration
  • Capacity/Scalability
  • Speed

To do this they need to develop: a means to collect the data; a means to consume it; and finally a means to use it.

The steps to implementing a collection network are:

  • Create a Shadow Network
  • Implement Tap Points
  • Create tiers based on granularity
  • Implement a syslog sink

Once you have an infrastructure, capacity and scalability will eventually become a problem.  This infrastructure will works well, but it too has a limit in the events per second.  Vendors will “tune” systems to reduce volume, and this is fine when logs are only used for the detection.  But as processes become more automated and streamlined a further shift in log management occurs, and that is the need to import more event data to support remediation.  

Remediation requires higher event logs and ad hoc querying in order to scope the incident and highlight the attributes for response and recovery.  Remediation is the underlying issue in network operations.  The Target breach, like many large scale breaches, was detected.  It was a failure in the remediation.

To resolve this last issue, the management component of the collection infrastructure needs to move away from disk intensive relational databases.  It needs to move away from script oriented consumption of events to faster languages.  It needs to leverage solutions that have come about with big data.

Divide the Network

The first rule is that security communication gets its own network.  Lessons learned from the days of phone phreaking is not to place the management channel and information channel on the same line.  Attackers target authentication and security software and devices.  Gaining access not only hides the attack, but provide powerful access into controlling the network and its data.

The second rule is that passive devices share a tap. There is the cost desire to use SPAN ports.  SPAN have a level of corruption to them.  They are also limited in the number of devices that can sit off the SPAN.  In the old days, we often would use hubs instead of taps, but hubs impact network throughput and quality.  


Customer Premise Equipment (CPE) are are in the field.  The term comes from telecom, referring to the telecom’s equipment in a customer’s facility. CPE should be thought of the devices that you just can’t walk down the hallway and touch.  CPE sits as the first drop into a network.


The CPE diagram above is an example of devices sharing a fail open aggregator tap on a shadow network.  Because these are passive devices, it the tap fails we still want the network to operate: Therefore, the tap is fail open.  Aggregator, all passive devices can take a single stream of data.  When we get over 5 Gbps lines, that means that there is 10Gbps aggregated (five inbound/five outbound) and we will need to use a two line tap.  

Taps are normally modular and allow us to add more systems to the tap point.  This means that we can place breach detection systems and traditional detection systems side by side.  More importantly, is that we can add metaflow on the same tap.  

Devices can be placed inline.  In that case you are still connecting their management and logging ports to the shadow network.  

Metaflow provides an overall view of all traffic, even that which did not generate an alert.  Besides metadata’s pivotal addition to analysis, response and recovery, metadata also allows us to lower the amount of traffic data from the other devices.  The primary reason for a SIEM to have a dedicated collector is to gain network granularity from the device.  The metaflow data provides this a good portion of this and allows us to get syslog from the devices knowing we are getting the metaflow of the same session. 


The ability to see something in a photograph clearly is based on the granularity of the picture.  This analogy is what we refer to when we talk of network granularity.  When an event is detected, the next step is to scope the incident.  We use granularity to determine validity, association events and extract attributes for response.  We store information based on granularity.  Think of all that data weighing something.  The greater the weight, the less it travels.

It would seem to reason that if we captured all the data from a communication and stored it, we would have perfect granularity.  Reason would be wrong here. If we had a picture of a house we know very little of that house.  Who lives there, what the house is for, when was it build, what are its telephone numbers, and how much electricity does it use: all of this is metadata.  Flows are the same.  Metaflow data might include who is the user, when was the communication, what subnet, and what are the names of the systems.  Most of what we need to know to respond is in the metadata.  Metaflow data has the attributes we need to respond, and it is a fraction of the size of the actual traffic. 

Full packet capture is still very useful, its just costly.  First, an organization should implement metaflow recording.  If they can afford it, then they can implement full packet capture afterwards.  The reason is that data analytics require metaflow data, and provide a more cost effective means to review and scope incidents.  Full packet requires trained personnel and require not just more money, but more time.  

Tiered CPE

In this CPE diagram, we are using a tap going directly to the full packet capture.  This is an ideal situation both for the integration and operations.  A full packet capture can slice the flows into streams to the other devices.  This solves issues of capacity, as the size of streams can be divided to meet the capacity of the device.  It also allows for filtering, so that is a device does not need parts of the flow, like mail or SSL, then the traffic can be shaped to the devices strength.  Of course, when there is an incident.  The metadata is used to determine what sessions should be reviewed by analyst, and then a call to the packet capture is made to slice out that piece of the communication for hard core analyst. 

This structure creates three level of granularity: alert, meataflow (event) and full packet. Alert is used to target an event for analysis.  Metaflow is used for validation and scoping.  Full packet is used to analyze the actual traffic when the instance of the data is needed.  For example, a covert channel might be detected through reputation, but the actual row is needed to write a more generic signature to fin it elsewhere in your network.  As you can see by this example, full packet capture is a tool for the serious expert.  You will need this type of person in your organization, or you will just have a tool powerful tool no one knows how to use.

Next (Part 2)

The next post will look at the capacity and speed issues caused a more robust collection infrastructure.  With collecting in large organizations, the number of events reach in the billions.  What is the cost in time and money using traditional datastores like ArcSight and Splunk.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: