The Fault in Our Logs

Network Log Management invades Security

“The fault, dear Brutus, is not in our logs,
But in ourselves, that we are admins.”

Could the processes of network log management be hurting security operation centers that leverage centralized security?  It would seem that collecting events and responding to alerts would be straight forward.   The organizational failure to respond to alerts is squarely blamed on the organization and its personnel.  Consider the Target breach, where both Symantec antivirus and FireEye alerted on a malicious file, yet nothing was done.  This pattern of known issues, and failed responses repeats itself.  Why is there a constant issue with companies failing to respond to events?

Clearly it is not an issue of detection as it is a problem of response.  The idea that if there was an alert, therefore there should be a response is ingrained in how networks are operated.  Security is not so simple.  The primary reason for failure is that the amount of effort it takes to validate each alert and respond correctly is too great given the current tools.  Validation and response requires additional information and there are thousands of alerts a day.  The collection of additional information and numerous actions related to a single alert takes effort.  This effort is unaccounted for both in the budget, the tools used and in the response process.  The result is a failure to act. 

A significant issue is network savvy people running security operations implement tools and processes they are familiar with using, but these tools are not designed to support the security process. When a security operation center (SOC) leverages network tools, the processes need to be absorbed entirely on the personnel. The same personnel that selected a network tools in the first place.  This is further inflamed by the high cost personnel have over automated processes.

Terms.  This blog uses a number of terms that have important differences in meaning:

  • Alert is a system message that there is an issue that needs to be addressed.
  • Message is information from a system process about itself or its system.
  • Event is activity that is being monitored.
  • Log is a communication from a process to an external entity.
  • Log Management is the management of system messages.
  • Security Event Information Management is the management of alerts

Why would a SOC use log management software and not something designed for this problem like a breach information event management (BIEM) system?  The reason is that network people that have used network log management tools look at the ingestion capability of log management software.  Most of the capacity and speed exist for network tools are measured using network logs that are smaller and less complex than that of security alerts. Security tools also import significant amount of raw events for context.  Event data is extremely diverse and is not indexed properly when using network tools. Security events unlike network messages need to be significantly analyzed. 

Log Management is not Security Event Management

Log Managers, like Splunk, ArcLogger and ELK, are often confused with Security Event Information Management, like ArcSight and Nitro.  Size, complexity, indexes and dictionaries differ between network logs and network events.  There are Breach Event Information Management (BIEM) tools like SecurityDo’s Fluency that are designed to handle metaflow.  Metaflow has the size and complexity of alerts but volumes of events much larger than log managers or SIEMs.  Meataflow is also very powerful for scoping, pivoting and attribute extraction. Key steps in the response process.

System Administrators who build security operation centers have a misconception that there is a one-to-one relationship between a security event and a response to address it.  Building a security operation center relies on being able to collect events, process events, and respond to issues. Historically, security mimics network management in how it handles the event flow even through the type and complexity of the events differ greatly.  While network events tend to have a closer correlation between an event to its response, security events require a greater degree of validation, forensics and a larger number of responses to address an issue. 

Focus is on Collection not Analysis

While security tools are about “analysis and response” as the key aspects to the response process, network tools are focused on “collection and summary analytics”.  Splunk is the latest network tool that is entering the security event management space.  The pricing of Splunk is on Gigabytes per day, but there is still a focus on two metrics when stating the power of the Splunk and how it compares itself to the established vendors: 

  • Events per second (EPS)
  • Gigabytes per day 

“Measure what is measurable, and make measurable what is not so.”

  – Galileo Galilei

Log management tools focus on overall trends and known critical alerts.  Known alerts, like full disk drive, are consistent with clear remediation.  Most information can be derived as it enters the system.  Charts are generated and issues create a workflow response.  There is very little searching after data enters.  The result is that log management focuses on how many alarms the system can handle.  The desire is to be able to absorb more.

The analysis of alerts in the Splunk dashboard do not show any insight into the data, but just a count and charting of the alerts as if they are all validated.  This approach is applied to security events you get Splunk’s new dashboard.  There is no useful data on the dashboard, it just is there is say there are more of less types of unvalidated alerts entering the system.  The designers of the interface have no intention of understanding security, they are trying to get security people to see the world as if security alerts were the same as system messages. 

SplunkDashBoard

Splunk Dashboard focuses on Statistics

Security management tools focus on validating critical alerts and determining how to respond.  Critical alerts, like possible blackhole communication, require validation.  Validation, in turn, requires searching for other log events that support the critical alert.  The combination of the number of new type of alerts and the need to validate makes searching a core aspect to breach event information management.  

Insight Comes at a High Cost

Validation requires information from events that did not trigger an alarm.  Analyst have always needed to see other events that did not alarm in order to get the insight needed to make a decision and direct response.  Creation of full packet capture systems, like Solera and Netwitness, allow analyst to go back the system and review the related session data.  But these are extremely costly and time intensive.  With thousands of alerts a day, only the most critical events can be evaluated this way.

To get more insight, systems are set to be noisier, providing more data than is needed.  This is not a bad thing for the higher amount of readily available data makes the response process more effective. The issues of managing this large amounts of data is a strong reason for the popularity of Splunk over HP’s ArcSight.  Splunk is often seen as being more cost effective.  There is seen close to a 30% reduction in cost with Splunk in the larger networks. 


Cost by Size

Total Licensing Cost Over 4 Years (ArcSight v Splunk)

But when adding flow data into the mix, the volume of large organizations jumps into the billions of records (1 Billion events is about 250GB of data).   

Another item worth noting is that both vendors do not provide listing prices for volumes higher than 250GB/day, which leads us to basic multiplication of the licenses and volume to accommodate for 2 to 4 billion events per day in our largest scenario. This approach can certainly lead the pricing into prohibitive territory and does not offer the customers with cost-effective options to continuously consume very large amounts of event data.  

HP ArcSight versus Splunk Comparision

Despite the push to store more data centrally, there are clearly cost issues with this approach.   And though billion of events a day is clearly cost ineffective, even the half billion line (which 250GB/day represents) is a multimillion dollar initiative.  A cost that most organizations cannot afford. 

Storing does not mean you can search it

Even if you had the deep pockets to implement one of these solutions to the higher data loads of metaflow data, the next hurdle is search speed.

Splunk’s documentation gives a hint of this when looking at “If your query returns more than about 1/1000 to 1/2000 of the data in the range/index then it should be able to cover around 1 to 2 million events/minute”.  This means that at 1 billion events, the analyst is looking between 16 hours to 8 hours to perform a query.  This is the primary reason for Splunk’s creation of Hunk.  The reality is that Splunk is not a big data solution, and most companies are unaware of this, for they cannot afford to put billions of events into Spunk.

Reading through blogs of frustrated engineers, a trend occurs.  That companies that do develop a collection infrastructure find that it takes hours (and even days) to generate reports.   Individual searches might take minutes, but this multiplies when there are complex queries.  Consider knowing that there is a malicious web link in an email sent to employees.  Queries should ask: “What users clicked on this link?”, “What artifacts (files) came from that link?” and “What were the next sites visited after the download (data analytic covert channel query)?”  These types of queries require multiple queries in reality.  The first gives us a list (let’s say a hundred sessions).  Then each member of that list generates a new query.  That two (2) minute query now takes two-hundred minutes (over three hours).  This huge amount of time cripples the analyst, often making it so they never pursue the validation and response. And this is our underlying problem.

Cost Effective Insight

The simplest solution is to get the insight needed from metaflow data.  That is SecurityDo’s Fluency approach (www.security.do), and is termed Breach Information Event Management (BIEM).  What prevents Log Managers and SIEMs from taking this approach is their inability to import the volume of metaflow data needed for insight.

How big is metaflow? Flow data can generate about 125 GBD/5k EPS/200m EPD for a  1 Gbps line.  This amount of data is too high for SIEMs without paying for extensive modifications.  Let’s break that down for a network that has dual entry gigabyte lines. 

  • 250 Gigabytes per day uncompressed metadata and logs.  This compresses to 50-80 Gigabytes.  Tools like Splunk charge by uncompressed data.
  • Ten thousand (10k) Events per Second.  Spikes often jump in the 13k range drop to the 1k range at night.
  • Four hundred Million Events per Day.  The raw number of events are almost double that number.  Running detection engines like Sourcefire are very noisy, and on average there is an additional event per each flow record.

Volume is half the equation, the ability to quickly search and relate attributes determines the quality of response. Search speed is the missing metric.  A BIEM system have fast search times for attributes when there are extremely high volumes of metadata. This is where traditional tools break.

Metaflow data contains the key attributes of the flow.  These attributes are entities like host name, network address, user, URL and file hashes.  When the analyst scopes an event, these attributes are available immediately.  This has a dual effect.  It aids in the scoping of the incident (determination of all entities) and generating actionable intelligence (attributes are then used to respond and recover). 

The advantage of a metaflow message is that all the attributes needed in the response are recorded allowing for pivoting and scoping to occur without intensive pcap recovery and replay.  As new actionable intelligence is announced, the entity datastore (realtime attribute access stored in big data) can determine under a second if any activity has occurred related to the intelligence.  This compares to 20-30 minutes for log management systems. 

BIEM systems are fast in ingesting data, and it is a dedicated big data system that works with the security process of collection, analysis and response that separate them.  The big data and high-speed entity data stores allow for the higher volume of insight.  This higher volume is supported in the analysis and response structures via designed search speeds.  

Summary

The issue with a log management approach to security is that security events are not status updates, but are complex messages about possible issues.  Security messages are larger and more complex than system system messages.  Without a backend to support this complexity, the knowledge and processes of the operations need to be enforced by the users.  But the real weakness is that network tools do not support the validation and scoping needed for analyst.  

The fact that log and SIEM systems are not designed for scoping and pivoting on large datasets has not stopped organizations from building systems dependent on this approach.  Companies that provide consulting on implementing larger log analysis platforms highlight that they are not cost effective as the volume of events gets large. 

BIEM systems  are designed for this higher volume and provide higher quality operational security.  Response and recovery is driven by attributes.  The more attributes known of an attack, the better the response.  Hence, the more attributes collected and the faster they can be searched, relates to the quality of security that can be performed.  Vendors continue to measure the volume of ingestion, but that is half the problem.  The ability to search and relate the ingested data provides its value.  

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: