Why Big Data?

What is big data? Why do we need? Why is it hard? After two years of development of Fluency, a big data security breach information event management system, looking back offers a unique perspective. Not all parts of the infrastructure can keep pace with Moore’s Law.  The result is a redesign of how we handle data.

Why Big Data?

What is happening around information is because of Moore’s Law (processor density doubles every two years). Not only is our processing power doubling every two years, so is our information. It obvious, for these processors are producing data and their output is doubling every two year. To us, this information ranges from our financial transactions, videos on YouTube, and e-mails. Along with the information comes something called metadata. 

Metadata is information associated, but not part of, the original content. Think of this like an aerial  photograph. The data may be a picture of a house. The house’s metadata is it’s address, it’s purpose, it’s utility bills, and the people living inside. Every time someone interacts with the Internet, they are producing metadata.  Marketing people love metadata, when processed its called analytics. It can tell us the geographical region, type of system they use, what web sites they visit and who are their friends. Metadata analytics are easier if the data is all in one place.  The problem is metadata can get very big, especially when it keeps data on the relationship between things.  For example, there are far more twitter messages than twitter accounts. Tweets are what we call edges for they connect to entities, and recording the edges explodes as the number of members join a system.

Metadata is nothing new. Companies, researchers and Governments have been doing it for decades.  Its at the heart of government studies such as censuses.  An early type of computer called a tabulating machine was used in the 1890’s for the US census.  Computers are really good with metadata. 

What Changed?

So what changed that the common high-end server can’t handle all this data. It would seem natural that the thing that created this large pile of data could consume it.  But they can’t.  That’s because the thing that produced this metadata is a distributed system; its all your phones, tablets and laptops.  

Enter another law, Amdahl’s law.  This law states that improvement of parallel efforts on a process is limited by its sequential elements.  Simply put, that a one lane road moves at the pace of the slowest car.  And though almost all parts of our computer system have increased with Moore’s law, the speed of the disk input output has not. This means that there is a small straw between the data and the processors. That straw is getting smaller.  We can spend more money throwing data into active memory, but that is expensive and limited, and we are talking big data here.  You, the producers of this data, do not feel the disk IO problem because you are one of a distributed network.  The central database has a limited number of disk IO threads.

Big Data is About Being Able to Use Data

Big data is an approach to solving this access issue between the processors and the data.  Its not just storing the data, its also searching and managing the data.  And this is hard for many reasons.  First, the computer world focused on creating abstraction between the hardware and software.  Now, software needs to consider hardware, especially this disk IOs.  There are new layers between the database interface and the hardware data storage.  This is occurring so fast that hype techniques like map-reduced are becoming obsolete. There is a shift in how data is being managed and the industry is learning on the fly.  Creating a big data solution is a full time job, especially considering that the dust has not settled and doesn’t appear to be settling anytime soon.  This full-time cost is too much for most companies.

Big Data is Business

So with uncertainty and difficulties around big data, do companies need to invest in big data? This is a bet that a large company needs to take, or they will become obsolete.  Bold statement, right? Yet, financial impact of making mistakes at the high level because you cannot manage your data is extremely costly. GM knew there were ignition issues. Target knew of malware on their point of sales devices.   In both cases, its important to remember that this is the one issue that became public, and there are thousands of issues occurring every day for these companies.  Big data may not stop the first instance, but it would see the trends before 

8.4 million cars are recalled and 40 million credit cards are stolen.  The truth is that often people who makes the decisions are overwhelmed by the number of key events that need to be analyzed.  Management is often making decisions without a clear picture.  A picture that without big data will take time to see the full scope.  Then it is too late.

These front page issues are not alone.  We have a customer that produces sometimes more than a billion events a day.  That is more data than there are tweets.  Which ones are important enough to respond to? A traditional database takes a minimum of an hour and a half, while fast specialized file systems take 30 minutes.  That means that each member of the staff can evaluate a maximum of 16 events.  That does not include responding to the ones that are problems.  It also means that they need to pick the right 16 events out of a billion.  Each wrong choose means one less discovery. Both the people and the system have information overload without big data.  With big data our searches are often under a second for a two week search.  That over 10 billion events.  This allows analyst to consider results in ten minutes of work that would have taken the entire day.  It opens up the ability to consider and evaluate events that would never had been looked at.

Besides the initial cost, building your own big data solution produces a costly legacy problems.  Start road-mapping and choosing products that have big data built into them.  Unfortunately, there are few legacy products migrating to big data.  Many companies are forcing you to use their cloud.  This occurs when they have not solved the big data management issue.  By hiding the management of the big data, they can deliver the solution today.  But that introduces security, privacy and infrastructure issues to the buyer.

What is the lesson learned?

Without a technology breakthrough in disk IO, big data is here to stay.  Businesses need to roadmap and align a strategy to find and incorporate big data solutions into their decision processes.  Where to start? Focus on areas of information overload.  Often a sign of a person being able to take a lunch break as the system crunches numbers is a place to start.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: