Big Data, Fast Data, Smart Data

DAUNTING DATA

Every minute, 48 hours of video are uploaded onto Youtube. 204 million e-mail messages are sent and 600 new websites generated. 600,000 pieces of content are shared on Facebook, and more than 100,000 tweets are sent. And that does not even begin to scratch the surface of data generation, which spans to sensors, medical records, corporate databases, and more.

As we record and generate a growing amount of data every millisecond, we also need to be able to understand this data just as quickly. From monitoring traffic to tracking epidemic spreads to trading stocks, time is of the essence. A few seconds’ delay in understanding information could cost not only funds, but also lives.

BIG DATA’S NOT A BUBBLE WAITING TO BURST

Though “Big Data” has been recently deemed an overhyped buzz word, it’s not going to go away any time soon. Information overload is a phenomenon and challenge we face now, and will inevitably continue to face, perhaps with increased severity, over the next decades. In fact, large-scale data analytics, predictive modeling, and visualization are increasingly crucial in order for companies in both high-tech and mainstream fields to survive. Big data capabilities are a need, not a want today.

“Big Data” is a broad term that encompasses a variety of angles. There are complex challenges within “Big Data” that must be prioritized and addressed – such as “Fast Data” and “Smart Data.”

SMART DATA

“Smart Data” means information that actually makes sense. It is the difference between seeing a long list of numbers referring to weekly sales vs. identifying the peaks and troughs in sales volume over time. Algorithms turn meaningless numbers into actionable insights. Smart data is data from which signals and patterns have been extracted by intelligent algorithms. Collecting large amounts of statistics and numbers bring little benefit if there is no layer of added intelligence.

IN-THE-MOMENT DECISIONS

By “Fast Data” we’re talking about as-it-happens information enabling real-time decision-making. A PR firm needs to know how people are talking about their clients’ brands in real-time in order to mitigate bad messages by nipping them in the bud. A few minutes too late and viral messages might be uncontainable. A retail company needs to know how their latest collection is selling as soon as it is released. Public health workers need to understand disease outbreaks in the moment so they can take action to curb the spread. A bank needs to stay abreast of geo-political and socio-economic situations to make the best investment decisions with a global-macro strategy. A logistics company needs to know how a public disaster or road diversion is affecting transport infrastructure so that they can react accordingly. The list goes on, but one thing is clear: Fast Data is crucial for modern enterprises, and businesses are now catching onto the real need for such data capabilities.

GO REAL-TIME OR GO OBSOLETE

Fast data means real-time information, or the ability to gain insights from data as it is generated. It’s literally as things happen. Why is streaming data so hot at the moment? Because time-to-insight is increasingly critical and often plays a large role in smart, informed decision making.

In addition to the obvious business edge that a company gains from having exclusive knowledge to information about the present or even future, streaming data also comes with an infrastructure advantage.

With big data comes technical aspects to address, one of which is the costly and complex issue of data storage. But data storage is only required in cases where the data must be archived historically. More recently, as more and more real-time data is recorded with the onset of sensors, mobile phones, and social media platforms; on-the-fly streaming analysis is sufficient, and storing all of that data is unnecessary.

STREAMING VS. STORING & DATA’S EXPIRATION DATE

Historical data is useful for retroactive pattern detection; however, there are many cases in which in-the-moment data analyses are more useful. Examples include quality control detection in manufacturing plants, weather monitoring, the spread of epidemics, traffic control, and more. You need to act based on information coming in by the second. Re-directing traffic around a new construction project or a large storm requires that you know the current traffic and weather situation, for example, rendering last week’s information useless.

When the kind of data you are interested in does not require archiving, or only selective archiving, then it does not make sense to accommodate for data storage infrastructure that would store all the data historically.

Imagine that you wanted to listen for negative tweets about Justin Bieber. You would either store historical tweets about the pop star, or analyze streaming tweets about him. Recording the entire history of Twitter just for this purpose would cost tens of thousands of dollars in server cost, not to mention physical RAM requirements to process the algorithms through this massive store of information.

It is crucial to know what kind of data you have and what you want to analyze from it in order to pick a flexible data analytics solution to suite your needs. Sometimes data needs to be analyzed from the stream, not stored. Do we need such massive cloud infrastructure when we do not need persistent data? Perhaps we need more non-persistent data infrastructures that allow for data that does not need to be stored eternally.

Data’s Time-To-Live (TTL) can be set so that it expires after a specific length of time, taking the burden off your data storage capabilities. For example, sales data on your company from two years ago might be irrelevant to predicting sales for your company today. And that irrelevant, outdated data should be laid to rest in a timely manner. As compulsive hoarding is unnecessary and often a hindrance to people’s lifestyles, so is mindless data storage.

BEYOND BATCH PROCESSING

Aside from determining data life cycles, it is also important to think about how the data should be processed. Let’s look at the options for data processing, and the type of data appropriate for each.

Batch processing: Batch processing means that a series of non-interactive jobs are executed by the computer all at once. When referring to batch processing for data analysis, this means that you have to manually feed the data to the computer and then issue a series of commands that the computer then executes all at once. There is no interaction with the computer while the tasks are being performed. If you have a large amount of data to analyze, for instance, you can order the tasks in the evening and the computer will analyze the data overnight, delivering the results to you the following morning. The results of the data analysis are static and will not change if the original data sets change – that is unless a whole new series of commands for analysis are issued to the computer. An example is the way all credit card bills are processed by the credit card company at the end of each month.

Real-time data analytics: with real-time data analysis, you get updated results every time you query something. You get answers in near real-time with the most updated data up to the moment the query was sent out. Similar to batch processing, real-time analytics require that you send a “query” command to the computer, but the task is executed much more quickly, and the data store is automatically updated as new data comes in.

Streaming analytics: Unlike batch and real-time analyses, streaming analytics means the computer automatically updates results about the data analysis as new pieces of data flow into the system. Every time a new piece of information is added, the signals are updated to account for this new data. Streaming analytics automatically provides as-it-occurs signals from incoming data without the need to manually query for anything.

REAL-TIME, DISTRIBUTED, FAULT-TOLERANT COMPUTATION

How can we process large amounts of real-time data in a seamless, secure, and reliable way?

One way to ensure reliability and reduce cost is with distributed computing. Instead of running algorithms on one machine, we run an algorithm across 30 to 50 machines. This distributes the processing power required and reduces the stress on each.

Fault-tolerant computing ensures that in a distributed network, should any of the computers fail, another computer will take over the botched computer’s job seamlessly and automatically. This guarantees that every piece of data is processed and analyzed, and that no information gets lost even in the case of a network or hardware break down.

IN SHORT

In an age when time to insight is critical across diverse industries, we need to cut time to insight down from weeks to seconds.

Traditional, analog data-gathering took months. Traffic police or doctors would jot down information about patients’ infections or drunk driving accidents, and these forms would then be mailed to a hub that would aggregate all this data. By the time all these details were put into one document, a month had passed since an outbreak of a new disease or a problem in driving behavior. Now that digital data is being rapidly aggregated; however, we are given the opportunity to make sense of this information just as quickly.

This requires analyzing millions of events per second against trained, learning algorithms that detect signals from large amounts of real, live data – much like rapidly fishing for needles in a haystack. In fact, it is like finding the needles the moment they are dropped into the haystack.

How is real-time data analysis useful? Applications range from detecting faulty products in a manufacturing line to sales forecasting to traffic monitoring, among many others. These next years will hail a golden age not for any old data, but for fast, smart data. A golden age for as-it-happens actionable insights.

For more information check out www.augify.com