Big Data Should Not Equal Big Insights

I was at a storage event a few weeks ago where every panel, keynote speaker, and workshop presenter mentioned big data throughout their presentations. The part I found most amusing was how they all seemed to have a different definition of what big data was.  For some, it was big unstructured datasets of every kind that were growing at an increasingly alarming rate. To others it was the data that was outgrowing their databases, and to others still, it was complex analytics of structured and unstructured data.  It reminded me of how early on, and in some cases still today, “the cloud” had a nebulous definition that required you to ask, “When you say cloud, what exactly is your definition?”

I decided then and there that whenever I talk about big data, I would start by giving my definition so people understand what part of big data I’m addressing.  So here we go.

Big data is all the data that we create, ingest, store, validate, manage, and analyze to understand our businesses better, our customers better, our lives better, and our world better.  I know for some they say it is analytics or big insights, but in my opinion, Big Data + Big Analytics = Big Insights.  Instead of combining all three fields into big data, we can define them better as three.  Particularly since these three areas can be the focus of three different parts of an organization.  This helps me sort out some of the ambiguity around how big data is defined.

Now that I’ve separated the components out, I want to address big data specifically.  Every day new technologies, businesses and websites are created to generate more of this data to bring even more definition and understanding to the equation.  I see this only expanding since tomorrow we will create more data, in more industries, for more reasons.  

In my opinion, big data is multi-dimensional, but at its core it is the data that exhibits one or more of the following characteristics in the extreme, as first discussed by Gartner:  volume, velocity, variety, and volatility; and has been embraced by organizations who want to extract value and insight from these datasets to be more competitive, productive, and innovative.  Big data is more aligned with trend analysis across millions of records, rather than the reporting on a few or just one record.  Machine-generated data is a perfect example of these kinds of big datasets.  This data is produced by automated systems that are being empowered each day to create records and logs of events, sensor readings, and much more at an astonishing rate that is continuous and escalating.  All this data will bring significant change to how companies create, store, manage and use their data over the next five years.

Our shrinking world is driving us to the digital space to uncover new opportunities.  I sometimes think it is our own human instinct to explore that is creating this appetite to generate more data.  You don’t have to look further than the new data imaging technologies to envision new, unexplored landscapes waiting to be discovered like the oceanic explorers of old seeking out undiscovered landscapes and and new opportunities.  Big data offers an unlimited expanse of new worlds and untapped resources that are waiting for those brave enough to drill into them with big analytics.

Before we can become successful big data explorers, organizations will need to become proficient in three key areas.  They will need to store big data in the most efficient manner or suffer unnecessary expenses tied to managing scale and operational expenses; manage it in the most effective way to meet regulatory and governance requirements; and provide fast access to it for analysis so organizations can uncover the incremental value.

Big data starts with the creation or ingestion of the data.  For many this creates the first question around where to put it, or store it and provide constant accessibility.   Since we know this data is already extreme in some of its characteristics, this may pose a challenge for the existing strategies.  Big data for many will mean experimentation with the data to uncover correlations that matter.  Big data is not the accessing of one record and filing it away.  It is about trends across millions of records, possibly over some period of time, to make discoveries.  That means bigger is better and that the data must be always accessible for the constantly changing needs of the data scientists and their desire for more data and data sources. 

Many of the big datasets are structured or semi-structured in nature.  Since this data has extreme characteristics like size, it often is beyond the capabilities of traditional databases and is offloaded into file systems that are then pushed to their limits or archived to tape.  In unstructured formats or in offline archives, the data is harder to access, slower to be analyzed, and often stripped of any schema.  Overcoming these challenges will require new thinking around traditional technologies and the implementation of new technologies.    

Today there are new database technologies that can be combined with scale-out storage platforms, which together can bring significant benefits.  These OLDR or Online Data Retention databases can scale to support these extreme ingestion rates and capacities, support SQL with ODBC and JDBC access, have built in automated information management features, have significant compression capabilities and are optimized for semi-structured and structured big data, resulting in significant footprint reduction and improved analytic performance and data access.  I’ve also found that these databases typically have very low management requirements.  Once they are setup, they don’t require tuning or further configuring.  With capabilities like these, these innovative technologies can serve as the tier-1 location for these big datasets with lots of benefits over traditional or temporary band aid strategies, and at a significantly reduced cost.

For some it might take a hybrid approach that leverages their existing infrastructures and skill sets and compliments them with new solutions.  To meet this need, OLDR databases can be combined with existing data bases or data warehouses in a hybrid approach that can enable a tiered archive model that extends the scalability of an existing database of data warehouse while helping lower the costs of the entire system significantly.  For many this is a compelling solution, or a first step in a staged implementation approach of big data, big analytics, and big insights.

These kinds of systems and other big data technologies, here today and yet to come, will enable organizations to entertain their own goals around leveraging old and new data to uncover opportunities and differentiation.  The organizations that are slower to improve the storage, management and accessibility of their big datasets, may be left behind by their competitors who are already tackling these issues and becoming more efficient.

As we move to a smart and connected economy, the successful or unsuccessful transformation of bits to data, data to information, information to knowledge, and knowledge to wisdom, will determine the organizations who will be the winners over the next 10 years.