Subscribe to Active Archive Alliance updates - blog posts, newsletters, and more!
Floyd Christofferson's blog
It is sometimes hard to fathom the sheer magnitude of the explosion of data being created by mankind in the short time since computing first began. According to some analysts, in 2013 we are generating the same amount of data every two days that was created by all of human history up until 10 years ago, which was roughly 5 billion gigabytes. More surprising, it is estimated that by next year, we will generate that much data every 10 minutes.
Our digital universe, as it is called, is estimated to grow to 40,000 exabytes by 2020, which is roughly 40,000 billion gigabytes.
Even though the Active Archive Alliance has only been around for a couple of years, the practice of active archiving has been a core design principle of some of SGI's biggest customers, and indeed some of the largest active archives on earth for over two decades.
Consider CSIRO, an important government body for scientific research in Australia, who has been running a distributed nationwide active archive powered by SGI's DMF tier virtualization solution for more than 21 years. In all that time, every bit of their continuously expanding data sets has been online and available to a nationwide network of researchers without interruption and without any data loss. This reliable online data accessibility has enabled them to continually pursue projects as diverse as the invention of Wi-Fi technology and the development of the first polymer banknotes.
It has become axiomatic that data of all types is growing at mind-numbing rates across all industries. According to Google's Erik Schmidt, as much data is created every two days as was created since "the dawn of civilization until 2003". This is not news. Anyone managing an IT infrastructure today is painfully aware that the growth of data is relentless, and shows no sign of letting up.
According to virtually every study, analysis, pundit or perspective, the growth of file-based enterprise data is skyrocketing. Gartner research from March 2011 forecasts a compound annual growth rate in raw terabytes of external storage arrays at 55% over the next 5 years. This translates in a growth from the 2010 number of 11.8 million terabytes sold to the projected volume of 107.5 million terabytes in 2015.
This growth rate does not include drives inside laptops or desktop computers. It doesn’t include drives that might be used in a myriad of other devices and technologies. This growth directly represents the increase in storage infrastructure for business, new file-oriented applications used in both enterprise and technical computing, disk-based backup and archive deployments and expanded server and desktop virtualization projects.
Gartner is but one to make this prediction. IDC, Frost and Sullivan and almost any IT manager will tell you the same thing. All data is exploding exponentially and in the process causing primary storage and backup infrastructures to grow massively and expensively to keep pace.
But there is a companion problem that goes along with the issue of data and infrastructure growth. The problem is that even though more and more files are filling up ever-larger disk silos, the utilization of those files does not necessarily increase at the same rate. In other words, people may be creating more and more files, but they are still using them only a few at a time. My own hard drive has tens of thousands of presentations, documents, photos, emails, and other files, the large majority of which I have not touched in months or years. And yet, I want them available at all times for when I need them.
Translate this to an enterprise and the problem becomes astounding. Because the problem moves from the realm of personal preference: (I want my files available all the time) to business necessity: (my business needs to have access to its data at all times).
Researching the Problem
In a 2008 study at the University of California, Santa Cruz funded by the National Science Foundation, an active storage pool of 22TB used by 1500 employees in business and technical workflows was analyzed for utilization of network file system workloads. In other words, they studied usage patterns for the type of workloads used in virtually every enterprise in the world.
What did they find? Files live an order of magnitude longer than in previous studies. Files are rarely reopened; 95 percent are reopened fewer than five times. Over 60% of file reopens occur within a minute of the first open. Over 76% of files are never opened by more than one client, and of those that are opened by others, 90% of sharing is read only. Finally, most files are not re-opened once they are closed.
The net of this is that in that environment of 22TB, most of the files sitting in those arrays are not going to be reopened or changed. And yet, just like the files on my laptop that I haven’t touched in a couple of years, business users have a very difficult time determining which files to delete or remove from active storage, so datacenter disk infrastructures keep growing at an astronomical rate. This problem is compounded by the cost of that growth; not just in the acquisition of new disk arrays, but also in the cost of backing up those arrays, the cost of adding data center space, the cost in electricity and cooling of the data centers for disk drives that are spinning continually, but seldom being used.
Translated into real numbers, a leading network attached disk-based storage array uses about $91 per TB of power when operational, or 32 KwH per cabinet. For a 2 petabyte system, this equals $190,000 per year in operational power costs alone, at typical U.S. utility rates, not including data center space, and cooling costs. Yes, the data is available at all times for users to access - but at what cost?
Inactive Data Sitting on Active Disk
The solution is as simple as it has often been beguiling. Create an active archive where the data is available in an ‘online’ state for easy access, where the data is protected for extremely long-term retention, and where the operational cost is extremely low. If any of those three elements is missing, archive strategies tend to fall flat.
The problem is that most archive solutions only address part of the issue. When backup (protection of active data) gets confused with archive (retention of inactive data), data paralysis occurs. Backup and restore times become impossible to manage because they involve both inactive and active data. Seldom-accessed data becomes hard to find. Operating costs skyrocket when additional production disks are needed just to keep up with the relentless growth of data.
What’s worse, the excessive growth of data stored on production disk can be a contributing factor to data becoming segmented into incompatible silos. This makes collaboration between different areas either impossible, or at best a manual process prone to error and wasted effort.
Users deal with files but are forced to work within file systems. The job of a proactive data management strategy – an active archive strategy – is to let the users focus on their work and not waste time, infrastructure, or energy on just setting up to do their work. That is what the tool kit that comprises active archiving solutions enables. Namely, a way for IT managers to keep data accessible, affordable, and protected without requiring the users to add work to their day trying to understand how data is managed, where data is located, or what steps need to be taken to ensure data protection.
Readers of the Active Archive Alliance site, and those who are dealing first hand with the explosion of data, are faced with many of the same problems over and over. Key questions arise in managing data, or more importantly, in cost-effectively enabling users to have easy access to data as it grows. Questions such as:
- How much inactive data is sitting on active production disk?
-- Simply adding more disk is not a long-term strategy
- As production disk grows, backup windows also grow.
-- Unless archive and backup solutions are aligned, bottlenecks occur. Backup and restore times become unmanageable.
- Growing storage creates silos of data, often incompatible silos.
-- Collaboration becomes difficult, often impractical. Management becomes unwieldy.
- Unchecked data growth = increased operating costs, power, space, cooling issues.
The premise of an Active Archive is to keep primary, production storage small, or as close to a constant as possible. Increase in data is absorbed in lower-cost tiers through an HSM or Tiered Storage Virtualization strategy. Thus, each of the problems noted above becomes manageable.
But often the problem is that users are unwilling, or unable to determine what portion of their data is truly able to be migrated to lower tiers. They fear that even if their data has not been accessed for a significant period of time it might be too difficult to access if it is not immediately available on production disk – which brings us back to the problems noted above.
A recent conversation with an IT director of a major pharmaceutical company illustrated the dilemma: “We spent a lot of money on an ‘archive’ solution, which sits at 10 percent utilization,” he said. “Even though it works very well, and there is little latency in getting data back into production environments, the user community can’t agree on which portion of the data should be archived. And so they archive nothing.”
An active archive solution minimizes this, because the content is always available all the time. But even in this there is the problem of how to determine what should be kept, what tier the data should live on, and how to ensure that it will be around forever.
Metadata is the starting point... Metadata evolution is the key.
At the heart of this problem is the difficulty in categorizing data; to know whether it is something to keep forever, to throw away, to keep in active storage or a faster tier, or whether it can move downstream. Add to the problem that when thinking of true archives, the indexing and management scheme needs to take into account that metadata schemas will change over time. Metadata evolves, just as language does. New data types will emerge, and new use cases will be added that must be accommodated such that a query into the archive pulls back all relevant content, not just a subset of the content. Only by employing an indexing scheme that can harvest metadata actively and automatically and do so in the face of constantly evolving metadata can an active archive be persistent for the long haul.
A true active archive is one that can be truly alive for decades... or until the data is truly no longer relevant. By decoupling production disk from the archive, we get hardware independence such that a refresh of back-end infrastructure does not limit the users’ ability to access data, no matter what platform may appear in the future. In the same way, archive schemes must accommodate the ability for metadata to retroactively evolve. In this way, not only are the files physically available but the archive itself is fresh and users can easily find the content they seek. New or old, the archive remains a vital asset, and the problems of isolated data and underutilized or expensive infrastructure can be minimized.
Creating an Active Archive strategy to address both archive and backup in the midst of data explosion
The key to any archive is not only being able to preserve the data for future use, but also to be able to actually find it in a practical and timely manner. The more difficult it is to do this, the less likely it is that people will do it, or do it effectively.
Taking proactive control of data is an essential requirement to getting the full value of data. Only with the proper strategy and the necessary hardware and software tools is it possible to make the management of active data, inactive data and protection data seamless and essentially transparent to the user. Only when looking at all three of those data types as a whole will the full “Time Value” of data be realized and be manageable. As an added benefit, this approach usually results in reduced operational costs.
The Time Value of Data white paper digs into the growing problems of Backup and Archive, and the typical difficulties when those two are commingled in the midst of massive data explosion. By addressing the core strategy for creating an active archive, this paper will help IT managers understand how to apply the tactics used at the world’s largest archives in the the day-to-day realities of their own environments.
To view the Time Value of Data white paper, click here.
SGI has joined the Active Archive Alliance because taking proactive control of data is an essential requirement to getting the full value of your data. Finding the best tools for creating and managing extremely large amounts of data is what SGI customers are all about. Only with the proper tools that make the management of active, inactive and protection data seamless and essentially transparent to the user will the full "Time-Value" of data be realized.
The Time-Value of data is just what it sounds like: When data is fresh, it is typically more valuable. The value of some data decays over time. (Old receipts, outdated reports, etc). The value of other data increases over time. (NASA footage of the first men walking on the moon). But the methods for understanding the relative value of the data is not as simple as these examples might suggest. What if the outdated report is also a necessary bit of evidence that might be needed for compliance or historical purposes? At what point does one realize that it is not just and outdated file?
In addition to understanding the Time-Value of data, it is also important to understand whether that data needs to be active or not. According to IDC, only 40% of fixed data is active or is accessed infrequently. Forrester Research says that 85% of production data is inactive, with 68% having not been accessed in 90 days. So while this data needs to be accessed sometimes, it doesn’t need to be filling up expensive production disk capacity.
So, Time-Value of data includes not only whether it's important over the long-term, but whether the data needs to be immediately accessible. When these things are not proactively managed, and when backup (protection of active data) gets confused with archive (retention of inactive data), then data paralysis occurs. Backup and restore times become impossible to manage. Seldom-accessed data becomes hard to find. Operating costs go crazy when additional production disk is needed just to keep up with the relentless growth of data.
Users deal with files but are forced to work within file systems. The job of a proactive data management strategy, an Active Archive strategy, is to let the users focus on their work, and not waste time or infrastructure or energy on just setting up to do their work. That is what the tool kit that comprises Active Archiving solutions enables. Namely, a way for IT managers to keep data accessible, affordable, and protected without requiring the users to add work to their day understanding how it is done or where it is, or what steps need to be taken to get there.
In successive articles on this blog, SGI and the other Active Archive members will explore various strategies to achieve the ideal of getting the data management methodology out of the way of the data users. Although common organizing principles are immediately evident, the discussion will reveal that the actual solutions will be different across different environments. It is in finding those common threads that the Time-Value of these concepts will also emerge.