The Decision Tree for Archiving Data

by Dave Thomson

Many users understand they have a need to archive data for compliance reasons, to improve data preservation or to reduce storage costs. Beyond these base user requirements, you have your own specific environment that an archive 
must support. For example:

  • Total capacity
  • New capacity per day
  • Smallest/average/largest file size
  • Average file age and last retrievals dates
  • Estimated retrievals per day (and type of retrieves – single file or file sets)
  • Existing archived data (technology and formats used)
  • Redundancy requirements
  • Plan for archive migration

Follow our decision tree for data archiving
In all circumstances, we follow a decision tree that provides you with the best and most economical solution for your individual circumstances. A variety of data storage technologies are available including tape, disk, object storage and cloud storage.

This blog was excerpted from Around the Storage Block. To read the article in its entirety, click here.

Back to Basics

David Thomson – SVP Sales and Marketing – QStar Technologies

It is five years ago that a group of companies came together to form the Active Archive Alliance. That group agreed that on a regular basis the term “archive” was being misused, often to represent retaining backups for long periods of time. We saw archive in a different way, as a separate process to backup to secure non-changing data, making it available to the user or application that created it.

Today it seems that although many organizations understand this message, many more do not. I am still perplexed when IT staff fail to understand the significant advantages of using active archive technology. This inspired me to write this blog and to restate the benefits of using active archive, and what it means in 2015.

How much data within an organization is static or unchanging? For many organizations, it is a significant percentage, and there are simple, sometimes free, tools to help users understand how much data we are talking about. 

We do not archive changing / evolving data, this data is secured using RAID, replication, snapshots and backup, all of which are expensive and possibly time-consuming tasks. Most of the time involved is dedicated to ensuring that the processes are working correctly and that if something fails data is recoverable and not lost.

Archiving is about securing unchanging data in a different way. As data is ingested into the archive it is written to multiple places or media. Should one site or media fail there should always be second and sometimes third places to access the data from. This could be an automatic switch to a second repository or require manual intervention; this choice is left to the organization based on their budgets and minimum response times.

By relocating significant amounts of data that is unchanging into an archive environment, primary data sets of constantly changing information can be more easily and cost effectively protected. Backup windows are reduced, the replicated capacities are reduced and the frequency of snapshots could be increased.

Active archives can be as fast or slow, as expensive or low-cost as an organization needs.  You are not forced to use tape libraries, although many organizations do, due to their low total cost of ownership. Many active archives use SSD, disk, optical and/or cloud to store and secure data. It all depends on the individual requirements of the organization and the static data they are archiving.

If architected correctly, active archives can benefit the entire organization by categorizing data and protecting it using the most economical methods for that data type.

How to Implement an Active Archive for HPC Research

By Eric Polet

The world of high performance computing requires ever-present data accessibility, along with scalable capacity. The data management required for computational and data-intensive methods in a high-performance infrastructure can create unique challenges for those tasked with maintaining data and ensuring its long-term veracity and availability.  Research and high performance computing (HPC) sites face the challenge of retaining the ever-growing amount of data being generated by employees and computers. The data’s value expands far beyond what can be gained from it today.  To deem that data useful, information needs to be kept for decades for reexamination when future advancements are achieved.

HPC requires active archive solutions that are, among many things, reliable, scalable, cost-effective, and energy efficient. As data volumes grow, its imperative new solutions can sustain the organization’s anticipated data growth while seamlessly replacing other legacy equipment. The National Computational Infrastructure (NCI), home of the Southern Hemisphere’s fastest supercomputer and Australia’s highest performance research cloud, was facing this data growth problem.  NCI’s supercomputer supports more than 5PB of data that must be backed up and archived. NCI was faced with significant forecasted growth and wanted to implement an updated, single archive solution. This goal was achieved with an active archive solution created by Spectra Logic and SGI.

How did they do it?

NCI selected an active archive approach to manage their data, which has proven to operate quite flawlessly. Active archive solutions turn offline archive into visible, accessible extensions of online storage systems, enabling fast and easy access to archived data. “The incorporation of an active archive solution provides a platform for storage growth,” said NCI associate director Allan Williams, “It allows us to keep our primary data online and accessible to users, while also increasing the reliability of our stored data across physical sites.” The organization is able to easily scale their storage solution as their data continues to expand due to NCI’s depth of engagement with research communities and organizations. Some of the key features gained by the implementation of NCI’s active archive solution are:

• Extreme scalability
• Intelligent data management
• High data reliability
• Portable data storage solution
• Low cost per terabyte
• Reduction in energy costs and space
• Performance and uptime

HPC organizations that need a scalable storage solution are faced with a number of difficult decisions on how to store and archive their data.  Important factors to consider when selecting an archive solution include scalability, data reliability, and affordability.  Active archive’s intelligent data management framework provides organizations file level access to data at a significant reduction of cost.  When NCI introduced its active archive solution they were provided with a dense, high capacity storage solution for its cloud installation with significant economies of scale and data integrity safeguards.  By selecting an active archive solution, NCI has created a long-lasting and reliable storage solution for the country’s largest supercomputer.

A Perfectly Rational Approach to Data Hoarding

by Mark Pastor

People and the companies they work for hoard data - it's a fact brought out in survey after survey. Hoarders are not always proud of their habit and are often curious about the options available. Contrary to popular belief, in many cases it is OK to hoard data. Sometimes it is necessary, and in many cases the data being saved can be of great value to the company. Having clarity on the purpose and requirements in your own organization will provide insight into best practices for maximizing the value of the content you keep with the greatest efficiency.

The Four Hoarder Personas

There are four hoarder personas: Pacifist, Captive, Opportunist and Capitalist. Take a look below to decide which of these best describes your situation and to get ideas on best practices and technologies for your situation.

Pacifist. This persona describes an individual or an organization espousing the policy that it is OK to keep everything, even when there are no requirements to retain data. There are no formal data deletion policies or guidelines for deciding what to delete. These users don't take the time to delete their content, and IT is not empowered to delete it for them. Risk and the cost of doing nothing different is tolerable on all fronts. Storage and protection costs are acceptable; backup windows are satisfactory; there is no legal exposure resulting from keeping all that content laying around; and there is no motivation to shave costs of storage or infrastructure. If this describes your situation, congratulations on finding a rare nirvana.

Captive. Regulations and corporate policies are driving the need to hoard data for years or even decades. The day-to-day business value of the preserved content is negligible. Time-to-data and performance metrics, if they exist, will help decide between the likely [technology] choices below. Organizations involved in finance and healthcare are well represented in this persona.

Opportunist. This group generates and acquires valuable content. They have made substantial investments to develop the content, and it would be sinful to not have it available when a perfect use arises in the future. They often want to contrast with, or build upon, historical snapshots or perhaps take advantage of an opportunity to monetize the content. The use of the Opportunist’s hoarded content is generally unplanned. An opportunity will surface, and if it is not easy to get to the relevant content, the opportunity to leverage it may quickly disappear. The organization that can be nimble and regularly draw from the past can gain tremendous advantage. Those who can impressively go beyond only current content will be the star performers.

Capitalist. Content is king. Capitalists are in the content business and generate or capture content that is difficult if not impossible to reproduce. They market, sell and otherwise monetize their content. Their data and content are core to their business strategy, and success is measured by how quickly they can deliver the content, how economically they can store it until it is needed, and even by the volume of the repository from which they draw.
Which type of hoarder are you?

Use Case Requirements and Technologies

The personas above each carry a set of requirements for data storage architectures. Longer time to access data is acceptable to some while completely unacceptable to others. However, in almost all cases, when hoarding large amounts of content, the most important thing to avoid is using expensive high-performance storage for the hoarded content.

There are many great tools available to help understand how much of a company’s content is not active (typically 50% - 80%) and [demonstrate/reinforce] that inactive content should be stored on a less expensive tier (LTFS tape, object storage disk or cloud). Cheap NAS is not a good option once the cost to protect content is considered – protection software and replication hardware will be added, raising cost of ownership and burdening infrastructure.

When discussing best practices, referring to specific storage technology choices is unavoidable. Two key areas must be understood to have a complete view of best hoarding practices: data movers and storage technologies.

The table below simplifies and summarizes the key attributes of storage technology choices that need to be considered for the various hoarding architectures.

Best Practices Based on Persona

Pacifists and Captives: Leverage Your Backup Process. Retained data is not strategic for you so investments should be focused on protecting the currently active data and leveraging that process for long term retention.  Disk with deduplication or tape backup are both very acceptable alternatives. Speedy access to retained content is not critical, so it is acceptable to leverage backup jobs for retention by copying tape backup to deep archive, or sending a copy of backup data to be archived in a cloud.

Opportunists: Deploy a Cost-Effective Active Archive.  You want to take advantage of content when it’s needed, and you cannot predict when that will be.  LTFS tape or object storage disk are very cost-effective means of hoarding content.  These technology choices enable ready access (active archive) to content. Where high growth, larger scale and global access are important, object storage is the obvious choice, though LTFS tape behind a global access infrastructure is still worth considering.

Capitalists: Integrate Active Access and Content Protection.  Disk backup is critical when practical, but backing up very large content data is not always practical. Some content sets are tens to hundreds of terabytes or more. For these environments archive and protection need to be one in the same. Data dispersed object storage is perfect for this use case. Data can be cost effectively and simultaneously stored and protected.  Smaller environments (i.e., less than 200TB of data) may do well with LTFS tape, but larger environments still need to consider object storage for their hoard.

As you can see there are many good reasons for hoarding data, and as the motivations for hoarding become clear, so does the best way to manage it.

As previously published on Wired Innovation Insights, Jan 6, 2015

2015 Trends in Data Storage and Archiving

As we predicted at the end of last year, active archives became a more mainstream best practice in 2014. Businesses and organizations are recognizing the value of active archives in addressing their overall long-term data storage needs.

As we begin 2015, Active Archive Alliance members shared their predictions for data storage as it relates to active archives in the coming year. Here's a look at what’s to come according to some of the industry’s top storage experts:

  • Advanced Data Tape Will Carry More of the Storage Load

With all the significant innovation occurring in the tape market, the pieces are in place for tape solutions to expand their presence in the data center and carry more of the storage load in 2015. The timing could not be better as users struggle with increasing data loads and limited budgets. New and exciting innovations like LTFS, Barium Ferrite, tape NAS, Flape (flash + tape), tape in the cloud, new high capacity formats and newly extended roadmaps are all coming together to provide best-practice solutions for data protection and active archiving.

  • There Will be Increased Adoption of Storage Tiers

The need for large-scale data capacity is driving the implementation of an increasing number of tiers of storage across a growing number of organizations.  There will be an increase in Tier 0 with a tidal wave of flash adoption for the fastest form of storage as well as a multi-tier approach to long-term data, with the rapid adoption of public cloud and an anticipated swift increase in private cloud creation. Combinations of flash, disk and tape are being used in both public and private clouds to meet custom requirements. An increasingly complex storage environment will become the norm, with specific data being placed on specific storage technologies for specific periods of time with automated "data fluidity" systems controlling the life-cycle process.

  • Greater Intelligence Between Applications and Storage Will Simplify Active Archive Deployments

Applications that can be integrated with storage will improve overall storage management by removing complexity and helping organizations to better utilize active archive solutions. Solutions will use intelligence to deliver the right storage to meet application performance while driving efficiencies that help keep storage costs within targeted budget requirements.

  • There Will Be a Move to Object Storage as an Archive

    There is a big movement in the industry towards object storage as an archive. Object storage is attractive for several reasons: 1) it is massively scalable; 2) it is cost effective; and 3) it is able to also act as a cloud infrastructure for collaboration.  The trend is being accelerated because there are many ways to access an object based archive these days, including NFS, CIFS, Mobile Oss and more.

As the demand for more cost-effective, long-term storage options continues, active archives will proliferate. The Active Archive Alliance will support technology expansion and innovation to address the newest advancements in data storage.

Alliance members Crossroads Systems, DataDirect Networks, Fujifilm, and QStar contributed to this blog.

Using LTO Tape to Complement Your Cloud Storage Active Archive

So you’ve decided to deploy a cloud storage solution to protect your active archive data. What will you do if you lose your network connection and need your data? What if your cloud provider decides to close its doors? How will you deal with the slow upload speeds? How about data that you don’t want to store in the cloud because of security concerns? By adding a complementary tape-as-NAS (tNAS) solution on the front end, you can now address these issues using cost effective LTO tape storage.

You’re Not Alone

The issues described above are shared by a number of IT professionals. The Enterprise Strategy Group polled a variety of IT professionals from a variety of business sizes and found that 84% were using some sort of public cloud service. More so, 69% of these users were very interested in using their own on-premise storage for storing some or all of their data. Most users were concerned with overall data protection and security. And 34% were also concerned about performance issues. By using a tNAS solution on the front end of the active archive cloud solution, you can easily mitigate these issues.

The Latest in Tape Technology

In one of my previous blogs, Tape Ensures Future of Active Archives, I wrote about a customer who asked if tape was an essential component of an active archive. The answer is “no,” but if you are faced with increasing data growth and need a cost effective and reliable, long term solution, then the answer is decidedly “yes.” In a typical active archive environment, data migrates by policy from expensive primary storage tiers to a more cost effective tier while maintaining the convenience of online file access to all of the data. Tape is ideal for this application based on its economic benefits, high capacity, low energy consumption, superior error rates and long archival life.  

Using Flash as Part of an Active Archive

I am not fond of made up terms like “Big Data,” which is used and re-used in all sorts of ways to promote all sorts of products. So, you can imagine how I felt over the summer when I began to see words like “Flape” and “Floud” emerge for the first time. If you haven’t seen these terms before, “Flape” is a combination of flash and tape, while “Floud” is a combination of flash and cloud.

Yuck! I hate the terms… But I do like the concepts behind them.

Drowning in Backup? Active Archive Is the Lifeboat

You have to protect business data. This typically translates to: you have to back it up, and there are hundreds of backup solutions out there. Why then, do so many storage administrators say that managing backup data is their biggest challenge? As data continues to grow, content is lodged in backup cycles, increasing backup windows and spurring the need for more storage and data protection investments. To make matters worse, 50-80% of enterprise data likely won’t be accessed again after it is first created.   

The solution? An active archive with built-in data protection

Workflow Step One: Archive?

A new paradigm for big media and post-production workflows in a high resolution world

Ever tried standing on your head to give you a new perspective on a problem? It sounds a bit silly but you never know, it might just do the trick. This unorthodox method seems like what a few media & entertainment storage architects have begun doing to help solve their problem of exploding storage due to higher resolution, stereo imagery, and frame rates (4K, 5K, 8K, 3D, 48fps, 60fps). Architects are turning their workflow on its head starting with the last step, first.