Data Management, AI, and the Active Archive
In the next few years, we will see a massive increase in the volume of data stored. According to some estimates, there will be close to 200 zettabytes of data globally by the end of 2025. To cope with that amount of data and to render it in the most useful possible way for AI, it must be well organized. At a recent Active Archive Alliance Video Conference, AI Needs Active Archive, three experts gave their take on the necessary preparation for the data explosion.
Dr. Catrin Kersten, Marketing Manager of PoINT Software & Systems, advocates object storage tiering with active archiving as the best way to deal with AI and data-intensive workloads. She noted that AI workloads are changing storage requirements by introducing challenges to traditional storage technologies, challenges in data privacy, security, compliance, and archiving, and major challenges with regard to energy consumption.
S3 object storage has risen to prominence when it comes to data-intensive workloads. It scales in capacity and performance; the standardized S3 RESTful API makes it easy to implement, it copes well with a distributed architecture, offers high redundancy, and has extensive metadata support. However, S3 runs into problems when using only disk and flash-based object storage. Rapid data growth drives up storage and energy costs. Additionally, sensitive datasets demand more robust protection to ensure compliance and regulatory requirements. Hence, tape-based active archives are gaining ground as a sustainable and cost-effective form of secondary S3-based storage.
“Not all data requires constant use, and tape provides high capacity for cold data at an attractive cost,” said Kersten. “S3 object storage on tape offers cost efficiency, scalability, energy efficiency, sustainability, immutability, and seamless integration.”
Disaggregated storage for AI and HPC
Mark Pastor, Director Product Management at Western Digital, believes disaggregated storage is the future as opposed to servers having their own internal storage. This, he says, is the most efficient way to prepare for AI. Why? Modern servers are all about compute power. They need to pack as many CPUs and GPUs as possible into a small form factor. There is little or no real estate left for storage.
“Storage external to the server can be connected to AI workflows, and this storage can be every bit as fast and available as local storage,” said Pastor. “We can minimize latency and have GPUs operating at full capability.”
A primary layer of storage might consist of SSDs over NVMe. There would also be large pools of HDDs that can be cached. For example, The Western Digital Ultrastar Data 102 Hybrid Storage Platform and the OpenFlex Data 24 4200 could support AI with a high level of performance and with high capacity.
The benefits of disaggregated storage would include:
*Maximized capacity and performance independent of the server
*Optimized utilization by sharing storage resources across servers
*Lower cost storage
Programs, Not Projects
Dr. Kel Pults, Chief Clinical Officer & Vice President of Government Strategy at MediQuant, and cochair of the Active Archive Alliance, spoke about the importance of data management.
“Most organizations have been in a project-based state of mind,” she said. “It is better to work for the long term and adopt a program-based approach.”
Programs are more encompassing. They contain many projects within them, driven by achieving a long-term goal.
“Instead of thinking, ‘We have an archive project,’ think, ‘We are working on a Data Management Program’,” said Pults. “Data management programs involve security, access, regulatory requirements, and long-term storage.”
She laid out further tips:
* Plan to include this program in your budget when you move to a new system well in advance. Elements like active archiving should not be an afterthought. They should be planned into the budget early.
* Select your archive team and vendor with experience and subject matter experts (SMEs) familiar with your industry.
* Complete an inventory of all the hardware and software you have at all sites, including all applications. This is especially vital after a merger or acquisition.
* Put data into buckets such as PHI/PII, regulatory, research, clinical, financial, and analytics. Why are you keeping it and what will it be used for?
* Rationalize your accumulated data with your vision and determine what you need to retain and what you can discard, and where the data you retain is needed. Use this assessment to figure out the best data storage architecture.
* Determine access requirements and user needs. Hot data should offer immediate access for use and viewing; medium-range data for occasional use or queries; and cold data for long-term storage that is rarely queried. Restrict access to whoever needs to look at it and/or use it. Make sure it is easy to query.
* Initiate program rollout, monitor outcomes, and adjust the program as required.
* Evaluate the overall progress and alignment with your program vision and goals as an ongoing process.
“Thorough preparation and planning are the keys to success,” said Pults. “Incorporate responsible AI and other technologies as needed into your active archiving architecture to organize, move, and optimize your data.”