AI Training and Inferencing: How Correct Archiving Can Aid Both
There are a great many different types and categories of AI: Artificial Narrow AI, General AI, Super AI, Reactive Machine AI, Limited Memory AI, Theory of Mind AI, Self-Aware AI, and, of course, the one on everyone’s lips, Generative AI (GenAI).
As GenAI is the flavor of the month, let’s zero in on it and review how active archives can assist it in fulfilling its purpose. That opens up two main areas: Archiving for training and archiving for inferencing.
How Active Archives Can Serve GenAI Model Training
Eric Polet, Director of Product Marketing at Arcitecta, covered this topic during a recent Active Archive Alliance conference, Why AI Needs Active Archive. He laid out a variety of ways active archives could be deployed to assist AI.
Training Datasets
Any GenAI engine needs to be thoroughly trained on data to ensure it arrives at accurate conclusions. This includes pre-training on massive datasets and fine-tuning to ensure it works well for specific tasks (such as content generation, image creation, call center chat responses, etc.). Those devising and managing LLMs seek to eliminate errors and hallucinations and reinforce desired outcomes. All of this requires vast computational resources and access to a whole lot of data.
In training GenAI datasets, it is essential for the model to learn from as large a dataset as possible. Some of the largest large language models have access to hundreds of billions of parameters. One or two have topped a trillion. Such a high volume of data is unwieldy if it all has the same value. Hence, tiering is necessary to group datasets and assign them levels of importance and priorities. An active archive serves this purpose well as it allows data to be relegated to a lower tier yet be available rapidly should it be needed by the AI engine.
Inference Datasets:
Inferencing is where a trained LLM is harnessed to make predictions and arrive at conclusions to queries. The goal is to enable it to arrive at new findings to solve real-world problems and challenges. Once the training phase is over, the datasets involved will be considerably smaller, as will the computational resources required. Nevertheless, an active archive can serve up data when called upon, while higher tiers of storage would be used to address queries about more frequently asked topics. Caching, too, would play an important role.
“AI inference can be made faster using smart caching,” said Polet. “It is a good way to predict AI access patterns by pre-loading critical datasets and enabling real-time AI performance.”
Active archives provide access to data for problems that may not require real-time responses. For example, someone analyzing a large number of genomes or massive geological datasets isn’t slowed down by a nominal delay in retrieving data from an active archive. Those kinds of datasets can be assigned to an active archive where they have by far the lowest storage and energy costs, yet the information is close by whenever needed. Other datasets can be assigned to higher tiers, and the most frequently used data can be stored in a cache.
“It is all about how fast and efficiently you can feed in the data and get results by using storage wisely to achieve the desired result,’ said Polet.
While storage and an active archive can benefit AI in many ways, they also become the beneficiaries of AI technology. AI, for example, can make tiering far smarter and for staging data based on accurate predictions about when it might be needed. Use cases include:
*AI-driven indexing
*Metadata-rich search
*AI-powered tagging
*Smart retrieval
Such capabilities reduce the time taken in query response and help to minimize training delays by improving model accuracy.
“Those wishing to deploy AI should pick a solution that uses metadata-rich search, AI-powered tagging, and fast search,” said Polet.
The Challenges of Leveraging Archived Data for AI Training
Paul Luppino, Director of Data Management at Iron Mountain, went into more detail about the challenges AI training posed for archiving. After all, the data stored in an active archive for long-term retention might include historical records, email, logs, video, audio, and backups. It typically exists in a wide variety of formats, such as databases, documents, disk-based, cloud, and backup tapes. These include obsolete file types and numerous generations of technology. Backward compatibility becomes an issue.
“Users want to be able to access archived data in a form that is useful to them,” said Luppino. “But they can extract tremendous value from historical data sitting in cloud, disk, and tape systems as it may contain previously unseen patterns.”
He outlined situations such as having archived data on LTO 5 tape but not having a tape drive to run it. He advised anyone engaged in archiving to think of this type of problem up front, to pay attention to aging formats, and to future-proof their systems so they can always retain access. The last thing you want to engage in when training an AI model is extracting and transforming old tape formats or wrestling with outdated file types on disk. Such a situation could significantly slow the rate at which models can be trained.
“The process of extracting, cleaning, and processing archival data from tapes, cloud, and disk storage can be costly and time-consuming, requiring significant expertise,” said Luppino.