Active Archiving at the Heart of User-driven HPC Data Movement

April 11th, 2022 by Armel Bile, Product Marketing Manager at Atempo

High-Performance Computing is a world where precious storage resources must be carefully managed. We should in fact use the plural form for “storages” as there are typically three storage tiers within each HPC infrastructure:

    1. High-end storage accessed by the compute nodes mostly made of very expensive high-end flash drives,
    2. Large capacity NAS-like shared storage where researchers store their daily work documents and projects and prepare this data for high-compute simulations and experiments. The filesystems in play here are mostly scale-out Lustre/GPFS or Scale-Out NAS (Isilon, NetApp, …).
    3. Slower access, yet cost-effective storages manage long-term retention of cold data for archiving and/or compliance purposes. This storage comprises offline or nearline tape libraries or glacier-type cloud storage.

The growth of unstructured data is phenomenal in all verticals including the world of scientific research. Data volume increases have the greatest impact on medium and cold storage tiers. HPC IT teams receive more and larger data movement requests between these tiers. Researchers need to perform more frequent and longer data movement tasks to and from their home spaces and the compute tier.

Many research centers have adopted the practice of letting researchers choose and install the tools of their choice to perform data transfers. The proliferation of these tools poses many practical problems that impact the performance of the entire infrastructure:

  1. integrity risks for the transferred data,
  2. risks of data leakage,
  3. risk of adding unknown and potentially malicious software,
  4. lack of control over IOs and the lack of coordination of the impact of these IOs on system components performance,
  5. the fact that IT teams are deprived of this 360° view of data management and user needs, depriving them of the ability to anticipate.

Access to a data management solution including cost-effective active archive storage, therefore, makes solid sense. HPC users get:

  1. An immediate boost of their storage TCO as it reduces the IO strain on higher-performance storage tiers,
  2. Instant gains on data transfer speeds: the most recent data is available on the active archive that acts as a caching system to both the intermediate NAS storage and the cold storage tier.
  3. Greater autonomy as teams benefit from a reliable tool that simplifies data movement and has built-in mechanisms for:
    1. preventing data loss and corruption,
    2. restarting data transfer on error,
    3. launching actual data movement IOs using direct storage access and not user workstations,
  4. Greater IT resource hygiene rules in terms of file naming, organization, and folder structure,
  5. More automation for applying best practices for data protection. For example:
    1. scheduling versioned backups
    2. storing multiple copies on different media to comply with 3-2-1 best practices
    3. scheduling automatic offloading of colder data from the main storage.
  6. More autonomy and empowerment with control over their own data movement by leveraging secure and predefined policies, data management rules, and industry best practices and conformity.

As a bonus, active archiving also benefits tape archives by streamlining the archiving process and I/O management, like a giant buffer. The user can now move massive petabyte-sized data sets quickly between computer storage and tape.

In addition to controlling their data movement between tiers, users are also fully autonomous when it comes to data retrieval from the active archive or long-term archives. Users navigate on a simple Web interface to search for a project name, file name, or for specific metadata collected automatically within the datasets. The search provides data previews and is used to transparently recover datasets to a storage of choice or to a storage tier predefined by the IT team.

In practice, this solution proves to be much more solid and efficient than stub-based or link-based HSM solutions that have difficulty keeping up with the growth of large HPC volumes.

Atempo works with organizations to deploy on-premises, hybrid, and multi-cloud storage active archiving solutions, maximizing operational efficiency and providing complete confidence in content security. Integrating secure, accessible, and scalable archiving means instant and unlimited self-service access. The bottom line? Researchers and IT professionals get to spend more time doing what they do best!

Alliance Members & Sponsors