Concepts

Elm is a modern object store backed with a high performance disk tier and a scalable tape library. Elm is ideally suited for cold storage of large research datasets, raw instrument data, and various backups.

Design overview#

Elm's key strength points include:

Integration with Stanford Identity Provider (IdP) and Workgroup Manager
S3-compatible object store with erasure coding, using MinIO (open source)
Fast and scalable disk tier, backed by Lustre (open source)
Highly scalable tape library, itself an object store managed by Phobos (open source)
Cost-effective at large scale, particularly with tape storage in terms of media and energy consumption

Elm's big picture is represented below:

Accessing Elm: MinIO Console#

Elm utilizes the MinIO Console, which can be accessed over HTTPS. Authentication is handled via Stanford OpenID Connect, secured by two-factor authentication. New Elm users must connect at least once to generate an access key before they can transfer data using the S3 protocol.

The MinIO Console also features a bucket browser and supports basic file transfers over HTTPS, making it convenient for handling smaller files.

Accessing Elm: S3 endpoint#

Elm provides an S3-compatible endpoint designed for data transfers, making it ideal for managing large datasets and research data. While the MinIO Console offers a convenient interface for browsing and transferring small files, the S3 endpoint is optimized for more demanding and automated workflows.

Using the S3 endpoint allows users to leverage parallelization and automation tools.

Elm's architecture is designed to support rapid data ingestion, leveraging a high-performance disk tier to quickly accept large volumes of data. This makes it suitable for bulk uploads of large datasets.

As Elm is backed by a scalable tape library, it excels in long-term cold storage. However, users should be aware that data retrieval times can vary. While fast data ingestion is a key strength of Elm, access times are not always guaranteed due to the nature of tape storage.

Read on to understand how Elm efficiently transitions data from disks to tapes.

From disks to tapes#

When you upload your data to Elm, it first arrives at the high-performance disk tier via MinIO. Your data is guaranteed to remain immediately accessible in this tier for at least one month. During this period, Elm's system automatically migrates your data to tapes while keeping it available on the disk tier.

After the initial month, your data is automatically removed from the disk tier but remains safely stored on tapes. This entire process is transparent to users, ensuring a seamless experience.

If you need to retrieve data from Elm after it has been migrated to tapes, please be aware that there may be high latency. The system will automatically mount the corresponding tapes and restore the data back to the disk tier, making it accessible through MinIO once again.

This data management lifecycle is implemented using Lustre/HSM and Phobos, both open source components. This approach avoids vendor lock-in and ensures that Elm remains cost-efficient as it scales in the future.

To summarize:

Initial stage: Data is uploaded to the high-performance disk tier via MinIO and guaranteed to remain there for at least a month.
Migration to tapes: Data is automatically migrated to tapes during the first month, while still being accessible on disks.
Post-migration: After a month, data is removed from the disk tier but remains on tapes. Retrieval will involve some latency as the system restores the data to the disk tier.

This architecture ensures that Elm offers both rapid data ingestion and cost-effective long-term storage, leveraging tapes for large-scale media and energy efficiency.

Learn more about Elm's architecture in this presentation from the Lustre Administrators & Developers (LAD) 2024 conference..