Skip to content

Archiving Data to Elm from Sherlock#

elm_archive#

Archiving large datasets with numerous small files can be challenging. elm_archive is a utility designed to streamline the archival process of data from Sherlock to Elm storage using Globus. This command line tool is particularly beneficial for archiving numerous small files that are not ideally suited for Elm and tape storage.


What can elm_archive do for me?#

elm_archive consolidates your files into one or more tar files, and uploads them to Elm for safekeeping. While direct upload is ideal for large files (greater than 1 GB), elm_archive is highly recommended for efficiently archiving directories with many small files from Sherlock to Elm.

Using larger tar files in Elm improves performance and makes future data restoration easier. You can later retrieve specific small files by downloading the relevant tar files and extracting only the needed items.


How does elm_archive work?#

elm_archive runs from Sherlock and orchestrates a chain of dependent Slurm jobs to automate the entire archival process. It handles:

  1. Packing: Consolidating files into .tar archives using the parallel dtar tool.
  2. Post-Processing: Splitting oversized archives and creating manifests for each one.
  3. Transfer: Submitting the data transfer to Elm via Globus.
  4. Monitoring & Cleanup: A recurring job monitors the Globus transfer's progress. Upon completion (success or failure), it automatically cleans up all temporary data from your $SCRATCH directory.

This workflow is designed for unattended execution; you can submit a transfer and the script will handle the rest, even if you log out.

logo


Workflow#

Step 1: Register an Elm Access Key#

1. Retrieve Access Key & Secret#

  • Log in to Elm's MinIO console: https://campus.elm.stanford.edu:9001/
  • Create a new Access Key (recommended: descriptive name, e.g., "sherlock-elm-archive")
  • Optionally restrict your key’s permissions (Globus & Elm Documentation)
  • Copy the Access and Secret Keys for the next step (keep them secure and do not share them)

2. Load Keys in Globus#

Step 2: Log in to Globus from Sherlock#

Use our wrapper to authenticate to Globus and grant required consents:

$ elm_archive auth

Follow browser prompts and return tokens when requested. You'll authenticate once for Globus and each endpoint (source and destination).

Step 3: Initiate a Transfer#

From a Sherlock login node, start a transfer using the elm_archive transfer command.

Basic example:

$ elm_archive transfer /path/to/source elm-bucket/prefix/
  • /path/to/source is any directory on Sherlock where you have the necessary permissions. This includes any of the shared filesystems, such as /oak. For more information on filesystems, visit the Sherlock Filesystems Documentation.
  • elm-bucket/prefix is your Elm bucket and destination path. Subfolders are created automatically.

Transfer Label

By default, the source directory name is used as a label for log files, tar files, manifests, and Globus tasks. You can specify a custom label:

$ elm_archive transfer --label projectA-2025 /path/to/source elm-bucket/prefix/

Job Options#

elm_archive provides several options for customizing how archiving jobs are run on Sherlock. You can adjust key Slurm job parameters (like partition, number of tasks, wall time, and constraints) to optimize resource usage, accommodate larger datasets, or integrate with your group’s workflows.

Example: Customizing Resources#

Suppose you have a large dataset and prefer your owner's partition for more performance. You want to run with 8 tasks and give the job up to 8 hours. You would use:

elm_archive transfer -p owner_partition -n 8 -t 8:00:00 /path/to/dataset elm-bucket/prefix/

Here's what each option means in detail:

Option Description
-p owner_partition Use your own Slurm partition (default is service). This will allow more resources and longer run times.
-n 8 Request 8 parallel tasks (ntasks). More tasks often speeds up packing, especially for many small files. Default is 8.
-t 8:00:00 Allow up to 8 hours wall time for the job. Increase this for very large datasets or slow packing. Default is 2 hours.

Other useful job options:

Option Description
-C, --constraint <constraint> Set a Slurm constraint (e.g., request specific hardware nodes).
--mail-type <type> Specify email notification preferences for Slurm jobs (e.g., END, FAIL).

Quick Start for requesting Custom Resources#

  • For more CPUs (parallelism) or longer job time, adjust -n and -t:

    elm_archive transfer -n 16 -t 1-0 /path/to/data elm-bucket/prefix/
    
    (1-0 is one day)

  • For large datasets, switch to your group's partition:

    elm_archive transfer -p mygroup /path/to/data elm-bucket/prefix/
    

  • For more control (e.g. label, splitting), combine options:
    elm_archive transfer --label myproject \
      -p mypartition \
      -n 32 \
      -t 2-00:00:00 \
      -s 1000 \
      /path/to/source elm-bucket/prefix/
    

Tip: Find your available partitions and groups using sh_part. See Sherlock's Available Resources page for more info.


Tar File Management and Thresholds#

Splitting Tar Files#

If your dataset exceeds a specified size (default: 1 TiB), elm_archive automatically splits it into multiple standalone tar files using tarsplitter.

Note: Each tar file can be downloaded and extracted on its own—you do not need to reassemble tar files to restore individual files. In addition, elm_archive provides a separate manifest file (tar.txt) listing the contents of each tar.

Best Practices#

  • Specifying Split Size: Adjust the split size with the -s flag. For example, to split files at every 500 GiB:

    elm_archive transfer ... -s 500 ...
    

  • Thresholds & Restore Speed: Setting a smaller split size (e.g., reducing from 1 TiB to 250 GiB using -s 250) enables faster restores; you’ll need to download only a smaller tar file to recover a specific file. This is especially useful when restoring on a machine with limited storage, such as a laptop.
    Conversely, increasing the split size (e.g., from 1 TiB to 4 TiB with -s 4096) means fewer but larger tar files, which can speed up bulk restores of very large datasets.

  • Minimum Split Size: The split threshold must be greater than the largest single file in your dataset, as each file is stored whole in a tar (no file is split across tar files).
    Note: elm_archive will automatically verify the size of the largest file to be archived. If the largest file is bigger than your chosen split size, elm_archive will abort.

  • The maximum tar size allowed by Elm's S3 protocol is 5 TiB. For largest possible tar files, we recommend to use -s 5000.


Restoring and Extracting Data From Elm Archives#

You can restore files from Elm by downloading the corresponding tar and manifest files. Each tar file can be used independently; you do not need to merge or process multiple tar files.

Step-by-Step Extraction:

  1. Inspect Manifest: The manifest .tar.txt is the output of tar -tvf, listing files in that tar.

    awk '{for(i=6;i<=NF;i++) printf "%s%s", $i, (i==NF?"\n":" ")}' archive.tar.txt > archive.paths.txt
    

  2. Select Needed Files:

    grep wanted_pattern archive.paths.txt > archive.to_extract.txt
    

  3. Extract Selected Files: Make sure your extraction job is allocated enough time (extracting large tars can take hours):

    tar -xvf archive.tar --files-from=archive.to_extract.txt
    

Notes:

  • You can use any standard tar command.
  • Each tar file stands alone.

Non-Blocking Submission with --headless#

For automation workflows (e.g., job dependencies), elm_archive transfer --headless submits all jobs required for archiving, prints the chain of Slurm job IDs, and exits immediately—it does not block for the full workflow. Thus, it is possible to use job dependency to chain your own jobs.

How to get elm_archive's last job ID for job dependency:

tmpfile=$(mktemp)
elm_archive transfer --headless ... | tee "$tmpfile"
archive_rc=${PIPESTATUS[0]}   # exit code of elm_archive
tee_rc=${PIPESTATUS[1]}       # (usually zero unless disk full)

if [[ $archive_rc -ne 0 ]]; then
    echo "ERROR: elm_archive failed with exit code $archive_rc"
    # ...handle error...
fi

cleanup_jobid=$(awk -F': ' '/^elm_globus_cleanup:/ {print $2}' "$tmpfile")

# use $cleanup_jobid for job dependency, for example:
# sbatch --dependency=afterok:$cleanup_jobid my_next_job.sh


Summary of misc. options#

Option Description
-r, --reattach <label.XXXXX> Reattach to an existing transfer session (resume monitoring interrupted job/logs).
-s, --size-threshold <N> Split tar files when exceeding N GiB (default: 1024). Lower for easier restore, must be >= largest file.
--headless Submit jobs and print the job ID chain, then exit immediately (useful for scripting, pipelines, etc.).
-l, --label <label> Set a custom label for files/logs (defaults to source directory name).
--tar-only Creates tar files and manifests in $SCRATCH but does not start the Globus transfer. This is useful for inspecting archives before a manual transfer. Note: The transfer cannot be resumed by elm_archive later. Since automated cleanup is bypassed, you must manually remove the generated tar files from $SCRATCH once you are done to conserve your quota space.

See all options with:

elm_archive --help


Monitoring Transfers#

You can monitor the progress of your active and completed transfers at any time using the elm_archive status command. This provides a clean, at-a-glance view of all your recent archiving sessions.

Understanding the status table#

Running elm_archive status on a Sherlock login node will display a table similar to this:

$ elm_archive status
SUBMITTED          STATE        STAGE        SIZE/INO      PROGRESS     DURATION   SESSION_LABEL           DESTINATION
2025-10-18 09:31   COMPLETED    Cleaned      1.2T/1.1M     -            1h45m      myoldproject1.ek3oCI    my-bucket/archived_projects/
2025-10-18 10:28   FAILED       Transfer     500G/850k     FAILED       2h15m      myoldproject2.mXYN1J    my-bucket/archived_projects/
2025-10-18 10:39   RUNNING      Packing      2.1T/3.5M     78%          45m+       largedata.yZulTZ        my-bucket/rawdata/
2025-10-18 11:05   RUNNING      Post-Proc    890G/500k     -            12m+       dataset-final.aBCdE1    my-bucket/final/
2025-10-18 11:15   RUNNING      Packing      -             Queued       2m+        new-analysis.fG2hI3     my-bucket/analysis/

At the bottom, a helpful hint will appear if you have any running jobs:

2025-10-18 11:30:00 - To reattach to the most recent running session, use:
    elm_archive transfer --reattach largedata.yZulTZ

Copy-paste this command to attach the session's logs (you can then detach with Control-C at any time).

Column Explanations#

Column Description
SUBMITTED The date and time you initiated the transfer. All entries are sorted chronologically.
STATE The overall condition of the workflow: RUNNING (Yellow), COMPLETED (Green), or FAILED (Red). This is your primary indicator of the job's health.
STAGE The specific step the workflow is currently in, or where it failed:
Init: The session is being created.
Packing: Consolidating files into a .tar archive. A pre-analysis (dwalk) runs first.
Post-Proc: Splitting large tar files and/or generating manifests.
Queuing: The Globus transfer job is submitted and waiting to run.
Transfer: Actively transferring data to Elm via Globus.
Cleanup: Removing temporary files from $SCRATCH after a successful Globus transfer.
Cleaned: The final stage for a COMPLETED job where temporary files were removed.
Packed: The final stage for a COMPLETED job run with --tar-only.
SIZE/INO The total size and number of files (inodes) in the source directory, as calculated by dwalk. Example: 1.2T/1.1M means 1.2 Terabytes and 1.1 million files.
PROGRESS Context for the current stage:
Queued: The Slurm job for this stage is waiting to start.
Analyzing: The pre-packing file scan (dwalk) is running.
XX%: Percentage completion for the Packing or Transfer stage.
Submitted: The transfer has been submitted to Globus but is not yet active.
PAUSED(XX%): The Globus transfer has been paused. The job is still RUNNING but no data is moving.
FAILED/SUCCEEDED: The final status of the Globus transfer, visible before cleanup runs.
DURATION The total time the job has been running. A + suffix (e.g., 1h15m+) indicates the job is still active.
SESSION_LABEL The unique identifier for this session. Use this label with the --reattach flag.
DESTINATION The bucket and prefix in Elm where your files will be archived.

Customizing the status view#

The status command can be modified with flags to provide more detailed information or to filter the output.

Option Description
--running Show only transfers that are currently active.
--failed Show only transfers that have failed, which is useful for quickly identifying problems.
--completed Show only successfully completed transfers.
--last Show only the most recent transfer session.
--all Show your complete transfer history (default is the last 30 days).
--full Display the full source and destination paths on new lines.
--globus Display a direct link to the Globus task page for each transfer.
--debug Display detailed troubleshooting information for advanced users.

Live Monitoring a Transfer#

For a real-time, auto-updating view of your most recent transfer, you can use the watch command:

watch -n 10 --color elm_archive status --last

This command provides a live dashboard of your transfer by running elm_archive status --last every 10 seconds, with colors intact. Press Ctrl+C to exit the live view.

Filtering the view#

To help you quickly find the transfers that matter most, you can filter the status list. For example, to see all jobs that are not yet successfully completed (i.e., they are either running or have failed), you can combine flags:

$ elm_archive status --running --failed

When using --all to view your entire transfer history, keep in mind that the session logs for elm_archive are stored in your $SCRATCH directory (inside $SCRATCH/.elm_archive/). Due to Sherlock's 90-day purge policy on $SCRATCH, transfer history older than 90 days may not be available.

Debugging a transfer#

The --debug flag provides technical details essential for troubleshooting. It shows the live status of the underlying Slurm jobs and the latest information from Globus.

$ elm_archive status --debug --last
SUBMITTED          STATE     STAGE      SIZE/INO    PROGRESS   DURATION   SESSION_LABEL
2025-10-18 17:09   RUNNING   Transfer   4.5T/8.9M   75%        2h28m+     largedata.jOJMY9
  ├── DEBUG INFO
  │   ├── Session Directory: /scratch/users/jane/.elm_archive/largedata.jOJMY9
  │   ├── Cleanup Status:    Incomplete
  │   ├── SLURM Job IDs:
  │   │   - Packing:       7832699
  │   │   - Post-Proc:     7832700
  │   │   - Xfer-Submit:   7832701
  │   │   - Cleanup:       7832702 (RUNNING)
  │   └── globus_task_status:
  │       Label:                  elm_archive:largedata.jOJMY9
  │       Status:                 ACTIVE
  │       Is Paused:              False
  │       Total Subtasks:         4
  │       Subtasks Succeeded:     3
  │       ...

What this tells you:

  • Session Directory: The location of all logs and metadata for this job in $SCRATCH.
  • Cleanup Status: Shows if the final cleanup job has run.
  • SLURM Job IDs: Lists each job in the workflow. The status in parentheses (e.g., RUNNING, PENDING) is its live status from the Slurm scheduler.
  • globus_task_status: A snapshot of the last known status of the Globus transfer, as seen by the cleanup job.

Limitations#

  • Tar files are written to your $SCRATCH (100 TB quota). To stage elsewhere (e.g., /oak), symlink $SCRATCH/.elm_archive as needed.

Support#

Stanford Research Computing maintains this tool. For help, contact srcc-support@stanford.edu (mention "elm_archive" in your subject).