Skip to content

Integrity and Validation#

Overview#

If you are considering storing your data on Elm, we expect that you may be interested in understanding our approach to ensuring data integrity and validation. This page is intended to offer a brief FAQ to skim for quick answers as well as a more comprehensive technical summary that may be of interest to power users.

FAQ#

How does Elm protect against data corruption for files at rest?

Elm ensures data integrity at rest using advanced storage technologies:

  • Redundancy: Data is split into smaller pieces with parity data, allowing recovery from failures.
  • Regular Integrity Checks: Systems like RAID-6, checksums, and erasure coding monitor and repair data as needed.
  • Tape Backups: Data is also stored on tapes with built-in error correction and checksum verification.
How can I be sure my data was not corrupted while en route into or out of Elm?

Elm validates all data transfers using:

  • Checksum Verification: The system checks data integrity during uploads and downloads by comparing checksums provided by your tools.
  • End-to-End Validation: Tools like Globus and rclone independently confirm successful transfers by matching source and destination integrity.
How can I be sure my data is written to disk and tape correctly in Elm?

Elm safeguards data during writes with:

  • Automated Validation: Checksums confirm data accuracy when written to disk or tape.
  • Built-In Error Correction: Tapes follow industry standards for detecting and correcting errors, and any defective blocks are rewritten automatically.

For a more in-depth technical overview of how we approach data integrity and validation, please read on.

Integrity#

Integrity deals with ensuring that an object's contents will not become altered or corrupted within the Elm system. The Elm system uses a number of methods to check for integrity issues:

Erasure Coding (MinIO)#

The MinIO system that manages objects on disk uses Erasure Code algorithms to ensure that parity data is written alongside object data. Currently Elm is configured to evenly split each object into sub-files across four disk partitions. Each object's sub-files contain a mix of object and parity data, allowing for recovery if any one partition's files are lost or corrupted.

HighwayHash Checksums (MinIO)#

MinIO uses the HighwayHash algorithm to calculate hashes for each chunk of object data, storing those hashes as part of the metadata for the object. Every read and write operation in MinIO consults those hashes to protect against corruption.12

Lustre Metadata (Dell PowerVault ME5 ADAPT)#

Elm uses two Lustre Metadata Servers (MDS) and four Lustre Metadata Target (MDT) block devices3 served by a Dell PowerVault ME5 Storage Array. The ME5 offers redundancy using Autonomic Distributed Allocation Protection Technology (ADAPT) erasure encoding, and provides fast filesystem rebuild capabilities and integrated spare drives to handle SSD failures. The ME5 also provides instant snapshotting at the block level, and we will be regularly backing up volume snapshots to the Stanford Research Computing Facility (SRCF) [pending task].

Lustre Data (Linux RAID)#

The disks used by Elm are managed in a Linux software RAID-6 (8+2) configuration4, to protect against the loss of up to two disks per volume. Volumes are checked for consistency on approximately a monthly basis. We expect that any failed disks will be replaced within 2 business days.

Over-The-Wire Checksums (Lustre)#

Elm’s disk tier is served by a Lustre filesystem that provides HSM features to seamlessly copy data between disks and tapes. To ensure that the data has not been corrupted in memory and in transit over Elm’s internal network, Lustre uses the adler32 checksum algorithm.5

LTFS Tape Backups#

Files on disk are backed up to an LTFS tape archive, using four distinct tape sets (one tape set for each MinIO disk partition discussed in #1) for long term storage. During writes of files to the tapes an xxh128 hash is calculated and stored in the Phobos database, and in the extended filesystem attributes of the files in LTFS (on tape). The checksum is used to verify the initial write to tape and when the data is read back from tape. The Phobos database will be regularly backed up to the SRCF [pending task].

LTO Error Correction#

When the LTFS system writes data to tape it does so using hardware that follows the LTO specification. Implementations of that specification follow standards for error correction and reliability:

2.1.2 LTO standards

The LTO formats also use advanced error correction codes for data integrity. These systems automatically correct most cross-track errors and provide data correction even if a full track is lost. Data is further protected by the demarcation of bad areas of the tape (for example, where servo signals are unreliable) and through dynamically rewriting bad blocks.

2.3.3 Reliability

Data integrity

The drive performs a read after write for verification. Incorrectly written data, such as the result of a tape defect, is automatically rewritten by the drive in a new location. Data that is rewritten as the result of media defects is not counted against the drive error performance. The drive never records incorrect data to the tape media without posting an error condition.

Power loss

No recorded data is lost as a result of normal or abnormal power loss while the drive is reading or writing data. If power is lost while writing data, only the data block that is being written might be in error. Any previously written data is not destroyed.

Error correction

Data integrity features include two levels of error correction that can provide recovery from longitudinal media scratches.

Pending Work#

As we continue to develop Elm, we aim to implement the following strategies to further safeguard data stored on Elm. These are not in place today but are provided to illustrate how our service roadmap seeks to enhance data integrity measures.

  1. Periodic backup of the Phobos database to a separate datacenter.
  2. Periodic backup of Lustre MDTs to a separate datacenter.

Validation#

Validation deals with ensuring that the file you write to or read from Elm are not corrupted mid-flight. MinIO offers the following layers of support for validation, depending on what your S3 client implements:

  1. Content-MD5 checksums, either for whole object uploads or for the individual multi-part upload parts.
  2. X-Amz-Checksum-Algorithm headers, supporting CRC-32, CRC-32C, SHA-1, or SHA-256, to verify whole object uploads or multi-part upload parts
  3. ETag headers

Using Content-MD5 or X-Amz-Checksum-Algorithm headers require the client to calculate the checksum of each data payload it sends, and the server will verify that the data it received match the specified checksums. If the checksums do not match an error will be returned by the server.

The server will gather the checksums used to create an object and it will calculate a hash-of-hashes checksum, followed by a dash and the number of parts used to create the object if it was a multi-part upload. A client that has tracked the checksums sent as part of its upload process can verify that the ETag generated by the server matches its own calculation.

For more details on these methods please see Checking object integrity in Amazon S3 and Additional Checksum Algorithms for Amazon S3.

In addition to these methods clients have the option to implement additional validation phases outside of explicit support from the server. Examples of this can be seen with Globus and RClone.

Globus#

When a Globus transfer is initiated, the source endpoint will generate an MD5 checksum for the source object

If an object is within the Globus 500MiB chunk size limit then Globus can create the destination object in a single PUT operation. When Elm processes a single PUT operation it will return an ETag header whose value is the MD5 checksum of the uploaded object. Globus can then compare these two checksums.

As an example, after transferring a file name a-a-500MB.dat, Globus indicates it calculated an MD5 checksum from the source:

{
  "DATA_TYPE": "successful_transfer",
  "checksum": "061fa237522bd2200ed14453ddfa6c86",
  "checksum_algorithm": "MD5",
  "destination_path": "/test-jrobinso/globus/a-a-500MB.dat",
  "dynamic": false,
  "size": 500000000,
  "source_path": "/oak/stanford/groups/ruthm/jrobinso/karl/a-a-500MB.dat"
}

The checksum value matches the output of the md5sum utility:

$ md5sum a-a-500MB.dat
061fa237522bd2200ed14453ddfa6c86  a-a-500MB.dat

When Globus copied this file into Elm it did so using a single PUT request, and the response included an ETag header which matches the MD5 checksum:

PUT /test-jrobinso/globus/a-a-500MB.dat HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: Mozilla/4.0 (Compatible; Unknown; libs3 4.1; Linux x86_64)
Content-Length: 500000000
Accept: */*
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=10ce5db9a504d8f588559a514569ca76b3efa58e8eb55fa48b9e9eae168dc8a6
X-Amz-Content-Sha256: UNSIGNED-PAYLOAD
X-Amz-Date: 20250114T200036Z
X-Forwarded-For: 172.18.0.1
Accept-Encoding: gzip
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Type: text/plain; charset=utf-8
ETag: "061fa237522bd2200ed14453ddfa6c86"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Id-2: 7e63378ad44bb38bbd22e530765313e809e81cf73e8706de3a00b16c1a63c395
X-Amz-Request-Id: 181AA78E77379CCF
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15957
X-Ratelimit-Remaining: 15957
X-Xss-Protection: 1; mode=block
Date: Tue, 14 Jan 2025 20:00:42 GMT
Because the ETag value was a match Globus knows the file was transferred successfully.

Objects that are larger than 500MiB require multi-part uploads, meaning an object will be copied to Elm in multiple parts. As Globus copies each part to Elm it will calculate the MD5 checksum of the part, and will confirm that Elm returns the expected ETag for the part (the same as with the PUT operation discussed above).

Once all the parts have been copied Globus will sort the decoded ETag values by their sequence number, and calculate the checksum of each of the constituent hashes. That value, sometimes referred to as the hash of hashes, will then have the number of parts appended to it to arrive at the expected ETag value of the entire object. If Elm returns the same ETag value, then Globus knows the files were transferred successfully.

A complete example follows:

PUT /test-jrobinso/globus/1G.dat?partNumber=1&uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLmU5YzVlMjE1LTc1NWUtN
DNjMy05ZTRiLWQ2NDNlMjZjMDgxMXgxNzM2ODg0ODM1Njg4NjM4MzUz HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: Mozilla/4.0 (Compatible; Unknown; libs3 4.1; Linux x86_64)
Content-Length: 500000000
Accept: */*
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-
content-sha256;x-amz-date,Signature=2c0f2fc0a7e37212f3762d8b21b7a850d34dc1015a212882b45e58afac93703c
X-Amz-Content-Sha256: UNSIGNED-PAYLOAD
X-Amz-Date: 20250114T200035Z
X-Forwarded-For: 172.18.0.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Type: text/plain; charset=utf-8
ETag: "550b5bc6c697302869b6c34f246929ba"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Id-2: 4c3f8076ee3906d0d5211289d83d7d8a5224ac479e2b2af661fb46f8fa445ff2
X-Amz-Request-Id: 181AA78E43D9F6BE

PUT /test-jrobinso/globus/1G.dat?partNumber=2&uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLmU5YzVlMjE1LTc1NWUtNDNjMy05ZTRiLWQ2NDNlMjZjMDgxMXgxNzM2ODg0ODM1Njg4NjM4MzUz HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: Mozilla/4.0 (Compatible; Unknown; libs3 4.1; Linux x86_64)
Content-Length: 500000000
Accept: */*
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=75d694ffffad1c30d00064badd02fd81cce2c1a8a2f9bf694dcfd77c31788f45
X-Amz-Content-Sha256: UNSIGNED-PAYLOAD
X-Amz-Date: 20250114T200041Z
X-Forwarded-For: 172.18.0.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
Content-Type: text/plain; charset=utf-8
ETag: "8df30abb818374c290bb5de8aa6e3c6d"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Id-2: 7e63378ad44bb38bbd22e530765313e809e81cf73e8706de3a00b16c1a63c395
X-Amz-Request-Id: 181AA78F7EF9DA61

POST /test-jrobinso/globus/1G.dat?uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLmU5YzVlMjE1LTc1NWUtNDNjMy05ZTRiLWQ2NDNlMjZjMDgxMXgxNzM2ODg0ODM1Njg4NjM4MzUz HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: Mozilla/4.0 (Compatible; Unknown; libs3 4.1; Linux x86_64)
Content-Length: 223
Accept: */*
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date,Signature=89d89dc8e5fc4ed3bc4ac9337b4b5f9cdd72c7477fa24b56e81edb0991628cae
X-Amz-Content-Sha256: UNSIGNED-PAYLOAD
X-Amz-Date: 20250114T200045Z
X-Forwarded-For: 172.18.0.1
Accept-Encoding: gzip

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 343
Content-Type: application/xml
ETag: "0a0be0923ed0ff63f6d961064f850678-2"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Id-2: 7e63378ad44bb38bbd22e530765313e809e81cf73e8706de3a00b16c1a63c395
X-Amz-Request-Id: 181AA790A515DB01
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15957
X-Ratelimit-Remaining: 15957
X-Xss-Protection: 1; mode=block
Date: Tue, 14 Jan 2025 20:00:46 GMT
In this example the ETag for the individual parts (the results of the two PUT operations) were:

ETag: "550b5bc6c697302869b6c34f246929ba"
ETag: "8df30abb818374c290bb5de8aa6e3c6d"

and the final object's ETag (the result of the POST operation) was:

ETag: "0a0be0923ed0ff63f6d961064f850678-2"

We can replicate the process of generating this ETag on the command line:

$ (echo -n 550b5bc6c697302869b6c34f246929ba | xxd -r -p - ; \
   echo -n 8df30abb818374c290bb5de8aa6e3c6d | xxd -r -p -) | md5sum
0a0be0923ed0ff63f6d961064f850678  -
Because we know there were 2 parts in the object, we know the final ETag value should be

0a0be0923ed0ff63f6d961064f850678-2

And that does match the ETag returned by Elm. Globus does not appear to surface this ETag value on its results page, instead it displays the MD5 checksum of the source object as calculated by tools such as md5sum:

{
  "DATA_TYPE": "successful_transfer",
  "checksum": "e1670a70fb010cf827e822742807a59e",
  "checksum_algorithm":MD5",
  "destination_path": "/test-jrobinso/globus/1G.dat",
  "dynamic": false,
  "size": 1000000000,
  "source_path": "/oak/stanford/groups/ruthm/jrobinso/karl/1G.dat"
}

It's important to note that Globus's S3 Connector's workings are somewhat opaque to us, as it is not open source. The description given here is based off conversations with the Globus support team and observation of the headers exchanged.

rclone#

When rclone transfers files from disk into Elm it will calculate an MD5 checksum of each file, and of individual parts it uploads.

When rclone creates an object in Elm it will embed the object checksum into the object's metadata record. If we follow the creation of an object in the example below, we can see an initial POST request from rclone that contains an X-Amz-Metadata-Md5chksum header. The value Bh+iN1Ir0iAO0URT3fpshg== in this example is the Base64 encoded value of the MD5 checksum of the original file.

$ md5sum a-a-500MB.dat | tee /dev/tty | awk '{print $1}' | xxd -r -p - | base64
061fa237522bd2200ed14453ddfa6c86  a-a-500MB.dat
Bh+iN1Ir0iAO0URT3fpshg==
This same MD5 checksum value is passed in a Content-Md5 header when rclone sends the actual object body in a PUT request. Elm will check that the body it read from the client matches this MD5 checksum, and it will reject the request if they do not match.6 rclone then verifies the expected ETag and X-Amz-Metadata-Md5chksum fields are returned by Elm via a HEAD request.

An example:

POST /test-jrobinso/rclone/a-a-500MB.dat?uploads= HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: rclone/v1.57.0-DEV
Content-Length: 0
Accept-Encoding: gzip
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request, SignedHeaders=content-type;host;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-meta-md5chksum;x-amz-meta-mtime, Signature=91557ffbf6f455ed69f396146ccc8ac5a608718918057327315f915c609be5d1
Content-Type: application/octet-stream
X-Amz-Acl: private
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20250114T200250Z
X-Amz-Meta-Md5chksum: Bh+iN1Ir0iAO0URT3fpshg==
X-Amz-Meta-Mtime: 1734560990
X-Forwarded-For: 172.18.0.1

...

PUT /test-jrobinso/rclone/a-a-500MB.dat?partNumber=1&uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLmZkNmM0NWZkL
TAyNTEtNDZhNS04YTJiLWYyYmJkYWJiMDQ4NXgxNzM2ODg0OTcwNzU0MzcxMjU4 HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: rclone/v1.57.0-DEV
Content-Length: 500000000
Accept-Encoding: gzip
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request, SignedHeaders=content-l
ength;content-md5;host;x-amz-content-sha256;x-amz-date, Signature=c41feb02cbe99b300cc350e93b15dc87ba1bc1f9a8a291b59006b1ed7
4adb9e1
Content-Md5: Bh+iN1Ir0iAO0URT3fpshg==
X-Amz-Content-Sha256: 72eff721bfe77d4320348c4f17be6d772886065f1e86e1c937ddd3162292171d
X-Amz-Date: 20250114T200253Z
X-Forwarded-For: 172.18.0.1

...

HEAD /test-jrobinso/rclone/a-a-500MB.dat HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: rclone/v1.57.0-DEV
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250114/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=9046c34cb89b259699c01808664fc9642a8d521cbb71cb58736d309d35fcd79d
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20250114T200258Z
X-Forwarded-For: 172.18.0.1

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 500000000
Content-Type: application/octet-stream
ETag: "320a04f6c7d09738d03c8d7c620a4b60-1"
Last-Modified: Tue, 14 Jan 2025 20:02:58 GMT
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Id-2: 4c3f8076ee3906d0d5211289d83d7d8a5224ac479e2b2af661fb46f8fa445ff2
X-Amz-Request-Id: 181AA7AF789B9D67
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15955
X-Ratelimit-Remaining: 15955
X-Xss-Protection: 1; mode=block
x-amz-meta-md5chksum: Bh+iN1Ir0iAO0URT3fpshg==
x-amz-meta-mtime: 1734560990
Date: Tue, 14 Jan 2025 20:02:58 GMT

The X-Amz-Meta-Md5chksum can be used in the future to quickly check whether or not the local file has changed. If there is any question as to the validity of the checksum, rclone accepts a --download flag to force it to download the object from Elm and then to recalculate the checksum.7

AWS command line client#

The AWS command line offers two different interfaces for managing S3 data.

The easiest mechanism to use is the s3 command. This interface offers a reasonable set of default assumptions for how the s3 api should be called to accomplish tasks such as copying files into the S3 service.

The aws s3 command works in a similar fashion asrclone, providing Content-Md5 headers to Elm to validate the contents of object. The aws s3 command differs from rclone in that it does not store checksum metadata (the X-Amz-Meta-Md5chksum header used by rclone).

An alternative command is s3api which offers fewer default assumptions and gives the caller more control, including the ability to check object integrity using different checksum algorithms.

Using the aws s3api command you can create a multipart upload and request that the client calculate checksums to transmit to the server. In the example below we're using CRC32, but the s3 api offers a number of different algorithms:

  • CRC-32
  • CRC-32C
  • SHA-1
  • SHA-256

An example upload:

$ aws s3api create-multipart-upload --bucket test-jrobinso --key aws-s3api/a-a-500MB.dat --checksum-algorithm CRC32
{
    "ChecksumAlgorithm": "CRC32",
    "Bucket": "test-jrobinso",
    "Key": "aws-s3api/a-a-500MB.dat",
    "UploadId": "<upload_id elided>"
}

$ aws s3api upload-part --bucket test-jrobinso --key aws-s3api/a-a-500MB.dat --part-number 1 --upload-id ... --body ... --checksum-algorithm CRC32
{
    "ETag": "\"061fa237522bd2200ed14453ddfa6c86\"",
    "ChecksumCRC32": "s5V9kg=="
}

$ aws s3api complete-multipart-upload --bucket test-jrobinso --key aws-s3api/a-a-500MB.dat --multipart-upload 'Parts=[{ETag="061fa237522bd2200ed14453ddfa6c86",ChecksumCRC32=s5V9kg==,PartNumber=1}]' --upload-id ...
{
    "Location": "http://test.elm.stanford.edu:9000/test-jrobinso/aws-s3api/a-a-500MB.dat",
    "Bucket": "test-jrobinso",
    "Key": "aws-s3api/a-a-500MB.dat",
    "ETag": "\"320a04f6c7d09738d03c8d7c620a4b60-1\"",
    "ChecksumCRC32": "UcuxoA==-1"
}
The above commands resulted in the following exchange between the client and the Elm server:

POST /test-jrobinso/aws-s3api/a-a-500MB.dat?uploads HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: aws-cli/2.15.41 Python/3.11.8 Linux/5.14.0-362.18.1.el9_3.x86_64 exe/x86_64.rocky.9 prompt/off command/s3api.create-multipart-upload
Content-Length: 0
Accept-Encoding: identity
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250116/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-checksum-algorithm;x-amz-content-sha256;x-amz-date, Signature=9e355827004a000fbc55d574c416bd3d025c5652ee7c7386d00ce434b1d4ee5a
X-Amz-Checksum-Algorithm: CRC32
X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
X-Amz-Date: 20250116T202541Z
X-Forwarded-For: 172.18.0.1

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 359
Content-Type: application/xml
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Checksum-Algorithm: CRC32
X-Amz-Id-2: d0ff3e8fe0714877ba3d64d4951e4df01ca73d8b3ee4b38d9228453c3a71c54b
X-Amz-Request-Id: 181B4616005745E2
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15934
X-Ratelimit-Remaining: 15934
X-Xss-Protection: 1; mode=block
Date: Thu, 16 Jan 2025 20:25:41 GMT

PUT /test-jrobinso/aws-s3api/a-a-500MB.dat?partNumber=1&uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLjI2YjAxYjg1LTlkYzgtNDQ4OS1iZWFjLTJlNTEwNzFiNWYzYXgxNzM3MDU5MTQxNjEzOTYxNTU5 HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: aws-cli/2.15.41 Python/3.11.8 Linux/5.14.0-362.18.1.el9_3.x86_64 exe/x86_64.rocky.9 prompt/off command/s3api.upload-part
Transfer-Encoding: chunked
Accept-Encoding: identity
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250116/us-east-1/s3/aws4_request, SignedHeaders=content-encoding;host;transfer-encoding;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length;x-amz-sdk-checksum-algorithm;x-amz-trailer, Signature=a9db960aca7f8d10c58736822af19415c52e3be191c1464230ebd1fcff7a4ed9
Content-Encoding: aws-chunked
Expect: 100-continue
X-Amz-Content-Sha256: STREAMING-UNSIGNED-PAYLOAD-TRAILER
X-Amz-Date: 20250116T202548Z
X-Amz-Decoded-Content-Length: 500000000
X-Amz-Sdk-Checksum-Algorithm: CRC32
X-Amz-Trailer: x-amz-checksum-crc32
X-Forwarded-For: 172.18.0.1

HTTP/1.1 100 Continue

<body transmitted from client to Elm>
x-amz-checksum-crc32:s5V9kg==

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 0
ETag: "061fa237522bd2200ed14453ddfa6c86"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Checksum-Crc32: s5V9kg==
X-Amz-Id-2: 4c3f8076ee3906d0d5211289d83d7d8a5224ac479e2b2af661fb46f8fa445ff2
X-Amz-Request-Id: 181B461797C44D39
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15955
X-Ratelimit-Remaining: 15955
X-Xss-Protection: 1; mode=block
Date: Thu, 16 Jan 2025 20:25:54 GMT

POST /test-jrobinso/aws-s3api/a-a-500MB.dat?uploadId=ZjNmZjgyZGUtOGFmMy00ZGViLTlmMzAtOGY3MDY4N2RlNjcxLjI2YjAxYjg1LTlkYzgtNDQ
4OS1iZWFjLTJlNTEwNzFiNWYzYXgxNzM3MDU5MTQxNjEzOTYxNTU5 HTTP/1.1
Host: test.elm.stanford.edu:9000
User-Agent: aws-cli/2.15.41 Python/3.11.8 Linux/5.14.0-362.18.1.el9_3.x86_64 exe/x86_64.rocky.9 prompt/off command/s3api.com
plete-multipart-upload
Content-Length: 222
Accept-Encoding: identity
Authorization: AWS4-HMAC-SHA256 Credential=uBM8VflA03OwmLarTVp8/20250116/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz
-content-sha256;x-amz-date, Signature=b4582afdb2b4a382c71447fe935ab133fab60d7bd925a80822f5423c36e54d4c
X-Amz-Content-Sha256: 264af8533f513b8042e460c5daf9a6fc6810efd1901817995adac3426eee415f
X-Amz-Date: 20250116T202555Z
X-Forwarded-For: 172.18.0.1

HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Length: 404
Content-Type: application/xml
ETag: "320a04f6c7d09738d03c8d7c620a4b60-1"
Server: MinIO
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Origin
Vary: Accept-Encoding
X-Amz-Checksum-Crc32: UcuxoA==-1
X-Amz-Id-2: 4c3f8076ee3906d0d5211289d83d7d8a5224ac479e2b2af661fb46f8fa445ff2
X-Amz-Request-Id: 181B46193A5E98C3
X-Content-Type-Options: nosniff
X-Ratelimit-Limit: 15955
X-Ratelimit-Remaining: 15955
X-Xss-Protection: 1; mode=block
Date: Thu, 16 Jan 2025 20:25:55 GMT

If you follow the headers you can see that the initial request to start a new multi-part upload specified

X-Amz-Checksum-Algorithm: CRC32

indicating that we'd use CRC32 to calculate the checksum of the object.

When part 1 was uploaded via PUT the SDK generated the initial headers:

X-Amz-Sdk-Checksum-Algorithm: CRC32
X-Amz-Trailer: x-amz-checksum-crc32

and added the trailing header:

x-amz-checksum-crc32:s5V9kg==

This is a useful efficiency the SDK offers, where the SDK calculates the checksum of the stream and then sends the checksum it expects the server to verify after transmission. This means that the client does need to buffer anything before it sends the part to the server, it can simply monitor the data stream it sends to the server, and then indicate the checksum it expects the server to have calculated for that stream.

mc command line client#

The MinIO client offers a number of methods for copying data to an S3 server:

  • mc cp copies one or more objects
  • mc put uploads one object to a bucket
  • mc od stream a file file to an object
  • mc pipe stream standard input to an object

More information coming soon.

s3up client#

More information coming soon.

Getting Support#