Introduction

⚠️ CRITICAL: DO NOT USE FOR STANDARD CHECKSUMS

This project is an unsuccessful prototype that will produce different hashes than md5sum and sha256sum for all files larger than 1MB. checkle is thus incompatible with standard MD5/SHA256 checksum utilities.

Please use standard time-tested tools like md5sum or sha256sum instead.

Welcome to checkle - an extremely fast checksum utility designed for bioinformatics workflows involving terabyte-scale genomics data.

What is checkle?

checkle is a high-performance command-line tool that leverages Merkle tree parallelization to compute checksums faster than traditional tools like md5sum or sha256sum. It's specifically optimized for bioinformatics workflows where data integrity is critical and files can be hundreds of gigabytes each.

Key Features

  • Blazing Fast: 5-10x faster than md5sum on multicore systems
  • Merkle Tree Parallelization: Near-linear speedup with CPU cores
  • Archive Support: Hash files within TAR/ZIP archives without extraction
  • Bioinformatics Focus: Optimized for large genomics files (FASTQ, BAM, VCF)
  • Multiple Output Formats: Text, JSON, CSV for pipeline integration
  • Progress Tracking: Real-time progress for long-running operations

Quick Example

# Hash a single genome file
checkle hash genome.fastq.gz

# Hash all FASTQ files in a sequencing run
checkle hash /data/run_001 --recursive --include "*.fastq.gz"

# Verify downloaded reference genome
checkle verify GRCh38.fa.gz --hash e3b0c44298fc1c149afbf4c8996fb924

# Hash files in compressed archive without extracting
checkle hash sequencing_data.tar.gz:*.fastq

Getting Started

Head over to the Installation guide to get started with Checkle.

Installation

checkle offers multiple installation methods to suit different preferences and use cases.

The quickest way to get started is using our installation script:

# Standard build
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh

# SIMD-optimized build (faster, requires modern CPU)
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh -s -- --simd

Manual Binary Download

Download precompiled binaries from releases:

# SIMD-optimized (recommended for modern CPUs)
wget https://github.com/nrminor/checkle/releases/latest/download/checkle-x86_64-unknown-linux-gnu-simd.tar.gz
tar -xzf checkle-x86_64-unknown-linux-gnu-simd.tar.gz
sudo mv checkle /usr/local/bin/

# Standard compatibility version
wget https://github.com/nrminor/checkle/releases/latest/download/checkle-x86_64-unknown-linux-gnu.tar.gz

Cargo Install

If you have Rust installed, you can build and install checkle using Cargo:

# From crates.io (when published)
cargo install checkle

# With cargo-binstall (if available)
cargo binstall checkle

# From source
cargo install --git https://github.com/nrminor/checkle

Verification

After installation, verify that checkle is working correctly:

checkle --version

You should see output showing the installed version of checkle.

SIMD vs Standard Builds

checkle offers two build variants:

  • SIMD-optimized: Faster performance using advanced CPU instructions. Requires modern CPUs (x86_64 with SSE4.2+ or ARM64 with NEON).
  • Standard: Maximum compatibility across different hardware platforms.

For best performance on modern systems, use the SIMD-optimized build. If you encounter issues or are using older hardware, use the standard build.

Quick Start

Get up and running with checkle in 5 minutes.

Basic Commands

Hash a single file

checkle hash genome.fastq.gz

Hash multiple files

checkle hash *.fastq.gz

Hash with SHA-256 instead of MD5

checkle hash --algo sha256 genome.fastq.gz

Save checksums to a file

checkle hash *.fastq.gz -o checksums.txt

Verification

Verify a single file

checkle verify genome.fastq.gz --hash d41d8cd98f00b204e9800998ecf8427e

Verify multiple files from a checksum file

checkle verify-many --checksum-file checksums.txt

Working with Directories

Hash all files in a directory recursively

checkle hash /data/sequencing_run --recursive

Hash only specific file types

checkle hash /data --recursive --include "*.fastq" --include "*.fasta"

Exclude certain patterns

checkle hash /data --recursive --exclude "*.tmp" --exclude "*.log"

Archive Support

Hash files inside a TAR archive without extracting

checkle hash data.tar.gz:sequences/sample.fastq

Hash all files in an archive

checkle hash data.tar.gz:*

Output Formats

JSON output for downstream processing

checkle hash *.bam --format json > checksums.json

CSV for spreadsheet import

checkle hash *.vcf --format csv > checksums.csv

Pretty table display

checkle hash *.fastq --pretty

Performance Tuning

Increase parallel readers for large files

checkle hash huge_genome.fasta --parallel-readers 16

Adjust chunk size for optimal performance

checkle hash *.bam --chunk-size-kb 4096

Next Steps

Command Line Usage

Complete reference for checkle's command-line interface.

Global Options

checkle [OPTIONS] <COMMAND>

Options available for all commands:

  • -v, --verbose: Increase logging verbosity (use multiple times)
  • -q, --quiet: Suppress non-essential output
  • --version: Display version information
  • --help: Show help information

Commands

hash

Generate checksums for files.

checkle hash [OPTIONS] <FILE_OR_DIR>

Options:

  • --algo <ALGORITHM>: Hash algorithm (md5, sha256) [default: md5]
  • -r, --recursive: Process directories recursively
  • -o, --output <FILE>: Save checksums to file
  • --format <FORMAT>: Output format (text, json, csv)
  • --pretty: Display results in a formatted table
  • --per-file: Create individual checksum files
  • --include <PATTERN>: Include only matching files
  • --exclude <PATTERN>: Exclude matching files
  • --no-ignore: Don't respect .gitignore rules
  • --parallel-readers <N>: Number of parallel readers [default: auto]
  • --chunk-size-kb <SIZE>: Chunk size in KB [default: 1024]
  • --absolute-paths: Display absolute paths in output

Examples:

# Hash single file
checkle hash file.txt

# Hash directory recursively
checkle hash /data --recursive

# Save SHA-256 checksums to file
checkle hash *.fastq --algo sha256 -o checksums.sha256

# Pretty table output
checkle hash *.bam --pretty

verify

Verify a file against a known hash.

checkle verify [OPTIONS] <FILE> --hash <HASH>

Options:

  • --hash <HASH>: Expected hash value (required)
  • --algo <ALGORITHM>: Hash algorithm [default: md5]
  • --chunk-size-kb <SIZE>: Chunk size in KB
  • --parallel-readers <N>: Number of parallel readers

Examples:

# Verify MD5
checkle verify genome.fasta --hash d41d8cd98f00b204e9800998ecf8427e

# Verify SHA-256
checkle verify data.tar.gz --hash e3b0c44298fc1c149afbf4c8996fb924 --algo sha256

verify-many

Verify multiple files using a checksum file.

checkle verify-many [OPTIONS] --checksum-file <FILE>

Options:

  • --checksum-file <FILE>: File containing checksums (required)
  • --base-dir <DIR>: Base directory for relative paths
  • --algo <ALGORITHM>: Hash algorithm [default: md5]
  • --fail-fast: Stop on first verification failure
  • --quiet: Only show failures
  • --pretty: Display results in formatted table
  • --parallel-files <N>: Files to process in parallel
  • --chunk-size-kb <SIZE>: Chunk size in KB
  • --failed-only: Only show failed verifications

Examples:

# Verify all files in checksum list
checkle verify-many --checksum-file checksums.txt

# Stop on first failure
checkle verify-many --checksum-file checksums.txt --fail-fast

# Show only failures
checkle verify-many --checksum-file checksums.txt --failed-only

# Verify with custom base directory
checkle verify-many --checksum-file checksums.txt --base-dir /data

Archive Syntax

Access files within archives using colon notation:

# Hash specific file in archive
checkle hash archive.tar:path/to/file.txt

# Hash all files matching pattern
checkle hash archive.zip:*.csv

# Hash all files in archive
checkle hash archive.tar.gz:*

Supported archive formats:

  • TAR (.tar, .tar.gz, .tar.bz2, .tar.xz)
  • ZIP (.zip)

Pattern Matching

Use glob patterns to filter files:

# Include patterns
checkle hash /data --include "*.fastq" --include "*.fasta"

# Exclude patterns
checkle hash /data --exclude "*.tmp" --exclude ".git/**"

# Combine include and exclude
checkle hash /data --include "*.txt" --exclude "*test*"

Exit Codes

  • 0: Success
  • 1: General error
  • 2: Verification failure
  • 3: File not found
  • 4: Invalid arguments

Configuration

Configure checkle for optimal performance in your environment.

Performance Tuning

Chunk Size

The chunk size determines how much data is read at once. Larger chunks can improve performance for sequential reads.

# Default: 1MB chunks
checkle hash file.txt

# Larger chunks for fast SSDs
checkle hash file.txt --chunk-size-kb 4096

# Smaller chunks for slower storage
checkle hash file.txt --chunk-size-kb 256

Recommendations:

  • Fast NVMe SSDs: 4096-8192 KB
  • Standard SSDs: 1024-2048 KB (default)
  • HDDs: 256-512 KB
  • Network storage: 128-256 KB

Parallel Readers

Control how many threads read file data in parallel.

# Auto-detect (default)
checkle hash large_file.bin

# Explicit thread count
checkle hash large_file.bin --parallel-readers 8

# Single-threaded for debugging
checkle hash large_file.bin --parallel-readers 1

Guidelines:

  • Files <64MB: Single thread (automatic)
  • Files ≥64MB: Multi-threaded based on CPU cores
  • Maximum useful: ~16 threads (I/O bound)

Batch Processing

When processing many files, control parallelism:

# Process 8 files simultaneously
checkle verify-many --checksum-file list.txt --parallel-files 8

# Limit batch size for memory constraints
checkle hash /data --recursive --max-files-batch 100

Algorithm Selection

Choose the right algorithm for your needs:

MD5 (Default)

  • Speed: Fastest
  • Security: Not cryptographically secure
  • Use case: Data integrity, duplicate detection
  • Compatibility: Universal support
checkle hash file.txt --algo md5

SHA-256

  • Speed: Slower than MD5
  • Security: Cryptographically secure
  • Use case: Security-critical verification
  • Compatibility: Wide support
checkle hash file.txt --algo sha256

Output Configuration

File Output

# Text format (default)
checkle hash *.txt -o checksums.txt

# JSON for programmatic use
checkle hash *.txt --format json -o checksums.json

# CSV for spreadsheets
checkle hash *.txt --format csv -o checksums.csv

Display Options

# Pretty table to stderr
checkle hash *.txt --pretty

# Absolute paths
checkle hash *.txt --absolute-paths

# Per-file checksum files
checkle hash *.txt --per-file

Filtering

Include/Exclude Patterns

# Include only specific extensions
checkle hash /data --include "*.fastq" --include "*.fasta"

# Exclude temporary files
checkle hash /data --exclude "*.tmp" --exclude "*.swp"

# Ignore .gitignore rules
checkle hash /project --no-ignore

Directory Traversal

# Recursive (process subdirectories)
checkle hash /data --recursive

# Non-recursive (default)
checkle hash /data

Environment Variables

While checkle doesn't require environment variables, you can use shell features:

# Set default algorithm
alias checkle='checkle --algo sha256'

# Set default verbosity
export CHECKLE_OPTS='-vv'
checkle hash file.txt $CHECKLE_OPTS

Memory Usage

Memory usage scales with:

  • Number of parallel readers × chunk size
  • Number of files processed in parallel
  • Archive decompression buffers

Typical memory usage:

  • Single large file: ~64MB
  • Batch processing: ~256MB
  • Archive processing: ~128MB per archive

To reduce memory usage:

# Smaller chunks
checkle hash large_file --chunk-size-kb 256

# Fewer parallel operations
checkle hash /data --parallel-readers 2 --max-files-batch 10

Hash Algorithms

checkle supports multiple hash algorithms for different use cases.

Available Algorithms

MD5

  • Speed: ~500 MB/s per core
  • Hash size: 128 bits (32 hex characters)
  • Security: Broken for cryptographic use
  • Best for: Fast integrity checks, duplicate detection
checkle hash file.txt --algo md5

SHA-256

  • Speed: ~300 MB/s per core
  • Hash size: 256 bits (64 hex characters)
  • Security: Cryptographically secure
  • Best for: Security-critical verification, compliance
checkle hash file.txt --algo sha256

Algorithm Comparison

AlgorithmSpeedSecurityHash SizeUse Case
MD5FastestWeak128 bitsData integrity
SHA-256ModerateStrong256 bitsSecurity verification

Choosing an Algorithm

Use MD5 when:

  • Speed is critical
  • Processing terabyte-scale datasets
  • Checking data integrity (not security)
  • Compatibility with legacy systems
  • Detecting accidental corruption

Use SHA-256 when:

  • Security is important
  • Regulatory compliance required
  • Verifying downloaded files
  • Long-term archival storage
  • Protecting against tampering

Performance Characteristics

Merkle Tree Parallelization

checkle uses Merkle trees to parallelize hashing:

  1. File divided into chunks
  2. Each chunk hashed independently
  3. Hash results combined in binary tree
  4. Single root hash produced

This provides:

  • Near-linear speedup with CPU cores
  • Deterministic results
  • Memory-bounded operation

Real-World Performance

On a modern 8-core system:

MD5:

  • Single-threaded: ~500 MB/s
  • Multi-threaded: ~3.5 GB/s

SHA-256:

  • Single-threaded: ~300 MB/s
  • Multi-threaded: ~2.1 GB/s

Implementation Details

Chunk Processing

# Default 1MB chunks
checkle hash genome.fasta

# Larger chunks for better throughput
checkle hash genome.fasta --chunk-size-kb 4096

Parallel Readers

# Auto-detect optimal threads
checkle hash large_file.bin

# Manual thread control
checkle hash large_file.bin --parallel-readers 16

Verification

Single File

# MD5 verification
checkle verify file.txt --hash d41d8cd98f00b204e9800998ecf8427e

# SHA-256 verification
checkle verify file.txt --hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 --algo sha256

Batch Verification

# From checksum file
checkle verify-many --checksum-file md5sums.txt --algo md5

Compatibility

checkle produces checksums compatible with standard tools:

# checkle output matches:
md5sum file.txt
sha256sum file.txt

# Verify with standard tools:
md5sum -c checksums.md5
sha256sum -c checksums.sha256

Performance

checkle is designed for maximum performance on modern multicore systems.

Key Performance Features

Parallel Processing

  • Merkle tree-based parallelization
  • Near-linear scaling with CPU cores
  • Automatic thread count detection
  • Memory-bounded operation

Optimizations

  • SIMD acceleration (optional builds)
  • Zero-copy I/O where possible
  • Buffer pooling to reduce allocations
  • Optimized for SSD characteristics

Benchmarks

Single File Performance

Testing with a 10GB file on 8-core system:

ToolAlgorithmTimeSpeed
checkleMD52.8s3.5 GB/s
md5sumMD520s500 MB/s
checkleSHA-2564.7s2.1 GB/s
sha256sumSHA-25633s300 MB/s

Batch Processing

Processing 1000 files (100MB each):

ToolTimeFiles/sec
checkle45s22
md5sum200s5
sha256sum330s3

Performance Tuning

For Large Files (>1GB)

# Increase chunk size
checkle hash large_genome.fasta --chunk-size-kb 4096

# More parallel readers
checkle hash large_genome.fasta --parallel-readers 16

For Many Small Files

# Increase batch parallelism
checkle hash /data --recursive --max-files-batch 50

# Reduce per-file overhead
checkle hash /data --recursive --no-progress

For Network Storage

# Smaller chunks to reduce latency impact
checkle hash /nfs/data/file.bin --chunk-size-kb 256

# Fewer parallel readers to avoid congestion
checkle hash /nfs/data/file.bin --parallel-readers 4

Memory Usage

Memory scales with:

  • chunk_size × parallel_readers per file
  • Number of files in parallel batch
  • Archive decompression buffers

Typical usage:

  • Large file (8 threads): ~64MB
  • Batch processing: ~256MB
  • Archive processing: ~128MB

CPU Utilization

checkle efficiently uses available CPU cores:

Small Files (<64MB)

  • Single-threaded (overhead not worth parallelization)
  • Multiple files processed in parallel

Large Files (≥64MB)

  • Multi-threaded per file
  • Scales to available cores
  • I/O and CPU overlapped

Storage Considerations

SSD Optimization

  • Default 1MB chunks align with SSD erase blocks
  • Sequential reads within regions
  • Minimal random I/O

HDD Optimization

# Larger sequential reads
checkle hash /hdd/file.bin --chunk-size-kb 4096

# Single reader to avoid seek overhead
checkle hash /hdd/file.bin --parallel-readers 1

SIMD Acceleration

SIMD builds provide additional speedup:

Performance Gains

  • Hex encoding: 2-3x faster
  • Memory operations: 20-30% faster
  • Overall: 10-15% improvement

Using SIMD Build

# Install SIMD version
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh -s -- --simd

# Verify SIMD support
checkle --version  # Shows "simd" in version string

Comparison with Other Tools

vs Traditional Tools (md5sum, sha256sum)

  • 5-10x faster on multicore systems
  • Linear scaling with cores
  • Better memory efficiency

vs Parallel Implementations

  • Comparable raw performance
  • Better progress reporting
  • Archive support without extraction
  • More output formats

Best Practices

  1. Let checkle auto-detect settings - Default heuristics work well
  2. Use SIMD builds on modern CPUs - Free 10-15% speedup
  3. Match chunk size to storage - Larger for SSD, smaller for HDD
  4. Process files in batches - Better than one at a time
  5. Use appropriate algorithm - MD5 for speed, SHA-256 for security

Archive Support

checkle can hash files within archives without extracting them.

Supported Formats

TAR Archives

  • .tar - Uncompressed
  • .tar.gz / .tgz - Gzip compressed
  • .tar.bz2 - Bzip2 compressed
  • .tar.xz - XZ compressed

ZIP Archives

  • .zip - Various compression methods

Basic Usage

Hash Specific File in Archive

checkle hash archive.tar:path/to/file.txt

Hash All Files in Archive

checkle hash archive.tar.gz:*

Hash Files Matching Pattern

checkle hash data.zip:*.csv
checkle hash backup.tar:logs/*.log

Archive Path Syntax

Use colon (:) to separate archive from internal path:

archive_path:internal_path

Examples:

# Specific file
data.tar.gz:results/output.txt

# All files
data.tar.gz:*

# Pattern matching
data.zip:*.fastq
data.tar:experiments/*/results.csv

Pattern Matching

Wildcards

  • * - Match any characters (except /)
  • ** - Match any characters (including /)
  • ? - Match single character

Examples

# All CSV files in root
checkle hash archive.zip:*.csv

# All files in subdirectory
checkle hash archive.tar:data/*

# Recursive pattern
checkle hash archive.tar.gz:**/*.txt

Performance

Streaming Processing

  • Files processed without full extraction
  • Memory usage bounded
  • Decompression on-the-fly

Limitations

  • Sequential access within archives
  • Cannot parallelize individual archive entries
  • Compressed archives require decompression

Examples

Genomics Data

# Hash FASTQ files in compressed archive
checkle hash sequencing_run.tar.gz:*.fastq

# Verify specific sample
checkle verify reads.tar.gz:sample_001.fastq --hash abc123

Backup Verification

# Hash all files in backup
checkle hash backup.tar.gz:* -o backup_checksums.txt

# Verify backup integrity later
checkle verify-many --checksum-file backup_checksums.txt

Data Transfer

# Before transfer - hash archive contents
checkle hash data.tar.gz:* > checksums_before.txt

# After transfer - verify integrity
checkle hash data.tar.gz:* > checksums_after.txt
diff checksums_before.txt checksums_after.txt

Archive vs Regular File

Without colon - hash the archive itself

checkle hash archive.tar.gz
# Output: abc123def456  archive.tar.gz

With colon - hash contents

checkle hash archive.tar.gz:file.txt
# Output: 789xyz012  archive.tar.gz:file.txt

Compressed Archives

Compression is handled transparently:

# All work the same way
checkle hash data.tar:file.txt      # Uncompressed
checkle hash data.tar.gz:file.txt   # Gzip
checkle hash data.tar.bz2:file.txt  # Bzip2
checkle hash data.tar.xz:file.txt   # XZ

Verification

Single File in Archive

checkle verify archive.tar:important.dat --hash e3b0c44298fc1c14

Multiple Files

Create checksum file:

checkle hash archive.tar:* -o archive_checksums.txt

Verify later:

checkle verify-many --checksum-file archive_checksums.txt

Tips

  1. Use patterns to hash multiple files - More efficient than individual commands
  2. Save checksums for archives - Verify integrity without re-reading
  3. Compressed archives are slower - Decompression adds overhead
  4. Large archives work fine - Streaming prevents memory issues
  5. Archive path must exist - Archive file itself must be accessible

CLI Reference

Complete command-line interface reference for checkle.

Synopsis

checkle [OPTIONS] <COMMAND>

Global Options

OptionShortDescription
--verbose-vIncrease verbosity (use multiple times)
--quiet-qSuppress non-essential output
--versionDisplay version
--help-hShow help

Commands

checkle hash

Generate checksums for files.

checkle hash [OPTIONS] <FILE_OR_DIR>

Arguments

  • <FILE_OR_DIR> - File, directory, or archive path to hash

Options

OptionDefaultDescription
--algo <ALGORITHM>md5Hash algorithm (md5, sha256)
--recursivefalseProcess directories recursively
--output <FILE>-Output file for checksums
--format <FORMAT>textOutput format (text, json, csv)
--prettyfalseDisplay formatted table
--per-filefalseCreate individual checksum files
--include <PATTERN>-Include only matching files
--exclude <PATTERN>-Exclude matching files
--no-ignorefalseDon't respect .gitignore
--parallel-readers <N>autoParallel reader threads
--chunk-size-kb <SIZE>1024Chunk size in KB
--max-files-batch <N>1000Maximum files per batch
--absolute-pathsfalseUse absolute paths
--no-progressfalseDisable progress bars

checkle verify

Verify file against known hash.

checkle verify [OPTIONS] <FILE> --hash <HASH>

Arguments

  • <FILE> - File to verify

Options

OptionDefaultDescription
--hash <HASH>requiredExpected hash value
--algo <ALGORITHM>md5Hash algorithm
--chunk-size-kb <SIZE>1024Chunk size in KB
--parallel-readers <N>autoParallel reader threads
--no-progressfalseDisable progress bar

checkle verify-many

Verify multiple files from checksum file.

checkle verify-many [OPTIONS] --checksum-file <FILE>

Options

OptionDefaultDescription
--checksum-file <FILE>requiredFile containing checksums
--base-dir <DIR>.Base directory for paths
--algo <ALGORITHM>md5Hash algorithm
--fail-fastfalseStop on first failure
--quietfalseOnly show failures
--prettyfalseDisplay formatted table
--parallel-files <N>4Files to verify in parallel
--chunk-size-kb <SIZE>1024Chunk size in KB
--failed-onlyfalseShow only failures
--no-progressfalseDisable progress bars

Archive Path Syntax

archive.tar:internal/path/file.txt

Special patterns:

  • :* - All files in archive
  • :*.ext - Files matching pattern
  • :dir/* - All files in directory

Exit Codes

CodeMeaning
0Success
1General error
2Verification failure
3File not found
4Invalid arguments
5I/O error

Environment Variables

No environment variables are used directly. Shell aliases can customize defaults:

alias checkle='checkle --algo sha256'

File Formats

Checksum File Format (text)

<hash>  <filepath>

Example:

d41d8cd98f00b204e9800998ecf8427e  file1.txt
e3b0c44298fc1c149afbf4c8996fb924  file2.txt

JSON Format

[
  {
    "hash": "d41d8cd98f00b204e9800998ecf8427e",
    "filepath": "file1.txt"
  }
]

CSV Format

hash,filepath
d41d8cd98f00b204e9800998ecf8427e,file1.txt

Configuration Reference

Detailed configuration options for checkle.

Performance Parameters

chunk-size-kb

Controls the size of data chunks read from files.

  • Type: Integer
  • Default: 1024 (1MB)
  • Range: 4 - 65536
  • Unit: Kilobytes

Impact:

  • Larger chunks: Better sequential throughput, more memory
  • Smaller chunks: Less memory, better for slow storage

Recommendations:

Storage TypeRecommended Size
NVMe SSD4096-8192 KB
SATA SSD1024-2048 KB
HDD256-512 KB
Network128-256 KB

parallel-readers

Number of threads for parallel file reading.

  • Type: Integer
  • Default: Auto (based on CPU cores)
  • Range: 1 - 64
  • Auto logic: min(CPU_cores, 8)

Impact:

  • More threads: Better parallelization for large files
  • Fewer threads: Less memory usage, less I/O contention

Guidelines:

  • Files <64MB: Always single-threaded
  • Files ≥64MB: Multi-threaded based on size
  • I/O bound after ~16 threads

max-files-batch

Maximum files to process simultaneously.

  • Type: Integer
  • Default: 1000
  • Range: 1 - 10000

Impact:

  • Larger batch: Better throughput for many small files
  • Smaller batch: Less memory usage

parallel-files

Files to verify in parallel (verify-many only).

  • Type: Integer
  • Default: 4
  • Range: 1 - 32

Impact:

  • More parallel: Faster verification of many files
  • Less parallel: Lower system load

Algorithm Configuration

algo

Hash algorithm selection.

  • Type: Enum
  • Values: md5, sha256
  • Default: md5

Characteristics:

AlgorithmSpeedSecuritySizeCompatibility
md5FastestWeak128-bitUniversal
sha256ModerateStrong256-bitWide

Output Configuration

format

Output format for checksums.

  • Type: Enum
  • Values: text, json, csv
  • Default: text (auto-detected from file extension)

Auto-detection:

  • .json → JSON format
  • .csv → CSV format
  • Others → text format

pretty

Display results in formatted table.

  • Type: Boolean
  • Default: false
  • Output: stderr (doesn't interfere with stdout)

per-file

Create individual checksum files.

  • Type: Boolean
  • Default: false
  • Naming: <filename>.<algo> (e.g., file.txt.md5)

absolute-paths

Use absolute paths in output.

  • Type: Boolean
  • Default: false (relative paths)

Filtering Configuration

include

Patterns for files to include.

  • Type: String (multiple allowed)
  • Syntax: Glob patterns
  • Default: Include all

Examples:

--include "*.txt"
--include "data/*.csv"
--include "**/*.fastq"

exclude

Patterns for files to exclude.

  • Type: String (multiple allowed)
  • Syntax: Glob patterns
  • Default: Exclude none

Examples:

--exclude "*.tmp"
--exclude ".git/**"
--exclude "**/test/*"

no-ignore

Don't respect .gitignore rules.

  • Type: Boolean
  • Default: false (respect .gitignore)

recursive

Process directories recursively.

  • Type: Boolean
  • Default: false

Progress Configuration

no-progress

Disable progress bars.

  • Type: Boolean
  • Default: false (show progress)

Auto-disabled when:

  • Output is not a terminal
  • Quiet mode is enabled
  • Single small file

Verification Configuration

fail-fast

Stop on first verification failure.

  • Type: Boolean
  • Default: false (verify all)
  • Applies to: verify-many

failed-only

Show only failed verifications.

  • Type: Boolean
  • Default: false (show all)
  • Applies to: verify-many

quiet

Suppress non-error output.

  • Type: Boolean
  • Default: false
  • Applies to: verify-many

Resource Limits

Built-in limits for stability:

ResourceLimitConfigurable
Max chunk size64MBYes (chunk-size-kb)
Max parallel readers64Yes (parallel-readers)
Max files batch10000Yes (max-files-batch)
Max archive size1TBNo
Max path length4096No

Default Behavior

Without options, checkle:

  1. Uses MD5 algorithm
  2. Auto-detects optimal thread count
  3. Uses 1MB chunks
  4. Shows progress bars
  5. Outputs to stdout in text format
  6. Processes single files/directories non-recursively
  7. Respects .gitignore rules

Development Setup

Get your development environment ready for contributing to checkle.

Prerequisites

Required

  • Rust 1.70+ (latest stable recommended)
  • Git
  • C compiler (for dependencies)

Optional

  • Just (command runner)
  • cargo-watch (auto-rebuild)
  • cargo-criterion (benchmarking)

Clone Repository

git clone https://github.com/nrminor/checkle.git
cd checkle

Install Rust

If you don't have Rust installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Update to latest stable:

rustup update stable
rustup default stable

Install Development Tools

Just (Command Runner)

# macOS
brew install just

# Linux
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/.local/bin

# Via cargo
cargo install just

Development Dependencies

# Auto-rebuild on changes
cargo install cargo-watch

# Benchmarking
cargo install cargo-criterion

# Code coverage
cargo install cargo-tarpaulin

Build and Test

Using Just

# Run all checks
just check

# Run tests
just test

# Run clippy
just clippy

# Format code
just fmt

# Run benchmarks
just bench

Using Cargo Directly

# Build debug version
cargo build

# Build release version
cargo build --release

# Run tests
cargo test

# Run with verbose output
cargo test -- --nocapture

# Run specific test
cargo test test_name

Development Workflow

1. Create Feature Branch

git checkout -b feature/my-feature

2. Make Changes

# Edit files
vim src/module.rs

# Auto-rebuild and test
cargo watch -x test

3. Run Checks

# Format code
cargo fmt

# Run clippy (MUST pass)
cargo clippy --all-targets --all-features -- -D warnings

# Run tests
cargo test

# Verify hashes
just verify-hashes

4. Test Release Build

cargo build --release
./target/release/checkle hash test.txt

Code Standards

Mandatory Checks

Every change MUST pass:

  1. cargo fmt - Code formatting
  2. cargo clippy --all-targets --all-features -- -D warnings - Zero warnings
  3. cargo test - All tests pass
  4. just verify-hashes - Hash correctness

Documentation

  • All public items need rustdoc
  • Include examples in doc comments
  • Update relevant .md files

Testing

  • Add at least 3 tests per change
  • Use proptest for algorithmic code
  • Integration tests for CLI changes

Project Structure

checkle/
├── src/
│   ├── main.rs           # Binary entry point
│   ├── lib.rs            # Library root
│   ├── cli.rs            # CLI definitions
│   ├── hashing.rs        # Core hashing logic
│   ├── io.rs             # File I/O
│   ├── errors.rs         # Error types
│   └── commands/         # Command implementations
├── tests/                # Integration tests
├── benches/              # Benchmarks
├── docs/                 # mdBook documentation
└── justfile              # Development commands

Common Tasks

Add New Hash Algorithm

  1. Add variant to HashingAlgo enum
  2. Implement in Hasher methods
  3. Add tests
  4. Update documentation

Add New Command

  1. Define in cli.rs
  2. Implement in commands/
  3. Add integration tests
  4. Update CLI documentation

Improve Performance

  1. Run benchmarks: just bench
  2. Profile with cargo flamegraph
  3. Make changes
  4. Verify with benchmarks
  5. Ensure tests still pass

Debugging

Enable Debug Output

RUST_LOG=debug cargo run -- hash test.txt

Run with Backtrace

RUST_BACKTRACE=1 cargo run -- hash test.txt

Use GDB/LLDB

cargo build
gdb target/debug/checkle

IDE Setup

VS Code

Install extensions:

  • rust-analyzer
  • CodeLLDB (debugging)

IntelliJ/CLion

Install Rust plugin

Vim/Neovim

Use rust.vim and rust-analyzer LSP

Getting Help

  • Check existing issues on GitHub
  • Read AGENTS.md for AI assistant guidelines
  • Ask in discussions/issues
  • Review test files for examples

Building from Source

Build checkle from source code for development or custom installations.

Quick Build

Standard Build

git clone https://github.com/nrminor/checkle.git
cd checkle
cargo build --release

Binary will be at: target/release/checkle

SIMD-Optimized Build

RUSTFLAGS="-C target-cpu=native" cargo build --release --features simd

Build Types

Debug Build

Fast compilation, slow runtime, debug symbols:

cargo build
# Binary at: target/debug/checkle

Release Build

Slow compilation, fast runtime, optimized:

cargo build --release
# Binary at: target/release/checkle

SIMD Build

Hardware-specific optimizations:

# For current CPU
RUSTFLAGS="-C target-cpu=native" cargo build --release --features simd

# For specific architecture
RUSTFLAGS="-C target-feature=+avx2" cargo build --release --features simd

Installation

System-Wide

# Build and install to ~/.cargo/bin
cargo install --path .

# Or copy manually
sudo cp target/release/checkle /usr/local/bin/

Local Directory

# Build and copy to specific location
cargo build --release
cp target/release/checkle ~/bin/

Cross-Compilation

Linux to Windows

# Install target
rustup target add x86_64-pc-windows-gnu

# Build
cargo build --release --target x86_64-pc-windows-gnu

Linux to macOS

# Install target
rustup target add x86_64-apple-darwin

# Build (requires macOS SDK)
cargo build --release --target x86_64-apple-darwin

Using Cross

# Install cross
cargo install cross

# Build for target
cross build --release --target aarch64-unknown-linux-gnu

Build Features

Available Features

  • simd - Enable SIMD optimizations (requires nightly)

Feature Combinations

# Default (no features)
cargo build --release

# With SIMD
cargo build --release --features simd

# All features
cargo build --release --all-features

Build Requirements

Minimum Requirements

  • Rust 1.70+
  • 2GB RAM
  • 500MB disk space
  • Rust latest stable
  • 4GB RAM
  • 1GB disk space
  • SSD for faster builds

Platform-Specific Notes

Linux

No special requirements. Works on all major distributions.

macOS

# Install Xcode command line tools if needed
xcode-select --install

Windows

Use either:

  • MSVC toolchain (Visual Studio)
  • GNU toolchain (MinGW)
# MSVC
rustup default stable-msvc

# GNU
rustup default stable-gnu

Optimizations

Add to Cargo.toml:

[profile.release]
lto = true

Codegen Units

For maximum performance:

[profile.release]
codegen-units = 1

CPU-Specific

RUSTFLAGS="-C target-cpu=native" cargo build --release

Docker Build

FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/checkle /usr/local/bin/
CMD ["checkle"]

Build:

docker build -t checkle .

Nix Build

Using flake:

nix build

Traditional:

nix-build

Verification

After building, verify the binary works:

# Check version
./target/release/checkle --version

# Run tests
cargo test

# Verify against standard tools
echo "test" > test.txt
./target/release/checkle hash test.txt
md5sum test.txt  # Should match

Troubleshooting

Out of Memory

# Limit parallel jobs
cargo build --release -j 2

Linker Errors

# On Linux, install development packages
sudo apt-get install build-essential
# or
sudo yum groupinstall "Development Tools"

SIMD Build Fails

# Use nightly Rust
rustup install nightly
cargo +nightly build --release --features simd

Slow Builds

# Use sccache
cargo install sccache
export RUSTC_WRAPPER=sccache
cargo build --release

Build Artifacts

Build produces:

target/
├── release/
│   ├── checkle          # Main binary
│   ├── deps/            # Dependencies
│   └── build/           # Build scripts output
└── debug/               # Debug build (if built)

Clean Build

# Remove all build artifacts
cargo clean

# Remove only release artifacts
cargo clean --release

Working with AI Agents

As AI coding agents continue to improve, generated code is increasingly a part of the software development process, and indeed projects that plan to use AI-generated code should be structured to provide maximum context--including goals, style and API design guidelines, and non-negotiable development rules--to make agents effective. The checkle codebase is designed to do just that. Starting with AGENTS.md and including additional context in the context/ directory, the codebase comes with many specific requirements and guidelines that all AI agents must follow to ensure code quality and maintainability.

Strict Rule Compliance

AI agents working on checkle must follow all development rules without exception. At minimum, this includes:

  • Quality Checks: Run cargo fmt, cargo check, and cargo clippy --all-targets --all-features -- -D warnings before declaring any work complete
  • No Lint Suppressions: Never add #[allow()] lint suppressions without explicit permission from the project maintainer. clippy lints operate under an opt-out default principle, which means project lints are exceptionally strict and demanding unless exceptions have been explicitly approved. This keeps coding agents "on the rails" quite effectively while also providing specific, achievable feedback on how to verify that a feature is actually finished.
  • Three-Test Rule: Every change must include at least 3 new, improved tests, or replaced tests

Frequent Context Loading

Before making any changes, AI agents must read and understand these documents:

  1. AGENTS.md - Complete development guidelines and project rules
  2. README.md - Project overview and goals
  3. TIGER_STYLE.md - World-class software robustness principles
  4. GRUG BRAIN DEVELOPER - Pragmatic simplicity principles

It is also recommended that agents review these documents after implementing a feature to ensure that the submitted code is standards-compliant and leaves the codebase better than it was found.

Essential Guidelines for AI Agents

Performance Focus

checkle is designed for bioinformatics workflows with terabyte-scale files. Always consider:

  • Multicore utilization and parallel processing
  • Memory efficiency for large file handling
  • Merkle tree optimization opportunities
  • Buffer reuse and minimal allocations

Code Quality Standards

The project enforces extremely strict quality standards:

  • Zero clippy warnings allowed
  • Comprehensive test coverage required
  • No unwrap() calls in production code
  • Proper error handling with context

Balance Principles

Follow both development philosophies:

  • Tiger Style: Robustness, assertions, resource limits
  • Grug Brain: Simplicity, avoiding premature complexity

When these conflict, prefer solutions that are both robust AND simple.

Bioinformatics Context

Remember that checkle serves genomics researchers who:

  • Work with files that can be 500GB+ each
  • Need reliable integrity verification for critical data
  • Require fast batch processing of many large files
  • Value performance and correctness over feature richness

Critical Requirements

Documentation Reading

Working without reading the required documents (AGENTS.md, README.md, TIGER_STYLE.md, GRUG BRAIN DEVELOPER) is unacceptable and will result in code rejection.

Error Prevention

Common mistakes AI agents must avoid:

  • Adding dependencies without approval
  • Skipping quality checks
  • Writing tests that only verify assertions
  • Adding complexity without clear benefit
  • Ignoring performance implications

Success Criteria

Code is considered complete only when:

  1. All quality checks pass without warnings
  2. At least 3 meaningful tests are included
  3. Performance implications have been considered
  4. Tiger Style and Grug Brain principles are balanced
  5. The change serves the bioinformatics use case

Getting Started

  1. Read all required documentation
  2. Understand the specific task requirements
  3. Consider performance and simplicity implications
  4. Implement with comprehensive error handling
  5. Add thorough tests
  6. Run all quality checks
  7. Document any architectural decisions

By following these guidelines, AI agents can contribute effectively to checkle while maintaining the high standards that make it reliable for critical bioinformatics workflows.