Introduction
⚠️ CRITICAL: DO NOT USE FOR STANDARD CHECKSUMS
This project is an unsuccessful prototype that will produce different hashes than
md5sumandsha256sumfor all files larger than 1MB.checkleis thus incompatible with standard MD5/SHA256 checksum utilities.Please use standard time-tested tools like
md5sumorsha256suminstead.
Welcome to checkle - an extremely fast checksum utility designed for bioinformatics workflows involving terabyte-scale genomics data.
What is checkle?
checkle is a high-performance command-line tool that leverages Merkle tree parallelization to compute checksums faster than traditional tools like md5sum or sha256sum. It's specifically optimized for bioinformatics workflows where data integrity is critical and files can be hundreds of gigabytes each.
Key Features
- Blazing Fast: 5-10x faster than md5sum on multicore systems
- Merkle Tree Parallelization: Near-linear speedup with CPU cores
- Archive Support: Hash files within TAR/ZIP archives without extraction
- Bioinformatics Focus: Optimized for large genomics files (FASTQ, BAM, VCF)
- Multiple Output Formats: Text, JSON, CSV for pipeline integration
- Progress Tracking: Real-time progress for long-running operations
Quick Example
# Hash a single genome file
checkle hash genome.fastq.gz
# Hash all FASTQ files in a sequencing run
checkle hash /data/run_001 --recursive --include "*.fastq.gz"
# Verify downloaded reference genome
checkle verify GRCh38.fa.gz --hash e3b0c44298fc1c149afbf4c8996fb924
# Hash files in compressed archive without extracting
checkle hash sequencing_data.tar.gz:*.fastq
Getting Started
Head over to the Installation guide to get started with Checkle.
Installation
checkle offers multiple installation methods to suit different preferences and use cases.
Quick Install (Recommended)
The quickest way to get started is using our installation script:
# Standard build
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh
# SIMD-optimized build (faster, requires modern CPU)
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh -s -- --simd
Manual Binary Download
Download precompiled binaries from releases:
# SIMD-optimized (recommended for modern CPUs)
wget https://github.com/nrminor/checkle/releases/latest/download/checkle-x86_64-unknown-linux-gnu-simd.tar.gz
tar -xzf checkle-x86_64-unknown-linux-gnu-simd.tar.gz
sudo mv checkle /usr/local/bin/
# Standard compatibility version
wget https://github.com/nrminor/checkle/releases/latest/download/checkle-x86_64-unknown-linux-gnu.tar.gz
Cargo Install
If you have Rust installed, you can build and install checkle using Cargo:
# From crates.io (when published)
cargo install checkle
# With cargo-binstall (if available)
cargo binstall checkle
# From source
cargo install --git https://github.com/nrminor/checkle
Verification
After installation, verify that checkle is working correctly:
checkle --version
You should see output showing the installed version of checkle.
SIMD vs Standard Builds
checkle offers two build variants:
- SIMD-optimized: Faster performance using advanced CPU instructions. Requires modern CPUs (x86_64 with SSE4.2+ or ARM64 with NEON).
- Standard: Maximum compatibility across different hardware platforms.
For best performance on modern systems, use the SIMD-optimized build. If you encounter issues or are using older hardware, use the standard build.
Quick Start
Get up and running with checkle in 5 minutes.
Basic Commands
Hash a single file
checkle hash genome.fastq.gz
Hash multiple files
checkle hash *.fastq.gz
Hash with SHA-256 instead of MD5
checkle hash --algo sha256 genome.fastq.gz
Save checksums to a file
checkle hash *.fastq.gz -o checksums.txt
Verification
Verify a single file
checkle verify genome.fastq.gz --hash d41d8cd98f00b204e9800998ecf8427e
Verify multiple files from a checksum file
checkle verify-many --checksum-file checksums.txt
Working with Directories
Hash all files in a directory recursively
checkle hash /data/sequencing_run --recursive
Hash only specific file types
checkle hash /data --recursive --include "*.fastq" --include "*.fasta"
Exclude certain patterns
checkle hash /data --recursive --exclude "*.tmp" --exclude "*.log"
Archive Support
Hash files inside a TAR archive without extracting
checkle hash data.tar.gz:sequences/sample.fastq
Hash all files in an archive
checkle hash data.tar.gz:*
Output Formats
JSON output for downstream processing
checkle hash *.bam --format json > checksums.json
CSV for spreadsheet import
checkle hash *.vcf --format csv > checksums.csv
Pretty table display
checkle hash *.fastq --pretty
Performance Tuning
Increase parallel readers for large files
checkle hash huge_genome.fasta --parallel-readers 16
Adjust chunk size for optimal performance
checkle hash *.bam --chunk-size-kb 4096
Next Steps
- See Command Line Usage for detailed command reference
- Learn about Performance optimization
- Explore Archive Support features
Command Line Usage
Complete reference for checkle's command-line interface.
Global Options
checkle [OPTIONS] <COMMAND>
Options available for all commands:
-v, --verbose: Increase logging verbosity (use multiple times)-q, --quiet: Suppress non-essential output--version: Display version information--help: Show help information
Commands
hash
Generate checksums for files.
checkle hash [OPTIONS] <FILE_OR_DIR>
Options:
--algo <ALGORITHM>: Hash algorithm (md5, sha256) [default: md5]-r, --recursive: Process directories recursively-o, --output <FILE>: Save checksums to file--format <FORMAT>: Output format (text, json, csv)--pretty: Display results in a formatted table--per-file: Create individual checksum files--include <PATTERN>: Include only matching files--exclude <PATTERN>: Exclude matching files--no-ignore: Don't respect .gitignore rules--parallel-readers <N>: Number of parallel readers [default: auto]--chunk-size-kb <SIZE>: Chunk size in KB [default: 1024]--absolute-paths: Display absolute paths in output
Examples:
# Hash single file
checkle hash file.txt
# Hash directory recursively
checkle hash /data --recursive
# Save SHA-256 checksums to file
checkle hash *.fastq --algo sha256 -o checksums.sha256
# Pretty table output
checkle hash *.bam --pretty
verify
Verify a file against a known hash.
checkle verify [OPTIONS] <FILE> --hash <HASH>
Options:
--hash <HASH>: Expected hash value (required)--algo <ALGORITHM>: Hash algorithm [default: md5]--chunk-size-kb <SIZE>: Chunk size in KB--parallel-readers <N>: Number of parallel readers
Examples:
# Verify MD5
checkle verify genome.fasta --hash d41d8cd98f00b204e9800998ecf8427e
# Verify SHA-256
checkle verify data.tar.gz --hash e3b0c44298fc1c149afbf4c8996fb924 --algo sha256
verify-many
Verify multiple files using a checksum file.
checkle verify-many [OPTIONS] --checksum-file <FILE>
Options:
--checksum-file <FILE>: File containing checksums (required)--base-dir <DIR>: Base directory for relative paths--algo <ALGORITHM>: Hash algorithm [default: md5]--fail-fast: Stop on first verification failure--quiet: Only show failures--pretty: Display results in formatted table--parallel-files <N>: Files to process in parallel--chunk-size-kb <SIZE>: Chunk size in KB--failed-only: Only show failed verifications
Examples:
# Verify all files in checksum list
checkle verify-many --checksum-file checksums.txt
# Stop on first failure
checkle verify-many --checksum-file checksums.txt --fail-fast
# Show only failures
checkle verify-many --checksum-file checksums.txt --failed-only
# Verify with custom base directory
checkle verify-many --checksum-file checksums.txt --base-dir /data
Archive Syntax
Access files within archives using colon notation:
# Hash specific file in archive
checkle hash archive.tar:path/to/file.txt
# Hash all files matching pattern
checkle hash archive.zip:*.csv
# Hash all files in archive
checkle hash archive.tar.gz:*
Supported archive formats:
- TAR (.tar, .tar.gz, .tar.bz2, .tar.xz)
- ZIP (.zip)
Pattern Matching
Use glob patterns to filter files:
# Include patterns
checkle hash /data --include "*.fastq" --include "*.fasta"
# Exclude patterns
checkle hash /data --exclude "*.tmp" --exclude ".git/**"
# Combine include and exclude
checkle hash /data --include "*.txt" --exclude "*test*"
Exit Codes
0: Success1: General error2: Verification failure3: File not found4: Invalid arguments
Configuration
Configure checkle for optimal performance in your environment.
Performance Tuning
Chunk Size
The chunk size determines how much data is read at once. Larger chunks can improve performance for sequential reads.
# Default: 1MB chunks
checkle hash file.txt
# Larger chunks for fast SSDs
checkle hash file.txt --chunk-size-kb 4096
# Smaller chunks for slower storage
checkle hash file.txt --chunk-size-kb 256
Recommendations:
- Fast NVMe SSDs: 4096-8192 KB
- Standard SSDs: 1024-2048 KB (default)
- HDDs: 256-512 KB
- Network storage: 128-256 KB
Parallel Readers
Control how many threads read file data in parallel.
# Auto-detect (default)
checkle hash large_file.bin
# Explicit thread count
checkle hash large_file.bin --parallel-readers 8
# Single-threaded for debugging
checkle hash large_file.bin --parallel-readers 1
Guidelines:
- Files <64MB: Single thread (automatic)
- Files ≥64MB: Multi-threaded based on CPU cores
- Maximum useful: ~16 threads (I/O bound)
Batch Processing
When processing many files, control parallelism:
# Process 8 files simultaneously
checkle verify-many --checksum-file list.txt --parallel-files 8
# Limit batch size for memory constraints
checkle hash /data --recursive --max-files-batch 100
Algorithm Selection
Choose the right algorithm for your needs:
MD5 (Default)
- Speed: Fastest
- Security: Not cryptographically secure
- Use case: Data integrity, duplicate detection
- Compatibility: Universal support
checkle hash file.txt --algo md5
SHA-256
- Speed: Slower than MD5
- Security: Cryptographically secure
- Use case: Security-critical verification
- Compatibility: Wide support
checkle hash file.txt --algo sha256
Output Configuration
File Output
# Text format (default)
checkle hash *.txt -o checksums.txt
# JSON for programmatic use
checkle hash *.txt --format json -o checksums.json
# CSV for spreadsheets
checkle hash *.txt --format csv -o checksums.csv
Display Options
# Pretty table to stderr
checkle hash *.txt --pretty
# Absolute paths
checkle hash *.txt --absolute-paths
# Per-file checksum files
checkle hash *.txt --per-file
Filtering
Include/Exclude Patterns
# Include only specific extensions
checkle hash /data --include "*.fastq" --include "*.fasta"
# Exclude temporary files
checkle hash /data --exclude "*.tmp" --exclude "*.swp"
# Ignore .gitignore rules
checkle hash /project --no-ignore
Directory Traversal
# Recursive (process subdirectories)
checkle hash /data --recursive
# Non-recursive (default)
checkle hash /data
Environment Variables
While checkle doesn't require environment variables, you can use shell features:
# Set default algorithm
alias checkle='checkle --algo sha256'
# Set default verbosity
export CHECKLE_OPTS='-vv'
checkle hash file.txt $CHECKLE_OPTS
Memory Usage
Memory usage scales with:
- Number of parallel readers × chunk size
- Number of files processed in parallel
- Archive decompression buffers
Typical memory usage:
- Single large file: ~64MB
- Batch processing: ~256MB
- Archive processing: ~128MB per archive
To reduce memory usage:
# Smaller chunks
checkle hash large_file --chunk-size-kb 256
# Fewer parallel operations
checkle hash /data --parallel-readers 2 --max-files-batch 10
Hash Algorithms
checkle supports multiple hash algorithms for different use cases.
Available Algorithms
MD5
- Speed: ~500 MB/s per core
- Hash size: 128 bits (32 hex characters)
- Security: Broken for cryptographic use
- Best for: Fast integrity checks, duplicate detection
checkle hash file.txt --algo md5
SHA-256
- Speed: ~300 MB/s per core
- Hash size: 256 bits (64 hex characters)
- Security: Cryptographically secure
- Best for: Security-critical verification, compliance
checkle hash file.txt --algo sha256
Algorithm Comparison
| Algorithm | Speed | Security | Hash Size | Use Case |
|---|---|---|---|---|
| MD5 | Fastest | Weak | 128 bits | Data integrity |
| SHA-256 | Moderate | Strong | 256 bits | Security verification |
Choosing an Algorithm
Use MD5 when:
- Speed is critical
- Processing terabyte-scale datasets
- Checking data integrity (not security)
- Compatibility with legacy systems
- Detecting accidental corruption
Use SHA-256 when:
- Security is important
- Regulatory compliance required
- Verifying downloaded files
- Long-term archival storage
- Protecting against tampering
Performance Characteristics
Merkle Tree Parallelization
checkle uses Merkle trees to parallelize hashing:
- File divided into chunks
- Each chunk hashed independently
- Hash results combined in binary tree
- Single root hash produced
This provides:
- Near-linear speedup with CPU cores
- Deterministic results
- Memory-bounded operation
Real-World Performance
On a modern 8-core system:
MD5:
- Single-threaded: ~500 MB/s
- Multi-threaded: ~3.5 GB/s
SHA-256:
- Single-threaded: ~300 MB/s
- Multi-threaded: ~2.1 GB/s
Implementation Details
Chunk Processing
# Default 1MB chunks
checkle hash genome.fasta
# Larger chunks for better throughput
checkle hash genome.fasta --chunk-size-kb 4096
Parallel Readers
# Auto-detect optimal threads
checkle hash large_file.bin
# Manual thread control
checkle hash large_file.bin --parallel-readers 16
Verification
Single File
# MD5 verification
checkle verify file.txt --hash d41d8cd98f00b204e9800998ecf8427e
# SHA-256 verification
checkle verify file.txt --hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 --algo sha256
Batch Verification
# From checksum file
checkle verify-many --checksum-file md5sums.txt --algo md5
Compatibility
checkle produces checksums compatible with standard tools:
# checkle output matches:
md5sum file.txt
sha256sum file.txt
# Verify with standard tools:
md5sum -c checksums.md5
sha256sum -c checksums.sha256
Performance
checkle is designed for maximum performance on modern multicore systems.
Key Performance Features
Parallel Processing
- Merkle tree-based parallelization
- Near-linear scaling with CPU cores
- Automatic thread count detection
- Memory-bounded operation
Optimizations
- SIMD acceleration (optional builds)
- Zero-copy I/O where possible
- Buffer pooling to reduce allocations
- Optimized for SSD characteristics
Benchmarks
Single File Performance
Testing with a 10GB file on 8-core system:
| Tool | Algorithm | Time | Speed |
|---|---|---|---|
| checkle | MD5 | 2.8s | 3.5 GB/s |
| md5sum | MD5 | 20s | 500 MB/s |
| checkle | SHA-256 | 4.7s | 2.1 GB/s |
| sha256sum | SHA-256 | 33s | 300 MB/s |
Batch Processing
Processing 1000 files (100MB each):
| Tool | Time | Files/sec |
|---|---|---|
| checkle | 45s | 22 |
| md5sum | 200s | 5 |
| sha256sum | 330s | 3 |
Performance Tuning
For Large Files (>1GB)
# Increase chunk size
checkle hash large_genome.fasta --chunk-size-kb 4096
# More parallel readers
checkle hash large_genome.fasta --parallel-readers 16
For Many Small Files
# Increase batch parallelism
checkle hash /data --recursive --max-files-batch 50
# Reduce per-file overhead
checkle hash /data --recursive --no-progress
For Network Storage
# Smaller chunks to reduce latency impact
checkle hash /nfs/data/file.bin --chunk-size-kb 256
# Fewer parallel readers to avoid congestion
checkle hash /nfs/data/file.bin --parallel-readers 4
Memory Usage
Memory scales with:
chunk_size × parallel_readersper file- Number of files in parallel batch
- Archive decompression buffers
Typical usage:
- Large file (8 threads): ~64MB
- Batch processing: ~256MB
- Archive processing: ~128MB
CPU Utilization
checkle efficiently uses available CPU cores:
Small Files (<64MB)
- Single-threaded (overhead not worth parallelization)
- Multiple files processed in parallel
Large Files (≥64MB)
- Multi-threaded per file
- Scales to available cores
- I/O and CPU overlapped
Storage Considerations
SSD Optimization
- Default 1MB chunks align with SSD erase blocks
- Sequential reads within regions
- Minimal random I/O
HDD Optimization
# Larger sequential reads
checkle hash /hdd/file.bin --chunk-size-kb 4096
# Single reader to avoid seek overhead
checkle hash /hdd/file.bin --parallel-readers 1
SIMD Acceleration
SIMD builds provide additional speedup:
Performance Gains
- Hex encoding: 2-3x faster
- Memory operations: 20-30% faster
- Overall: 10-15% improvement
Using SIMD Build
# Install SIMD version
curl -fsSL https://raw.githubusercontent.com/nrminor/checkle/main/INSTALL.sh | sh -s -- --simd
# Verify SIMD support
checkle --version # Shows "simd" in version string
Comparison with Other Tools
vs Traditional Tools (md5sum, sha256sum)
- 5-10x faster on multicore systems
- Linear scaling with cores
- Better memory efficiency
vs Parallel Implementations
- Comparable raw performance
- Better progress reporting
- Archive support without extraction
- More output formats
Best Practices
- Let checkle auto-detect settings - Default heuristics work well
- Use SIMD builds on modern CPUs - Free 10-15% speedup
- Match chunk size to storage - Larger for SSD, smaller for HDD
- Process files in batches - Better than one at a time
- Use appropriate algorithm - MD5 for speed, SHA-256 for security
Archive Support
checkle can hash files within archives without extracting them.
Supported Formats
TAR Archives
.tar- Uncompressed.tar.gz/.tgz- Gzip compressed.tar.bz2- Bzip2 compressed.tar.xz- XZ compressed
ZIP Archives
.zip- Various compression methods
Basic Usage
Hash Specific File in Archive
checkle hash archive.tar:path/to/file.txt
Hash All Files in Archive
checkle hash archive.tar.gz:*
Hash Files Matching Pattern
checkle hash data.zip:*.csv
checkle hash backup.tar:logs/*.log
Archive Path Syntax
Use colon (:) to separate archive from internal path:
archive_path:internal_path
Examples:
# Specific file
data.tar.gz:results/output.txt
# All files
data.tar.gz:*
# Pattern matching
data.zip:*.fastq
data.tar:experiments/*/results.csv
Pattern Matching
Wildcards
*- Match any characters (except/)**- Match any characters (including/)?- Match single character
Examples
# All CSV files in root
checkle hash archive.zip:*.csv
# All files in subdirectory
checkle hash archive.tar:data/*
# Recursive pattern
checkle hash archive.tar.gz:**/*.txt
Performance
Streaming Processing
- Files processed without full extraction
- Memory usage bounded
- Decompression on-the-fly
Limitations
- Sequential access within archives
- Cannot parallelize individual archive entries
- Compressed archives require decompression
Examples
Genomics Data
# Hash FASTQ files in compressed archive
checkle hash sequencing_run.tar.gz:*.fastq
# Verify specific sample
checkle verify reads.tar.gz:sample_001.fastq --hash abc123
Backup Verification
# Hash all files in backup
checkle hash backup.tar.gz:* -o backup_checksums.txt
# Verify backup integrity later
checkle verify-many --checksum-file backup_checksums.txt
Data Transfer
# Before transfer - hash archive contents
checkle hash data.tar.gz:* > checksums_before.txt
# After transfer - verify integrity
checkle hash data.tar.gz:* > checksums_after.txt
diff checksums_before.txt checksums_after.txt
Archive vs Regular File
Without colon - hash the archive itself
checkle hash archive.tar.gz
# Output: abc123def456 archive.tar.gz
With colon - hash contents
checkle hash archive.tar.gz:file.txt
# Output: 789xyz012 archive.tar.gz:file.txt
Compressed Archives
Compression is handled transparently:
# All work the same way
checkle hash data.tar:file.txt # Uncompressed
checkle hash data.tar.gz:file.txt # Gzip
checkle hash data.tar.bz2:file.txt # Bzip2
checkle hash data.tar.xz:file.txt # XZ
Verification
Single File in Archive
checkle verify archive.tar:important.dat --hash e3b0c44298fc1c14
Multiple Files
Create checksum file:
checkle hash archive.tar:* -o archive_checksums.txt
Verify later:
checkle verify-many --checksum-file archive_checksums.txt
Tips
- Use patterns to hash multiple files - More efficient than individual commands
- Save checksums for archives - Verify integrity without re-reading
- Compressed archives are slower - Decompression adds overhead
- Large archives work fine - Streaming prevents memory issues
- Archive path must exist - Archive file itself must be accessible
CLI Reference
Complete command-line interface reference for checkle.
Synopsis
checkle [OPTIONS] <COMMAND>
Global Options
| Option | Short | Description |
|---|---|---|
--verbose | -v | Increase verbosity (use multiple times) |
--quiet | -q | Suppress non-essential output |
--version | Display version | |
--help | -h | Show help |
Commands
checkle hash
Generate checksums for files.
checkle hash [OPTIONS] <FILE_OR_DIR>
Arguments
<FILE_OR_DIR>- File, directory, or archive path to hash
Options
| Option | Default | Description |
|---|---|---|
--algo <ALGORITHM> | md5 | Hash algorithm (md5, sha256) |
--recursive | false | Process directories recursively |
--output <FILE> | - | Output file for checksums |
--format <FORMAT> | text | Output format (text, json, csv) |
--pretty | false | Display formatted table |
--per-file | false | Create individual checksum files |
--include <PATTERN> | - | Include only matching files |
--exclude <PATTERN> | - | Exclude matching files |
--no-ignore | false | Don't respect .gitignore |
--parallel-readers <N> | auto | Parallel reader threads |
--chunk-size-kb <SIZE> | 1024 | Chunk size in KB |
--max-files-batch <N> | 1000 | Maximum files per batch |
--absolute-paths | false | Use absolute paths |
--no-progress | false | Disable progress bars |
checkle verify
Verify file against known hash.
checkle verify [OPTIONS] <FILE> --hash <HASH>
Arguments
<FILE>- File to verify
Options
| Option | Default | Description |
|---|---|---|
--hash <HASH> | required | Expected hash value |
--algo <ALGORITHM> | md5 | Hash algorithm |
--chunk-size-kb <SIZE> | 1024 | Chunk size in KB |
--parallel-readers <N> | auto | Parallel reader threads |
--no-progress | false | Disable progress bar |
checkle verify-many
Verify multiple files from checksum file.
checkle verify-many [OPTIONS] --checksum-file <FILE>
Options
| Option | Default | Description |
|---|---|---|
--checksum-file <FILE> | required | File containing checksums |
--base-dir <DIR> | . | Base directory for paths |
--algo <ALGORITHM> | md5 | Hash algorithm |
--fail-fast | false | Stop on first failure |
--quiet | false | Only show failures |
--pretty | false | Display formatted table |
--parallel-files <N> | 4 | Files to verify in parallel |
--chunk-size-kb <SIZE> | 1024 | Chunk size in KB |
--failed-only | false | Show only failures |
--no-progress | false | Disable progress bars |
Archive Path Syntax
archive.tar:internal/path/file.txt
Special patterns:
:*- All files in archive:*.ext- Files matching pattern:dir/*- All files in directory
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Verification failure |
| 3 | File not found |
| 4 | Invalid arguments |
| 5 | I/O error |
Environment Variables
No environment variables are used directly. Shell aliases can customize defaults:
alias checkle='checkle --algo sha256'
File Formats
Checksum File Format (text)
<hash> <filepath>
Example:
d41d8cd98f00b204e9800998ecf8427e file1.txt
e3b0c44298fc1c149afbf4c8996fb924 file2.txt
JSON Format
[
{
"hash": "d41d8cd98f00b204e9800998ecf8427e",
"filepath": "file1.txt"
}
]
CSV Format
hash,filepath
d41d8cd98f00b204e9800998ecf8427e,file1.txt
Configuration Reference
Detailed configuration options for checkle.
Performance Parameters
chunk-size-kb
Controls the size of data chunks read from files.
- Type: Integer
- Default: 1024 (1MB)
- Range: 4 - 65536
- Unit: Kilobytes
Impact:
- Larger chunks: Better sequential throughput, more memory
- Smaller chunks: Less memory, better for slow storage
Recommendations:
| Storage Type | Recommended Size |
|---|---|
| NVMe SSD | 4096-8192 KB |
| SATA SSD | 1024-2048 KB |
| HDD | 256-512 KB |
| Network | 128-256 KB |
parallel-readers
Number of threads for parallel file reading.
- Type: Integer
- Default: Auto (based on CPU cores)
- Range: 1 - 64
- Auto logic: min(CPU_cores, 8)
Impact:
- More threads: Better parallelization for large files
- Fewer threads: Less memory usage, less I/O contention
Guidelines:
- Files <64MB: Always single-threaded
- Files ≥64MB: Multi-threaded based on size
- I/O bound after ~16 threads
max-files-batch
Maximum files to process simultaneously.
- Type: Integer
- Default: 1000
- Range: 1 - 10000
Impact:
- Larger batch: Better throughput for many small files
- Smaller batch: Less memory usage
parallel-files
Files to verify in parallel (verify-many only).
- Type: Integer
- Default: 4
- Range: 1 - 32
Impact:
- More parallel: Faster verification of many files
- Less parallel: Lower system load
Algorithm Configuration
algo
Hash algorithm selection.
- Type: Enum
- Values: md5, sha256
- Default: md5
Characteristics:
| Algorithm | Speed | Security | Size | Compatibility |
|---|---|---|---|---|
| md5 | Fastest | Weak | 128-bit | Universal |
| sha256 | Moderate | Strong | 256-bit | Wide |
Output Configuration
format
Output format for checksums.
- Type: Enum
- Values: text, json, csv
- Default: text (auto-detected from file extension)
Auto-detection:
.json→ JSON format.csv→ CSV format- Others → text format
pretty
Display results in formatted table.
- Type: Boolean
- Default: false
- Output: stderr (doesn't interfere with stdout)
per-file
Create individual checksum files.
- Type: Boolean
- Default: false
- Naming:
<filename>.<algo>(e.g.,file.txt.md5)
absolute-paths
Use absolute paths in output.
- Type: Boolean
- Default: false (relative paths)
Filtering Configuration
include
Patterns for files to include.
- Type: String (multiple allowed)
- Syntax: Glob patterns
- Default: Include all
Examples:
--include "*.txt"
--include "data/*.csv"
--include "**/*.fastq"
exclude
Patterns for files to exclude.
- Type: String (multiple allowed)
- Syntax: Glob patterns
- Default: Exclude none
Examples:
--exclude "*.tmp"
--exclude ".git/**"
--exclude "**/test/*"
no-ignore
Don't respect .gitignore rules.
- Type: Boolean
- Default: false (respect .gitignore)
recursive
Process directories recursively.
- Type: Boolean
- Default: false
Progress Configuration
no-progress
Disable progress bars.
- Type: Boolean
- Default: false (show progress)
Auto-disabled when:
- Output is not a terminal
- Quiet mode is enabled
- Single small file
Verification Configuration
fail-fast
Stop on first verification failure.
- Type: Boolean
- Default: false (verify all)
- Applies to: verify-many
failed-only
Show only failed verifications.
- Type: Boolean
- Default: false (show all)
- Applies to: verify-many
quiet
Suppress non-error output.
- Type: Boolean
- Default: false
- Applies to: verify-many
Resource Limits
Built-in limits for stability:
| Resource | Limit | Configurable |
|---|---|---|
| Max chunk size | 64MB | Yes (chunk-size-kb) |
| Max parallel readers | 64 | Yes (parallel-readers) |
| Max files batch | 10000 | Yes (max-files-batch) |
| Max archive size | 1TB | No |
| Max path length | 4096 | No |
Default Behavior
Without options, checkle:
- Uses MD5 algorithm
- Auto-detects optimal thread count
- Uses 1MB chunks
- Shows progress bars
- Outputs to stdout in text format
- Processes single files/directories non-recursively
- Respects .gitignore rules
Development Setup
Get your development environment ready for contributing to checkle.
Prerequisites
Required
- Rust 1.70+ (latest stable recommended)
- Git
- C compiler (for dependencies)
Optional
- Just (command runner)
- cargo-watch (auto-rebuild)
- cargo-criterion (benchmarking)
Clone Repository
git clone https://github.com/nrminor/checkle.git
cd checkle
Install Rust
If you don't have Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Update to latest stable:
rustup update stable
rustup default stable
Install Development Tools
Just (Command Runner)
# macOS
brew install just
# Linux
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/.local/bin
# Via cargo
cargo install just
Development Dependencies
# Auto-rebuild on changes
cargo install cargo-watch
# Benchmarking
cargo install cargo-criterion
# Code coverage
cargo install cargo-tarpaulin
Build and Test
Using Just
# Run all checks
just check
# Run tests
just test
# Run clippy
just clippy
# Format code
just fmt
# Run benchmarks
just bench
Using Cargo Directly
# Build debug version
cargo build
# Build release version
cargo build --release
# Run tests
cargo test
# Run with verbose output
cargo test -- --nocapture
# Run specific test
cargo test test_name
Development Workflow
1. Create Feature Branch
git checkout -b feature/my-feature
2. Make Changes
# Edit files
vim src/module.rs
# Auto-rebuild and test
cargo watch -x test
3. Run Checks
# Format code
cargo fmt
# Run clippy (MUST pass)
cargo clippy --all-targets --all-features -- -D warnings
# Run tests
cargo test
# Verify hashes
just verify-hashes
4. Test Release Build
cargo build --release
./target/release/checkle hash test.txt
Code Standards
Mandatory Checks
Every change MUST pass:
cargo fmt- Code formattingcargo clippy --all-targets --all-features -- -D warnings- Zero warningscargo test- All tests passjust verify-hashes- Hash correctness
Documentation
- All public items need rustdoc
- Include examples in doc comments
- Update relevant .md files
Testing
- Add at least 3 tests per change
- Use proptest for algorithmic code
- Integration tests for CLI changes
Project Structure
checkle/
├── src/
│ ├── main.rs # Binary entry point
│ ├── lib.rs # Library root
│ ├── cli.rs # CLI definitions
│ ├── hashing.rs # Core hashing logic
│ ├── io.rs # File I/O
│ ├── errors.rs # Error types
│ └── commands/ # Command implementations
├── tests/ # Integration tests
├── benches/ # Benchmarks
├── docs/ # mdBook documentation
└── justfile # Development commands
Common Tasks
Add New Hash Algorithm
- Add variant to
HashingAlgoenum - Implement in
Hashermethods - Add tests
- Update documentation
Add New Command
- Define in
cli.rs - Implement in
commands/ - Add integration tests
- Update CLI documentation
Improve Performance
- Run benchmarks:
just bench - Profile with
cargo flamegraph - Make changes
- Verify with benchmarks
- Ensure tests still pass
Debugging
Enable Debug Output
RUST_LOG=debug cargo run -- hash test.txt
Run with Backtrace
RUST_BACKTRACE=1 cargo run -- hash test.txt
Use GDB/LLDB
cargo build
gdb target/debug/checkle
IDE Setup
VS Code
Install extensions:
- rust-analyzer
- CodeLLDB (debugging)
IntelliJ/CLion
Install Rust plugin
Vim/Neovim
Use rust.vim and rust-analyzer LSP
Getting Help
- Check existing issues on GitHub
- Read AGENTS.md for AI assistant guidelines
- Ask in discussions/issues
- Review test files for examples
Building from Source
Build checkle from source code for development or custom installations.
Quick Build
Standard Build
git clone https://github.com/nrminor/checkle.git
cd checkle
cargo build --release
Binary will be at: target/release/checkle
SIMD-Optimized Build
RUSTFLAGS="-C target-cpu=native" cargo build --release --features simd
Build Types
Debug Build
Fast compilation, slow runtime, debug symbols:
cargo build
# Binary at: target/debug/checkle
Release Build
Slow compilation, fast runtime, optimized:
cargo build --release
# Binary at: target/release/checkle
SIMD Build
Hardware-specific optimizations:
# For current CPU
RUSTFLAGS="-C target-cpu=native" cargo build --release --features simd
# For specific architecture
RUSTFLAGS="-C target-feature=+avx2" cargo build --release --features simd
Installation
System-Wide
# Build and install to ~/.cargo/bin
cargo install --path .
# Or copy manually
sudo cp target/release/checkle /usr/local/bin/
Local Directory
# Build and copy to specific location
cargo build --release
cp target/release/checkle ~/bin/
Cross-Compilation
Linux to Windows
# Install target
rustup target add x86_64-pc-windows-gnu
# Build
cargo build --release --target x86_64-pc-windows-gnu
Linux to macOS
# Install target
rustup target add x86_64-apple-darwin
# Build (requires macOS SDK)
cargo build --release --target x86_64-apple-darwin
Using Cross
# Install cross
cargo install cross
# Build for target
cross build --release --target aarch64-unknown-linux-gnu
Build Features
Available Features
simd- Enable SIMD optimizations (requires nightly)
Feature Combinations
# Default (no features)
cargo build --release
# With SIMD
cargo build --release --features simd
# All features
cargo build --release --all-features
Build Requirements
Minimum Requirements
- Rust 1.70+
- 2GB RAM
- 500MB disk space
Recommended
- Rust latest stable
- 4GB RAM
- 1GB disk space
- SSD for faster builds
Platform-Specific Notes
Linux
No special requirements. Works on all major distributions.
macOS
# Install Xcode command line tools if needed
xcode-select --install
Windows
Use either:
- MSVC toolchain (Visual Studio)
- GNU toolchain (MinGW)
# MSVC
rustup default stable-msvc
# GNU
rustup default stable-gnu
Optimizations
Link-Time Optimization (LTO)
Add to Cargo.toml:
[profile.release]
lto = true
Codegen Units
For maximum performance:
[profile.release]
codegen-units = 1
CPU-Specific
RUSTFLAGS="-C target-cpu=native" cargo build --release
Docker Build
FROM rust:latest as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/checkle /usr/local/bin/
CMD ["checkle"]
Build:
docker build -t checkle .
Nix Build
Using flake:
nix build
Traditional:
nix-build
Verification
After building, verify the binary works:
# Check version
./target/release/checkle --version
# Run tests
cargo test
# Verify against standard tools
echo "test" > test.txt
./target/release/checkle hash test.txt
md5sum test.txt # Should match
Troubleshooting
Out of Memory
# Limit parallel jobs
cargo build --release -j 2
Linker Errors
# On Linux, install development packages
sudo apt-get install build-essential
# or
sudo yum groupinstall "Development Tools"
SIMD Build Fails
# Use nightly Rust
rustup install nightly
cargo +nightly build --release --features simd
Slow Builds
# Use sccache
cargo install sccache
export RUSTC_WRAPPER=sccache
cargo build --release
Build Artifacts
Build produces:
target/
├── release/
│ ├── checkle # Main binary
│ ├── deps/ # Dependencies
│ └── build/ # Build scripts output
└── debug/ # Debug build (if built)
Clean Build
# Remove all build artifacts
cargo clean
# Remove only release artifacts
cargo clean --release
Working with AI Agents
As AI coding agents continue to improve, generated code is increasingly a part
of the software development process, and indeed projects that plan to use
AI-generated code should be structured to provide maximum context--including
goals, style and API design guidelines, and non-negotiable development rules--to
make agents effective. The checkle codebase is designed to do just that.
Starting with AGENTS.md and including additional context in the context/
directory, the codebase comes with many specific requirements and guidelines
that all AI agents must follow to ensure code quality and maintainability.
Strict Rule Compliance
AI agents working on checkle must follow all development rules without exception. At minimum, this includes:
- Quality Checks: Run
cargo fmt,cargo check, andcargo clippy --all-targets --all-features -- -D warningsbefore declaring any work complete - No Lint Suppressions: Never add
#[allow()]lint suppressions without explicit permission from the project maintainer.clippylints operate under an opt-out default principle, which means project lints are exceptionally strict and demanding unless exceptions have been explicitly approved. This keeps coding agents "on the rails" quite effectively while also providing specific, achievable feedback on how to verify that a feature is actually finished. - Three-Test Rule: Every change must include at least 3 new, improved tests, or replaced tests
Frequent Context Loading
Before making any changes, AI agents must read and understand these documents:
- AGENTS.md - Complete development guidelines and project rules
- README.md - Project overview and goals
- TIGER_STYLE.md - World-class software robustness principles
- GRUG BRAIN DEVELOPER - Pragmatic simplicity principles
It is also recommended that agents review these documents after implementing a feature to ensure that the submitted code is standards-compliant and leaves the codebase better than it was found.
Essential Guidelines for AI Agents
Performance Focus
checkle is designed for bioinformatics workflows with terabyte-scale files. Always consider:
- Multicore utilization and parallel processing
- Memory efficiency for large file handling
- Merkle tree optimization opportunities
- Buffer reuse and minimal allocations
Code Quality Standards
The project enforces extremely strict quality standards:
- Zero clippy warnings allowed
- Comprehensive test coverage required
- No
unwrap()calls in production code - Proper error handling with context
Balance Principles
Follow both development philosophies:
- Tiger Style: Robustness, assertions, resource limits
- Grug Brain: Simplicity, avoiding premature complexity
When these conflict, prefer solutions that are both robust AND simple.
Bioinformatics Context
Remember that checkle serves genomics researchers who:
- Work with files that can be 500GB+ each
- Need reliable integrity verification for critical data
- Require fast batch processing of many large files
- Value performance and correctness over feature richness
Critical Requirements
Documentation Reading
Working without reading the required documents (AGENTS.md, README.md, TIGER_STYLE.md, GRUG BRAIN DEVELOPER) is unacceptable and will result in code rejection.
Error Prevention
Common mistakes AI agents must avoid:
- Adding dependencies without approval
- Skipping quality checks
- Writing tests that only verify assertions
- Adding complexity without clear benefit
- Ignoring performance implications
Success Criteria
Code is considered complete only when:
- All quality checks pass without warnings
- At least 3 meaningful tests are included
- Performance implications have been considered
- Tiger Style and Grug Brain principles are balanced
- The change serves the bioinformatics use case
Getting Started
- Read all required documentation
- Understand the specific task requirements
- Consider performance and simplicity implications
- Implement with comprehensive error handling
- Add thorough tests
- Run all quality checks
- Document any architectural decisions
By following these guidelines, AI agents can contribute effectively to checkle while maintaining the high standards that make it reliable for critical bioinformatics workflows.