OneRoof Pipeline File Reference
A comprehensive guide to every file in the repository
Overview
This document provides a comprehensive reference for all files in the OneRoof bioinformatics pipeline repository. Files are organized by directory to help you quickly find what you’re looking for.
Root Directory Files
Core Pipeline Files
main.nf
- The main entry point for the Nextflow pipeline
- Orchestrates the selection and execution of platform-specific workflows (Nanopore vs Illumina)
- Handles parameter validation and workflow routing
- Essential for running the pipeline
nextflow.config
- Central configuration file for the Nextflow pipeline
- Defines default parameters, process configurations, and execution profiles
- Controls resource allocation, container settings, and platform-specific behaviors
- Must be understood for pipeline customization and optimization
Documentation and Configuration
README.md
- Primary documentation for users
- Contains installation instructions, usage examples, and quick start guides
- First point of reference for new users
CLAUDE.md
- AI assistant guidelines for code development
- Defines project structure, key commands, and development practices
- Useful for maintaining consistency in AI-assisted development
llms.txt
- LLM context file
- Contains project information for AI assistants
- Helps maintain consistent AI interactions
LICENSE
- Software license file
- Defines terms of use and distribution
- Legal requirement for open source software
pyproject.toml
- Python package configuration and dependencies
- Defines project metadata, dependencies, and tool configurations
- Essential for Python environment setup
pixi.lock
- Lock file for Pixi environment manager
- Ensures reproducible environments across different systems
- Critical for dependency management
justfile
- Task runner configuration (similar to Makefile. but more modern, featureful, and easier to learn)
- Defines common development tasks like building Docker images and generating docs
- Speeds up development workflow–just run
just
in the same directory as the file to see what it can do
Environment and Container Files
Containerfile
- Docker/Podman container definition for the pipeline
- Defines the execution environment with all required tools
- Essential for reproducible, portable execution
flake.nix & flake.lock
- Nix package manager configuration files
- Provides an reproducible environment setup
- Useful for Nix users and HPC environments
uv.lock
- UV package manager lock file
- Extremely fast and robust Python dependency management
- Ensures exact Python package versions across platforms, ensuring reproducibility
Build and Configuration Files
**/_quarto.yml**
- Quarto documentation system configuration
- Controls documentation rendering settings
- Used for building the documentation website
refman.toml
- Project configuration file for our homegrown bioinformatic reference file management solution,
refman
- Can be used to download batches of critical reference files for common use-cases for the pipeline
nf-test.config
- Configuration for Nextflow testing framework
- Defines test settings and locations
- Important for pipeline testing and validation
Platform-specific workflow definitions that orchestrate the entire analysis pipeline:
illumina.nf
- Complete workflow for processing Illumina paired-end sequencing data
- Handles FASTQ input, quality control, alignment, variant calling, and consensus generation
- Optimized for short-read sequencing characteristics
nanopore.nf
- Complete workflow for processing Oxford Nanopore sequencing data
- Supports pod5, BAM, and FASTQ inputs with optional basecalling
- Handles long-read specific challenges and parameters
subworkflows/ Directory
Modular workflow components that can be reused across different main workflows:
alignment.nf
- Handles read alignment to reference genomes
- Integrates minimap2 with platform-specific parameters
- Produces sorted, indexed BAM files for downstream analysis
consensus_calling.nf
- Generates consensus sequences from aligned reads
- Implements platform-specific frequency thresholds
- Critical for producing final genomic sequences
gather_illumina.nf
- Collects and validates Illumina FASTQ files
- Handles paired-end read organization
- Prepares data for processing pipeline
gather_nanopore.nf
- Collects Nanopore data from various formats (pod5, BAM, FASTQ)
- Handles barcode demultiplexing
- Manages basecalling workflow integration
haplotyping.nf
- Performs viral haplotype reconstruction
- Uses Devider tool for identifying viral quasispecies
- Important for studying viral diversity
illumina_correction.nf
- Applies error correction specific to Illumina data
- May include adapter trimming and quality filtering
- Improves downstream analysis accuracy
metagenomics.nf
- Performs metagenomic profiling using Sylph
- Identifies organisms present in samples
- Useful for contamination detection and co-infections
phylo.nf
- Phylogenetic analysis using Nextclade
- Assigns sequences to clades and identifies mutations
- Essential for epidemiological tracking
primer_handling.nf
- Manages primer validation, trimming, and analysis
- Ensures complete amplicon coverage
- Critical for amplicon sequencing workflows
quality_control.nf
- Comprehensive quality control workflow
- Integrates FastQC, MultiQC, and custom metrics
- Produces quality reports for decision making
slack_alert.nf
- Sends notifications to Slack channels
- Reports pipeline completion status
- Useful for monitoring long-running analyses
variant_calling.nf
- Identifies genetic variants from aligned reads
- Uses ivar for amplicon data, bcftools for general data
- Produces VCF files for downstream analysis
modules/ Directory
Individual process definitions for specific bioinformatics tools:
Basecalling and Preprocessing
dorado.nf
- Oxford Nanopore basecaller integration
- Converts pod5 files to FASTQ with quality scores
- Requires GPU for optimal performance
chopper.nf
- Quality filtering for long reads
- Removes low-quality Nanopore sequences
- Improves downstream analysis quality
fastp.nf
- Fast preprocessing for Illumina reads
- Performs quality filtering and adapter trimming
- Generates QC reports
cutadapt.nf
- Adapter and primer trimming tool
- Removes sequencing artifacts
- Essential for accurate variant calling
Alignment and Coverage
minimap2.nf
- Versatile sequence aligner
- Handles both short and long reads
- Primary alignment tool in the pipeline
samtools.nf
- SAM/BAM file manipulation
- Sorting, indexing, and filtering alignments
- Essential for BAM file processing
mosdepth.nf
- Fast coverage depth calculation
- Generates coverage statistics and plots
- Important for quality assessment
cramino.nf
- CRAM/BAM file statistics
- Provides quick alignment metrics
- Useful for QC checks
Variant Calling and Consensus
ivar.nf
- Variant calling and consensus for amplicon data
- Handles primer trimming and frequency-based calling
- Primary tool for viral genomics
bcftools.nf
- General-purpose variant calling and manipulation
- VCF file processing and filtering
- Complementary to ivar for specific tasks
snpeff.nf
- Variant annotation tool
- Predicts functional effects of variants
- Important for biological interpretation
Quality Control and Reporting
fastqc.nf
- Sequence quality control
- Generates detailed quality metrics
- Standard tool for NGS QC
multiqc.nf
- Aggregates QC reports from multiple tools
- Creates unified quality report
- Essential for multi-sample projects
plot_coverage.nf
- Custom coverage visualization
- Creates coverage plots per amplicon
- Helps identify coverage gaps
reporting.nf
- Generates analysis reports
- Compiles results into readable formats
- User-facing output generation
Specialized Tools
nextclade.nf
- Viral clade assignment and phylogenetics
- Identifies mutations and QC issues
- Essential for SARS-CoV-2 and influenza analysis
sylph.nf
- Metagenomic profiling
- Fast organism identification
- Useful for contamination detection
devider.nf
- Viral haplotype reconstruction
- Identifies quasispecies in samples
- Important for studying viral diversity
amplicon-tk.nf
- Amplicon analysis toolkit
- Will provide amplicon-specific utilities
- Supports targeted sequencing workflows
- May be used for contamination detection
Utility Modules
bedtools.nf
- BED file manipulation
- Genomic interval operations
- Used for primer and region handling
seqkit.nf
- Sequence manipulation toolkit
- FASTA/FASTQ processing utilities
- General sequence handling
rasusa.nf
- Read subsampling tool
- Reduces coverage to specified depth
- Helps manage computational resources
vsearch.nf
- Sequence clustering and searching
- Supports sequence similarity analyses
duckdb.nf
- SQL database for data analysis
- Enables complex data queries
- currently not implemented in the pipeline
grepq.nf
- Pattern matching in sequences
- Quick sequence searching
- Utility for sequence filtering
- currently not implemented in the pipeline
bbmap.nf
- BBMap tool suite integration
- Various sequence processing utilities
- Alternative/complementary to other tools
deacon.nf
- customizable decontamination module
- currently not implemented in the pipeline
Pipeline-Specific Modules
validate.nf
- Input validation module
- Checks file formats and parameters
- Ensures pipeline requirements are met
primer_patterns.nf
- Generates primer search patterns
- Supports primer identification in reads
- Important for primer trimming
split_primer_combos.nf
- Splits primers by combinations
- Handles complex primer schemes
- Supports multiplexed amplicons
resplice_primers.nf
- Re-splices primer sequences
- May handle primer artifacts
- Specialized primer processing
write_primer_fasta.nf
- Outputs primers in FASTA format
- Utility for primer sequence export
- Supports downstream analyses
output_primer_tsv.nf
- Exports primer information as TSV
- Creates tabular primer summaries
- Useful for documentation
concat_consensus.nf
- Concatenates consensus sequences
- Combines multi-segment genomes
- Important for segmented viruses
file_watcher.nf
- Monitors directories for new files
- Enables real-time processing
- Supports continuous sequencing runs
call_slack_alert.nf
- Sends Slack notifications
- Reports pipeline events
- Part of monitoring system
bin/ Directory
Python scripts and utilities for data processing:
Core Analysis Scripts
ivar_variants_to_vcf.py
- Converts ivar variant output to standard VCF format
- Fixes known issues with ivar’s VCF generation
- Essential for variant calling pipeline
plot_coverage.py
- Generates coverage plots from alignment data
- Creates visual representation of sequencing depth
- Helps identify problematic regions
concat_consensus.py
- Concatenates consensus sequences from multiple segments
- Handles multi-segment viruses like influenza
- Produces complete genome sequences
generate_variant_pivot.py
- Creates pivot tables of variants across samples
- Useful for comparing mutations between samples
- Supports epidemiological analyses
Primer Management Scripts
validate_primer_bed.py
- Validates primer BED file format and content
- Checks for primer pair completeness
- Prevents primer-related pipeline failures
make_primer_patterns.py
- Generates regex patterns for primer detection
- Handles primer orientation and mismatches
- Supports primer trimming accuracy
split_primer_combos.py
- Separates primers by pool/combination
- Handles multiplexed primer schemes
- Important for complex protocols
resplice_primers.py
- Python implementation of primer resplicing
- Handles primer artifacts in sequences
- Complements Rust version
resplice_primers.rs
- Rust implementation for performance
- Fast primer sequence processing
- Used in high-throughput scenarios
Monitoring and Utilities
file_watcher.py
- Monitors directories for new sequencing files
- Triggers pipeline execution automatically
- Enables real-time analysis
slack_alerts.py
- Sends notifications to Slack
- Reports pipeline status and errors
- Integrated with monitoring workflow
multisample_plot.py
- Creates plots comparing multiple samples
- Visualizes cross-sample metrics
- Useful for batch analysis
Package Files
init.py
- Python package initialization
- Makes bin/ directory a Python module
- Enables script imports
main.py
- Package entry point
- Allows running as
python -m bin
- May provide CLI interface
Test Files
**test_*.py files**
- Unit tests for corresponding scripts
- Ensures script functionality
- Part of quality assurance
conf/ Directory
Configuration files for various pipeline components:
nanopore.config
- Nanopore-specific pipeline settings
- Defines basecalling models, parameters
- Optimizes for long-read characteristics
illumina.config
- Illumina-specific pipeline settings
- Short-read optimized parameters
- Handles paired-end specific options
snpeff.config
- SnpEff variant annotation settings
- Defines reference databases
- Controls annotation behavior
file_watcher.template.yml
- Template for file watcher configuration
- Defines monitoring parameters
- Customizable for different setups
lib/ Directory
Groovy libraries for Nextflow:
Utils.groovy
- Utility functions for Nextflow workflows
- Common functionality across workflows
- Reduces code duplication
docs/ Directory
Project documentation sources:
Core Documentation
index.qmd
- Main documentation page source
- Renders to HTML/PDF documentation
- User-facing pipeline guide
developer.qmd & developer.md
- Developer documentation
- Technical details for contributors
- Code structure and patterns
pipeline_architecture.qmd & pipeline_architecture.md
- Detailed pipeline design documentation
- Architectural decisions and flow
- Technical reference
data_management.qmd & data_management.md
- Data handling guidelines
- Storage and organization practices
- Best practices documentation
whats-that-file.qmd & whats-that-file.md
- This file - comprehensive file reference
- Documents every file in the repository
- Helps developers understand project structure
Post-Render Scripts
fix-index-paths.lua
- Lua script for Quarto post-render processing
- Fixes relative paths in generated HTML
- Copies markdown files back to docs directory
- Ensures proper website navigation structure
fix-index-paths.py
- Python alternative to the Lua script
- Same functionality as fix-index-paths.lua
- Provides fallback option if Lua is unavailable
- Uses standard library for cross-platform compatibility
Generated Files
site_libs/
- Quarto-generated web assets
- JavaScript, CSS, and fonts
- Bootstrap and Quarto theme files
- Supports interactive documentation
**_site/**
- Rendered documentation website
- HTML output from Quarto
- Ready for deployment to GitHub Pages
globus/ Directory
Globus integration for data transfer:
README.md
- Globus setup instructions
- Configuration guidelines
- Integration documentation
action_provider/
- Globus action provider implementation
- Enables automated workflows
- Cloud integration support
config/
- Globus configuration files
- Service settings
- Authentication setup
flows/
- Globus flow definitions
- Automated data workflows
- Pipeline integration
scripts/
- Deployment and testing scripts
- Globus service management
- Operational utilities
tests/ Directory
Test files and data:
README.md
- Test documentation
- Running test instructions
- Test data descriptions
data/
- Test datasets
- Example files for each data type
- Validation datasets
modules/, subworkflows/, workflows/
- Nextflow test definitions
- Unit and integration tests
- Pipeline validation
GitHub Workflows (.github/)
workflows/test.yml
- CI/CD test workflow
- Automated testing on commits
- Quality assurance
workflows/docker-image.yaml
- Docker image building workflow
- Automated container updates
- Deployment automation
Summary
The OneRoof pipeline repository is organized into logical directories that separate:
- Core pipeline logic (workflows/, subworkflows/, modules/)
- Utility scripts (bin/)
- Configuration (conf/, *.config)
- Documentation (docs/, *.md)
- Test infrastructure (tests/)
- Reference data (assets/)
- External integrations (globus/)
This structure promotes modularity, reusability, and maintainability while supporting both Nanopore and Illumina sequencing platforms for viral genomics applications.