graph TD A[GATHER_NANOPORE] --> B[PRIMER_HANDLING] B --> C[ALIGNMENT] C --> D[CONSENSUS] C --> E[VARIANTS] C --> F[HAPLOTYPING] C --> G[METAGENOMICS] D --> H[PHYLO] E --> I[SLACK_ALERT] style B fill:#f9f,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
OneRoof Pipeline Architecture
This document provides a comprehensive map of the OneRoof Nextflow pipeline structure, including workflow dependencies, data flow, and critical testing points.
1 Main Workflow Entry Point
1.1 main.nf
- Purpose: Central orchestrator that routes to platform-specific workflows
- Key Functions:
- Platform detection (Nanopore vs Illumina) based on input parameters
- Input channel initialization for all required files
- Workflow selection and invocation
- Email notification on completion
Input Channels:
ch_primer_bed
: Optional primer BED filech_refseq
: Required reference FASTAch_ref_gbk
: Optional GenBank file for annotationch_contam_fasta
: Optional contamination sequencesch_metagenomics_ref
: Optional metagenomics referencech_snpeff_config
: Optional SnpEff configurationch_primer_tsv
: Optional primer TSV filech_sylph_tax_db
: Optional Sylph taxonomy database
2 Platform-Specific Workflows
2.1 NANOPORE Workflow (workflows/nanopore.nf)
Workflow DAG:
Dashed boxes indicate optional workflow components
Key Parameters:
platform = "ont"
min_variant_frequency = 0.2
min_qual = 10
2.2 ILLUMINA Workflow (workflows/illumina.nf)
Workflow DAG:
graph TD A[GATHER_ILLUMINA] --> B[ILLUMINA_CORRECTION] B --> C[PRIMER_HANDLING] C --> D[ALIGNMENT] D --> E[CONSENSUS] D --> F[VARIANTS] D --> G[PHYLO] D --> H[METAGENOMICS] E --> I[SLACK_ALERT] F --> I style C fill:#f9f,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
Key Parameters:
platform = "illumina"
min_variant_frequency = 0.05
min_qual = 20
3 Sub-workflows and Dependencies
3.1 Data Gathering Workflows
3.1.1 GATHER_NANOPORE (subworkflows/gather_nanopore.nf)
Purpose: Handle multiple Nanopore input formats
Input Options:
- Remote POD5 monitoring (
remote_pod5_location
) - Local POD5 directory (
pod5_dir
) - Pre-called staging directory (
precalled_staging
) - Pre-processed data directory (
prepped_data
)
Process Flow:
graph LR A[POD5 Input] --> B[DOWNLOAD_MODELS] B --> C[BASECALL] C --> D[MERGE_BAMS] D --> E[DEMULTIPLEX] F[Pre-called Input] --> G[VALIDATE_NANOPORE] E --> G G --> H[FILTER_WITH_CHOPPER] H --> I[COMPRESS_TO_SORTED_FASTA] I --> J[FAIDX] J --> K[EARLY_RASUSA_READ_DOWNSAMPLING]
3.1.2 GATHER_ILLUMINA (subworkflows/gather_illumina.nf)
Purpose: Process paired-end Illumina FASTQ files
Process Flow:
graph LR A[Paired FASTQs] --> B[VALIDATE_ILLUMINA] B --> C[MERGE_READ_PAIRS]
3.2 Processing Workflows
3.2.1 ILLUMINA_CORRECTION (subworkflows/illumina_correction.nf)
Purpose: Quality control and decontamination for Illumina reads
Process Flow:
graph TD A[CORRECT_WITH_FASTP] --> B[DECONTAMINATE] B --> C[FASTQC] C --> D[MULTIQC] B --> E[COMPRESS_TO_SORTED_FASTA] E --> F[FAIDX] F --> G[EARLY_RASUSA_READ_DOWNSAMPLING] style B fill:#f9f,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
3.2.2 PRIMER_HANDLING (subworkflows/primer_handling.nf)
Purpose: Validate primers and extract complete amplicons
Input Options:
- Primer BED file
- Primer TSV file
Process Flow:
graph TD A[ORIENT_READS] --> B[GET_PRIMER_PATTERNS] B --> C[FIND_COMPLETE_AMPLICONS] B --> D[TRIM_ENDS_TO_PRIMERS] D --> E[PER_AMPLICON_FILTERS] E --> F[MERGE_BY_SAMPLE]
3.2.3 ALIGNMENT (subworkflows/alignment.nf)
Purpose: Map reads to reference and generate coverage statistics
Process Flow:
graph TD A[ALIGN_WITH_PRESET] --> B[CONVERT_AND_SORT] B --> C[RASUSA_ALN_DOWNSAMPLING] C --> D[SORT_BAM] D --> E[INDEX] E --> F[MOSDEPTH] F --> G[PLOT_COVERAGE] G --> H[COVERAGE_SUMMARY]
3.2.4 VARIANTS (subworkflows/variant_calling.nf)
Purpose: Call and annotate variants
Process Flow:
graph TD A[CALL_VARIANTS] --> B[CONVERT_TO_VCF] B --> C[ANNOTATE_VCF] C --> D[EXTRACT_FIELDS] D --> E[MERGE_VCF_FILES]
3.2.5 CONSENSUS (subworkflows/consensus_calling.nf)
Purpose: Generate consensus sequences
Process Flow:
graph LR A[CALL_CONSENSUS] --> B[CONCAT]
3.3 Optional Feature Workflows
3.3.1 PHYLO (subworkflows/phylo.nf)
Purpose: Phylogenetic analysis using Nextclade
Process Flow:
graph LR A[CHECK_DATASET] --> B[DOWNLOAD_DATASET] B --> C[RUN_NEXTCLADE]
3.3.2 METAGENOMICS (subworkflows/metagenomics.nf)
Purpose: Metagenomic classification using Sylph
Process Flow:
graph TD A[SKETCH_DATABASE_KMERS] --> C[CLASSIFY_SAMPLE] B[SKETCH_SAMPLE_KMERS] --> C C --> D[OVERLAY_TAXONOMY] D --> E[MERGE_TAXONOMY] style D fill:#f9f,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5 style E fill:#f9f,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5
3.3.3 HAPLOTYPING (subworkflows/haplotyping.nf)
Purpose: Viral haplotype reconstruction (Nanopore only)
Condition: Number of reference sequences equals number of amplicons
4 Key Modules/Processes
4.1 Critical Processes for Testing
DOWNLOAD_MODELS
: Model cachingBASECALL
: GPU-based basecallingDEMULTIPLEX
: Barcode demultiplexing
ALIGN_WITH_PRESET
: Platform-specific alignment
CALL_VARIANTS
: Variant detectionCALL_CONSENSUS
: Consensus generationCONVERT_TO_VCF
: Format conversion
CONVERT_AND_SORT
: BAM processingINDEX
: BAM indexing
VALIDATE_NANOPORE
: Input validationVALIDATE_ILLUMINA
: Paired-end validationVALIDATE_PRIMER_BED
: Primer validation
5 Critical Testing Paths
5.1 Minimal Test Path (No Primers)
- Nanopore: POD5/FASTQ → Basecall → Align → Consensus/Variants
- Illumina: Paired FASTQs → Merge → Align → Consensus/Variants
5.2 Full Test Path (With Primers)
- Input validation
- Primer handling and amplicon extraction
- Alignment and coverage analysis
- Variant calling and annotation
- Consensus generation
- Optional: Phylogenetics, metagenomics, haplotyping
5.3 Key Input Requirements
- Reference FASTA (
--refseq
) - Sequencing data:
- Nanopore: POD5 files + kit name OR pre-called BAM/FASTQ
- Illumina: Paired-end FASTQ directory
- Primer BED file (
--primer_bed
) or TSV (--primer_tsv
) - Reference GenBank (
--ref_gbk
) for annotation - SnpEff config for variant annotation
- Contamination FASTA for decontamination
- Metagenomics database for classification
6 Output Structure
6.1 Nanopore Output Tree
nanopore/
├── 01_basecalled_demuxed/
├── 02_primer_handling/
├── 03_alignments/
├── 04_consensus_seqs/
├── 05_variants/
├── 06_QC/
├── 07_phylo/
├── metagenomics/
└── haplotyping/
6.2 Illumina Output Tree
illumina/
├── 01_merged_reads/
├── 02_primer_handling/
├── 03_alignments/
├── 04_consensus_seqs/
├── 05_variants/
├── 06_QC/
├── 07_phylo/
└── metagenomics/
7 Configuration and Parameters
7.1 Platform-Specific Defaults
Parameter | Nanopore | Illumina |
---|---|---|
min_variant_frequency |
0.2 | 0.05 |
min_qual |
10 | 20 |
Alignment preset | map-ont |
sr |
7.2 Resource Management
pod5_batch_size
: Controls GPU memory usagedownsample_to
: Coverage depth limitingbasecall_max
: Parallel basecalling instanceslow_memory
: Resource-constrained mode
7.3 Key Process Labels
big_mem
: Memory-intensive processes (variant calling, consensus)- GPU requirements: Dorado basecalling
8 Error Handling and Retry Strategy
Most processes implement:
{ task.attempt < 3 ? 'retry' : 'ignore' }
errorStrategy 2 maxRetries
This provides resilience against transient failures while preventing infinite loops.
9 Testing Considerations
9.1 Critical Validation Points
- Input file validation (exists, correct format)
- Primer validation (coordinates, sequences)
- Read count filtering (empty file handling)
- Platform-specific parameter application
- Optional workflow branching
9.2 Edge Cases to Test
- Empty input files
- No reads passing filters
- Missing optional inputs
- Primer mismatches
- Low coverage regions
- Multiple reference sequences
- Remote file watching timeout
9.3 Integration Test Scenarios
- Minimal run: reference + reads only
- Full featured run: all optional inputs
- Real-time processing: file watching
- Multi-sample processing
- Platform switching: same data, different platforms