Genotype API Documentation - v0.1.0
    Preparing search index...

    Variable detectEncodingConst

    detectEncoding: (qualityString: string) => QualityEncoding = detectEncodingImmediate

    Primary quality encoding detection (re-export for intuitive API) Uses immediate detection as the primary use case for most workflows

    Type Declaration

      • (qualityString: string): QualityEncoding
      • Detect quality encoding from single quality string using sequencing technology patterns

        Implements intelligent quality encoding detection that accounts for the complex history of sequencing technology evolution. The algorithm uses ASCII range analysis combined with pattern recognition to distinguish between Solexa, Phred+64, and Phred+33 encodings. This is essential for processing legacy sequencing data and ensuring compatibility across the 15+ year history of high-throughput sequencing.

        Detection Algorithm Strategy: The algorithm uses a multi-stage approach to handle overlapping ASCII ranges:

        1. Uniform high patterns: Q40+ modern data (ASCII 73+) → Phred+33
        2. Constrained legacy patterns: High-ASCII only (64-104) → Phred+64 or Solexa
        3. Mixed modern patterns: Wide ASCII range (33-93) → Phred+33
        4. Statistical prevalence: Default to most common encoding (95% Phred+33)

        Technological Context:

        • Phil Green's Phred (1998): Original quality scoring for Sanger sequencing
        • Solexa Genome Analyzer (2006): Proprietary odds-based scoring
        • Illumina acquisition (2007): Continued Solexa format initially
        • Pipeline 1.3+ (2007-2011): Switched to Phred scores, kept ASCII+64
        • CASAVA 1.8+ (2011): Adopted Sanger-compatible Phred+33 format

        Detection Challenges: The overlapping ASCII ranges create fundamental ambiguity:

        • "High-quality" Phred+33: ASCII 70-93 (Q37-Q60)
        • "Low-quality" Phred+64: ASCII 64-75 (Q0-Q11)
        • Overlap zone: ASCII 64-93 could be either encoding
        • Context clues: Pattern analysis and prevalence-based decisions

        Biological Impact of Quality Scores:

        • Q10: 90% accuracy (1 in 10 error rate) - poor quality
        • Q20: 99% accuracy (1 in 100 error rate) - acceptable
        • Q30: 99.9% accuracy (1 in 1000 error rate) - high quality
        • Q40: 99.99% accuracy (1 in 10,000 error rate) - excellent

        Platform-Specific Characteristics:

        • Solexa GA: Variable quality, often filtered data in ASCII 64-90
        • Illumina 1.3-1.7: Higher baseline quality, ASCII 64-104 range
        • Modern Illumina: Excellent quality, often Q30+ (ASCII 63+)
        • Long-read platforms: Different quality models entirely

        Parameters

        • qualityString: string

          Quality string from FASTQ record

        Returns QualityEncoding

        Detected quality encoding with confidence assessment

        // Typical Illumina 1.5 quality string (Phred+64)
        const illumina15 = "@@CDEFGHIJKLMNOPQRSTUVWXYZ[\\]";
        const encoding = detectEncodingImmediate(illumina15);
        console.log(encoding); // "phred64" - legacy Illumina format
        // Modern Illumina data with high quality scores
        const modernHQ = "IIIIIIIIIIIIIIIIIIIII"; // Q40 across entire read
        const encoding = detectEncodingImmediate(modernHQ);
        console.log(encoding); // "phred33" - modern standard
        // Original Solexa scoring (odds-based, rare in modern data)
        const solexa = ";;;;;;;;;;"; // Poor quality, Solexa-specific range
        const encoding = detectEncodingImmediate(solexa);
        console.log(encoding); // "solexa" - historical format

        🔥 NATIVE CANDIDATE: ASCII min/max finding with SIMD acceleration