Genotype API Documentation - v0.1.0
    Preparing search index...

    Function detectEncodingImmediate

    • Detect quality encoding from single quality string using sequencing technology patterns

      Implements intelligent quality encoding detection that accounts for the complex history of sequencing technology evolution. The algorithm uses ASCII range analysis combined with pattern recognition to distinguish between Solexa, Phred+64, and Phred+33 encodings. This is essential for processing legacy sequencing data and ensuring compatibility across the 15+ year history of high-throughput sequencing.

      Detection Algorithm Strategy: The algorithm uses a multi-stage approach to handle overlapping ASCII ranges:

      1. Uniform high patterns: Q40+ modern data (ASCII 73+) → Phred+33
      2. Constrained legacy patterns: High-ASCII only (64-104) → Phred+64 or Solexa
      3. Mixed modern patterns: Wide ASCII range (33-93) → Phred+33
      4. Statistical prevalence: Default to most common encoding (95% Phred+33)

      Technological Context:

      • Phil Green's Phred (1998): Original quality scoring for Sanger sequencing
      • Solexa Genome Analyzer (2006): Proprietary odds-based scoring
      • Illumina acquisition (2007): Continued Solexa format initially
      • Pipeline 1.3+ (2007-2011): Switched to Phred scores, kept ASCII+64
      • CASAVA 1.8+ (2011): Adopted Sanger-compatible Phred+33 format

      Detection Challenges: The overlapping ASCII ranges create fundamental ambiguity:

      • "High-quality" Phred+33: ASCII 70-93 (Q37-Q60)
      • "Low-quality" Phred+64: ASCII 64-75 (Q0-Q11)
      • Overlap zone: ASCII 64-93 could be either encoding
      • Context clues: Pattern analysis and prevalence-based decisions

      Biological Impact of Quality Scores:

      • Q10: 90% accuracy (1 in 10 error rate) - poor quality
      • Q20: 99% accuracy (1 in 100 error rate) - acceptable
      • Q30: 99.9% accuracy (1 in 1000 error rate) - high quality
      • Q40: 99.99% accuracy (1 in 10,000 error rate) - excellent

      Platform-Specific Characteristics:

      • Solexa GA: Variable quality, often filtered data in ASCII 64-90
      • Illumina 1.3-1.7: Higher baseline quality, ASCII 64-104 range
      • Modern Illumina: Excellent quality, often Q30+ (ASCII 63+)
      • Long-read platforms: Different quality models entirely

      Parameters

      • qualityString: string

        Quality string from FASTQ record

      Returns QualityEncoding

      Detected quality encoding with confidence assessment

      // Typical Illumina 1.5 quality string (Phred+64)
      const illumina15 = "@@CDEFGHIJKLMNOPQRSTUVWXYZ[\\]";
      const encoding = detectEncodingImmediate(illumina15);
      console.log(encoding); // "phred64" - legacy Illumina format
      // Modern Illumina data with high quality scores
      const modernHQ = "IIIIIIIIIIIIIIIIIIIII"; // Q40 across entire read
      const encoding = detectEncodingImmediate(modernHQ);
      console.log(encoding); // "phred33" - modern standard
      // Original Solexa scoring (odds-based, rare in modern data)
      const solexa = ";;;;;;;;;;"; // Poor quality, Solexa-specific range
      const encoding = detectEncodingImmediate(solexa);
      console.log(encoding); // "solexa" - historical format

      🔥 NATIVE CANDIDATE: ASCII min/max finding with SIMD acceleration