Decoding

Contributions: Lucy, Chae, Riya, Sebastian

Overview, Context and Scope

After DNA is sequenced, we must perform some work on the sequenced strands before we can decode the DNA. This includes alignment of fragments of DNA sequences after sequencing, and the methodology depends the sequencing platform we use.

Many pre-existing alignment strategies are used in the realm of bioinformatics, and rely on a reference template; we don’t have that luxury and conduct sequence alignment without a template, otherwise known as de novo sequence assembly.

To perform iterations of the DBTL cycle, we will complete the algorithms required to perform alignment in a timely manner and with acceptable accuracy on one platform, most likely NGS. We will implement one of the two algorithms that is mentioned below, then in the second iteration try the other algorithm.

What types of sequencing machines are there?

Sanger

  • [@a2021_analyzing] [@a2021_sanger]
  • First 20-40 base pairs are not well resolved
  • Simple data analysis
  • Longer reads (500-700 bps)
  • Low sensitivity (~15–20% detection limit)
  • Sequences close to primer-binding sites to be of poor quality
  • Output: four-color chromatogram representing the peak fluorescence intensity associated with each labeled ddNTP along the DNA sequence sanger

NextGen

  • [@next] [@cheng_2023_methods]
  • Higher sequencing depth for increased sensitivity (down to 1%)
  • Higher discovery power
  • Short reads (150-300 bps) by Illumina ngs

Nanopore

Doesn't require PCR amplification, eliminating amplification bias and simplifying sequencing protocols relatively high error rates, around 10% per nt [@emerging].

What algorithms are we using?

There are many established algorithms in this domain, we will use one of those. Based on the selected platform, sequence alignment will be performed as so:

Sanger

Given four-color chromatogram(s) representing the peak fluorescence intensity, depending on the amplification scale in wet lab, we can compare chromatogram(s) to each other to resolve conflicts. For de novo assembly, we can resolve conflicts based on redundancy of chromatograms [@sanger].

NGS

A fastq file contains multiple chopped up sequence reads, each having a confidence score known as a Phred score. A Phred score is the probability the sequencer called the base incorrectly. We assemble the sequence reads de novo into one long read based on the overlaps between the chopped up sequences [@ngs].

ngs_output

To conduct de novo assembly for output from NGS platform, reads are examined for overlap between them, and the goal to build up a single contig from smaller contigs.

There are established algorithms for solving this problem [@paszkiewicz_2010_de]:

The overlap-layout-consensus (OLC) approach [@paszkiewicz_2010_de]

  • Find overlaps between all pairs of segments, deriving a similarity score for each pair; this will be used to generate our heuristic
  • Then we have to generate the layout based on overlaps, with an overlap graph
    • Vertices: sequence reads
    • Edges: overlaps
  • Find a path that visits each node once -> Hamiltonian circuit, NP-hard
  • So use heuristic (similar score) to greedily select which edges to take to maximize the heuristic (sum of similarities) until a single string is found (path is found)
  • Order of merging pairs matters and can change the final string

Requires overlap to be scored between all pairs of reads, making runtime as least O(n2), overlap is less likely for short reads.

The de Bruijn graph approach [@paszkiewicz_2010_de]

  • No need for overlap phase
  • Sequence reads cut into smaller pieces
  • K-tuples, DNA sequence word of length k, are generated
  • Bruijn graph:
    • Node: (k-1)-tuples that occur in k-tuples
    • Edge: connects (k-1)-tuples that form a k-tuple
  • Find shortest (minimum weight, using heuristic) path that visits every edge -> Chinese postman problem, NP-complete! ** Problems with finding multiple solutions, or sequencing errors which cause high branching and tangle
  • Effect choice of k

Nanopore

TBD, as we may not end up using Nanopore to sequence.

Why do we need to remove primer from the sequence before we do error correction?

We have to remove the nucleotides representing the primer in order to only decode on the nucleotide strand that contains information bases. Using the primer we have stored, we can run fuzzy string matching algorithms [@wikipediacontributors_2019_approximate]. We must use fuzzy (approximate) algorithms because there is a chance the primer doesn't quite match the primer sequence we have stored (due to errors in synthesis and sequencing).

From simplest to advanced [@silva_2022_what]:

  • Levenshtein distance: used in strings
    • Damerau-Levenshtein: “transposition of two characters to find an approximate match”
  • Hamming distance: used in signals
  • Advanced: Hidden Markov Models via probability

For the first iteration of the DBTL cycle, we will try Levenshtein Distance, and pursue other algorithms if there is notable gain from using them.

Sequence collapse to single nucleotides

Given that we are doing semi-specific synthesis, we now collapse homonucleotides to mononucleotides. We use the occurrence of homonucleotides as the probability the sequenced base is actually at that index, to deal with base conflicts, and also signal that the positions of base conflict could either be a deletion, insertion or mutation error. homo

After these steps, we can do error correction.

Current Solutions

How do we test this?

We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance, additionally through ChaosDNA.