Decoding

Overview, Context and Scope
What types of sequencing machines are there?
- Sanger
- NextGen
- Nanopore
What algorithms are we using?
Why do we need to remove primer from the sequence before we do error correction?
Sequence collapse to single nucleotides
Current Solutions
How do we test this?

Contributions: Lucy, Chae, Riya, Sebastian

Overview, Context and Scope

After DNA is sequenced, we must perform some work on the sequenced strands before we can decode the DNA. This includes alignment of fragments of DNA sequences after sequencing, and the methodology depends the sequencing platform we use.

Many pre-existing alignment strategies are used in the realm of bioinformatics, and rely on a reference template; we don’t have that luxury and conduct sequence alignment without a template, otherwise known as de novo sequence assembly.

To perform iterations of the DBTL cycle, we will complete the algorithms required to perform alignment in a timely manner and with acceptable accuracy on one platform, most likely NGS. We will implement one of the two algorithms that is mentioned below, then in the second iteration try the other algorithm.

What types of sequencing machines are there?

Sanger

[@a2021_analyzing] [@a2021_sanger]
First 20-40 base pairs are not well resolved
Simple data analysis
Longer reads (500-700 bps)
Low sensitivity (~15–20% detection limit)
Sequences close to primer-binding sites to be of poor quality
Output: four-color chromatogram representing the peak fluorescence intensity associated with each labeled ddNTP along the DNA sequence

NextGen

[@next] [@cheng_2023_methods]
Higher sequencing depth for increased sensitivity (down to 1%)
Higher discovery power
Short reads (150-300 bps) by Illumina

Nanopore

Doesn't require PCR amplification, eliminating amplification bias and simplifying sequencing protocols relatively high error rates, around 10% per nt [@emerging].

What algorithms are we using?

There are many established algorithms in this domain, we will use one of those. Based on the selected platform, sequence alignment will be performed as so:

Sanger

Given four-color chromatogram(s) representing the peak fluorescence intensity, depending on the amplification scale in wet lab, we can compare chromatogram(s) to each other to resolve conflicts. For de novo assembly, we can resolve conflicts based on redundancy of chromatograms [@sanger].

NGS

A fastq file contains multiple chopped up sequence reads, each having a confidence score known as a Phred score. A Phred score is the probability the sequencer called the base incorrectly. We assemble the sequence reads de novo into one long read based on the overlaps between the chopped up sequences [@ngs].

ngs_output

To conduct de novo assembly for output from NGS platform, reads are examined for overlap between them, and the goal to build up a single contig from smaller contigs.

There are established algorithms for solving this problem [@paszkiewicz_2010_de]:

The overlap-layout-consensus (OLC) approach [@paszkiewicz_2010_de]

Find overlaps between all pairs of segments, deriving a similarity score for each pair; this will be used to generate our heuristic
Then we have to generate the layout based on overlaps, with an overlap graph
- Vertices: sequence reads
- Edges: overlaps
Find a path that visits each node once -> Hamiltonian circuit, NP-hard
So use heuristic (similar score) to greedily select which edges to take to maximize the heuristic (sum of similarities) until a single string is found (path is found)
Order of merging pairs matters and can change the final string

Requires overlap to be scored between all pairs of reads, making runtime as least O(n²), overlap is less likely for short reads.

The de Bruijn graph approach [@paszkiewicz_2010_de]

No need for overlap phase
Sequence reads cut into smaller pieces
K-tuples, DNA sequence word of length k, are generated
Bruijn graph:
- Node: (k-1)-tuples that occur in k-tuples
- Edge: connects (k-1)-tuples that form a k-tuple
Find shortest (minimum weight, using heuristic) path that visits every edge -> Chinese postman problem, NP-complete! ** Problems with finding multiple solutions, or sequencing errors which cause high branching and tangle
Effect choice of k

Nanopore

TBD, as we may not end up using Nanopore to sequence.

Why do we need to remove primer from the sequence before we do error correction?

We have to remove the nucleotides representing the primer in order to only decode on the nucleotide strand that contains information bases. Using the primer we have stored, we can run fuzzy string matching algorithms [@wikipediacontributors_2019_approximate]. We must use fuzzy (approximate) algorithms because there is a chance the primer doesn't quite match the primer sequence we have stored (due to errors in synthesis and sequencing).

From simplest to advanced [@silva_2022_what]:

Levenshtein distance: used in strings
- Damerau-Levenshtein: “transposition of two characters to find an approximate match”
Hamming distance: used in signals
Advanced: Hidden Markov Models via probability

For the first iteration of the DBTL cycle, we will try Levenshtein Distance, and pursue other algorithms if there is notable gain from using them.

Sequence collapse to single nucleotides

Given that we are doing semi-specific synthesis, we now collapse homonucleotides to mononucleotides. We use the occurrence of homonucleotides as the probability the sequenced base is actually at that index, to deal with base conflicts, and also signal that the positions of base conflict could either be a deletion, insertion or mutation error. homo

After these steps, we can do error correction.

Current Solutions

Basecalling, alignment, assembly and deconvolution of Sanger Chromatogram trace files
Aachen
Open source algorithms for short reads:
- “These software packages are able to perform de novo assembly of Illumina short sequence reads with the exception of SHORTY, which is designed to assemble ABI SOLiD colour-space data. Velvet and SOPRA can assemble sequence-space and colour-space data. aCurtain is a pipeline, based on Velvet, for hierarchical assembly of short sequence reads in order to overcome memory usage limitations. bOases is specifically designed for assembling transcribed sequences.” [@paszkiewicz_2010_de]

How do we test this?

We can test in silico by using open source genome data, and try to reassemble (without the reference template) and then check the performance, additionally through ChaosDNA.

iGEM 2023/24