Enhancing DNA Storage with Synthetic Biology: Software

UBC iGEM

Project Description

Our project aims to tackle the growing need for a better, more energy-efficient data storage medium compared to current magnetic and optical data storage options by means of synthetic biology. Currently, we aim to achieve this through 2 separate tracks:

Developing an enzymatic DNA synthesis platform that can elongate a single-stranded DNA (ssDNA) in a template-independent manner. The synthesized ssDNA strand will then be converted to a more stable, double-stranded DNA (dsDNA) and inserted into a plasmid for long-term data storage.

Developing a data encoding/decoding pipeline that allows binary files (used by computers) to be stored in a ternary format compatible with our DNA synthesis platform, retrieved, and converted back into binary.

Goals

In Silico:

Demonstrate ability to encode and decode information someone may store in long-term storage, in the 1000s of nucleotides long.

With wet lab:

Demonstrate ability to encode and decode a 100 nucleotide sequence with 30% error.

Plan

DBTL 1: March to April

Implement a barebones pipeline, and see how much error can be tolerated in 100 nucleotide long DNA sequences with in silico testing.

DBTL 2: April to May

Redefine algorithms to tolerate up to 30% error in 100 nucleotide long DNA sequences, with in silico testing.

DBTL 1: Proof of Concept

Encoding
Error Correction
Decoding
ChaosDNA

What programming language are we using?

Rust

systems programing language that is fast, memory efficient and memory-safe
expressive type system, great documentation, robust tooling
most admired language among developers

https://github.blog/2023-08-30-why-rust-is-the-most-admired-language-among-developers/

Encoding

Primer Generation
Sequence Generation

Primer Generation

Why generate our own primers?

act as unique identifiers for information
required for PCR amplification
specify requirements for TdT enzyme

How? Using a genetic algorithm.

Why use a genetic algorithm?

type of optimization algorithm
uses a set of constraints to produce heuristics to determine best parent candidates, which go on to produce children population

Primer design fits this description!

Schematic

Example: two chosen parents

Sequence Generation

Binary to Nucleotides

Many encoding formats exist:

Base 4 Encoding: 0 → A, 1 → T, 2 → G, 3 → C
Church Encoding: 0 → A or C, 1 → T or G
Base 2 Encoding: 00 → A, 11 → T, 01 → G, 10 → C
HEDGES (key-autokey cipher) ECC: (hash(input) + bit)mod4 → Base 4 Encoding

These encoding methods will ultimately be tested in silico...

Binary to Nucleotides

Some limitations...

short strands (100 nt)
high rate of deletion errors (30%)

Short strands?

Blocking data

requires we generate unique primers per strand
allows for parallel synthesis of DNA strands

Short strands?

To encode the UBC iGEM sponsorship package...

1.8 MB → 1.8 10⁶ bytes → 1.44 * 10⁷ bits → 180000 strands of DNA
assuming each DNA strand's information portion is 80 unique bases and we are using HEDGES Encoding
we will require compression...

Short strands?

High rate of deletion errors?

For the first iteration, we want to see what percentage of deletion errors we can correct for with minimal error correction. Some ways of reducing rate of deletion errors or preventing deletion errors include that we will explore in further DBTLs are:

synthesizing shorter sequences (more blocks of shorter length)
more complex encoding strategies
more complex error correction methods (inner and outer codes)

High rate of deletion errors?

HEDGES error-correcting codes: Why?

corrects for mutations and deletions, not prevalent type of error in traditional media storage
form of inner code: bits that encode for information

HEDGES error-correcting codes

Redundacy is enforced in the encoding strategy.

Press, W. H., Hawkins, J. A., Jones, S. K., Schaub, J. M., & Finkelstein, I. J. (2020). HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints. Proceedings of the National Academy of Sciences, 117(31), 18489–18496. https://doi.org/10.1073/pnas.2004821117

What does this mean?

HEDGES uses a hash function to encode redundancy and generate the base to be synthesized.

What's a hash function?

maps data of arbitrary size to a fixed-size value
fast to compute hash values
very low chance of returning the same hash value for two different hash inputs

We will also use a established checksum algorithm to generate a checksum to signal if error correction is needed.

HEDGES error-correcting codes: Rationale

"Hashing each bit value withits strand ID, bit index, and a few previous bits “poisons” bad decoding hypotheses, allowing for correction of indels."

"In summary, the algorithm encodes information as a stream of nucleotides such that any single decoding error in either nucleotide identity or nucleotide position will “poison” the downstream predictions. Thus, on decoding, there will be onlyone good-scoring chain of guesses—the correct one."

HEDGES error-correcting codes: Rationale

designed with orthogonality in mind: can add other error correction codes without interference, such as RSC
variable parameters, so we can tune to our specific DNA synthesis
MIT license, implementions online in C++ and Python

Decoding

Sequence Alignment
Error Correction

Sequence Alignment

we will most likely use either (or combination of):

Sanger Sequencing
NGS

De Novo Assembly

Why implement de novo assembly?

no reference template, so de novo assembly is required
important proof of concept to demonstrate our software can put back together a DNA sequence of 1000s of bases long

De Novo Assembly

greedy graph search

Why greedy graph search?

the exact solution is NP-hard problem
our sequences will be at most 1000-3000 bases long, so being greedy will usually yield the exact solution anyways

Error Correction

How does this work?

we are given a DNA sequence, and assuming we have correctly decoded bits[0...i-1], and we want bits[i]
we use the hash function from encoding stage to "guess" what the correct base should be
if we correctly guess the base, we continue searching that branch, otherwise assign a cumulative penality score or abandon that branch

HEDGES error-correction code: Reminder

Error Correction

ChaosDNA

an in-silico tool for generating faulty DNA sequences

Why?

ChaosDNA

an in-silico tool for generating faulty DNA sequences
will allow us to prove our software tool can deal with files of more realistic sizes (MB)
and enable us to work with DNA sequences of varying levels of deletion, insertion and mutation errors

After DBTL 2?

DBTL 3: May to June

Implement DNA Storage Alliance specifications, and do in silico testing on DNA sequences with 1000s of nucleotides.

DBTL 4 and 5: June to July

Test our software on sequences synthesized by wet lab, and redefine algorithms with in silico testing and wet lab data.

Future Directions

DNA Storage Alliance specifications
Outer Codes: GC
Graphical User Interface
SVGs

DNA Storage Alliance specifications

"Unlike traditional storage media such as tape, HDD, and SSD, DNA does not have a fixed physical structure, a built-in controller, or a way to address different regions of the media linearly, and thus needs a mechanism to start reading or “booting up” a DNA archive that does not rely on such a structure. The SNIA DNA Archive Rosetta Stone (DARS) working group, one of four working groups in the DNA Data Storage Alliance aimed at defining standards for DNA data storage systems, has developed two specifications to enable archive readers to find the sequence to begin booting up the data."

https://www.snia.org/news_events/newsroom/dna-data-storage-alliance-releases-its-first-specifications

Outer Codes: Guess & Check+

Correcting Insertions and Deletions in Short DNA sequences

Uses Reed-Solomon outer code to encode redundancies
Within redundant bits, portion of the bits store different possible indel patterns as 'guesses'
Rest of the parity bits encoded are 'checks' i.e. repetitive bits, used to check against guessed indel pattern

Graphical User Interface

SVGs and QR Codes

Scalable Vector Graphics

Store shapes as mathematical equations rather than individual pixels. Further compressible through traditional mechanisms or aforementioned text compression mechanisms.

QR Codes

Designed to store redundant information, enabling extreme error correction. Can be efficiently stored in many formats, including SVG or PNG.

Enhancing DNA Storage with Synthetic Biology: Software

UBC iGEM

Project Description

Goals

In Silico:

With wet lab:

Plan

DBTL 1: March to April

DBTL 2: April to May

DBTL 1: Proof of Concept

What programming language are we using?

Rust

Encoding

Primer Generation

Why generate our own primers?

Why use a genetic algorithm?

Schematic

Example: two chosen parents

Sequence Generation

Binary to Nucleotides

Binary to Nucleotides

Short strands?

Short strands?

Short strands?

High rate of deletion errors?

High rate of deletion errors?

HEDGES error-correcting codes

What does this mean?

HEDGES error-correcting codes: Rationale

HEDGES error-correcting codes: Rationale

Decoding

Sequence Alignment

De Novo Assembly

De Novo Assembly

Why greedy graph search?

Error Correction

HEDGES error-correction code: Reminder

Error Correction

ChaosDNA

ChaosDNA

After DBTL 2?

DBTL 3: May to June

DBTL 4 and 5: June to July

Future Directions

DNA Storage Alliance specifications

Outer Codes: Guess & Check+

Graphical User Interface

SVGs and QR Codes

Scalable Vector Graphics

QR Codes

Thank you!

Questions?