DNA for Digital Data Storage

Every day 297 billion emails are sent, 500 million tweets are uploaded, and 4 million gigabytes of Facebook data are generated. In 2020, 59 trillion gigabytes of data were generated. Every two years, 100 new data hyperscale centres are build to keep up with the demand for data storage. It is estimated that by 2028, data centres will account for 29% of energy demand in Ireland. In 2022, the Dutch government blocked the construction of what was going to be the biggest Data Centre ever built, it was going to consume 1.38 gigawatt hours (GW/h) of electricity and cover the equivalent of 232 football fields of farmland. Eventually we will not have enough resources to build more centres and maintain the ones that are already built. There is a lot of energy being used for these centres and by 2025 it is predicted that they will use 20% of the world’s electricity, but there is an alternative being developed: DNA for digital data storage.

DNA Computer Chip’ by Nicolle R. Fuller, NSF.

Why DNA

At the moment, all this data is kept mainly thanks to technologies based on optical and magnetic materials that store the data in the form of 1s and 0s (binary digits). When information is stored in a DNA molecule, it is stored in long sequences of As, Cs, Ts, and Gs; this is how all life is ‘stored’. Information is stored in the DNA in permutations of 3 letters long called codons. The long DNA sequences are converted into binary code when the data is needed. One gram of DNA can store approximately 215 million terabytes. This is much compared to current hard drives that weigh 400 grams and can store 1 terabyte. This is because DNA’s storage density is much higher than that of traditional storage media.

DNA has already been used to store digital data. Yazdi et al. encoded and edited Wikipedia pages using a DNA-based storage system. Because DNA molecules have been around for so long, they are very stable, and we can sequence entire genomes from 500,00 years ago. DNA has been cheaper and cheaper to synthesise, sequence, and store, and its storage does not take that much energy. Now, in labs, data stored in DNA is decoded error-free to binary code, but it takes a long time (minutes to hours).

The idea of using DNA for data storage started in the 1960s, and at the time, DNA synthesis and sequencing weren’t a thing. In the past, most DNA data storage involved in vivo cloning inside living cells. In vivo, DNA data storage was developed to record new information in specific regions in the genome of organisms. Still, this technique presents lower storage density, making its cells bigger. Additionally, it is necessary to modify natural DNA within living cells genetically. This technology has been shown to be useful in storing information about the cellular environment and history, meaning that it is possible to track in detail the cellular state and molecular events that happened to the mammalian cell through its genome!

How does it work? 

A computer algorithm converts strings of bits into DNA sequences. One example is DNA Storage Codex, a programme that translates binary code to genetic code by Los Alamos National Laboratory. This process takes a long time, approximately 1 second per base attachment, meaning that it could take decades to write an archive file. But technology only gets better!

These DNA sequences are then synthesised and generate multiple copies of each sequence; the DNA cannot be synthesised all in one DNA strand simply because sequencing technologies are not yet at that level, even though they have improved a lot in the last 10 years. The DNA is then synthesised by solid-phase synthesis via phosphoramidite-based chemical synthesis, which has a low throughput, and molecules are covalently bound to a solid support material, usually a membrane or an array with higher throughput.

After synthesis, the DNA must be stored either in vivo (cloned in a cell) or in vitro (frozen or dried down). When we need the information, the DNA can be retrieved through random access, which is when the data is retrieved by selecting a portion of data instead of scanning all the data in storage. This can be done by using primers that bind with the start of the DNA chunk with the information asked. However, this is challenging since molecular storage lacks physical organisation across data in the same pool. When the DNA is selected, it needs to be sequenced. Low throughput like Sanger and high throughput like Illumina sequencing techniques are popular, and more recently, Oxford Nanopore Sequencing for real-time data! However, the latter technology is not advised for long DNA strands because it is more error-prone than Illumina. The sequence is then re-converted into binary code.

Errors during DNA transcription and sequencing 

Yazdi et al., Erlich & Zielinski, who developed a writable random access DNA-based storage system, and DNA Fountain reported ~1% error per base per position, meaning that 1% of the transcribed DNA will have an error in that position. This was related to Illumina sequencing, where nanopore sequencing showed an error rate of approximately 10%. It was concluded that most errors come from sequencing and synthesis and that DNA manipulation, storage, and PCR can cause some sequences not to get as sequenced.

This error percentage seems like a lot, but the traditional magnetic media also has a 1% error. However, Other storage media only have substitution errors, while DNA also has insertions and deletions, making the coding more difficult. DNA is replicated multiple times during synthesis, which helps with the errors since they can be averaged out after sequencing and synthesis, but this does not guarantee error freedom.

Traditional data storage has correction codes to decrease the amount of error in the information. They do this by adding redundant information, increasing the probability that the information can be retrieved even with errors or missing data. After sequencing, DNA already has copies of the same sequence (when the sequence is done with Illumina or another column/array technique) - this is called physical redundancy. Receivers use this extra redundant data to check the accuracy of the information and reconstruct the data. This helps with the natural decay of the DNA and allows the errors to be averaged out; however, Church et al. still could not achieve zero-bit error even with this. 

Today, it is still too expensive to use DNA to store digital data and access latency (the time it takes to read data) is relatively high (taking minutes to hours). Various companies, including Twist Biosciences and Microsoft, are developing and advancing DNA-storage technology, which has already shown that it works on a small scale. The EU also has a project to establish a DNA Data storage solution for archives. Maybe we will use DNA to store our data sooner than you think.

Previous
Previous

‘Children by choice, not chance’

Next
Next

Solar Powered Slug