Submitted by Hungry_Definition t3_z7w2a3 in askscience
How do scientists know when to stop sequencing DNA when they are seeking to sequence an entire novel genome? How do they know they have the whole sequence?
Submitted by Hungry_Definition t3_z7w2a3 in askscience
How do scientists know when to stop sequencing DNA when they are seeking to sequence an entire novel genome? How do they know they have the whole sequence?
doc_nano t1_iy91y85 wrote
There are many ways one could do sequencing, but most modern sequencing involves chopping a genome into much smaller chunks at random, sequencing all those chunks, and using lots of computer power to see how all those chunks fit together originally ("shotgun" method). Since the chunks start and stop in (basically) random places, you will often find two different chunks that have the same sequence that came from different copies of the genome, and you can use that overlap to figure out how the whole thing fits together. This works really well for the most information-dense parts of the genome, and you can get a good sense of how complete it is by how frequently the same sequences pop up again and again (something called "depth" in the sequencing field). If most sequences pop up 10 or 20 times and you aren't getting any new sequences, there's a good chance you've sampled all the genome that you're going to see.
A hiccup is that large parts of the human genome and the genomes of many multicellular eukaryotic organisms contain very large, repetitive sequences of DNA. In this situation, you can't break the DNA into smaller chunks and expect to piece it back together, since different fragments of the repetitive sequence will look the same and you can't see how long that stretch really is. This is where you need to use other approaches such as long-read sequencing. However, the same logic applies: when you've sampled most of the genome many times over without uncovering anything new, the statistical probability that you're missing something is quite low.