How do scientists know when to stop sequencing DNA when they are seeking to sequence an entire novel genome? How do they know they have the whole sequence?

Comments

doc_nano t1_iy91y85 wrote on November 29, 2022 at 5:41 PM

There are many ways one could do sequencing, but most modern sequencing involves chopping a genome into much smaller chunks at random, sequencing all those chunks, and using lots of computer power to see how all those chunks fit together originally ("shotgun" method). Since the chunks start and stop in (basically) random places, you will often find two different chunks that have the same sequence that came from different copies of the genome, and you can use that overlap to figure out how the whole thing fits together. This works really well for the most information-dense parts of the genome, and you can get a good sense of how complete it is by how frequently the same sequences pop up again and again (something called "depth" in the sequencing field). If most sequences pop up 10 or 20 times and you aren't getting any new sequences, there's a good chance you've sampled all the genome that you're going to see.

A hiccup is that large parts of the human genome and the genomes of many multicellular eukaryotic organisms contain very large, repetitive sequences of DNA. In this situation, you can't break the DNA into smaller chunks and expect to piece it back together, since different fragments of the repetitive sequence will look the same and you can't see how long that stretch really is. This is where you need to use other approaches such as long-read sequencing. However, the same logic applies: when you've sampled most of the genome many times over without uncovering anything new, the statistical probability that you're missing something is quite low.

baldeagleNL t1_iy9xbv7 wrote on November 29, 2022 at 9:02 PM

So it's basically the same process as making a set of pictures of the horizon scrolling from left to right, and pasting them together where they overlap to create a panorama?

doc_nano t1_iya1u5l wrote on November 29, 2022 at 9:31 PM

Yeah, it's conceptually pretty similar to that kind of stitching. In some ways stitching the DNA sequences together is less complicated than an image because the data are one-dimensional and you don't need to correct for perspective artifacts for near-field objects. On the other hand it's a crap-ton more data elements than the number of pixels in even an HD image, so overall it's a lot more data to crunch through. And there are different kinds of artifacts that show up in sequencing data (such as read errors or mutations that occur during the copying of the DNA prior to sequencing) that need to be dealt with.

Hungry_Definition OP t1_iyai74n wrote on November 29, 2022 at 11:23 PM

Thank you!

[deleted] t1_iy9s18l wrote on November 29, 2022 at 8:29 PM

[removed]

[deleted] t1_iyd1kxx wrote on November 30, 2022 at 2:30 PM

[removed]

[deleted] t1_iy8n4pe wrote on November 29, 2022 at 4:02 PM

[removed]