Viewing a single comment thread. View all comments

adc34 t1_isfx6ri wrote

Both coding and non-coding DNA. Actually, 0.1% is a little bit outdated. The variance can be higher according to the 1000 genome project. It is said in this article that > We find that a typical genome differs from the reference human genome at 4.1 million to 5.0 million sites. Although >99.9% of variants consist of SNPs and short indels, structural variants affect more bases: the typical genome contains an estimated 2,100 to 2,500 structural variants (∼1,000 large deletions, ∼160 copy-number variants, ∼915 Alu insertions, ∼128 L1 insertions, ∼51 SVA insertions, ∼4 NUMTs, and ∼10 inversions), affecting ∼20 million bases of sequence.

These ~20 million bases count for ~0.6% of total genome length

1,193

danby t1_isgiefk wrote

I think a thing people maybe don't realise is that the 99.9% figure was a bit of a guesstimate 20 years ago. It has taken something like the 1000 genomes project to actually calculate the number/amount of differences

543

RantingRobot t1_ishhfay wrote

Also, a concrete answer to the question doesn't really exist since the number of differences vary depending on how you count them.

Some stretches of DNA do multiple, overlapping things, so is that counted as one difference or four? Some stretches of DNA can be the same in two people, but epigenetically expressed in one but not the other, so is that counted as one difference or none?

The number will always be kind of a guestimate.

238

danby t1_ishjkfs wrote

Agreed. "Proportion of non-shared base pairs" is at least a decent enough, semi-objective way to compare the differences between two genomes without getting too far in to the weeds about what exactly constitutes a difference. There are, in the end of the day, lots of differences that simply can't be expressed as a percentage difference (like gene/chromosome translocation)

80

Fmatosqg t1_ishmusp wrote

Since all of this is meant to produce proteins, it's only fair that the calculation is biased towards things that make different proteins.

So if a gene/allele gets moved to a different place, it still counts as no difference.

10

DreamWithinAMatrix t1_ishq800 wrote

Protein production used to be the thinking back in the day of the term "junk DNA" but we've since learned that actually there are sequences that have non-protein generating functions. Promoters and alternative splicing are the ones that come to mind. There are viral gene inserts which were originally thought to have no function but seem to be amplified in some regions and is now hypothesized to be a source of accelerated evolution, such as, in neurons which may have contributed to how humans diverged from chimps. The epigenome is the methyl groups around the DNA which can open or close to prevent the genes from being expressed, which might be mainly driven by environmental conditions and change frequently. There are some portions of DNA which might fold on itself to prevent expression as well.

If you only look at the raw gene sequence and say only the protein producing ones count. You have no way of telling:

  • how much
  • how many kinds
  • speed
  • and whether the protein is currently being expressed

without taking all those things into account. Also there are so many of the above being discovered that there's really no way to calculate all that yet

59

joalheagney t1_isi6mvr wrote

Not to mention all the various segments that code for functional but non-protein encoding RNA.

14

doc_nano t1_ishqweh wrote

Well… sort of. While encoding proteins is arguably the most important and certainly the most visible function of the genome, there are parts that code for RNA that does not get translated into protein. These and other non-coding segments actually make up the majority of the human genome, and many of them play important roles. Though it is true that almost all those roles support the expression or regulation of proteins in some indirect way.

Also, a gene moving to a different locus can actually make a big difference, because the way it is expressed and regulated can change, even if it codes for the same protein.

10

danby t1_isis6zk wrote

> So if a gene/allele gets moved to a different place, it still counts as no difference.

Definitely not. Translocation often leads to or implies different expression of genes. As an aside many, many translocations over large amounts of evolutionary time can lead to things like chromosome loss and/or speciation events. These are important forms of genetic change/mutation that do lead to important functional change. And they do make genomes quite different in ways that aren't measurable by simple percentages.

9

BryKKan t1_isj5b4f wrote

See, that's the problem though. Simply translocating a sequence, with no alteration, can diminish or amplify expression dramatically. So that could still be considered a difference.

2

derefr t1_isih06b wrote

"Easy" — but impractical to calculate in practice — concrete answer: it's the information-theoretic co-compressibility of the all the dependent information required to construct one individual's proteome relative to another indivdual's.

(I.e., if you have all the DNA + methylations et al of one person's genome, stored in a file, which you then compress in an information-theoretical optimal way [not with a general-purpose compressor, but rather one that takes advantage of the structure of DNA, rearranging things to pack better], and then measure the file-size of the result; and then you create another file which contains all that same [uncompressed] information, plus the information of a second person's DNA + methylations et al; and you optimally compress that file; then by what percentage is the second optimally-compressed file larger than the first?)

Or, to use a fanciful analogy: if we had a machine to synthesize human cells "from the bottom up", and you had all the information required to print one particular human's cells stored somewhere — then how much more information would you need as a "patch" on the first human's data, to describe an arbitrary other particular human, on average?

3

Inariameme t1_isk4gr1 wrote

idk that i tend to agree with any of the computational architectures ;)

Simply, is DNA as linear as has been suggested? probabilistic-ally_

2

snuffleupugus_anus t1_isnsed2 wrote

Would a metric like ratio of varying base pairs to the differential in expressed proteins be better metric? I realize that it's just a theoretical number and that we can't actually count literally every protein in a human body, but, as a thought experiment I suppose, is that a more meaningful depiction of actual genetic difference?

1

sunplaysbass t1_isgminf wrote

Half a percent range seems huge to me. But that’s my no nothing reaction.

61

Ixosis t1_isgv0ig wrote

Really isn’t that large when you find out we share 70% of our DNA with bananas

93

sunplaysbass t1_isgvgdx wrote

To me that is why 0.6% variance within humans is a lot, if we’re 30% off from being a banana.

145

powercow t1_ish1kn3 wrote

56

PhilosopherFLX t1_ishdx0a wrote

Always wonder how that squares with Neanderthal interbreeding when Neanderthals mostly lived 130,000 to 40,000 years ago, right in the middle of 70,000.

17

ECEXCURSION t1_ishvhah wrote

Maybe Neanderthals hunted humans to the brink of extinction. Just like humans and vampires!

5

Sylvurphlame t1_isirxzk wrote

Nah. A giant race war is something humanity would never engage in…

Wait…

7

Angdrambor t1_isjtgb1 wrote

Makes you wonder if they hit that same bottleneck before we wiped them out.

1

Xais56 t1_ishh9ox wrote

Depends on the animal. I doubt cheetahs have much variance.

Something hardy and successful and desired by humans though I could see having huge variance. Cannabis plants must have incredible variance between sexually produced individuals. (I'm aware it's not an animal, but the point stands).

7

powercow t1_ishztdj wrote

oh for sure some have similar or even less than us. I was talking more about on the average side of things, we are a bit less genetically diverse than most. But especially among endangered species id expect diversity to be likely to be lower than ours. Not all that long ago they discovered a family of stick insect that everyone thought was extinct, living in a bush on a remote island. Since only a single family of them were found, its unlikely they are as diverse as we are.

1

LoreChano t1_ishdssl wrote

So this was about the time we started to create art and religion, among other things? I wonder if it's related.

2

jadierhetseni t1_isgwy84 wrote

Eh. It’s hard to overstate how much of the genome isn’t code-specific. That is, some of it is useless, some of it is structural (need x bases of any sort), some of it is compositional (need a lot of g and c but the precise ratio isn’t important) etc

A lot of the major protein-coding, structural, and regulatory stuff is highly conserved, so there’s a lot of overlap between any two species (Eg humans + bananas)

But all of that other stuff? Eh. It can vary basically as much as it wants consequence-free, producing a lot of within species differences.

33

BiPoLaRadiation t1_isgxrl6 wrote

To be fair the percentage of genes that are different is probably a lot higher than 30 percent. The 30 percent is the number of base pair sequences that are similar between humans and bananas. So us and bananas both have a gene for a sodium pump or some other gene that is shared between most living things and on average the similarity between our average gene and their average gene (of the roughly 7000 genes that they compared in the original study) is about 40 (actual original number) percent (or less because they tested gene products and not base pairs so a lot of minor variability will still result in the same protein product).

If you were to compare on a gene by gene basis then probably none of our genes would be the exact same as a bananas. We and bananas also have multitudes of genes that are exclusive to us or them due to the structural differences and the long long evolutionary divergence.

So a 0.6% difference in genetic sequence between humans including not just base pairs of genes but also non coding sequences is actually really tiny. It's enough of a difference to do a lot but it's not as big of a difference as you are imagining.

26

Sylvurphlame t1_isis7ih wrote

The way my biology professor explained it, assuming I recall correctly after decades, is that it takes most of the DNA just to make a functional life from of any sort of complexity. So the amount the separates species, or individuals within a species is relatively small. But important.

3

bschug t1_isj3h9q wrote

Is that overlap the same for every human, or are some humans closer to a banana than others?

2

sunplaysbass t1_isjecm7 wrote

Given this variance I can only assume some humans are closer or farther from being a banana than others. It could be a new path for eugenics, or perhaps a banana cult ranking system.

1

danby t1_islfy04 wrote

> To me that is why 0.6% variance within humans is a lot

Sure but this includes non coding and repetitive DNA which between individuals is somewhat unconstrained. If you look at only protein coding genes you get back down to variances closer to 0.1%

2

dunnp t1_ishmgl8 wrote

That’s comparing just coding regions with bananas, not the non-coding regions which are the vast majority of the human genome. So more like 70% of the coding 2% of the genome are shared with bananas.

2

Thormeaxozarliplon t1_ishpbj8 wrote

That's only anecdotal. It's meant to show the common evolution of life. Most of that similarity is due to things like "housekeeping" genes and common biochemistry.

0

TomaszA3 t1_ish7nri wrote

I'll just drop here that small things can disable or enable almost entirety of other "code". Like, change an "if" to opposite symbol, 0.0...1% of the code has been changed but 99.9...% of code is not executing at all. Or only half of total code is executing on one branch and other half at another.

0.6% in such very highly flexible codebase should definitely bear massive functional(or not, but evolution) changes.

37

sometimesgoodadvice t1_isp75d7 wrote

An interesting analogy but slightly flawed in terms of looking at genomes of already viable organisms. A person whose genome is sequenced to compare to the reference has already undergone the selection criteria for viability and development. Basically, there are plenty of sites where single mutations would lead to a complete breakdown of making a "human" but those would never be seen in a sequenced genome.

The other main difference is that of course code is written to be concise and concrete. As far as I know, no one pastes in some random code that doesn't perform a function just in case it may be needed in the future. Of course, biology works precisely in that way and the genome is a mess of evolutionary history with plenty of space for modification without really resulting in any functional change. So a better example of those 0.6% may be that you can have typos in the comments of the code. In fact, for any large piece of software, I would be surprised if the comment section did not contain at least 0.5% typos.

3

promonk t1_ish7g8b wrote

Now I'm curious: whose genome is the human reference genome?

6

Kandiru t1_ishbuwl wrote

It's no one person's. It's a mishmash of several different high quality genomes, and then over time it's been changed to have the more common variants as the reference rather than the reference being a rare mutation for some genes.

25

promonk t1_ishdkr1 wrote

When you say "more common variants," common in what way?

I'm fascinated by the idea of a "reference human."

5

Kandiru t1_ishg2ww wrote

Say a certain position is a A for 90% of people, but a C for 10%. The A variant is more common than the C.

So when the reference had previously had a C there, in a later version it's often been changed to the most frequent base.

18

promonk t1_ishu3eb wrote

I get that. What I'm curious about is sampling. 90% of which population? Is it 90 of some college-age kids being paid a hundred bucks for a cheek swab? Or is it drawn from a broad swathe of demographics and locations?

2

tsunamisurfer t1_ishsh54 wrote

Originally though, the reference genome was that of the first sequenced human genome, which I believe belonged to J Craig Venter.

2

Kandiru t1_isim3ny wrote

Actually there were two competing approaches at the beginning. Venter did sequence himself with shotgun sequencing, while the high fidelity BAC sequencing with Sanger sequencing was done on a range of different individuals spanning the genome.

So the first version of the reference was a mixture of them all.

4

Angdrambor t1_isjtnw2 wrote

What makes a genome "High quality"?

1

danby t1_islg7bo wrote

Though I only spent a handful of years in genome sequencing I suspect what is probably meant here is that the sequence was based on several genomes where they were able to prepare high quality genomic libraries for those genomes.

1

Angdrambor t1_ismpsxm wrote

What makes a genomic library high or low quality? Few errors? Faithful representation of the original?

1

Splatulance t1_isia4bj wrote

Typically the question of variance comes down to an aggregate statistic. The most common is "the maximum likelihood estimate", which for a normal enough distribution (bell curve) is the mean.

It's called maximum likelihood because most of x is most likely to be close to the mean.

The more samples you have, the more genomes in this case, the better you can estimate the actual average. With enough samples the actual population mean is overwhelmingly likely to be the same as your estimate.

If the vast majority of people have 99% identical whatever, that's a very tightly grouped distribution around the mean with very low variance. It's practically a vertical line instead of a curve.

1

Slappy_G t1_isjxerj wrote

This is totally unrelated but I've also heard the figure of 1.6% for how different chimpanzees are compared to humans. So has that figure been revised, or are we saying that the variation inside of humans is much closer to the variation between the two species?

2

Cuco1981 t1_isk0dcq wrote

The differences are too complex to be reduced to a simple percentage. For instance, we have differing number of chromosomes.

2

Cornelius_Physales t1_isig1v7 wrote

And even the reference of the 1000 genome project leaves out low-complex regions. The first whole genome of a human from telomere to telomete was only completely sequenced last year.

1

creperobot t1_isigt3p wrote

So what is the largest difference between the two most extreme samples?

1

PeanutSalsa OP t1_iskk2g1 wrote

How is it determined this is talking about both coding and non-coding DNA combined?

1