danby t1_isgiefk wrote on October 15, 2022 at 8:27 PM

I think a thing people maybe don't realise is that the 99.9% figure was a bit of a guesstimate 20 years ago. It has taken something like the 1000 genomes project to actually calculate the number/amount of differences

RantingRobot t1_ishhfay wrote on October 16, 2022 at 12:49 AM

Also, a concrete answer to the question doesn't really exist since the number of differences vary depending on how you count them.

Some stretches of DNA do multiple, overlapping things, so is that counted as one difference or four? Some stretches of DNA can be the same in two people, but epigenetically expressed in one but not the other, so is that counted as one difference or none?

The number will always be kind of a guestimate.

danby t1_ishjkfs wrote on October 16, 2022 at 1:05 AM

Agreed. "Proportion of non-shared base pairs" is at least a decent enough, semi-objective way to compare the differences between two genomes without getting too far in to the weeds about what exactly constitutes a difference. There are, in the end of the day, lots of differences that simply can't be expressed as a percentage difference (like gene/chromosome translocation)

Fmatosqg t1_ishmusp wrote on October 16, 2022 at 1:32 AM

Since all of this is meant to produce proteins, it's only fair that the calculation is biased towards things that make different proteins.

So if a gene/allele gets moved to a different place, it still counts as no difference.

DreamWithinAMatrix t1_ishq800 wrote on October 16, 2022 at 1:58 AM

Protein production used to be the thinking back in the day of the term "junk DNA" but we've since learned that actually there are sequences that have non-protein generating functions. Promoters and alternative splicing are the ones that come to mind. There are viral gene inserts which were originally thought to have no function but seem to be amplified in some regions and is now hypothesized to be a source of accelerated evolution, such as, in neurons which may have contributed to how humans diverged from chimps. The epigenome is the methyl groups around the DNA which can open or close to prevent the genes from being expressed, which might be mainly driven by environmental conditions and change frequently. There are some portions of DNA which might fold on itself to prevent expression as well.

If you only look at the raw gene sequence and say only the protein producing ones count. You have no way of telling:

how much
how many kinds
speed
and whether the protein is currently being expressed

without taking all those things into account. Also there are so many of the above being discovered that there's really no way to calculate all that yet

joalheagney t1_isi6mvr wrote on October 16, 2022 at 4:18 AM

Not to mention all the various segments that code for functional but non-protein encoding RNA.

[deleted] t1_isjaww4 wrote on October 16, 2022 at 12:39 PM

[removed]

doc_nano t1_ishqweh wrote on October 16, 2022 at 2:04 AM

Well… sort of. While encoding proteins is arguably the most important and certainly the most visible function of the genome, there are parts that code for RNA that does not get translated into protein. These and other non-coding segments actually make up the majority of the human genome, and many of them play important roles. Though it is true that almost all those roles support the expression or regulation of proteins in some indirect way.

Also, a gene moving to a different locus can actually make a big difference, because the way it is expressed and regulated can change, even if it codes for the same protein.

danby t1_isis6zk wrote on October 16, 2022 at 8:39 AM

> So if a gene/allele gets moved to a different place, it still counts as no difference.

Definitely not. Translocation often leads to or implies different expression of genes. As an aside many, many translocations over large amounts of evolutionary time can lead to things like chromosome loss and/or speciation events. These are important forms of genetic change/mutation that do lead to important functional change. And they do make genomes quite different in ways that aren't measurable by simple percentages.

BryKKan t1_isj5b4f wrote on October 16, 2022 at 11:38 AM

See, that's the problem though. Simply translocating a sequence, with no alteration, can diminish or amplify expression dramatically. So that could still be considered a difference.

derefr t1_isih06b wrote on October 16, 2022 at 6:11 AM

"Easy" — but impractical to calculate in practice — concrete answer: it's the information-theoretic co-compressibility of the all the dependent information required to construct one individual's proteome relative to another indivdual's.

(I.e., if you have all the DNA + methylations et al of one person's genome, stored in a file, which you then compress in an information-theoretical optimal way [not with a general-purpose compressor, but rather one that takes advantage of the structure of DNA, rearranging things to pack better], and then measure the file-size of the result; and then you create another file which contains all that same [uncompressed] information, plus the information of a second person's DNA + methylations et al; and you optimally compress that file; then by what percentage is the second optimally-compressed file larger than the first?)

Or, to use a fanciful analogy: if we had a machine to synthesize human cells "from the bottom up", and you had all the information required to print one particular human's cells stored somewhere — then how much more information would you need as a "patch" on the first human's data, to describe an arbitrary other particular human, on average?

Inariameme t1_isk4gr1 wrote on October 16, 2022 at 4:21 PM

idk that i tend to agree with any of the computational architectures ;)

Simply, is DNA as linear as has been suggested? probabilistic-ally_

[deleted] t1_isi9idi wrote on October 16, 2022 at 4:47 AM

[removed]

[deleted] t1_isisapp wrote on October 16, 2022 at 8:41 AM

[removed]

snuffleupugus_anus t1_isnsed2 wrote on October 17, 2022 at 11:11 AM

Would a metric like ratio of varying base pairs to the differential in expressed proteins be better metric? I realize that it's just a theoretical number and that we can't actually count literally every protein in a human body, but, as a thought experiment I suppose, is that a more meaningful depiction of actual genetic difference?

[deleted] t1_isix804 wrote on October 16, 2022 at 9:51 AM

[removed]

sunplaysbass t1_isgminf wrote on October 15, 2022 at 8:56 PM

Half a percent range seems huge to me. But that’s my no nothing reaction.

Ixosis t1_isgv0ig wrote on October 15, 2022 at 9:58 PM

Really isn’t that large when you find out we share 70% of our DNA with bananas

sunplaysbass t1_isgvgdx wrote on October 15, 2022 at 10:01 PM

To me that is why 0.6% variance within humans is a lot, if we’re 30% off from being a banana.

powercow t1_ish1kn3 wrote on October 15, 2022 at 10:46 PM

from what i read, we have less variation than other animals. due to some event 70,000 years ago that caused our population to collapse to only a few thousand

PhilosopherFLX t1_ishdx0a wrote on October 16, 2022 at 12:21 AM

Always wonder how that squares with Neanderthal interbreeding when Neanderthals mostly lived 130,000 to 40,000 years ago, right in the middle of 70,000.

ECEXCURSION t1_ishvhah wrote on October 16, 2022 at 2:40 AM

Maybe Neanderthals hunted humans to the brink of extinction. Just like humans and vampires!

Sylvurphlame t1_isirxzk wrote on October 16, 2022 at 8:36 AM

Nah. A giant race war is something humanity would never engage in…

Wait…

[deleted] t1_isi9o65 wrote on October 16, 2022 at 4:49 AM

[removed]

Angdrambor t1_isjtgb1 wrote on October 16, 2022 at 3:07 PM

Makes you wonder if they hit that same bottleneck before we wiped them out.

Xais56 t1_ishh9ox wrote on October 16, 2022 at 12:47 AM

Depends on the animal. I doubt cheetahs have much variance.

Something hardy and successful and desired by humans though I could see having huge variance. Cannabis plants must have incredible variance between sexually produced individuals. (I'm aware it's not an animal, but the point stands).

powercow t1_ishztdj wrote on October 16, 2022 at 3:16 AM

oh for sure some have similar or even less than us. I was talking more about on the average side of things, we are a bit less genetically diverse than most. But especially among endangered species id expect diversity to be likely to be lower than ours. Not all that long ago they discovered a family of stick insect that everyone thought was extinct, living in a bush on a remote island. Since only a single family of them were found, its unlikely they are as diverse as we are.

[deleted] t1_isi95e5 wrote on October 16, 2022 at 4:43 AM

[removed]

LoreChano t1_ishdssl wrote on October 16, 2022 at 12:20 AM

So this was about the time we started to create art and religion, among other things? I wonder if it's related.

[deleted] t1_isilbet wrote on October 16, 2022 at 7:05 AM

[removed]

jadierhetseni t1_isgwy84 wrote on October 15, 2022 at 10:12 PM

Eh. It’s hard to overstate how much of the genome isn’t code-specific. That is, some of it is useless, some of it is structural (need x bases of any sort), some of it is compositional (need a lot of g and c but the precise ratio isn’t important) etc

A lot of the major protein-coding, structural, and regulatory stuff is highly conserved, so there’s a lot of overlap between any two species (Eg humans + bananas)

But all of that other stuff? Eh. It can vary basically as much as it wants consequence-free, producing a lot of within species differences.

BiPoLaRadiation t1_isgxrl6 wrote on October 15, 2022 at 10:18 PM

To be fair the percentage of genes that are different is probably a lot higher than 30 percent. The 30 percent is the number of base pair sequences that are similar between humans and bananas. So us and bananas both have a gene for a sodium pump or some other gene that is shared between most living things and on average the similarity between our average gene and their average gene (of the roughly 7000 genes that they compared in the original study) is about 40 (actual original number) percent (or less because they tested gene products and not base pairs so a lot of minor variability will still result in the same protein product).

If you were to compare on a gene by gene basis then probably none of our genes would be the exact same as a bananas. We and bananas also have multitudes of genes that are exclusive to us or them due to the structural differences and the long long evolutionary divergence.

So a 0.6% difference in genetic sequence between humans including not just base pairs of genes but also non coding sequences is actually really tiny. It's enough of a difference to do a lot but it's not as big of a difference as you are imagining.

Sylvurphlame t1_isis7ih wrote on October 16, 2022 at 8:40 AM

The way my biology professor explained it, assuming I recall correctly after decades, is that it takes most of the DNA just to make a functional life from of any sort of complexity. So the amount the separates species, or individuals within a species is relatively small. But important.

[deleted] t1_isk6hf5 wrote on October 16, 2022 at 4:34 PM

[removed]

bschug t1_isj3h9q wrote on October 16, 2022 at 11:16 AM

Is that overlap the same for every human, or are some humans closer to a banana than others?

sunplaysbass t1_isjecm7 wrote on October 16, 2022 at 1:11 PM

Given this variance I can only assume some humans are closer or farther from being a banana than others. It could be a new path for eugenics, or perhaps a banana cult ranking system.

danby t1_islfy04 wrote on October 16, 2022 at 9:25 PM

> To me that is why 0.6% variance within humans is a lot

Sure but this includes non coding and repetitive DNA which between individuals is somewhat unconstrained. If you look at only protein coding genes you get back down to variances closer to 0.1%

[deleted] t1_isgxjj7 wrote on October 15, 2022 at 10:17 PM

[removed]

[deleted] t1_ishe377 wrote on October 16, 2022 at 12:22 AM

[removed]

dunnp t1_ishmgl8 wrote on October 16, 2022 at 1:29 AM

That’s comparing just coding regions with bananas, not the non-coding regions which are the vast majority of the human genome. So more like 70% of the coding 2% of the genome are shared with bananas.

[deleted] t1_isgv7dh wrote on October 15, 2022 at 9:59 PM

[removed]

[deleted] t1_ish0xoh wrote on October 15, 2022 at 10:42 PM

[removed]

[deleted] t1_ish988w wrote on October 15, 2022 at 11:44 PM

[removed]

[deleted] t1_ishfl4i wrote on October 16, 2022 at 12:34 AM

[removed]

Thormeaxozarliplon t1_ishpbj8 wrote on October 16, 2022 at 1:51 AM

That's only anecdotal. It's meant to show the common evolution of life. Most of that similarity is due to things like "housekeeping" genes and common biochemistry.

TomaszA3 t1_ish7nri wrote on October 15, 2022 at 11:32 PM

I'll just drop here that small things can disable or enable almost entirety of other "code". Like, change an "if" to opposite symbol, 0.0...1% of the code has been changed but 99.9...% of code is not executing at all. Or only half of total code is executing on one branch and other half at another.

0.6% in such very highly flexible codebase should definitely bear massive functional(or not, but evolution) changes.

sometimesgoodadvice t1_isp75d7 wrote on October 17, 2022 at 5:50 PM

An interesting analogy but slightly flawed in terms of looking at genomes of already viable organisms. A person whose genome is sequenced to compare to the reference has already undergone the selection criteria for viability and development. Basically, there are plenty of sites where single mutations would lead to a complete breakdown of making a "human" but those would never be seen in a sequenced genome.

The other main difference is that of course code is written to be concise and concrete. As far as I know, no one pastes in some random code that doesn't perform a function just in case it may be needed in the future. Of course, biology works precisely in that way and the genome is a mess of evolutionary history with plenty of space for modification without really resulting in any functional change. So a better example of those 0.6% may be that you can have typos in the comments of the code. In fact, for any large piece of software, I would be surprised if the comment section did not contain at least 0.5% typos.

[deleted] t1_ish3uxz wrote on October 15, 2022 at 11:03 PM

[removed]

[deleted] t1_ishrifd wrote on October 16, 2022 at 2:09 AM

[removed]

[deleted] t1_isi9gfi wrote on October 16, 2022 at 4:46 AM

[removed]

[deleted] t1_isif8vw wrote on October 16, 2022 at 5:50 AM

[removed]

[deleted] t1_isifsbd wrote on October 16, 2022 at 5:56 AM

[removed]

Shadows802 t1_ishatal wrote on October 15, 2022 at 11:56 PM

So 99.4%?

[deleted] t1_isjv2o5 wrote on October 16, 2022 at 3:18 PM

[removed]

promonk t1_ish7g8b wrote on October 15, 2022 at 11:31 PM

Now I'm curious: whose genome is the human reference genome?

Kandiru t1_ishbuwl wrote on October 16, 2022 at 12:05 AM

It's no one person's. It's a mishmash of several different high quality genomes, and then over time it's been changed to have the more common variants as the reference rather than the reference being a rare mutation for some genes.

promonk t1_ishdkr1 wrote on October 16, 2022 at 12:18 AM

When you say "more common variants," common in what way?

I'm fascinated by the idea of a "reference human."

Kandiru t1_ishg2ww wrote on October 16, 2022 at 12:38 AM

Say a certain position is a A for 90% of people, but a C for 10%. The A variant is more common than the C.

So when the reference had previously had a C there, in a later version it's often been changed to the most frequent base.

promonk t1_ishu3eb wrote on October 16, 2022 at 2:29 AM

I get that. What I'm curious about is sampling. 90% of which population? Is it 90 of some college-age kids being paid a hundred bucks for a cheek swab? Or is it drawn from a broad swathe of demographics and locations?

emfts t1_isiaa4s wrote on October 16, 2022 at 4:55 AM

The first human reference genome (from the human genome project) was a group of people from all over, random volunteers.

You can read all about it here:

https://www.genome.gov/12513430/2004-release-ihgsc-describes-finished-human-sequence

Kandiru t1_isimfb1 wrote on October 16, 2022 at 7:20 AM

The 1000 genome project used populations around the world

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_populations.md

Has a list of the ones used.

[deleted] t1_ishhjrb wrote on October 16, 2022 at 12:49 AM

[removed]

[deleted] t1_isiu3kr wrote on October 16, 2022 at 9:07 AM

[removed]

tsunamisurfer t1_ishsh54 wrote on October 16, 2022 at 2:16 AM

Originally though, the reference genome was that of the first sequenced human genome, which I believe belonged to J Craig Venter.

Kandiru t1_isim3ny wrote on October 16, 2022 at 7:15 AM

Actually there were two competing approaches at the beginning. Venter did sequence himself with shotgun sequencing, while the high fidelity BAC sequencing with Sanger sequencing was done on a range of different individuals spanning the genome.

So the first version of the reference was a mixture of them all.

Angdrambor t1_isjtnw2 wrote on October 16, 2022 at 3:08 PM

What makes a genome "High quality"?

[deleted] t1_isjukzq wrote on October 16, 2022 at 3:14 PM

[removed]

danby t1_islg7bo wrote on October 16, 2022 at 9:26 PM

Though I only spent a handful of years in genome sequencing I suspect what is probably meant here is that the sequence was based on several genomes where they were able to prepare high quality genomic libraries for those genomes.

Angdrambor t1_ismpsxm wrote on October 17, 2022 at 3:12 AM

What makes a genomic library high or low quality? Few errors? Faithful representation of the original?

Splatulance t1_isia4bj wrote on October 16, 2022 at 4:53 AM

Typically the question of variance comes down to an aggregate statistic. The most common is "the maximum likelihood estimate", which for a normal enough distribution (bell curve) is the mean.

It's called maximum likelihood because most of x is most likely to be close to the mean.

The more samples you have, the more genomes in this case, the better you can estimate the actual average. With enough samples the actual population mean is overwhelmingly likely to be the same as your estimate.

If the vast majority of people have 99% identical whatever, that's a very tightly grouped distribution around the mean with very low variance. It's practically a vertical line instead of a curve.

[deleted] t1_isime79 wrote on October 16, 2022 at 7:19 AM

[removed]

Slappy_G t1_isjxerj wrote on October 16, 2022 at 3:34 PM

This is totally unrelated but I've also heard the figure of 1.6% for how different chimpanzees are compared to humans. So has that figure been revised, or are we saying that the variation inside of humans is much closer to the variation between the two species?

Cuco1981 t1_isk0dcq wrote on October 16, 2022 at 3:54 PM

The differences are too complex to be reduced to a simple percentage. For instance, we have differing number of chromosomes.

[deleted] t1_ispqcme wrote on October 17, 2022 at 7:56 PM

[removed]

[deleted] t1_isgtc9q wrote on October 15, 2022 at 9:46 PM

[removed]

[deleted] t1_ishrxb0 wrote on October 16, 2022 at 2:12 AM

[removed]

Cornelius_Physales t1_isig1v7 wrote on October 16, 2022 at 6:00 AM

And even the reference of the 1000 genome project leaves out low-complex regions. The first whole genome of a human from telomere to telomete was only completely sequenced last year.

creperobot t1_isigt3p wrote on October 16, 2022 at 6:08 AM

So what is the largest difference between the two most extreme samples?

[deleted] t1_isirhgs wrote on October 16, 2022 at 8:29 AM

[removed]

PeanutSalsa OP t1_iskk2g1 wrote on October 16, 2022 at 6:00 PM

How is it determined this is talking about both coding and non-coding DNA combined?