: What's in Your Genome

Showing posts with label What's in Your Genome. Show all posts

Tuesday, May 2, 2023

Prologue

Introduction

It is humbling for me and awe-inspiring to realize that we have caught the first glimpse of our own instruction book, previously known only to God.

Francis Collins (2000)

Those were the words of Francis Collins when the President of the United States, Bill Clinton, announced the completion of the human genome sequence on June 26, 2000. But in spite of what Collins said there were a great many people besides God who had a pretty good idea of what was in your genome. Knowledgeable experts had been predicting for the past 30 years that the human genome would contain about 30,000 genes and lots of other functional regions. They estimated that the human genome was about 90% junk.

The publication of the human genome sequence proved that those knowledgeable experts were correct. (pp. 1-5)

The junk DNA wars

Most scientists were reluctant to believe the experts and they developed all sorts of hypotheses and speculations to avoid accepting the evidence that most of our genome is junk. This spawned the junk DNA wars that continue to this day.

"In this book I will attempt to show you that the concept of junk DNA is compatible with all the evidence, consistent with our understanding of evolution and population genetics, and possess extraordinary explanatory power. It helps us make sense of biology. I will also show you that all the arguments against junk DNA are incompatible with our present understanding of molecular biology, incompatible with evolution, and lack explanatory power. They do not make sense." (pp. 5-6)

Friday, February 3, 2023

Chapter 6: How Many Genes? How Many Proteins?

Introduction

I think there are about 25,000 genes in the human genome but the annotated human genome says there are 45,000 and many scientists claim there are a lot more genes. Why is there a controversy over the number of genes? (pp. 136-137)

Defining a gene

It's important to have a usuable definition of a gene. I define a gene as a DNA sequence that's transcribed to produce a functional product. The important point is that the gene product (RNA or protein) must have a biological function. (pp. 137-138)
[Dan Graur proposes a new definition of "gene"] [Gerald Fink promotes a new definition of a gene]

The molecular gene and the Mendelian gene

I'm talking about the molecular gene. The Mendelian gene is used in genetics and it's similar to the definition Richard Dawkins uses in his book The Selfish Gene. (pp. 138-139)

Counting genes

Draft sequences of genomes always contain predictions of large numbers of genes that are subsequently eliminated by annotators as more information becomes available. The current best estimates are that there are somewhat fewer than 20,000 protein-coding genes. (pp. 139-142))
[The 20th anniversary of the human genome sequence: 3. How many genes?] [How many protein-coding genes in the human genome? (2)] [How many protein-coding genes in the human genome?]

Counting proteins

The latest count is 18,407 proteins detected and 1,343 probable proteins that haven't yet been found for a total of 19,750. (pp. 142-143)
[How many proteins in the human proteome?]

The functions of protein-coding genes

There are about 10,000 housekeeping genes that encode the proteins required for basic metabolic processes. (pp. 143-144)

Historical estimates of the number of genes

Historical estimates predicted that the human genome would have about 30,000 genes and those estimates turned out to be approximately correct. Guesstimates about larger numbers of genes (e.g. 100,000) were not based on facts. (pp. 144-146)
[False history and the number of genes: 2016]

Confusion about the number of genes

The popular press claimed that knowledgeable scientists were predicting 100,000 genes but that's not correct. (p. 147)
[Nature falls (again) for gene hype]

The Deflated Ego Problem

Many scientists don't believe that humans could only have the same number of genes as nematodes and flowering plants. I call this The Deflated Ego Problem. (pp. 147-149)
[Deflated egos and the G-value paradox] [Revisiting the deflated ego problem] [The Deflated Ego Problem]

Introns and the size of genes

A typical protein-coding gene is 61,700 bp long but most of this is introns. Coding regions occupy about 1% of the genome and introns take up 37%. Genes account for 45% of the genome when you add in the noncoding genes. This number is not widely reported in the popular press. (pp. 149-151)

Introns are mostly junk

The weight of evidence strongly favors the view that most of the DNA in introns is junk. The splice sites and the minumum amount of DNA required to form a loop suggest that only 50 bp in each intron is functional DNA. (pp. 151-152)
[Are introns mostly junk?] [Are splice variants functional or noise?]

Box: Yeast loses its introns

Yeast has lost most of its introns since it diverged from other fungi. Most of the rest can be deleted without causing any decrease in fitness but a few seem to be essential. More that 98% of the introns in yeast are dispensible, confirming the idea that introns are mostly junk. (pp. 153-154)
[Yeast loses its introns]

Alternative splicing: common or rare?

One way to solve the Deflated Ego Problem is to assume that human genes can make many different proteins by an alternative splicing mechanism. There are many real examples of biologically relevant alternative splicing. (pp. 154-156)
[Debating alternative splicing (Part I)] [Debating alternative splicing (Part II)] [Debating alternative splicing (Part III)] [Debating alternative splicing (Part IV)]

How does alternative splicing work?

Biologically relevant alternative splicing occurs when splicing factors alter the activity of the spliceosome. Splicing errors are common and mispliced transcripts (junk RNA) are easily detectable and entered into the transcript databases. (pp. 156-160)

Splicing errors are the best explanation

It's relatively easy to identify most splicing errors and eliminate those transcripts from the annotated reference genome. The vast majority of splice variants fall into the splicing errors category. (pp. 160-163)
[Splicing errors or alternative splicing?] [Alternative splicing and evolution] [Using conservation to determine whether splice variants are functional] [Splice variants of the human triose phosphate isomerase gene: is alternative splicing real?]

The case for splicing errors

There are 4 good reasons for concluding that true alternative splicing is confined to less than 5% of human protein-coding genes. (pp. 163)
[The frequency of splicing errors reflects the balance between selection and drift]

The controversy and how it’s reported

The controversy over the abundance of real alternative splicing is mostly ignored in the scientific literature and in the popular press. It is widely assumed that almost all human genes are alternatively spliced. (p. 164-165)
[Alternative splicing: function vs noise] [The persistent myth of alternative splicing] [The textbook view of alternative splicing] [The proteome complexity myth]

Box: The false logic of the argument for complexity

If alternative splicing is going to solve the Defalted Ego Problem then it must distinguish humans from other species. But all species produce abundant transcripts due to splicing errors so humans are no different than nematodes or flowering plants. (pp. 166-167)
[Alternative splicing in the nematode C. elegans]

Alternative splicing and disease

Genetic diseases can be caused by errors in splicing. Their widespread occurance is taken to be proof that alternative splicing is ubiquitous, but disease-causing splice errors can also occur in junk DNA. (pp. 167-169)

Notes for Chapter 6 (pp. 324-327)

Chaper 5: The Big Picture

Introduction

DNA sequencing and assembly. Cost of sequencing. (pp. 116-118)

A typical gene

DNA sequences are depositied in GenBank. The gene for triose phosphate isomerase (TPI1) is a typical gene. Decoding a protein-coding gene. (pp. 118-122)

Annotators interpret the genome

Human annotators must interpret the DNA sequence. (pp. 122-123)
[ Contaminated genome sequences]

How much of the genome has been sequenced?

About 95% of the genome has been sequenced in the standard reference genome. The rest is estimated from the size of the gaps giving a total of 3.1 Gb. The complete telomere-telomere sequence of T2T-CHM13 is also 3.1 Gb. (pp. 123-125)
[Karen Miga and the telomere-to-telomere consortium] [A complete human genome sequence (2022)] [What do we do with two different human genome reference sequences?] [How big is the human genome (2023)?]

Whose genome was sequenced?

The Celera sequence was mostly Craig Venter's genome. The IHGP standard reference genome was originally a composite of several difference individuals from Buffalo (New York, USA). (pp. 125-126)

How many genes?

The original genome sequence predicted 30,000-40,000 protein-coding genes but that number has dropped to about 20,000 in the current standard reference genome. There are about 5,000 noncoding genes but this number is disputed. Introns take up most of a protein-coding gene and introns are mostly junk DNA. (pp. 126-128)
[Are introns mostly junk?]

Pseudogenes

There are abot 15,000 pseudogenes derived from protein-coding genes. The number derived from noncoding genes is not known. Pseudogenes account for about 5% of the genome. (p. 128)

Regulatory sequences

If we assume about 200 bp of regulatory sequence for each gene then regulatory sequences account for less than 0.2% of your genome. Many scientists believe this number should be much higher. (pp. 128-129)

Origins of replication

There are about 30,000-50, 000 functioning origins of replication accounting for <0.3% of your genome. (pp. 129-130)

Centromeres

About 1% of your genome is occupied by centromeres. (p. 130)
[Centromere DNA] [Minimum Centromere Size in Plants]

Telomeres

Telomere sequences are about 0.1%. (pp. 130-131)
[Telomeres]

Scaffold Attachment regions (SARs)

SARs are required for chromatin organizaton and it's not clear how much DNA sequence is required. Assuming 100,000 loops and 100 bp of SAR per loop gives 0.3% of the genome. (p. 131)

Transposons

About 55% of the genome contains transposon-related and virus-related sequences. They are scattered throughout the genome including within introns. (pp. 131-132)

Viruses

Defective viruses take up about 9% of the genome and functional, dormant, viruses account for less than 0.1%. (p. 132)

Mitochondrial DNA

Less than 0.01% of your genome is occupied by mitochondrial DNA fragments. (p. 132)

How much of our genome is functional?

Adding up all the known functional sequences gives a value of about 4% functional. The actual amount is probably closer to 8-10% based on sequence conservation. The total amount of presumed junk DNA comes to 89%. About 90% of your genome is junk. (pp. 132-133)
[The 20th anniversary of the human genome sequence: 4. Functional DNA in our genome]

What is junk DNA?

Junk DNA is DNA that can be deleted without reducing the fitness of the individual. The debate is not whether junk DNA exists (it does) but over the amount of junk DNA. Opponents of junk DNA think that it would have been eliminated by natural selection if it were really junk. This is a common view in the popular press and even in the scientific literature. My vew is that genomes are sloppy and natural selection isn't capable of purging junk DNA in species with large genomes. (pp. 133-135)
[Identifying functional DNA (and junk) by purifying selection]

Notes for Chapter 5 (p. 324)

Wednesday, February 1, 2023

Chapter 4: Why Don't Mutations Kill Us?

Introduction

Gregor Mendel and mutations. Spontaneous mutations. Rate of mutation. (pp. 82-83)
[Mutation, Randomness, & Evolution]

Why aren’t we extinct? - a 100-year old problem

History of mutation load (genetic load). Prediction of 30,000 genes. (pp. 83-84)
[What Is a Mutation?] [Genetic Load, Neutral Theory, and Junk DNA]

Biochemical mutation rate

Knowing the overall error rate of DNA replication (10^-10 mutations per base pair) and the number of cell divisions in the germ line gives an average of 138 new mutations per generation. (pp. 84-85)
[Parental age and the human mutation rate ] [Estimating the Human Mutation Rate: Biochemical Method] [Human Y Chromosome Mutation Rates] [Mutation Rates]

Phylogenetic mutation rate

If you know the number of generations since the time of a common ancestor then you can calculate a mutation rate by looking at sequences that are evolving at the neutral rate. (pp. 85-86)
[Estimating the Human Mutation Rate: Phylogenetic Method] [Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)]

   Box: Tick, tock, the molecular clock (p. 87)
   [The Modern Molecular Clock] [Can some genomes evolve more slowly than others?]
   [Reading the Entrails of Chickens] [Calibrating the Molecular Clock]

The direct method of calculating mutation rate

Comparing the sequences of a child and both parents gives you the number of new mutations per generation. (p. 88)
[Direct Measurement of Human Mutation Rate] [Parental age and the human mutation rate] [Estimating the Human Mutation Rate: Direct Method ] [Human Mutation Rates] [Human mutation rates - what's the right number?] [Somatic cell mutation rate in humans]

You are not Craig Venter

Craig Venter's genome sequence was the first one to include all 46 chromosomes separately. The amount of heterogeneity in human genomes means that no two individuals are alike. (pp. 89-90)
[What happens when twins get their DNA tested?] [Genetic variation in the human population] [Genetic variation and the complete human genome sequence] [Sequencing both copies of your diploid genome] [Sequencing human diploid genomes] [All about Craig]

Revisiting the genetic load argument

Given the mutation rate and the probability of deleterious mutations, only a small percenage of the human genome can be susceptible to mutation or our species would go extinct. (pp. 90-94)
[Revisiting the genetic load argument with Dan Graur]

Box: Human gene knockouts (pp. 92-93)

How much of our genome is conserved?

About 8-10% of the DNA sequences in the human genome are conserved in other species. (pp. 94-95)

Defining function

The best definition of function is the maintenance definition that relies on purifying selection. Functional DNA is any stretch of DNA whose deletion from the genome would reduce the fitness of the individual. (pp. 96-98)
[Identifying functional DNA (and junk) by purifying selection] [On the Meaning of the Word "Function"] [The Function Wars: Part I] [The Function Wars: Part II] [The Function Wars: Part III] [The Function Wars: Part IV] [Restarting the function wars (The Function Wars Part V)] [The Function Wars Part VI: The problem with selected effect function] [The Function Wars Part VII: Function monism vs function pluralism] [The Function Wars Part VIII: Selected effect function and de novo genes] [The Function Wars Part IX: Stefan Linquist on Causal Role vs Selected Effect] [The Function Wars Part X: "Spam DNA"?]

Box: Levels of selection (pp. 99-101)
[The Function Wars Part XIII: Ford Doolittle writes about transposons and levels of selection]

Why is the evidence of sequence conservation so hard to accept?

There are several arguments against sequence conservation as an indicator of function. (pp. 101-103)

Box: Deleting DNA to prove that it is junk (pp. 104-105)

Bulk DNA hypotheses

Skeletal DNA hypotheses. The bodyguard hypothesis. Genetic diversity. (pp. 105-110)
[Teaching about genomes using Nessa Carey's book: Junk DNA]

Medical relevance

Medical relevance is a weak argument for function because mutations in junk DNA can cause genetic diseases. (pp. 110-112)
[Junk DNA vs noncoding]

Ignoring history

Opponents of junk DNA have propagated a false narrative about the history of junk DNA by claiming that scientists in the late 1960s and early 1970s thought that all noncoding DNA was junk. (pp. 112-115)
[The "standard" view of junk DNA is completely wrong] [Junk DNA vs noncoding DNA] [The surprising (?) conservation of noncoding DNA] [More misconceptions about junk DNA - what are we doing wrong?] [Alan McHughen defends his views on junk DNA] [A University of Chicago history graduate student's perspective on junk DNA] [Nature journalist is confused about noncoding RNAs and junk] [What is the dominant view of junk DNA?]

Notes for Chapter 4 (pp. 321-324)

Monday, August 29, 2022

Chapter 3: Repetitive DNA and Mobile Genetic Elements

Introduction

Half of our genome is composed of highly repetitive DNA and moderately repetitive DNA. Satellite DNA. C₀t curves. (pp. 57-58)
[Transcription activity in repeat regions of the human genome]

Centromeres

Centromeres contain highly repetitive DNA. (p. 58)
[The structures of centromeres]

Telomeres

Telomeres at the ends of chromosomes contain repetitive DNA. (pp. 58-59)

Box: Dead centromeres and telomeres (pp. 59-60)

Short tandem repeats (STRs)

Short tandem repeats (STRs) are short stretches of repetitive DNA. (p. 60)

Box: DNA fingerprints (pp. 60-61)

Mobile genetic elements

Moderately repetitive DNA consists of interspersed copies of viruses and transposons. (p. 61)

Hidden viruses in your genome

The human genome contains copies of DNA viruses and RNA viruses. Most of them are due to ancient insertions and the viral genomes have acquired inactivating mutations. Many virus-related sequences are just fragments of the original virus genome. (pp. 61-65)

What do we need to know about transposons?

The two main tpes of transposons are DNA transposons and RNA transposons (retrotransposons). (pp. 65-67)

LINES and SINES

Long interspersed elements (LINEs) are transposons that carry a gene for reverse transcriptase. Most LINE-related sequences are degenerate versions of a once-active transposons. Short interspersed elements are derived from small noncoding genes and they require exogenous reverse transcriptase to propagate. Alu elements are one example of a SINE and there are more than one million copies in the human genome. (pp. 67-70)

How much of our genome is composed of transposon-related sequences?

Most of the transposon-related sequences are inactive fragments of the original transposons. It's diffficult to get a precise estimate of the total amount of transposon-related sequences but it's probably at least 50% of the human genome.(pp. 70-72)

BOX: What does the humped bladderwort tell us about junk DNA? (p. 72)

Selfish genes and selfish DNA

Selfish DNA refers to DNA sequences that can propagate by themselves within the genome. (p. 73)
[Junk DNA and selfish DNA] [The selfish gene vs the lucky allele]

Exaptation versus the post hoc fallacy

Some transposon-related sequences have secondarily acquired a function that contributes to the fitness of the organism. This is an example of exaptation. Some scientists believe that transposon-related sequences are retained in order to serve as a reservoir for future exaptation but this argment is related to a logical fallacy called the post hoc fallacy. (pp. 73-78)
[Peter Larsen: "There is no such thing as 'junk DNA'"]

Mitochondria are invading your genome!

The human genome contains fragments of mitochondrial DNA that have recently been incorprated by accident. (pp. 78-79)
[How much mitochondrial DNA in your genome?]

On the origin of junk DNA

A lot of junk DNA originates from ancient insertions of transposons and their subsequent degeneration by acquiring mutations. (pp. 79-80)

If it walks like a duck ...

Transposons look like junk, behave like junk, and evolve like junk, so let's just call them junk. (pp. 80-81)

Notes for Chapter 3 (pp. 320-321)

Pages

Tuesday, May 2, 2023

Prologue

Friday, February 3, 2023

Chapter 6: How Many Genes? How Many Proteins?

Chaper 5: The Big Picture

Wednesday, February 1, 2023

Chapter 4: Why Don't Mutations Kill Us?

Monday, August 29, 2022

Chapter 3: Repetitive DNA and Mobile Genetic Elements