Tuesday, May 2, 2023

Prologue

Introduction

It is humbling for me and awe-inspiring to realize that we have caught the first glimpse of our own instruction book, previously known only to God.

                                                                                                Francis Collins (2000)

Those were the words of Francis Collins when the President of the United States, Bill Clinton, announced the completion of the human genome sequence on June 26, 2000. But in spite of what Collins said there were a great many people besides God who had a pretty good idea of what was in your genome. Knowledgeable experts had been predicting for the past 30 years that the human genome would contain about 30,000 genes and lots of other functional regions. They estimated that the human genome was about 90% junk.

The publication of the human genome sequence proved that those knowledgeable experts were correct. (pp. 1-5)

The junk DNA wars
Most scientists were reluctant to believe the experts and they developed all sorts of hypotheses and speculations to avoid accepting the evidence that most of our genome is junk. This spawned the junk DNA wars that continue to this day.

"In this book I will attempt to show you that the concept of junk DNA is compatible with all the evidence, consistent with our understanding of evolution and population genetics, and possess extraordinary explanatory power. It helps us make sense of biology. I will also show you that all the arguments against junk DNA are incompatible with our present understanding of molecular biology, incompatible with evolution, and lack explanatory power. They do not make sense." (pp. 5-6)

Chapter 11: Zen and the Art of Coping with a Sloppy Genome

Introduction
The title comes from Zen and the Art of Motorcycle Maintenance, one of the most popular philosophy books of all time. The theme of my book is very different; it's about the idea that life at the molecular level is very messy and error-prone and looks nothing like a well-constructed Swiss watch. The author of Zen writes a number of short essays called Chautauquas and I'm going to close this book with a few of my own. (pp. 297-298)
The limitations of genomics
Genomics focuses on a global analysis of the entire genome rather than on specific genes. Genomic studies collect large amounts of data that may be useful in uncovering new features and in forming new hypotheses but those hypotheses still need to be tested at the level of individual genes. Genomics workers often believe they have discovered novel features of the genome that overthrow old ideas—features such as tens of thousands of noncoding genes, abundant alternative splicing, and huge amounts of regulatory sequence—but they have only discovered data that may or may not point to new features pending closer analysis. (pp. 299-302)
The function wars
The ENCODE publicity campaign kicked off an extended discussion about the meaning of the word function—a discussion that Alex Palazzo calls the 'function wars.' The new function wars drew in philosophers who have been debating the meaning of function for many decades. The wars are over and the most reasonable definition of molecular function is the maintenance definition that describes functional DNA as DNA that is currently maintained by natural selection (purifying selection.) (pp. 302-307)
[ENCODE and their current definition of "function"] [Identifying functional DNA (and junk) by purifying selection] [The Function Wars Part IX: Stefan Linquist on Causal Role vs Selected Effect] [The Function Wars Part VIII: Selected effect function and de novo genes] [The Function Wars Part VII: Function monism vs function pluralism] [The function wars are over] [Philosophers talking about genes] [When philosophers write about evolution] [When philosophers talk about genomes]

Scientific revolutions
The scientific literature and the popular press are full of reports of scientific revolutions that have just overthrown some old paradigm causing the textbooks to be rewritten. That's not how science really works. Most scientific revolutions develop slowly over a period of many years as more and more data causes us to revise our old ideas. Many of the so-called revolutions reported in the popular press are actually paradigm shafts, not paradigm shifts. The concept of junk DNA refers to a real revolution in our thinking about genomes. It developed over many years in the 1960s and 70s but it failed to convince most biologists. All of the announcements about disproving junk DNA are fake revolutions and paradigm shafts. (pp. 307-311)
[ENCODE and their current definition of "function"] [University press releases are a major source of science misinformation] [Press release from the Francis Crick Institute misrepresents junk DNA]
No comfort for Intelligent Design Creationists
Intelligent Design Creationists have been predicting for years that most of our genome would turn out to be functional. They interpret recent results to be a vindication of their prediction. I hope I've demonstrated that they are wrong. (pp. 311-312)
[Religion vs science (junk DNA): a blast from the past] [Stephen Meyer "predicts" there's no junk DNA] [Do Intelligent Design Creationists still think junk DNA refutes ID?] [You need to understand biology if you are going to debate an Intelligent Design Creationist]
Scientific controversies
There is a genuine controversy over the amount of junk DNA in the human genome but the existence of this controversy is hidden from the general public because most scientists ignore it. Why do most scientists refuse to even consider the idea that our genome could be full of junk? I outline four reasons for this behavior. (pp. 321-315)
[Scientists say "sloppy science" more serious than fraud]
Coping with a sloppy genome
I hope I've convinced you that it's possible to live with the idea that 90% of our genome is junk. (p. 315)

Notes for Chapter 11 (pp. 333-334)

References (pp. 335-358)

Index (pp. 359-372)

Saturday, February 18, 2023

Chapter 10: Turning Genes On and Off

Introduction
Francis Collins, and many others, believe that the concept of junk DNA is outmoded because recent discoveries have shown that most of the human genome is devoted to regulation. This is part of a clash of worldviews where one side sees the genome as analogous to a finely tuned Swiss watch with no room for junk and the other sees the genome as a sloppy entity that's just good enough to survive. (pp. 264-266)
What is regulation?
Regulation refers to gene expression that can be modified according to environmental conditions. (pp. 266-267)
[Protein concentrations in E. coli are mostly controlled at the level of transcription initiation]
Stochastic gene expression
The rate of transcription of a gene can vary from cell to cell due to the stochastic nature of transcription factor binding and the initiation of RNA synthesis. This is not regulation. (pp. 267-268)
What do we know about regulatory sequences?
Most transcription factors bind within a few hundred base pairs of the promoter. With a few exceptions, regulatory sequences are found in close proximity to the 5′ end of the gene. (pp. 268-270)
[Are multiple transcription start sites functional or mistakes?] [How many enhancers in the human genome?] [Are most transcription factor binding sites functional?] [The Encyclopedia of Evolutionary Biology revisits junk DNA]

   Box: The making of a queen by regulating gene expression (p. 270)

Regulation and evolution
One way of resolving the Deflated Ego Problem is to speculate that humans have a much more sophisticated regulatory network than other "lower" species. This is consistent with evo-devo, which postulates that the differences between species is due more to differences in regulating gene expression that differences in the number of genes. However, most of the results from evo-devo suggest that the differences in regulation are due to very small changes in the binding of existing transcription factors and not huge changes in genome organization. (pp. 271-273)

   Box: Can complex regulation evolve by acccident? (pp. 273-274)

Regulating gene expression by rearranging the genome
There are some well-studied examples of regulation that are connected to rearranging the genome by recombination. (pp. 275-277)
Open and closed domains
DNA is more accessible to transcription factors when nucleosomes are loosely organized in an open domain. Gene expression is repressed when the gene is embedded in a highly structured closed domain (heterochromatin). The transition from closed to open domains is coupled to demthylation of DNA and modification of histones. The DNase I sensitivity of DNA in an open domain is correlated with transcription activity. The spontaneous "breathing" of heterochromatic regions allows transcription actors to bind. (pp. 277-279)
[Epigenetic markers in the last 8% of the human genome sequence] [Chromatin organization at promoters in yeast cells]

   Box: X-chromosome inactivation (pp. 279-280)
   [Escape from X chromosome inactivation

The recruitment model of gene expression
The recruitment model of gene expression says that the binding of transcription factors triggers the demethylation of DNA and the modifiction of histone proteins to maintain an open domain. The histone code model is often connected to the belief that DNA demethylation and histone modification are the key events in regulation and not epiphenomena. This view is usually associated with a strong belief in the importance of epigenetics. (pp. 281-282
ENCODE promotes regulation
ENCODE researchers postulate that at least 20% of the genome is required for regulation and there are dozens of transcription factor binding sites for each gene. According to this view, this sophisticaed regulation explains why complex humans can exist with the same number of genes as most other species. (pp. 283-285)
[ENCODE's false claims about the number of regulatory sites per gene]
Does regulation explain junk? How can we test the hypothesis?
There are very few known examples of human genes with the complicated regulatory mechanisms promoted by the ENCODE leaders. The few published genomics tests of the hypothesis do not support it. What's missing is the random genome project in order to emphasize the importance of a negative control. (pp. 285-287)
[How much of the human genome is devoted to regulation?]

   Box: A thought experiment (pp. 287-288)

3D chromosomes
One possibility is that a lot of extra DNA is required in humans in order to organize genes into large functional loops of chromatin. This idea has been promoted by several scientists, including Emile Zuckerkandl. (pp. 288-291)
What the heck is epigenetics?
The most useful definition of epigenetics is the Holliday definition that restricts the term to changes that could be inherited by daughter cells following cell division. Proponents of epigenetics claim that chromatin markers can be passed from generation to generation in humans and they determine whether a gene will be expressed or silenced. There is no known mechanism for passing such markers from somatic cells to the germ line. (pp. 292-293)
[What the heck is epigenetics?] [Nessa Carey talks about epigenetics] [What do believers in epigenetics think about junk DNA?]
Restriction/modification and the inheritance of methylated nucleotides
The restriction/modification system in bacteria is a good example of how methylation signals can be passed to daughter cells following cell division but it does not explain epigenetics. There is a lot of hype associated with epigenetics and much of it is unjustified.(pp. 293-296)

Notes for Chapter 10 (pp. 331-333)

Chapter 9: The ENCODE Publicity Campaign

Introduction
On Sept. 5, 2012 Nature published a number of papers by the ENCODE Consortium. (The papers were rejected by Science.) The main summary paper announced that 80% of the human genome has a function and many of the ENCODE leaders pronounced the death of junk DNA. (pp. 238-141)
[The 10th anniversary of the ENCODE publicity campaign fiasco]
ENCODE results
The main results were that the human genome has 20,687 protein-coding genes and 18,451 noncoding genes. About 62% of the genome is transcribed. There are 636,336 binding sites for the 120 transcription factors they examined and these cover 231 million bp or 8.1% of the genome. The researchers identified more than 5 million open chromatin domains accounting for about 40% of the genome. If you add up all the biochemcally active DNA it somes to 80.4% of the genome. (pp. 241-244)
The ENCODE publicity campaign
The papers that appeared in the Sept. 5th edition of Nature were accompanied by a massive publicity campaign orgnanized by the editors at Nature. There were press releases from the univesities and government research centers that were involved in the project. The main message was that 80% of the genome is functional and the idea of junk DNA has been refuted. The de facto ENCODE leader, Ewan Birney, was hailed as a "Big Talker." (pp. 244-246)
[The ENCODE publicity campaign of 2007]
Criticisms of ENCODE
The blogosphere, Twitter, and Facebook erupted immediately with posts criticising ENCODE for misleading the public about the meaning of function and pointing out that junk DNA is alive and well. Brendan Maher, the feature editor for Nature realized the next day (Sept. 6) that they had a problem and he announced that the main purpose of the publicity capmaign was to create "the biggest splash possible" to promote results that usually don't get much attention in the popular press. He conceded that the claim of 80% functional might have been an exaggeration. Over the next couple of years a number of papers critical of the ENCODE claim have been published in the scientific literature. I have never seen such a strong and rapid criticism of papers published by leading scientists in a well-respected journal like Nature (pp. 247-254)
Science journal doubles down
In December 2012 Science listed the ENCODE results as one of the breakthroughs of the year. Although it ackonwledged the controversy, it still reported that 80% of the human genome is functional. (pp. 254-255)
ENCODE backpedals
In 2014, the ENCODE researchers partially retracted their claims about function and announced that the main purpose of ENCODE is to map all the spurious transcripts and spurious transcription factor biding sites in order to provide a resource for the community of scientists. (pp. 255-260)
[ENCODE and their current definition of "function"] [The Function Wars Part XII: Revising history and defending ENCODE] [Manolis Kellis dismisses junk DNA] [What did ENCODE researchers say on Reddit?] [Tim Minchin's "Storm," the animated movie, and another no-so-good Minchin cartoon]
ENCODE III
ENCODE III said in 2020 that there are 20,225 protein-coding genes and 37,595 noncoding genes. There are now 2,157,387 open chromatin domains and 1,224,154 transcription factor binding sites. ENCODE III made no claims about function. (pp. 260-261)
What went wrong?
ENCODE failed to consider the null hypothesis of no function. The researchers failed to acknolwedge the critisisms of their claims back in 2007 and they failed to take into account alternative explanations of their data. This is not how science is supposed to work. (pp. 261-263)
[The 20th anniversary of the human genome sequence: 6. Nature doubles down on ENCODE results] [Style vs substance in science communication: The role of science writers in major science journals]

Notes for Chapter 9 (pp. 330-331)

Monday, February 6, 2023

Chapter 8: Noncoding Genes and Junk RNA

Inroduction
There are about 5000 noncoding genes. Some scientists attempt to solve The Deflated Ego Problem by postulating the existence of tens of thousands of new noncoding genes. (pp. 192-194)
[John Mattick's new book]
Different kinds of noncoding RNAs
In addition to tRNAs, rRNAs, and several unique RNAs, there are a number of classes of small RNAs: small nuclear RNAs (snRNAs); small nucleolar RNAs (snoRNAs); microRNAs (miRNAs); short interfering RNAs (siRNAs); PIWI-interacting RNAs (piRNAs); long noncoding RNAs (lncRNAs). (pp. 194-196)
Understanding transcription
Transcription initiation requires the binding of RNA polymerase to a gene promoter. This is often aided by additional transcription factors that bind to nearby binding sites. (pp. 197-199)

   Box: Three RNA polymerases for three different kinds of genes

On the important properties of DNA-binding proteins
Transcription factors bind to relatively small DNA sequences and these sequences will be present by chance in large genomes. Most transcription factors will be bound to nonfunctional sites. (pp. 200-203)
Random transcription initiation is the rule
Eukaryotic cells will contain a lot of junk RNA produced by spurious transcription from nonfunctional sites. (pp. 203-204)
[Sloppiness in translation initiation]
Random transcription termination is also common
Transcription termination is inefficient leading to large amounts of junk RNA from the 3′ end of the gene. (p. 204)
Sometimes RNA polymerase goes off in the wrong direction
Spurious transcription can also occur when transcription fires off in the wrong direction. (pp. 204-205)
Antisense transcription
Most antisense transciption is spurious but there are a number of examples of regulatory antisense RNAs. Upstream antisense RNAs are sometimes called enhancer RNAs (eRNAs) and a small number of genes might be regulated by eRNAs. (pp. 205-206)
How much of our genome is transcribed?
The ENCODE results from 2007 suggested that most of the human genome is transcribed at some time or other (pervasive transcription). It was assumed in 2007 that most of these transcripts were functional. (pp. 206-209)
[A 2004 kerfuffle over pervasive transcription in the mouse genome] [Transcription activity in repeat regions of the human genome] [The pervasive transcription controversy: 2002]
The pre-ENCODE history of pervasive transcription
The discovery that most of the human genome is transcribed dates back to the late 1960s. By 1980 it was understood that much of this is due to intron sequences that are rapidly degraded in the nucleus. There are two views on the possible function of all those transcripts; they are products of an exquisitely desiged genome like a Swiss watch or products of a sloppy genome. (pp. 209-211)
How do we know about pervasive transcription?
DNA sequencing technology, RNA-Seq, and newer methods of detecting low level transcripts. (pp. 211-211)
[The history of DNA sequencing]
How many lncRNAs?
LncRNA databases contain more that 500,000 putative lncRNAs. Only a small number of lncRNAs have been shown to have a biologically relevant function. Scientists who claim that the genome is full of lncRNA genes often don't understand the concept of the null hypothesis. (pp. 212-217)
[Most lncRNAs are junk] [Junk DNA [and lncRNAs]] [On the misrepresentation of facts about lncRNAs] [lncRNA nonsense from Los Alamos] [How many lncRNAs are functional?] [Experts meet to discuss non-coding RNAs - fail to answer the important question] [Confusion about the number of genes] [How many lncRNAs are functional: can sequence comparisons tell us the answer?]

   Box: Revisiting the Central Dogma?

Many scientists are confused about the meaning of the central dogma of molecular biology and this confusion leads them to promulgate misconceptions about junk. (pp. 217-218)
[Why is the Central Dogma so hard to understand?] [Subhash Lakhotia: The concept of 'junk DNA' becomes junk] [Georgi Marinov reviews two books on junk DNA]
John Mattick proves his hypothesis?
John Mattick claims that the human genome contains huge numbers of regulatory noncoding genes demonstrating that the central dogma is wrong and junk DNA is a myth. The Human Genome Organizaton awarded him a major prize in 2012 for "proving" his hypothesis. Definition of "paradigm shaft." (pp. 218-221)
[John Mattick presents his view of genomes] [The biggest mistake in the history of molecular biology (not!)] [John Mattick's latest attack on junk DNA] [Paradigm shifting]
The null hypothesis
The importance of the null hypothesis (no function) and criteria for determining whether a transcript is functional of not. (pp. 221-224)

 .  Box: The Random Genome Project

On the origin of new genes
New genes (de novo genes) can arise from junk DNA and this process might be enhanced by the presence of spurious transcripts. This is a form of exaptation. Some scientists argue that the presence of excess DNA and pervasive transcription might be evolutionarily advantageous to a species. But there aren't very many examples of real de novo genes and the argument is teleological. (pp. 225-228)
[Origin of de novo genes in humans] [The evolution of de novo genes] [Contingency, selection, and the long-term evolution experiment]

   Box: Constructive neutral evolution (pp. 228-229)

What the scientific papers don't tell you
Much of the scientfic literature promotes the idea that there are tens of thousands of noncoding RNAs. They don't mention the fact that function as only been established for a small number of these transcripipts and they don't mention that spurious transcription is a reasonable explanation of pervasive transcription. (pp. 229-321)

   Box: The false logic of the argument for noncoding RNAs (p. 232)

Biochemistry is messy
Biochemical reactions are not perfect and errors are common. The cell does not look like a finely-tuned Swiss watch. (pp. 233-234)
Change your worldview
If you think that every feature of a cell must be explained by adaptation then you should serously consider changing your worldview. Richard Dawkins and Stephen Jay Gould represent the two different views of evolution: adaptationism and pluralism. (pp. 234-237)
Notes for Chapter 8 (pp. 328-330)

Saturday, February 4, 2023

Chapter 7: Gene Families and the Birth and Death of Genes

Introduction

The histone gene family. Definition of gene family. Pseudogenes. (p. 170-171)

The birth and death of genes

As genome evolve, new genes are born and old genes die. "Birth & death evolution" was mainly developed and promoted by Masatochi Nei beginning in the early 1970s. Many new genes arise by gene duplication but most of them become pseudogenes within a few million years. Some evolve new functions by subfunctionalizaton or neofunctionalization. (pp. 172-174)
[On the evolution of duplicated genes: subfunctionalization vs neofunctionalization]

   Box: The smell of sweat (pp. 174-175)

Gene duplication and mutationism

Gene duplication is due mostly to errors in recombination. This is a subset of segmental duplication and it leads to genome expansion. The creation of new genes by mutation is a key aspect of mutationism. (p. 175-177)
[Mutation, Randomness, & Evolution] [Replaying life's tape] [What is "structuralism"?] [Reactionary fringe meets mutation-biased adaptation: Introduction]

Whole genome duplications and the fate of genes

Polyploidization and hybridization give rise to species with twice as much DNA. The fate of that extra DNA, especially extra genes, can be tracked over time. It looks like the extra DNA is another example of junk DNA, lending support to the idea that species can tolerate large amounts of nonfunctional DNA. (pp. 177-179)
[The birth and death of salmon genes] [Birth and death of genes in a hybrid frog genome]

   Box: Real orphans in the human genome

Completely new genes, de novo genes, are rare but there are genuine examples of genes that are unique in the human genome (ORFans). They arise by gene duplication and they are often polymophic. (p. 180)

Different kinds of pseudogenes

There are four different kinds of pseudogenes: death of a duplicated gene, processed, unitary, and polymorphic. The human genome has about 15,000 pseudogenes (5% of the genome) and almost all of them are junk. The fixation of a pseudogene involves two steps; mutation and fixation by random genetic drift. Pseudogenes can become unrecognizable after 100 million years. (pp. 181-184)
[Is the high frequency of blood type O in native Americans due to random genetic drift?]

   Box: Conserved pseudogenes and Ken Miller's argument against intelligent design

The presence of a conserved pseudogene in the beta globin gene cluster in chimpanzee and human genomes is difficult to explain by intelligent design. The fact that a small segment of the beta-globin pseudogene contains a SAR sequence is irrelevant to the main argument. (pp. 185-186)

Are they really pseudogenes?

Pseudogenes are broken genes and they are junk by any reasonable definition (see "If It Walks Like a Duck" in chapter 3). Some scientists who are opposed to junk DNA have claimed that most pseudogenes must be functional based on the fact that a tiny nunmber have secondarily acquired a functon. This is an example of cherry picking. (p. 186-188)
[Are pseudogenes really pseudogenes?]

   Box: The short legs of dachhunds (p. 188-189)

How accurate is the genome sequence?

The accuracy of DNA sequencing methods is approaching 99.99%. If that is coupled to 30x coverage, the overall accuracy is good enough to reliably distinguish between functional genes and pseudogenes. You also need a reliable sequence of your personal genome if you are going to make decisions about your health. (pp. 189-191)

Notes for Chapter 7 (pp. 327-328)

Friday, February 3, 2023

Chapter 6: How Many Genes? How Many Proteins?

Introduction
I think there are about 25,000 genes in the human genome but the annotated human genome says there are 45,000 and many scientists claim there are a lot more genes. Why is there a controversy over the number of genes? (pp. 136-137)
Defining a gene
It's important to have a usuable definition of a gene. I define a gene as a DNA sequence that's transcribed to produce a functional product. The important point is that the gene product (RNA or protein) must have a biological function. (pp. 137-138)
[Dan Graur proposes a new definition of "gene"] [Gerald Fink promotes a new definition of a gene]
The molecular gene and the Mendelian gene
I'm talking about the molecular gene. The Mendelian gene is used in genetics and it's similar to the definition Richard Dawkins uses in his book The Selfish Gene. (pp. 138-139)
Counting genes
Draft sequences of genomes always contain predictions of large numbers of genes that are subsequently eliminated by annotators as more information becomes available. The current best estimates are that there are somewhat fewer than 20,000 protein-coding genes. (pp. 139-142))
[The 20th anniversary of the human genome sequence: 3. How many genes?] [How many protein-coding genes in the human genome? (2)] [How many protein-coding genes in the human genome?]
Counting proteins
The latest count is 18,407 proteins detected and 1,343 probable proteins that haven't yet been found for a total of 19,750. (pp. 142-143)
[How many proteins in the human proteome?]
The functions of protein-coding genes
There are about 10,000 housekeeping genes that encode the proteins required for basic metabolic processes. (pp. 143-144)
Historical estimates of the number of genes
Historical estimates predicted that the human genome would have about 30,000 genes and those estimates turned out to be approximately correct. Guesstimates about larger numbers of genes (e.g. 100,000) were not based on facts. (pp. 144-146)
[False history and the number of genes: 2016]
Confusion about the number of genes
The popular press claimed that knowledgeable scientists were predicting 100,000 genes but that's not correct. (p. 147)
[Nature falls (again) for gene hype]
The Deflated Ego Problem
Many scientists don't believe that humans could only have the same number of genes as nematodes and flowering plants. I call this The Deflated Ego Problem. (pp. 147-149)
[Deflated egos and the G-value paradox] [Revisiting the deflated ego problem] [The Deflated Ego Problem]
Introns and the size of genes
A typical protein-coding gene is 61,700 bp long but most of this is introns. Coding regions occupy about 1% of the genome and introns take up 37%. Genes account for 45% of the genome when you add in the noncoding genes. This number is not widely reported in the popular press. (pp. 149-151)
Introns are mostly junk
The weight of evidence strongly favors the view that most of the DNA in introns is junk. The splice sites and the minumum amount of DNA required to form a loop suggest that only 50 bp in each intron is functional DNA. (pp. 151-152)
[Are introns mostly junk?] [Are splice variants functional or noise?]
   Box: Yeast loses its introns
Yeast has lost most of its introns since it diverged from other fungi. Most of the rest can be deleted without causing any decrease in fitness but a few seem to be essential. More that 98% of the introns in yeast are dispensible, confirming the idea that introns are mostly junk. (pp. 153-154)
[Yeast loses its introns]
Alternative splicing: common or rare?
One way to solve the Deflated Ego Problem is to assume that human genes can make many different proteins by an alternative splicing mechanism. There are many real examples of biologically relevant alternative splicing. (pp. 154-156)
[Debating alternative splicing (Part I)] [Debating alternative splicing (Part II)] [Debating alternative splicing (Part III)] [Debating alternative splicing (Part IV)]
How does alternative splicing work?
Biologically relevant alternative splicing occurs when splicing factors alter the activity of the spliceosome. Splicing errors are common and mispliced transcripts (junk RNA) are easily detectable and entered into the transcript databases. (pp. 156-160)
Splicing errors are the best explanation
It's relatively easy to identify most splicing errors and eliminate those transcripts from the annotated reference genome. The vast majority of splice variants fall into the splicing errors category. (pp. 160-163)
[Splicing errors or alternative splicing?] [Alternative splicing and evolution] [Using conservation to determine whether splice variants are functional] [Splice variants of the human triose phosphate isomerase gene: is alternative splicing real?]
The case for splicing errors
There are 4 good reasons for concluding that true alternative splicing is confined to less than 5% of human protein-coding genes. (pp. 163)
[The frequency of splicing errors reflects the balance between selection and drift]
The controversy and how it’s reported
The controversy over the abundance of real alternative splicing is mostly ignored in the scientific literature and in the popular press. It is widely assumed that almost all human genes are alternatively spliced. (p. 164-165)
[Alternative splicing: function vs noise] [The persistent myth of alternative splicing] [The textbook view of alternative splicing] [The proteome complexity myth]
   Box: The false logic of the argument for complexity
If alternative splicing is going to solve the Defalted Ego Problem then it must distinguish humans from other species. But all species produce abundant transcripts due to splicing errors so humans are no different than nematodes or flowering plants. (pp. 166-167)
[Alternative splicing in the nematode C. elegans]
Alternative splicing and disease
Genetic diseases can be caused by errors in splicing. Their widespread occurance is taken to be proof that alternative splicing is ubiquitous, but disease-causing splice errors can also occur in junk DNA. (pp. 167-169)
Notes for Chapter 6 (pp. 324-327)

Chaper 5: The Big Picture

Introduction

DNA sequencing and assembly. Cost of sequencing. (pp. 116-118)

A typical gene

DNA sequences are depositied in GenBank. The gene for triose phosphate isomerase (TPI1) is a typical gene. Decoding a protein-coding gene. (pp. 118-122)

Annotators interpret the genome

Human annotators must interpret the DNA sequence. (pp. 122-123)
[ Contaminated genome sequences]

How much of the genome has been sequenced?
About 95% of the genome has been sequenced in the standard reference genome. The rest is estimated from the size of the gaps giving a total of 3.1 Gb. The complete telomere-telomere sequence of T2T-CHM13 is also 3.1 Gb. (pp. 123-125)
[Karen Miga and the telomere-to-telomere consortium] [A complete human genome sequence (2022)] [What do we do with two different human genome reference sequences?] [How big is the human genome (2023)?]
Whose genome was sequenced?
The Celera sequence was mostly Craig Venter's genome. The IHGP standard reference genome was originally a composite of several difference individuals from Buffalo (New York, USA). (pp. 125-126)
How many genes?

The original genome sequence predicted 30,000-40,000 protein-coding genes but that number has dropped to about 20,000 in the current standard reference genome. There are about 5,000 noncoding genes but this number is disputed. Introns take up most of a protein-coding gene and introns are mostly junk DNA. (pp. 126-128)
[Are introns mostly junk?]

Pseudogenes
There are abot 15,000 pseudogenes derived from protein-coding genes. The number derived from noncoding genes is not known. Pseudogenes account for about 5% of the genome. (p. 128)
Regulatory sequences
If we assume about 200 bp of regulatory sequence for each gene then regulatory sequences account for less than 0.2% of your genome. Many scientists believe this number should be much higher. (pp. 128-129)
Origins of replication
There are about 30,000-50, 000 functioning origins of replication accounting for <0.3% of your genome. (pp. 129-130)
Centromeres
About 1% of your genome is occupied by centromeres. (p. 130)
[Centromere DNA] [Minimum Centromere Size in Plants]
Telomeres
Telomere sequences are about 0.1%. (pp. 130-131)
[Telomeres]
Scaffold Attachment regions (SARs)
SARs are required for chromatin organizaton and it's not clear how much DNA sequence is required. Assuming 100,000 loops and 100 bp of SAR per loop gives 0.3% of the genome. (p. 131)
Transposons
About 55% of the genome contains transposon-related and virus-related sequences. They are scattered throughout the genome including within introns. (pp. 131-132)
Viruses
Defective viruses take up about 9% of the genome and functional, dormant, viruses account for less than 0.1%. (p. 132)
Mitochondrial DNA
Less than 0.01% of your genome is occupied by mitochondrial DNA fragments. (p. 132)
How much of our genome is functional?
Adding up all the known functional sequences gives a value of about 4% functional. The actual amount is probably closer to 8-10% based on sequence conservation. The total amount of presumed junk DNA comes to 89%. About 90% of your genome is junk. (pp. 132-133)
[The 20th anniversary of the human genome sequence: 4. Functional DNA in our genome]
What is junk DNA?
Junk DNA is DNA that can be deleted without reducing the fitness of the individual. The debate is not whether junk DNA exists (it does) but over the amount of junk DNA. Opponents of junk DNA think that it would have been eliminated by natural selection if it were really junk. This is a common view in the popular press and even in the scientific literature. My vew is that genomes are sloppy and natural selection isn't capable of purging junk DNA in species with large genomes. (pp. 133-135)
[Identifying functional DNA (and junk) by purifying selection]
Notes for Chapter 5 (p. 324)

Wednesday, February 1, 2023

Chapter 4: Why Don't Mutations Kill Us?

Introduction
Gregor Mendel and mutations. Spontaneous mutations. Rate of mutation. (pp. 82-83)
[Mutation, Randomness, & Evolution]
Why aren’t we extinct? - a 100-year old problem
History of mutation load (genetic load). Prediction of 30,000 genes. (pp. 83-84)
[What Is a Mutation?] [Genetic Load, Neutral Theory, and Junk DNA]
Biochemical mutation rate
Knowing the overall error rate of DNA replication (10-10 mutations per base pair) and the number of cell divisions in the germ line gives an average of 138 new mutations per generation. (pp. 84-85)
[Parental age and the human mutation rate ] [Estimating the Human Mutation Rate: Biochemical Method] [Human Y Chromosome Mutation Rates] [Mutation Rates]
Phylogenetic mutation rate
If you know the number of generations since the time of a common ancestor then you can calculate a mutation rate by looking at sequences that are evolving at the neutral rate. (pp. 85-86)
[Estimating the Human Mutation Rate: Phylogenetic Method] [Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)]
   Box: Tick, tock, the molecular clock (p. 87)
   [The Modern Molecular Clock] [Can some genomes evolve more slowly than others?]
   [Reading the Entrails of Chickens] [Calibrating the Molecular Clock]
The direct method of calculating mutation rate
Comparing the sequences of a child and both parents gives you the number of new mutations per generation. (p. 88)
[Direct Measurement of Human Mutation Rate] [Parental age and the human mutation rate] [Estimating the Human Mutation Rate: Direct Method] [Human Mutation Rates] [Human mutation rates - what's the right number?] [Somatic cell mutation rate in humans]
You are not Craig Venter
Craig Venter's genome sequence was the first one to include all 46 chromosomes separately. The amount of heterogeneity in human genomes means that no two individuals are alike. (pp. 89-90)
[What happens when twins get their DNA tested?] [Genetic variation in the human population] [Genetic variation and the complete human genome sequence] [Sequencing both copies of your diploid genome] [Sequencing human diploid genomes] [All about Craig]
Revisiting the genetic load argument
Given the mutation rate and the probability of deleterious mutations, only a small percenage of the human genome can be susceptible to mutation or our species would go extinct. (pp. 90-94)
[Revisiting the genetic load argument with Dan Graur]
   Box: Human gene knockouts (pp. 92-93)
How much of our genome is conserved?
About 8-10% of the DNA sequences in the human genome are conserved in other species. (pp. 94-95)
Defining function
The best definition of function is the maintenance definition that relies on purifying selection. Functional DNA is any stretch of DNA whose deletion from the genome would reduce the fitness of the individual. (pp. 96-98)
[Identifying functional DNA (and junk) by purifying selection] [On the Meaning of the Word "Function"] [The Function Wars: Part I] [The Function Wars: Part II] [The Function Wars: Part III] [The Function Wars: Part IV] [Restarting the function wars (The Function Wars Part V)] [The Function Wars Part VI: The problem with selected effect function] [The Function Wars Part VII: Function monism vs function pluralism] [The Function Wars Part VIII: Selected effect function and de novo genes] [The Function Wars Part IX: Stefan Linquist on Causal Role vs Selected Effect] [The Function Wars Part X: "Spam DNA"?]
   Box: Levels of selection (pp. 99-101)
   [The Function Wars Part XIII: Ford Doolittle writes about transposons and levels of selection]
Why is the evidence of sequence conservation so hard to accept?
There are several arguments against sequence conservation as an indicator of function. (pp. 101-103)
   Box: Deleting DNA to prove that it is junk (pp. 104-105)
Bulk DNA hypotheses
Skeletal DNA hypotheses. The bodyguard hypothesis. Genetic diversity. (pp. 105-110)
[Teaching about genomes using Nessa Carey's book: Junk DNA]
Medical relevance
Medical relevance is a weak argument for function because mutations in junk DNA can cause genetic diseases. (pp. 110-112)
[Junk DNA vs noncoding]
Ignoring history
Opponents of junk DNA have propagated a false narrative about the history of junk DNA by claiming that scientists in the late 1960s and early 1970s thought that all noncoding DNA was junk. (pp. 112-115)
[The "standard" view of junk DNA is completely wrong] [Junk DNA vs noncoding DNA] [The surprising (?) conservation of noncoding DNA] [More misconceptions about junk DNA - what are we doing wrong?] [Alan McHughen defends his views on junk DNA] [A University of Chicago history graduate student's perspective on junk DNA] [Nature journalist is confused about noncoding RNAs and junk] [What is the dominant view of junk DNA?]
Notes for Chapter 4 (pp. 321-324)