The
Nitty Gritty Bit
By Thomas D. Schneider, Ph.D.
Posted October 12, 2005
Introduction
Too much of this "debate" in the media about the origin of living creatures
is fluff - no chewy intellectual concepts. Let's get right down to the nitty
gritty. As reported in all the newspapers, the main claim by advocates of
"intelligent design" is that living things are too complex to have evolved. Is
this right? What is 'complexity'? A practical, widely accepted measure is the
one used in communications systems: information theory, developed by Claude Shannon in the 1940s. If you use a phone or the
internet then you rely on Shannon's theory.
Genetic Control Systems
Now let's think about genetic control systems. An exciting one recently
discovered is the controls responsible for maintaining stem cell states. As
reported by scientists at MIT there are three proteins, Oct4, Sox2 and Nanog
which bind to DNA and keep the cell a stem cell (Boyer et. al.). If you remove the proteins from
the cell, the cell switches into specific cell types. How do these proteins
work? They bind to specific spots on the DNA to control other genes.
Finding Spots on a Genome
So we have a problem: how do the proteins find those spots? If they miss the
right spots, then the controls will be thrown off and the cell will go the wrong
direction making, perhaps, nerve tissue where skin should be, eyeballs instead
of teeth ... legs growing out of your head. Such an organism would
not survive. If a genetic control protein binds to the wrong spots, then the
organism will be wasting its energy and grow slower than one that doesn't waste
its protein manufacturing ability. In some cases binding in the wrong place
could turn on or off the wrong gene. In other words, there is strong
selection for a functional genetic control system.
Now suppose (for simplicity's sake) that an organism had only 32 positions on
its genome. That is, there are only 32 places where a protein could bind. This
is orders of magnitude smaller than genomes in nature, but let's use it as an
example.
How much "effort" is required to find one of those positions? The way to find
out is to divide the genome in half a series of times. (Think about using a
knife to divide a cake into pieces.) Each division requires a bit of
information, the measure introduced by Shannon, to specify the choice made. So:
32/2 |
= |
16 (1 bit) |
16/2 |
= |
8 (1 bit) |
8/2 |
= |
4 (1 bit) |
4/2 |
= |
2 (1 bit) |
2/2 |
= |
1 (1 bit) |
So 5 divisions or 5 bits is enough information for the protein to locate one
site. Mathematically this is log2(32) = 5. (See the appendix of my Information Theory Primer for a lesson on logs from the
ground up. Don't worry, I won't tell anyone that you read it.)
Now, suppose that instead of there being only one of 32 possible sites to
which the protein could bind, let us say that it could bind to two. If the
protein binds to either one, it is doing its job since it doesn't matter which
one it binds to and the other one can be found by another copy of the protein.
So one of the divisions doesn't matter and only log2(32/2) = 4 bits
are needed. Likewise, if there were 4 binding sites, only log2(32/4)
= 3 bits would be needed.
This number is called Rfrequency because it is based on the frequency of
sites and the size of the genome. The R follows Shannon's notation: rate of
information transmission. Shannon worked with bits per second in communications.
In molecular information theory, we work with bits per binding site. If a
protein binds at a certain rate in binding sites per second, then the two
measures would be equivalent.
Patterns in the Genome at Binding Sites
Now let's look at this same problem from a radically different angle. Let's
collect the binding sites together and align them. There sometimes is enough
pattern in the DNA sequences to do this by eye. Suppose that one position in the
sequences is always a T. This means that when the protein is searching for its
binding site, it will pick T out of the four genetic bases (A, C, G and T). So
it makes a 1 in 4 choice or 2 bits.
Now consider a position in the binding sites that is A half of the time and G
half of the time. How many bits is that? Well, the protein picks 2 out of 4 or
log2(4/2) = 1 bit.
Finally, suppose that there is a position that isn't contacted by the
protein. Then we will observe all four bases and the protein picks 4 out of 4 or
log2(4/4) = 0 bits.
So we can look at the binding sites and find out the information at different
positions in the sites. Shannon picked logarithms so that information can be
added. This turns out to be an excellent first order computation for binding
sites. So we can sum up the information across the different positions of a
binding site to get a total. This is called Rsequence because it is information
computed from DNA sequence data.
In nature things are a little more complicated because the frequencies of
bases aren't always 100%, 50% or 25% as in the examples above, but fortunately
Shannon's method lets us measure the information in other cases, and the results
are consistent with what we found above. If you want to know the details, you
can read the Information Theory Primer.
Finding Spots on a Genome versus Patterns at Binding Sites
So we have two measures, Rfrequency and Rsequence. How are they related? In a variety of
genetic control systems in nature, they have almost the same value (Schneider1986). The result makes intuitive sense: the
amount of information in the binding site (Rsequence) is just enough to find the
binding sites in the genome (Rfrequency). As in any scientific theory, there are
interesting exceptions which are teaching us fascinating new biology, but I'll
let you read those stories from my web site.
So now we have an evolutionary problem. How did this situation of Rsequence
being approximately equal to Rfrequency come about? Clearly the pattern at a
binding site can change more rapidly than the total genome size. Also, the
number of genes that could be controlled is pretty much fixed by the organism's
current circumstances and that won't change rapidly compared to single
mutations. So Rfrequency is more or less fixed over time. Even if the genome
were to double in size, Rfrequency would only go up by 1 bit, so it is a pretty
insensitive measure. So this means that Rsequence should evolve towards
Rfrequency.
Can we model this process? That question led me to write the Ev computer
program. You now have enough background to explore that.
If you are eager, just launch the new java version of Ev, Evj, recently
written by Paul C. Anagnostopoulos:
(If that doesn't work on your
computer, go to: http://www.ccrnp.ncifcrf.gov/~toms/paper/ev/evj/ and
follow the instructions there.)
Click the 'Run' button in the window that opens up. Look closely and you will
see the values of Rfrequency and Rsequence as the evolution proceeds. Be sure to
watch until at least generation 700.
There is a Guide to Evj:
http://www.ccrnp.ncifcrf.gov/~toms/paper/ev/evj/evj-guide.html
which explains all that odd flickering and jumping around that happens, and
you can read the original scientific paper at
http://www.ccrnp.ncifcrf.gov/~toms/papers/ev/
There are lots more resources on my web site to learn about information theory, bits and how to apply these to molecular biology.
Have fun!
Understanding "Intelligent Design"
Now let's see if the core idea of "intelligent Design" holds up. ID
proponents frequently compute the probability of a pattern in nature and then
claim that it couldn't happen 'by chance'. Of course the error here is that
there is replication and selection going on all the time, but they sweep that
under the rug.
Let's take the standard Evj run that you will get if you click Run on the Evj
window and don't do anything else (except, maybe, crank up the speed). The
results are shown in this screenshot:
http://www.ccrnp.ncifcrf.gov/~toms/papers/ev/evj/icons/evj-screenshot.jpg
The genome has 256 positions to which the recognizer can bind and there are
16 sites (marked by green bars) so Rfrequency is log2(256/16) = 4.00
bits. At 10,000 generations Rsequence is 3.19 bits and at 20,000 generations it
is 4.18 bits. Just as we observe in nature, Rsequence is close to Rfrequency!
This result shows that the Ev model of evolution is reasonable because it gives
the observed result. Rsequence fluctuates around 4 bits; let's use that. So the
total information that has evolved in the 16 binding sites is 16ª4 = 64 bits. If
I flip a coin 64 times, what are the chances it comes up heads every time? Once
every 264 = 18446744073709524992 = 1.84ª1019 times. If a
"reproduction" (a set of 64 coin flips) occurred every second, how long would
this take? 1.84ª1019 / (60ª60ª24ª365.25) = 5.85ª1011
years. That's 585 billion years! By comparison, the universe
is known to be only 13.7 billion years old and earth has only been around 4
to 5 billion years. So, according to the "intelligent design" advocates there
isn't enough time to evolve 16 sites with 4 bits each.
But all you have to do is try the Evj program several times with different
initial random seeds to see that 64 bits (or more!) can evolve rather quickly in
several hundred generations.
So what's wrong with that often-repeated "intelligent design" argument?
Several things:
- It neglects natural selection, which works small step by small step
instead of all at once. In other words, it is not legal to multiply to compute
those probabilities because, as everyone should recall from high school
probability, probabilities only multiply if the events are independent
I call this The AND-Multiplication Error. The next step of
evolution (the children) depend on variations (mutations) of the previous step
(the parents), so the evolutionary process is not independent.
- It neglects populations. A larger population should will produce useful
mutations faster. You can try this with the Ev program.
"Intelligent design" advocates have never, to my knowledge, admitted to this
error.
But that error demolishes their claims.
Conclusion
This paper does not give examples where the parameters of size of genome,
number of sites and mutation rate match those in nature, but this could be done.
However, several factors imply that the appearance of the observed 'complexity'
(information) in nature was indeed achieved by evolution.
First, in nature the mutation rate is controlled by the organism so the rate
can increase or decrease to obtain advantage. The case that demonstrates this is
that there are mutations in bacteriophage T4 DNA polymerase that decrease the mutation rate. The other relevant
case is HIV, which gains an advantage by keeping a high mutation rate and so
evades the immune system. Of course the fear that there may be a bird flu pandemic (CNN 2005 October 11) is based on our
understanding of the rapid evolution of influenza. Without understanding
Darwinian evolution, we could be hit by a major disaster. By understanding
evolution we have a chance to avert the disaster.
Secondly, as discussed above, there is plenty of time for the basic
house keeping genes to have evolved. With horizontal transfer of genetic
material, innovations in one species can end up in other species.
Third, microorganisms are abundant. The Evj model evolves faster when
there is a larger population because more more genetic material is exposed to
mutations.
Still, the details of the time of evolution of various features of our
complex heritage need to be worked out. In real science there is always more to
do.
Acknowledgments
Thanks to Mark Perakh of The Panda's Thumb
for useful suggestions and for encouraging me to write this article, and to Pete
Dunkelberg Russell Durbin, Richard Hoppe, Burt
C. Humburg, Erik "12345", and Danielle Needle for useful comments.
References
-
Intelligent Design claim:
-
Boyer et. al.:
- Scientists discover key to embryonic stem-cell
potential MIT News Office, David Cameron, Whitehead Institute September
8, 2005;
- Core transcriptional regulatory circuitry in human
embryonic stem cells, Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine
SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK,
Melton DA, Jaenisch R, Young RA. 2005 Sep 23;122(6):947-56.
- Young Lab
-
Antimutator: For papers on antimutator
polymerases, search for "bacteriophage T4 DNA polymerase mutation rate" at PubMed.
-
Commenatary at The Panda's Thumb.
|