Preprints & Reprints Biology, Computers, Sex & Sorting 1. Introduction For over 50 years computer science has stared longingly at biology for inspiration. Von Neumann spent years trying to build self-replicating machines, believing this to be the key to life. The cellular automata that arose from this work are still key to alternative parallel computing architectures. Turing considered morphogenesis one of the most interesting phenomena to study and wrote seminal papers on the subject. Artificial Intelligence (AI) and Artificial Life (AL), two major disciplines in computer science, draw their philosophy from biology. So why isn't there more interchange between the disciplines? Here we describe areas of biology and computer science now creating cross discipline developments of extreme importance to both camps. In addition, we outline the way these new developments and state-of-the-art correlation, visualisation and modelling techniques could be utilised by biologists to aid and advance the understanding of the complex systems found in nature.
However, there are a wide range of phenomena still remaining to be examined:
The attraction of biological systems is that they are error and noise tolerant, robust, asynchronous, chaotic, and non-deterministic. They represent a computing domain far from Von Neumann machines and may be absolutely necessary to cope with the emerging and chaotic world of networked machines. Recent developments in computer science offer biologists a huge array of new tools. Biology can be characterised as a data rich field where experimentation is difficult due to the massive number of parameters and hidden variables. Many biological systems show extremely non-linear responses to changes in the environment that negate traditional modelling and reductionist approaches. However, computer modellers are beginning to develop techniques to study the ensemble or family properties of complex systems. So it is now possible to make general statements and exploit noise-tolerant pattern matching techniques. It is also possible to use self-correlating and evolutionary systems to extract significance where the nature of any correlation is unknown. Recent developments in data visualisation allow us to observe correlations that may be difficult to isolate statistically. They take advantage of the best pattern matching recognition system - the human visual system - to aid biologists in understanding their data. State of the art data mining and visualisation tools are now available for analysing everything from eco-systems to understanding DNA. Biology is a field where experiments are often difficult to repeat due to complexity, connectivity and the initial parameter sensitivity. New models and data analysis tools based on a sound understanding of non-linear dynamics and chaos are therefore critical. Moreover, a coalescing of modelling and experimentation is essential to advance understanding in this complex field. The days of the microscope, pencil, paper and intuition alone are long gone. Although raw computer power has increased the ability to use modelling to explain the results and discoveries in engineering physics, economics, and many other fields, their application to biology seems to have been relatively neglected. Biologists should be at the forefront of system modelling, because they are trying to decode the most complex systems in nature. Curiously, biology has been a source of inspiration for some of the most important concepts and tools in computing. Today, there is a fast growing use of neural networks, genetic algorithms, and attempts at sexual reproduction in software. 2. Learning from Biology In developing evolutionary systems to run as computer models we naturally start with biological analogies. Evolution is driven by a cycle of reproduction (which must produce a diversity of offspring) and selection (which weeds out, statistically, the strong from the weak). Selection is relatively straight-forward with a test environment which compares the performance of different programs. A technique called co-evolution ensures that the programs are always tested against the hardest examples in the population. This speeds up the process over a constant set of tests or testing against every possible example, which is a very lengthy and time consuming process. The key design techniques lie in the nature of the reproductive process, and turn out to have profound implications for the selection of adaptable systems. Currently the most widely used systems are Genetic Algorithms (GA) and Genetic Programs (GP). In the former case a problem is represented as a set of symbols in a linear string for which parameters or values must be found. The symbols form an effective program that is calculated by the test harness and used to generate the individuals fitness. Sex (in the form of cross-over) consists of swapping parts of two parental strings to produce an offspring. Mutation is a random change of one of the symbols. In genetic programs the program is represented as a branching decision tree where sub-branches are swapped over. Both techniques have proved very powerful for finding solutions to small, but complex problems. To date GAs and GPs employing these techniques appear to be limited to small programs of about 100 symbols. In contrast DNA and biological evolution appears to involve millions of symbols - providing an existence theorem for process improvement. One major limitation of both techniques is that the creatures compete indirectly through an external fitness function which never changes - their world is static. This is a common mistake of drawing direct parallels between the disciplines and an incomplete extrapolation from, or modelling, biology. The real world of biological organisms consists of dynamic entities, and so, static systems modelling can lead to major errors and spurious artefacts. When we move to a silicon world we are no longer bound by the limits imposed on carbon based biological organisms. The first of these concerns children. In biological systems, there is no mechanism by which parents can learn from their off-spring's success in producing new and improved children. Improvement is a slow, generation by generation process. This need not be the case for silicon. For instance, we have designed an evolutionary system where a parent produces two off-spring and uses their success or failure to judge how to produce successively better children. These virtual children are then a partial learning process. On a classical set of test problems, the virtual children system outperforms the best existing algorithm (DHC) on 3 out of 5 of the test set, and is beaten on 2 out of 5. However the virtual children algorithm makes no hidden assumptions, unlike the DHC which exploits the symmetry in two of the problems to solve them fast. This is a clear warning that when we learn from biology we should not be slavishly bound by the best biological outcomes. In silicon we can be both recursive and postscriptive at the same time - using the benefit of hindsight to advantage. 2.2 The N-sex Problem Sex and evolution are tightly bound; and from these roots also spring many other ideas related to the adaptability and optimisation of biological organisms and engineering systems. These concepts can return to affect biology in many ways, in particular to equip biologists with powerful and unusual search/compare/find tools for handling large, complex and error-prone data sets. To examine "why two sexes?", a simulation was devised in a world of 6 species each containing 100 starting individuals. The environment also contained 10,000 individuals of a seventh species. We could consider the first group as flowers and the second group as insects (specifically bees), with the second group looking to gain energy and the first trying to get themselves pollinated at minimal cost. Critically the 'bees' were a diverse group wanting different features from the flowers in terms of colour, shape and scent to convince them the best pollen was there. In each generation the bees moved around the flowers to achieve the highest pay back. The best flowers breed, the poorer ones die. Now in this world there is a diverse and complex task (how do I make the most 'profit' i.e. attract bees at minimum cost) which is made worse by the choices of competing species. Of course, this analogue is much wider than flowers and bees, it could represent the way companies attract customers, or adaptive algorithms learn to manage traffic flow across international networks. A number of system parameters were systematically varied, and 2000 trials run for each combination of variables from randomly selected starting points. The 6 species each had a different number of parents for their off-spring either asexual or between 2 and 6. The other variables included: The assumption at the beginning of experimentation was that two sexes would win because it happened in nature, and we could think of good statistical reasons why 3 or more would be more disruptive. The results however contradicted our beliefs. We found that (1) had little effect - it simply led to high sex species dying out slightly more rapidly on occasion due to the disappearance of one or more sexes from the population. In a similar manner (3) was of slight consequence - all the dynamics arose from interactions between companies rather than the small changes in the environment. The effects observed were dominated by (2) and what we decided constituted a success. The results can be summarised as follows:
Why should this be? A first clue lies in early results as we were developing the system and had a less complex world. Then the battle was between 3 and 4 sexes for dominance; but as a more complex and dynamic environment was created 4 prevailed. This implies that system dynamics determines the best learning algorithm. A little thought would suggest this is axiomatic. But, because most work on learning systems is inspired by chasing optimal solutions where time is immaterial, the answers normally found are very different. Basically two sexes leads to less disruption than multi-sex, and when a good solution is found the population is not dispersed so easily from it. However, the disruptive effect of multi-sex sees population searches much faster than two sex. If there is sufficient time, two sex is best, otherwise it is three, four or more that search progressively faster, but find poorer overall optima. More evidence for this lies in the time-series evolution of single example populations (Fig 1). Figure 1a: A typical plot showing the number of individuals in 3 species (solid = 6 sex, dashes = 4 sexes, dots = 2 sexes) over 100 generations. Note the wild swings in the 6 sex population size. Fig 1b: Based on total profit, 6 sexes come out best due to the number of users Fig 1c: Based on efficiency 4 sex wins. The percentage swings in the 6 sex case is much larger than for 2 as evolution progresses. Thus the answer to the long held question; why are there two sexes in biology can be found in the dynamics of the world. When we trying to evolve adaptive computer systems, the optimum number of sexes may be very different. Our conclusions are simple but profound: in biology two sexes may be best, but in a computer many-sexes may be better. It just depends on the type of problem to be solved. The models show another interesting phenomena, if you start with a big enough advantage and a moderate amount of flexibility, your smaller, smarter competitor may be unable to win regardless. Many believe that if you are smarter and more adaptable you win in the long run. Evolutionary modelling shows this is not the case. We do not have to be as adaptable as smaller competitors, only sufficiently adaptable to beat them. 2.3 Massive Populations - More than a guess! In the natural world there does not seem to be a mechanism to backtrack on the evolutionary road. Once you have evolved into a humanoid, or dinosaur, you cannot evolve back to being a mouse. So far in the silicon world we face the same limitation. But here it is not a fundamental limitation. In principal. new sexes could be promoted with the characteristics required to adapt to niche changes, provided that the world changes relatively slowly. All the evidence in the carbon and silicon world is for fast change - so this prospect is one worthy of more investigation. Restarting life at the end of every cataclysm is not a very efficient or effective option for carbon or silicon life. 2.4 Information Processors Within cells, increasingly complex molecules are produced from simpler components in an intricate series of production lines. The resulting products, mostly proteins, maintain the cell's structure and operation. The reactions which transform molecules into the required products are controlled by enzymes which selectively accelerate the necessary chemical reactions by many orders of magnitude. Probably the two most important features of enzymes are that they exhibit specificity for particular molecules (substrates) and indirectly, via the reactions they catalyse, carry out particular functions which transform substrates into products. The specificity is associated with one or more enzyme binding site to which substrate(s) can bind, whilst an enzyme's active site plays the key part in enabling, and accelerating, a particular reaction and thus provides the enzyme's function. The specificity and function of a particular enzyme derive from the enzyme's shape and the positions of its constituent amino acids determined by a process known as protein folding. Whilst Enzymes exhibit immense pattern specificity, they only carry out one of four chemical operations - break, make or rotate bonds, or transport a molecule. In biology we frequently observe a duality of complexity and simplicity that is hard to explain. The information processing is carried out despite the fact that information (substrates) move around the system asynchronously at random. Overall control and integrity is maintained by manipulating concentrations (and thus the rate of processing). Robustness occurs through the massive parallelism. The system we have examined as an analogue of these processes consists of many processing elements. These are analogous to enzymes in that they have specificity, via pattern matching, for their operands (data as opposed to molecules) and cause the data to be transformed in some way. The interactions and dynamics of the system taken as a whole, operating in parallel and asynchronously, comprise the information processing system. Each enzyme performs an interim task in the context of the overall system in a similar manner to a particular machine working in a factory production line. We are interested in looking at the processing capabilities of biological cells and what can be learnt and used in the design of information processing systems. By analogy between molecules and data, and enzymes and processing units, cells can be viewed as processors which take simple building blocks and transform them to build increasingly complex structures. As cells have solved many complex problems in the interests of survival and have clearly proved to be evolvable, we may learn important lessons for designing silicon systems which demonstrate similar attributes. The attributes of the biological counterpart which are of particular interest include; robustness against small fluctuations (homeostasis), processing dynamics (enzyme kinetics), pattern matching of enzymes leading to specificity for substrates, and evolvability. A model for information processing designed with such an architecture is very different from the traditional Von Neumann architecture, so we design systems allowing us to explore the impact of each of these issues. The information processing architecture consists of an enclosure which we call a sac which includes processing enzymes, data items and, potentially, further constituent sacs corresponding to intracellular vesicles. Enzymes are capable of pattern matching data thus providing enzyme specificity and carrying out operations on the data which provide the enzyme functionality. The specificity provides a mechanism for consistent identification of appropriate classes of data to operate upon. The functionality typically transforms the data in some way, such as combining data into a more complex data structure or manipulating data field values. For real enzymes the specificity and functionality result from the enzymes' shape; the enzyme attributes described here could be derived from shape via a simple analogue of protein folding. It is worth noting that although real enzymes catalyse pre-existing chemical reactions rather than carrying out the reactions themselves, the enzymes here are best viewed as active processors performing a particular transformation of data. This is equivalent to assuming that a corresponding reaction (operation) exists and that the enzyme (processing unit) is capable of catalysing it. In order to use the architecture, it is necessary to specify certain details. The following are a set of minimal requirements:
These details can be specified manually for simple information processing tasks, but eventually they will evolve. A system based on the architecture outlined was built to investigate a particular information processing application, namely that of sorting a set of integers into an ordered list (Fig. 2). To carry out this task a data representation was defined as a list comprising a low-end terminator, one or more integers, and a high-end terminator. Two types of processing enzyme were defined and named as join and break. A join enzyme simultaneously binds a high-end and low-end terminator of two separate data items. The operation it performs is to concatenate the data item lists into a single list by removal of the terminators, but only if the integer value adjacent to the high-end marker is less than the integer value adjacent to the low-end marker. This enzyme acting alone results in ordered lists of integers, but with typically large steps between successive integers. We also define a break enzyme which simultaneously binds the high-end and low-end markers of one data item in addition to another data item of unit length. The operation it performs is to break the first list into two by addition of terminators, but only if the value of the single integer is both greater than the integer adjacent to the low-end marker and less than the integer adjacent to the high-end marker. Figure 2: A schematic of an information processor using 'enzymes'. Having defined the enzymes, it only remains to define the system kinetics which determines the rate data is processed, and the concentrations of data and enzymes. In this implementation data items and enzymes were assigned random positions on a toroidal surface with random initial velocities. Binding takes place when data and enzyme approach within a fixed distance and the necessary conditions are met. Concentrations depend on the number of join and break enzymes and data items introduced into the system. Following the binding of data to an enzyme, the transformed data (the reaction products) are released with random velocities after the appropriate function has been applied to the original data items. A single sac was created containing only data items of unit length (a single integer between low and high terminators) together with multiple join and break enzymes. Each integer to be sorted is converted into multiple instances of the corresponding unit length data list with the given integer value. The system was then iterated through time with the join and break enzymes manipulating the data lists. As the number of data items falls due to lists joining together, more unit length data lists are introduced to compensate. These are created using integers selected randomly from the set of integers we wish to sort. Simulation results are presented in Fig. 3, in which sorted list frequency is plotted against time. It can be seen that sorted lists grow in length over time as might be expected, with the longest lists representing the best solutions so far. Intermediate lengths of string initially rise in concentration and then fall, exactly as would be anticipated for intermediates in a chemical reaction with limited starting materials. Figure 3: Dynamics of list processing by a 'chemical' information processor. The rise and decay of intermediate products follow a similar pattern to that observed in chemical reactions. The simulation illustrates that an information processing architecture (Fig. 2) can successfully carry out specific application tasks. The processing happens in a highly parallel manner and it's dynamics depend on the concentrations of both data items and enzymes present. Note for instance that the break enzymes can only perform their role while a supply of unit length data lists is present. Whilst we would not propose that this architecture is efficient for number sorting in comparison with traditional algorithms, it does have interesting properties. Processing is highly parallel, depends on concentrations of data items and reaction dynamics, and we can evaluate the best solution available at any time by examining the longest list to date. In addition, the system should be relatively stable to small perturbations such as fluctuations in concentration - whether there is one more or one less instance of a particular data item present will simply alter the reaction rates slightly. This homeostasis may prove important to the system being evolvable. 3. Biology and Computers Now imagine each boid, rather than flying towards the centre of the flock, flies towards other boids with similar characteristics - being artificial these could be anything from colour, to a liking for architecture. If each boid has a group of characteristics and a group of interests what happens is that dynamic flocks appear separating, merging together with individuals moving form one to another. The flock rules make self-correlated behaviour appear, with boids of similar characteristics and interests flocking together. Our human eye is quick to spot groupings in this dynamic world. It should also be noted that these flocking systems are noise resistant. Figure 4: Flocking boids represented as fish, each colour of fish represents a different interest. Closeness of concepts is related to the time the flocks spend together. Easily captured in a movie, difficult to see in a still picture. So, one way to examine DNA sequences for homology, is to give each boid a DNA sequence so they attract with closeness of match. The resulting dynamic patterns of boids groups of similar sequences are visually separated in seconds. Curiously self-correlating behaviour like this was first studied as a way of improving the appearance of sophisticated graphic displays by making them look more natural and less scripted; but has yet to be re-used in its original environment - biology. Where data is complex or noisy, it is better to use the dynamic pattern recognising ability of the human eye than the limited use of pre-programmed correlation analysis and static graphs. If a graph is worth a thousand words, experience on data visualisation of complex systems has shown a 3D movie is worth a thousand graphs. Here the movie further enhances understanding since the dynamic movement of boids between clusters emphasises and shows low correlation links as well as the high correlation flocks. 3.2 Chaos, Biology and Modelling No deterministic hierarchical system can deal effectively with such chaos. Only dynamical, low flat and adaptive systems with a light hand of hierarchy can cope. In some respects it is a miracle that the human race has survived so long on the basis of the old, rigid, rank ridden and hierarchical systems. How many wars, how much civil unrest, how many revolutions have been prompted by 16th Century Management Systems trying to control 21st Century technology and civilisations? If anything epitomises the gross difference of approach then the telephone network and Internet, or serial and dynamical parallel computing. Change is inevitable and technology driven. 3.3 Categorisation 4. Conclusions Already, electronic engineers have discovered new aspects to sex, reproduction, genetics, sorting, and chaos in biology, that are hidden by the conventional thinking of a biological society. How rich a marriage it could be - biological computing, software that breeds, machines that think and reproduce might not be the dream they are currently perceived to be. About The Authors Chris Winter graduated in Biochemistry from Oxford University in 1980, and gained a doctorate in Solid State Physics from Lancaster University in 1982. He was then awarded an 1851 Research fellowship to explore novel advanced materials before joining the BT Laboratories at Martlesham Heath in 1985. In the 1980's he was responsible for teams developing some of the new optical materials and devices used to build the "Information Superhighway". In 1992 he moved into research on the radical software systems that are designed to make computers easier to use and more intelligent; and co-ordinated a large team developing new mobile services. He currently leads a research group that is attempting to mimic biology with computers to produce "Artificial Life". Chris has published about 50 scientific papers, holds a number of patents and has been interviewed for newspaper, radio and TV articles on scientific futures. BIBLIOGRAPHYM SITES TO VISIT Word Count = 5003 |