[WIP] Genetic algorithms

Score: 3.5/5 (132 votes)

What is a genetic algorithm?

Genetic algorithms are a subset of evolutionary algorithms; biologically-inspired search heuristics used for finding solutions to problems where the desired result is known. The genetic algorithm attempts to find the best candidate solution for the problem. This solution is often an approximation of the correct solution, especially in problems where an exact solution is either impossible, intractable (requires infinite time or resources), or simply unnecessary. These algorithms work by "evolving" a solution.

How do genetic algorithms work?

Genetic algorithms work by building an initial set of random potential solutions. A subset of these are selected for "breeding" to produce new potential solutions which then become the new population. This process continues until some termination criteria are met. These might include a "good-enough" (if not exact) solution being found, a lack of improvement in the population (convergence), a set maximum number of generations (i.e. new populations) being met, or a set limit for computational time and resources being met.

From this, we can extract three steps:

Initialisation - the initial population of N randomised candidate solutions (alternative, more biological terms are "individual", "organism" or "chromosome") is created
Regeneration - a new population is created from the previous one
Exit (once termination criteria are met) - the best solution found so far is returned and the algorithm stops running

There are three sub-steps to regeneration:

Selection - a subset of the population is selected algorithmically from the population
Recombination (also "crossover") - the selected individuals are combined to produce new ones
Mutation - the new individuals ("offspring") are mutated to increase genetic diversity

Initialisation

The initial population of N solutions is created randomly. Usually, solutions are encoded as a series of bits (binary digits). These can be considered analogous to the base pairs that make up genes in real DNA, although real-life genes consist of triplets of base pairs which each have one of four possible "values" (nucleotides - adenosine, cytosine, guanine and thymine (in RNA, thymine is replaced by uracil)) while our bits only have two - a 0 or a 1. Also, in biology, a chromosome is a coiled strand of DNA which contains many genes; however, in our terminology, a chromosome will simply refer to a series of bits. The solution's "DNA" can be decoded later. Usually the value of N is in the hundreds or thousands. The value 1,000 is acceptable initially and can be tweaked later.

Regeneration

Selection

During selection, a subset of the population—often two solutions, although more can be used if desired (some research suggests using more than two parents can result in higher quality offspring)—is selected using a selection algorithm. One example is called fitness-proportionate-selection, or roulette-wheel-selection. In this algorithm, individuals are selected at random with a probability based on their fitness, which is a value that represents how close that individual is to being a valid solution (often, it is a value between 0 and 1). Fitness functions will be discussed in more detail later on. Each iteration of FPS returns a single individual, so the algorithm can be applied multiple times to acquire the desired number of parents. Simpler selection algorithms include truncation selection, where the best half, third or some other fraction of the population is selected, and tournament selection, where the best individual from a random subset of the population is selected. Another, more complicated but fairer algorithm is called stochastic universal sampling, which is a modified version of RWS where the solutions are evenly spaced and thus weaker solutions (i.e., those with lower fitness function values) have a fair chance of being selected (although the algorithm still generally selects for greater fitness). The benefit in allowing weaker solutions to be selected is that a weak solutions could be a minor modification away from a much stronger solution, and only allowing the very fittest solutions to be selected may result in lack of genetic diversity of solutions.

Recombination

In recombination, the selected solutions are crossed-over to create new solutions, although often there is a probability of this happening; for example, 70%. This concept of crossing-over is also taken from biology: during meiosis (cellular division which results in four genetically-different cells), corresponding paternal (from the father) and maternal (from the mother) chromosomes come together and may swap genes (cross-over). In a genetic algorithm, this process is simulated in a number of ways. The simplest way is single-point or one-point crossover, in which a random position in the chromosome is selected and everything after that point is swapped with the other chromosome. Other methods are more complex but may produce higher quality offspring and more genetic diversity.

Mutation

After crossing-over occurs, mutation may occur with a very small probability (around 0.1% per bit). In mutation, the chromosome is iterated over and, each bit may be flipped according to a small probability. This is analogous to substitution mutations that occasionally occur during cellular division. Instead of simply flipping bits, one may also prepend, insert, append and or remove them, which would be equivalent to insertion and deletion mutations in biology. In this way, genetic diversity is increased further.

A word on encoding and decoding

As mentioned previously, chromosomes in genetic algorithms are often encoded as a sequence of bits. A group of bits—a gene—may represent a character in a string, for example if one wanted to generate the string "hello world", the letter h might be represented by the binary number 000, e by 001, l by 010, o by 011, space by 100, w by 101, r by 110 and d by 111. Eventually one would hope to stumble upon the sequence 000,001,010,010,011,100,101,011,110,010,111 which would correspond to the string "hello world". The great thing about encoding data like this is that the genetic algorithm can be written very generally—any object which has a fitness function, a crossover function and a mutation function can be used, and the algorithm never needs to know the implementation details. Usually decoding would happen at two stages: once whenever the fitness function needed to be calculated, and once whenever one wanted to display the output of the genetic algorithm.

Just what the hell are fitness functions already?

I've mentioned fitness functions several times without properly explaining what they are. Simply put, they measure an individual's fitness, that is, how close it comes to solving the problem desired. The computation done to produce this result is highly domain-specific, although usually a value between 0 and 1 is desired. In our "hello world" example, the fitness function might decode the binary sequence into an ASCII string and then compare that with the ASCII string supplied as the desired output. The difference between the two would then be converted into a number between 0 and 1 like so: 1/(y - x) – where y is the desired solution and x is the result of decoding.

To-do

Pseudo-code implementations of selection algorithms
Pseudo-code implementations of crossover algorithms
Pseudo-code implementations of mutation algorithms
Bare-bones structure of a generic genetic algorithm class