Mogamod Overview: Multi-Objective Genetic Algorithm for Motif Discovery

Topic > Mogamod Overview: Multi-Objective Genetic Algorithm for Motif Discovery

IndexIntroductionMethodsSimilaritySupportGenetic OperatorsResultsConclusionMulti-objective evolutionary algorithm is a popular approach that has been widely used in optimization problems. This research on using multi-objective genetic algorithm for motif discovery (MOGAMOD) was the first study to apply multi-objective genetic algorithm in motif finding problem. By maximizing three contrasting objectives: pattern length, similarity, and support, the pattern model can be obtained with high accuracy and short execution time. The MOGAMOD algorithm used a popular high-performance multi-objective genetic algorithm called non-dominated sorting algorithm (NSGA-II) with an adaptation to the motif search problem to find the optimal motif. What makes NSGA-II more efficient than other algorithms is that it has two unique operations, mutation and crossover, which constantly produce different sets of solutions and compare them to obtain an optimal final result. The algorithm was tested and analyzed for different samples with different properties: simple sample, corrupted sample, invaded sample, multiple model. The results were compared with three conventional methods, which used statistical approaches, to demonstrate their efficiency and superiority. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Introduction Sequence motifs are defined as repeating patterns in DNA that can be found at regulatory sites in DNA. These regulatory sites and motif instances are found to be responsible for serving as protein binding points of the genetic sequence to initiate the transcription process. Instances of motifs found in DNA sequences usually have some slight variations in their components. Finding instances of motifs on DNA and related regulatory regions is crucial for understanding the relationship between DNA and proteins such as nucleases and transcription factors; it is also the key factor in controlling gene expression and identifying drug targets for personalized medicine. In real-world problems, DNA can contain up to 220 million base pairs of nucleotides, and motif instances are typically short (30 nucleotide pairs). Consequently, biological experimental approaches have been developed to extract motif instances from certain DNA samples; the most popular methods are DNase foot printing, gel-shift analysis, and linker scanning. These biological approaches require an enormous amount of laboratory work and time as the length of sequences or the number of sequences increases. Therefore, computational methods with statistical approaches have been developed to find motifs in certain DNA samples such as Gibbs Sampler and Consensus. However, these algorithms also have high time complexities as the size of the DNA array increases. They also do not consider other cases where the sample contains no instances of motifs in some sequences or where multiple instances exist in a sequence. In this report, a new approach is introduced that uses a multi-objective genetic algorithm as an alternative to typical statistical approaches. Instead of optimizing just one objective and having extremely low performance of other objectives such as similarity or final motif length, this new approach produces results that trade off between objectives to solve problems found in other methods. The multi-objective genetic algorithm is designed to maximize three properties of the final motif: similarity, length, and support. The algorithm proposed in thisarticle is tested with three datasets and compared with other well-known biological methods to demonstrate their effectiveness and superiority in terms of accuracy and time complexity. It is also compared with the single-objective genetic algorithm to provide a better understanding of the trade-off between the objectives of the problem. Methods Multi-objective genetic algorithm for motif discovery (MOGAMOD) was built based on a popular high-performance multi-objective genetic algorithm called Non-Dominated Sorting Algorithm (NSGA-II). NSGA-II is a population-based method often used in optimization problems to find global optima quickly and effectively. It is established with the foundation of Darwin's principle of natural selection to obtain the best solution for the given problems. The first step of a genetic algorithm is to establish an initial randomly generated population that contains individuals representing possible solutions to the problem. In this case, an individual was created as an array that contained n genes that corresponded to n number of DNA sequences in the problem. Each gene was then divided into two parts: weight (wi) and possible starting position of the motif instance (si). The weight values in the array indicated the probability of the potential motif existing in the matching sequence, these values ranged from 0 to 1. MOGAMOD was designed to allow users to set a threshold limit of wi so that the matching sequence with low wi can be excluded from the process of discovering the reason. The initial position variables (si) indicated the potential initial position of the motif instance in that corresponding sequence, in this research it was limited between 7 and 64. Each individual in the population was then evaluated using a fitness function built on the basis of three objectives: similarity, pattern length, and support. Similarity In the motif discovery problem, similarity is defined as a measure of similarity across all instances of an individual's motif. The similarity value of an individual was calculated from the position weight matrix in each sequence by taking the average of the probability of the most popular nucleotide. This value also ranged from 0 to 1 and indicated the probability that the current motif would be chosen as the motif. In the motif discovery problem, the motif length is always an objective that each algorithm tries to maximize to reduce the probability of having false motif instances and increase the chance of obtaining a strong motif as a result. Support An individual's support value was determined by the number of sequences used to compose the candidate motif. This value was created to exclude “corrupt” sequences that had no motif instances in order to obtain a strong final motif without taking such sequences into account. In conclusion, to solve the pattern discovery problem, MOGAMOD was created to optimize three objectives of a final pattern: Similarity – Pattern Length – Support. From the initial population, the strongest individuals were selected to pass on to the next generation. A fitness function was created to determine whether an individual's goal was strong enough compared to other individuals in the current population. Individuals were first ranked based on their suitability using a non-dominated sorting algorithm. This algorithm has a second-order polynomial time complexity as described as O(M. N2), where M is the quantity of targets and N is the quantity of individuals in the population. According to Deb, a solution A dominates another solution B if and only if:A is no worse than B in any objective.A is better than B in at least one.