Constrained Refining of Multiple Alignments

Multiple sequence alignment (MSA) is a core problem in bioinformatics. As a naive alignment algorithm applied to n sequences of average length l would have a time complexity of O(l^n), efficient methods rely strongly on heuristics, progressive construction of multiple alignments and optimization to local optima that only approximates the global optimal alignment. These approaches have been very successful in solving important problems such as sequence retrieval, motif identification and phylogenetic analysis.

But one feature common to all current MSA methods is the assumption that mutations are independent. At the core of any MSA is the alignment of sequence pairs, and this alignment always treats sequence positions as independent. This assumption is built into the scoring functions and into the algorithms themselves.

This is a sensible and useful approach for most purposes. All evidence strongly suggests that most evolution at the molecular level involves the accumulation of selectively neutral mutations, which are thus effectively random and independent, since nearly all mutations with a significant impact are likely to be deleterious and eliminated from the gene lineage.

However, this assumption does not hold in the case of mutations fixated by natural selection, particularly when the selective advantage of one mutation is due to another mutation at the same sequence or in another gene. Examples of these categories are compensatory mutations in one amino acid residue of a pair of interacting residues in a protein structure. A mutation in one residue of the pair will select for mutants in the other residue that optimize the interaction. This can occur within one gene or between genes when two protein chains interact in a protein complexes. In these cases the identification of these correlations at the sequence level can give information about the structure but, due to the bias for minimizing sequence discrepancy, current sequence alignment methods tend to obscure these correlations by realigning the mismatching residues.

Another possibility is that mutations are correlated across different organism species or strains because they were all fixated by the same selection pressure. For example, selection for mutations that confer antibiotic resistance may lead to mutations in similar locations fixating in different populations of bacteria. In theory, these can be identified by the correlation of these mutations with antibiotic resistance, which can distinguish mutations relevant for this trait from mutations that are just the result of neutral evolution.

The problem that this project addresses is the strong focus of alignment algorithms in the identification of conserved regions. The minimization of mutations assumed to be independent results in a good alignment in regions where the sequence is conserved but unreliable alignments in regions where sequences vary. Which are precisely those regions that can indicate co-evolution.

The goal of this project is to explore how constraint programming can help identify these correlations by refining the multiple sequence alignments obtained under the default assumption of independent mutations. The hypothesis that two mutations are correlated can be tested by realigning the sequences at those regions giving a lower penalty for mismatches at those positions. Given the nature of the problem and the need to test many combinations of sequence positions – many hypotheses of co-evolution – this problem is particularly suited to a constraint programming approach, using propagation to limit the boundaries of the necessary realignments and a branch-and-bound search to explore only those combinations that can result in significant correlations.

Project financed by FCT/MCTES under designation PTDC/EIA-CCO/115999/2009

FCT