The latest research articles published by BMC Bioinformatics
Background: Gene set testing has become an important analysis technique in high throughput microarray and next generation sequencing studies for uncovering patterns of differential expression of various biological processes. Often, the large number of gene sets that are tested simultaneously require some sort of multiplicity correction to account for the multiplicity effect. This work provides a substantial computational improvement to an existing familywise error rate controlling multiplicity approach (the Focus Level method) for gene set testing in high throughput microarray and next generation sequencing studies using Gene Ontology graphs, which we call the Short Focus Level. Results: The Short Focus Level procedure, which performs a shortcut of the full Focus Level procedure, is achieved by extending the reach of graphical weighted Bonferroni testing to closed testing situations where restricted hypotheses are present, such as in the Gene Ontology graphs. The Short Focus Level multiplicity adjustment can perform the full top-down approach of the original Focus Level procedure, overcoming a significant disadvantage of the otherwise powerful Focus Level multiplicity adjustment. The computational and power differences of the Short Focus Level procedure as compared to the original Focus Level procedure are demonstrated both through simulation and using real data. Conclusions: The Short Focus Level procedure shows a significant increase in computation speed over the original Focus Level procedure (as much as ~15,000 times faster). The Short Focus Level should be used in place of the Focus Level procedure whenever the logical assumptions of the Gene Ontology graph structure are appropriate for the study objectives and when either no a priori focus level of interest can be specified or the focus level is selected at a higher level of the graph, where the Focus Level procedure is computationally intractable.
Background: A typical affinity purification coupled to mass spectrometry (AP-MS) experiment includes the purification of a target protein (bait) using an antibody and subsequent mass spectrometry analysis of all proteins co-purifying with the bait (aka prey proteins). Like any other systems biology approach, AP-MS experiments generate a lot of data and visualization has been challenging, especially when integrating AP-MS experiments with orthogonal datasets. Results: We present Circular Interaction Graph for Proteomics (CIG-P), which generates circular diagrams for visually appealing final representation of AP-MS data. Through a Java based GUI, the user inputs experimental and reference data as file in csv format. The resulting circular representation can be manipulated live within the GUI before exporting the diagram as vector graphic in pdf format. The strength of CIG-P is the ability to integrate orthogonal datasets with each other, e.g. affinity purification data of kinase PRPF4B in relation to the functional components of the spliceosome. Further, various AP-MS experiments can be compared to each other. Conclusions: CIG-P aids to present AP-MS data to a wider audience and we envision that the tool finds other applications too, e.g. kinase - substrate relationships as a function of perturbation. CIG-P is available under: http://sourceforge.net/projects/cig-p/
Background: Because of the difficulties involved in learning and using 3D modeling and rendering software,many scientists hire programmers or animators to create models and animations. This both slowsthe discovery process and provides opportunities for miscommunication. Working with multiplecollaborators, a tool was developed (based on a set of design goals) to enable them to directly constructmodels and animations. Results: SketchBio is presented, a tool that incorporates state-of-the-art bimanual interaction and drop shadowsto enable rapid construction of molecular structures and animations. It includes three novel features:crystal-by-example, pose-mode physics, and spring-based layout that accelerate operations commonin the formation of molecular models. Design decisions and their consequences are presented,including cases where iterative design was required to produce effective approaches. Conclusions: The design decisions, novel features, and inclusion of state-of-the-art techniques enabled SketchBioto meet all of its design goals. These features and decisions can be incorporated into existing and newtools to improve their effectiveness.
Background: The complexity of biological data related to the genetic origins of tumour cells, originates significant challenges to glean valuable knowledge that can be used to predict therapeutic responses. In order to discover a link between gene expression profiles and drug responses, a computational framework based on Consensus p-Median clustering is proposed. The main goal is to simultaneously predict (in silico) anticancer responses by extracting common patterns among tumour cell lines, selecting genes that could potentially explain the therapy outcome and finally learning a probabilistic model able to predict the therapeutic responses. Results: The experimental investigation performed on the NCI60 dataset highlights three main findings: (1) Consensus p-Median is able to create groups of cell lines that are highly correlated both in terms of gene expression and drug response; (2) from a biological point of view, the proposed approach enables the selection of genes that are strongly involved in several cancer processes; (3) the final prediction of drug responses, built upon Consensus p-Median and the selected genes, represents a promising step for predicting potential useful drugs. Conclusion: The proposed learning framework represents a promising approach predicting drug response in tumour cells.
Improving the accuracy of expression data analysis in time course experiments using resampling
Background: As time series experiments in higher eukaryotes usually obtain data from different individuals collected at the different time points, a time series sample itself is not equivalent to a true biological replicate but is, rather, a combination of several biological replicates. The analysis of expression data derived from a time series sample is therefore often performed with a low number of replicates due to budget limitations or limitations in sample availability. In addition, most algorithms developed to identify specific patterns in time series dataset do not consider biological variation in samples collected at the same conditions. Results: Using artificial time course datasets, we show that resampling considerably improves the accuracy of transcripts identified as rhythmic. In particular, the number of false positives can be greatly reduced while at the same time the number of true positives can be maintained in the range of other methods currently used to determine rhythmically expressed genes. Conclusions: The resampling approach described here therefore increases the accuracy of time series expression data analysis and furthermore emphasizes the importance of biological replicates in identifying oscillating genes. Resampling can be used for any time series expression dataset as long as the samples are acquired from independent individuals at each time point.
FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
Background: Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, sub-optimal pre-calculated generic matrices are typically used for protein-based phylogeny. Sequence availability has now grown to a point where problem-specific rate matrices can often be calculated if the computational cost can be controlled. Results: The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignment-splitting algorithm that splits alignments with many sequences into non-overlapping sub-alignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used. Conclusions: FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, data-specific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
Network-based analysis of comorbidities risk during an infection:
SARS and HIV case studies
Background: Infections are often associated to comorbidity that increases the risk of medical conditions whichcan lead to further morbidity and mortality. SARS is a threat which is similar to MERS virus, but thecomorbidity is the key aspect to underline their different impacts. One UK doctor says "I'd rather haveHIV than diabetes" as life expectancy among diabetes patients is lower than that of HIV. However,HIV has a comorbidity impact on the diabetes. Results: We present a quantitative framework to compare and explore comorbidity between diseases. By usingneighbourhood based benchmark and topological methods, we have built comorbidity relationshipsnetwork based on the OMIM and our identified significant genes. Then based on the gene expression,PPI and signalling pathways data, we investigate the comorbidity association of these 2 infectivepathologies with other 7 diseases (heart failure, kidney disorder, breast cancer, neurodegenerativedisorders, bone diseases, Type 1 and Type 2 diabetes). Phenotypic association is measured bycalculating both the Relative Risk as the quantified measures of comorbidity tendency of two diseasepairs and the ¿-correlation to measure the robustness of the comorbidity associations. The differentialgene expression profiling strongly suggests that the response of SARS affected patients seems tobe mainly an innate inflammatory response and statistically dysregulates a large number of genes,pathways and PPIs subnetworks in different pathologies such as chronic heart failure (21 genes),breast cancer (16 genes) and bone diseases (11 genes). HIV-1 induces comorbidities relationshipwith many other diseases, particularly strong correlation with the neurological, cancer, metabolicand immunological diseases. Similar comorbidities risk is observed from the clinical information.Moreover, SARS and HIV infections dysregulate 4 genes (ANXA3, GNS, HIST1H1C, RASA3) and3 genes (HBA1, TFRC, GHITM) respectively that affect the ageing process. It is notable that HIV andSARS similarly dysregulated 11 genes and 3 pathways. Only 4 significantly dysregulated genes arecommon between SARS-CoV and MERS-CoV, including NFKBIA that is a key regulator of immuneresponsiveness implicated in susceptibility to infectious and inflammatory diseases. Conclusions: Our method presents a ripe opportunity to use data-driven approaches for advancing our currentknowledge on disease mechanism and predicting disease comorbidities in a quantitative way.
Background: Viral integration into a host genome is defined by two chimeric junctions that join viral and host DNA. Recently, computational tools have been developed that utilize NGS data to detect chimeric junctions. These methods identify individual viral-host junctions but do not associate chimeric pairs as an integration event. Without knowing the chimeric boundaries of an integration, its genetic content cannot be determined. Results: Summonchimera is a Perl program that associates chimera pairs to infer the complete viral genomic integration event to the nucleotide level within single or paired-end NGS data. SummonChimera integration prediction was verified on a set of single-end IonTorrent reads from a purified Salmonella bacterium with an integrated bacteriophage. Furthermore, SummonChimera predicted integrations from experimentally verified Hepatitis B Virus chimeras within a paired-end Whole Genome Sequencing hepatocellular carcinoma tumor database. Conclusions: SummonChimera identified all experimentally verified chimeras detected by current computational methods. Further, SummonChimera integration inference precisely predicted bacteriophage integration. The application of SummonChimera to cancer NGS accurately identifies deletion of host and viral sequence during integration. The precise nucleotide determination of an integration allows prediction of viral and cellular gene transcription patterns.
Background: Millions of cells are present in thousands of images created in high-throughput screening (HTS). Biologists could classify each of these cells into a phenotype by visual inspection. But in the presence of millions of cells this visual classification task becomes infeasible. Biologists train classification models on a few thousand visually classified example cells and iteratively improve the training data by visual inspection of the important misclassified phenotypes. Classification methods differ in performance and performance evaluation time. We present a comparative study of computational performance of gentle boosting, joint boosting CellProfiler Analyst (CPA), support vector machines (linear and radial basis function) and linear discriminant analysis (LDA) on two data sets of HT29 and HeLa cancer cells. Results: For the HT29 data set we find that gentle boosting, SVM (linear) and SVM (RBF) are close in performance but SVM (linear) is faster than gentle boosting and SVM (RBF). For the HT29 data set the average performance difference between SVM (RBF) and SVM (linear) is 0.42 %. For the HeLa data set we find that SVM (RBF) outperforms other classification methods and is on average 1.41 % better in performance than SVM (linear). Conclusions: Our study proposes SVM (linear) for iterative improvement of the training data and SVM (RBF) for the final classifier to classify all unlabeled cells in the whole data set.
Background: DNA methylation changes are associated with a wide array of biological processes. Bisulfite conversion of DNA followed by high-throughput sequencing is increasingly being used to assess genome-wide methylation at single-base resolution. The relative slowness of most commonly used aligners for processing such data introduces an unnecessarily long delay between receipt of raw data and statistical analysis. While this process can be sped-up by using computer clusters, current tools are not designed with them in mind and end-users must create such implementations themselves. Results: Here, we present a novel BS-seq aligner, Bison, which exploits multiple nodes of a computer cluster to speed up this process and also has increased accuracy. Bison is accompanied by a variety of helper programs and scripts to ease, as much as possible, the process of quality control and preparing results for statistical analysis by a variety of popular R packages. Bison is also accompanied by bison_herd, a variant of Bison with the same output but that can scale to a semi-arbitrary number of nodes, with concomitant increased demands on the underlying message passing interface implementation. Conclusions: Bison is a new bisulfite-converted short-read aligner providing end users easier scalability for performance gains, more accurate alignments, and a convenient pathway for quality controlling alignments and converting methylation calls into a form appropriate for statistical analysis. Bison and the more scalable bison_herd are natively able to utilize multiple nodes of a computer cluster simultaneously and serve to simplify to the process of creating analysis pipelines.
Background: Signatures are short sequences that are unique and not similar to any other sequence in a databasethat can be used as the basis to identify different species. Even though several signature discoveryalgorithms have been proposed in the past, these algorithms require the entirety of databases to beloaded in the memory, thus restricting the amount of data that they can process. It makes thosealgorithms unable to process databases with large amounts of data. Also, those algorithms usesequential models and have slower discovery speeds, meaning that the efficiency can be improved. Results: In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discoveryand have proposed a parallel signature discovery algorithm on a computer cluster. The algorithmapplies the divide-and-conquer strategy to solve the problem posed to the existing algorithms wherethey are unable to process large databases and uses a parallel computing mechanism to effectivelyimprove the efficiency of signature discovery. Even when run with just the memory of regular personalcomputers, the algorithm can still process large databases such as the human whole-genome ESTdatabase which were previously unable to be processed by the existing algorithms. Conclusions: The algorithm proposed in this research is not limited by the amount of usable memory and canrapidly find signatures in large databases, making it useful in applications such as Next GenerationSequencing and other large database analysis and processing. The implementation of the proposedalgorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
CLAP: A web-server for automatic classification of proteins with special reference to multi-domain proteins
Background: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. Results: Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. Conclusions: CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/
Background: Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods. Results: We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments. Conclusions: Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.
Systematic identification of transcriptional and post-transcriptional regulations in human respiratory epithelial cells during influenza A virus infection
Background: Respiratory epithelial cells are the primary target of influenza virus infection in human. However, the molecular mechanisms of airway epithelial cell responses to viral infection are not fully understood. Revealing genome-wide transcriptional and post-transcriptional regulatory relationships can further advance our understanding of this problem, which motivates the development of novel and more efficient computational methods to simultaneously infer the transcriptional and post-transcriptional regulatory networks. Results: Here we propose a novel framework named SITPR to investigate the interactions among transcription factors (TFs), microRNAs (miRNAs) and target genes. Briefly, a background regulatory network on a genome-wide scale (~23,000 nodes and ~370,000 potential interactions) is constructed from curated knowledge and algorithm predictions, to which the identification of transcriptional and post-transcriptional regulatory relationships is anchored. To reduce the dimension of the associated computing problem down to an affordable size, several topological and data-based approaches are used. Furthermore, we propose the constrained LASSO formulation and combine it with the dynamic Bayesian network (DBN) model to identify the activated regulatory relationships from time-course expression data. Our simulation studies on networks of different sizes suggest that the proposed framework can effectively determine the genuine regulations among TFs, miRNAs and target genes; also, we compare SITPR with several selected state-of-the-art algorithms to further evaluate its performance. By applying the SITPR framework to mRNA and miRNA expression data generated from human lung epithelial A549 cells in response to A/Mexico/InDRE4487/2009 (H1N1) virus infection, we are able to detect the activated transcriptional and post-transcriptional regulatory relationships as well as the significant regulatory motifs. Conclusion: Compared with other representative state-of-the-art algorithms, the proposed SITPR framework can more effectively identify the activated transcriptional and post-transcriptional regulations simultaneously from a given background network. The idea of SITPR is generally applicable to the analysis of gene regulatory networks in human cells. The results obtained for human respiratory epithelial cells suggest the importance of the transcriptional, post-transcriptional regulations as well as their synergies in the innate immune responses against IAV infection.
Background: Proteins dynamically interact with each other to perform their biological functions. The dynamic operations of protein interaction networks (PPI) are also reflected in the dynamic formations of protein complexes. Existing protein complex detection algorithms usually overlook the inherent temporal nature of protein interactions within PPI networks. Systematically analyzing the temporal protein complexes can not only improve the accuracy of protein complex detection, but also strengthen our biological knowledge on the dynamic protein assembly processes for cellular organization. Results: In this study, we propose a novel computational method to predict temporal protein complexes. Particularly, we first construct a series of dynamic PPI networks by joint analysis of time-course gene expression data and protein interaction data. Then a Time Smooth Overlapping Complex Detection model (TS-OCD) has been proposed to detect temporal protein complexes from these dynamic PPI networks. TS-OCD can naturally capture the smoothness of networks between consecutive time points and detect overlapping protein complexes at each time point. Finally, a nonnegative matrix factorization based algorithm is introduced to merge those very similar temporal complexes across different time points. Conclusions: Extensive experimental results demonstrate the proposed method is very effective in detecting temporal protein complexes than the state-of-the-art complex detection techniques.
Visualization and Correction of Automated Segmentation, Tracking and Lineaging from 5-D Stem Cell Image Sequences
Background: Neural stem cells are motile and proliferative cells that undergo mitosis, dividing to produce daughtercells and ultimately generating differentiated neurons and glia. Understanding the mechanismscontrolling neural stem cell proliferation and differentiation will play a key role in the emergingfields of regenerative medicine and cancer therapeutics. Stem cell studies in vitro from 2-D imagedata are well established. Visualizing and analyzing large three dimensional images of intact tissueis a challenging task. It becomes more difficult as the dimensionality of the image data increases toinclude time and additional fluorescence channels. There is a pressing need for 5-D image analysisand visualization tools to study cellular dynamics in the intact niche and to quantify the role thatenvironmental factors play in determining cell fate. Results: We present an application that integrates visualization and quantitative analysis of 5-D(x, y, z, t, channel) and large montage confocal fluorescence microscopy images. The imagesequences show stem cells together with blood vessels, enabling quantification of the dynamicbehaviors of stem cells in relation to their vascular niche, with applications in developmental andcancer biology. Our application automatically segments, tracks, and lineages the image sequencedata and then allows the user to view and edit the results of automated algorithms in a stereoscopic3-D window while simultaneously viewing the stem cell lineage tree in a 2-D window. Usingthe GPU to store and render the image sequence data enables a hybrid computational approach.An inference-based approach utilizing user-provided edits to automatically correct related mistakesexecutes interactively on the system CPU while the GPU handles 3-D visualization tasks. Conclusions: By exploiting commodity computer gaming hardware, we have developed an application that canbe run in the laboratory to facilitate rapid iteration through biological experiments. We combineunsupervised image analysis algorithms with an interactive visualization of the results. Our validationinterface allows for each data set to be corrected to 100% accuracy, ensuring that downstream dataanalysis is accurate and verifiable. Our tool is the first to combine all of these aspects, leveraging thesynergies obtained by utilizing validation information from stereo visualization to improve the lowlevel image processing tasks.
tigaR: integrative significance analysis of temporal differential gene expression induced by genomic abnormalities
Background: To determine which changes in the host cell genome are crucial for cervical carcinogenesis, a longitudinal in vitro model system of HPV-transformed keratinocytes was profiled in a genome-wide manner. Four cell lines affected with either HPV16 or HPV18 were assayed at 8 sequential time points for gene expression (mRNA) and gene copy number (DNA) using high-resolution microarrays. Available methods for temporal differential expression analysis are not designed for integrative genomic studies. Results: Here, we present a method that allows for the identification of differential gene expression associated with DNA copy number changes over time. The temporal variation in gene expression is described by a generalized linear mixed model employing low-rank thin-plate splines. Model parameters are estimated with an empirical Bayes procedure, which exploits integrated nested Laplace approximation for fast computation. Iteratively, posteriors of hyperparameters and model parameters are estimated. The empirical Bayes procedure shrinks multiple dispersion-related parameters. Shrinkage leads to more stable estimates of the model parameters, better control of false positives and improvement of reproducibility. In addition, to make estimates of the DNA copy number more stable, model parameters are also estimated in a multivariate way using triplets of features, imposing a spatial prior for the copy number effect. Conclusion: With the proposed method for analysis of time-course multilevel molecular data, more profound insight may be gained through the identification of temporal differential expression induced by DNA copy number abnormalities. In particular, in the analysis of an integrative oncogenomics study with a time-course set-up our method finds genes previously reported to be involved in cervical carcinogenesis. Furthermore, the proposed method yields improvements in sensitivity, specificity and reproducibility compared to existing methods. Finally, the proposed method is able to handle count (RNAseq) data from time course experiments as is shown on a real data set.
Background: In past number of methods have been developed for predicting post-translational modifications in proteins. In contrast, limited attempt has been made to understand post-transcriptional modifications. Recently it has been shown that tRNA modifications play direct role in the genome structure and codon usage. This study is an attempt to understand kingdom-wise tRNA modifications particularly uridine modifications (UMs), as majority of modifications are uridine-derived. Results: A three-steps strategy has been applied to develop an efficient method for the prediction of UMs. In the first step, we developed a common prediction model for all the kingdoms using a dataset from MODOMICS-2008. Support Vector Machine (SVM) based prediction models were developed and evaluated by five-fold cross-validation technique. Different approaches were applied and found that a hybrid approach of binary and structural information achieved highest Area under the curve (AUC) of 0.936. In the second step, we used newly added tRNA sequences (as independent dataset) of MODOMICS-2012 for the kingdom-wise prediction performance evaluation of previously developed (in the first step) common model and achieved performances between the AUC of 0.910 to 0.949. In the third and last step, we used different datasets from MODOMICS-2012 for the kingdom-wise individual prediction models development and achieved performances between the AUC of 0.915 to 0.987. Conclusions: The hybrid approach is efficient not only to predict kingdom-wise modifications but also to classify them into two most prominent UMs: Pseudouridine (Y) and Dihydrouridine (D). A webserver called tRNAmod (http://crdd.osdd.net/raghava/trnamod/) has been developed, which predicts UMs from both tRNA sequences and whole genome.
MAE-FMD: Multi-agent evolutionary method for functional module detection in protein-protein interaction networks
Background: Studies of functional modules in a Protein-Protein Interaction (PPI) network contribute greatly to theunderstanding of biological mechanisms. With the development of computing science,computational approaches have played an important role in detecting functional modules. Results: We present a new approach using multi-agent evolution for detection of functional modules in PPInetworks. The proposed approach consists of two stages: the solution construction for agents in apopulation and the evolutionary process of computational agents in a lattice environment, where eachagent corresponds to a candidate solution to the detection problem of functional modules in a PPInetwork. First, the approach utilizes a connection-based encoding scheme to model an agent, andemploys a random-walk behavior merged topological characteristics with functional information toconstruct a solution. Next, it applies several evolutionary operators, i. e., competition, crossover, andmutation, to realize information exchange among agents as well as solution evolution. Systematicexperiments have been conducted on three benchmark testing sets of yeast networks. Experimentalresults show that the approach is more effective compared to several other existing algorithms. Conclusions: The algorithm has the characteristics of outstanding recall, F-measure, sensitivity and accuracy whilekeeping other competitive performances, so it can be applied to the biological study which requireshigh accuracy.
Background: Structure-based drug design is an iterative process, following cycles of structural biology, computer-aided design, synthetic chemistry and bioassay. In favorable circumstances, this process can lead to the structures of hundreds of protein-ligand crystal structures. In addition, molecular dynamics simulations are increasingly being used to further explore the conformational landscape of these complexes. Currently, methods capable of the analysis of ensembles of crystal structures and MD trajectories are limited and usually rely upon least squares superposition of coordinates. Results: Novel methodologies are described for the analysis of multiple structures of the same or related proteins. Statistical approaches that rely upon residue equivalence, but not superposition, are developed. Tasks that can be performed include the identification of hinge regions, allosteric conformational changes and transient binding sites. The approaches are tested on crystal structures of CDK2 and other CMGC protein kinases and a simulation of p38alpha. Known interaction - conformational change relationships are highlighted but also new ones are revealed. A transient but druggable allosteric pocket in CDK2 is predicted to occur under the CMGC insert. Furthermore, an evolutionarily-conserved conformational link from the location of this pocket, via the alphaEF-alphaF loop, to phosphorylation sites on the activation loop is discovered. Conclusions: New methodologies are described and validated for the superimposition independent conformational analysis of large collections of structures or simulation snapshots of the same protein. The methodologies are encoded in a Python package called Polyphony, which is released as open source to accompany this paper [http://wrpitt.bitbucket.org/polyphony/].