Improving Recognition of Antimicrobial Peptides and their Target Selectivity through Machine Learning and Genetic Programming

Pre-requisites

Java (TM): http://www.oracle.com/technetwork/java/index.html
Base code is written in Java.

Bio-Java (TM): http://biojava.org/wiki/Main_Page
Requires version 3.0.3 onwards. Bio-Java is used for Sequence pattern matching and I/O.

ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj
Requires version 20 onwards. ECJ is the base framework on which the GP-based feature construction algorithm, EFC, works.

WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka
Requires version 3.7.* onwards. WEKA is used for running training/testing and getting basic evaluation metrics like auROC and auPRC. WEKA's feature selection and attribute evaluation is also used for feature selection. Base sequences with class labels are sorted into 10 samples of train/test using stratified sampling in WEKA.

Datasets (FASTA format. All sequences encoded in GBMR4 alphabet.)

Notes on Running Experiments

Configuration to Run Dataset splitting in Training/Testing for Cross-Validation:
Stratified Sampling of Sequence Data, by making the sequences string attribute and class labels into class attribute into 10 folds using WEKA's filter.
Configuration to Run EFC:
For each Training fold or when entire data is training, run the EFC algorithm using the EFC GP code and amp.params having the right Problem, right parameters etc.
Configuration to Run Feature Interpretation (Training and Testing):
The Hall of Fame or Features generated as mentioned in the sequence params file, can be run through the interpreter to generate a machine learning file in libsvm format. For example to run the NN269 interpreter:
Configuration to Run Feature Reduction:
Open the libsvm file in WEKA and save as arff file, and change the class variable to be nominal from numeric either manually or using the NumericToNominal filter.