Data and Configuration:
Pre-requisites
- Java (TM): http://www.oracle.com/technetwork/java/index.html
Base code is written in Java.
- Bio-Java (TM): http://biojava.org/wiki/Main_Page
Requires version 3.0.3 onwards. Bio-Java is used for Sequence pattern matching and I/O.
- ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj
Requires version 20 onwards. ECJ is the base framework on which the GP-based feature construction algorithm, EFC, works.
- WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka
Requires version 3.7.* onwards. WEKA is used for running training/testing and getting basic evaluation metrics like auROC and auPRC. WEKA's feature selection and attribute evaluation is also used for feature selection. Base sequences with class labels are sorted into 10 samples of train/test using stratified sampling in WEKA.
Datasets (FASTA format. All sequences encoded in GBMR4 alphabet.)
Notes on Running Experiments
- Configuration to Run Dataset splitting in
Training/Testing for Cross-Validation:
Stratified Sampling of Sequence Data, by making the
sequences string attribute and class labels into class
attribute into 10 folds using WEKA's filter.
weka.filters.supervised.instance.StratifiedRemoveFolds
-S 0 -N 10 -F 1
- Configuration to Run EFC:
For each Training fold or when entire data is training,
run the EFC algorithm using the EFC GP code and
amp.params having the right Problem, right parameters
etc.
java ec.Evolve -file amp.params
- Configuration to Run Feature Interpretation
(Training and Testing):
The Hall of Fame or Features generated as mentioned in
the sequence params file, can be run through the
interpreter to generate a machine learning file in
libsvm format. For example to run the NN269 interpreter:
java
org.java.evolutionary.sequence.AMPSequenceFeatureInterpreter
C:\Research\Software\ECJ-Trunk-Latest\XiaoHallOfFameFeatures.txt
XiaoTraining.libsvm 1
C:\Research\datasets\Xiao\XiaoAMPTrain.fasta C:\Research\datasets\Xiao\XiaoAMPTest.fasta
- Configuration to Run Feature Reduction:
Open the libsvm file in WEKA and save as arff file, and
change the class variable to be nominal from
numeric either manually or using the NumericToNominal
filter.
java -classpath weka.jar weka.classifiers.meta.FliteredClassifier \
-t /My_Training_File.libsvm.arff \
-T /My_Testing_File.libsvm.arff \
-F "weka.filters.supervised.atrribute.AttributeSelection \
-E \"weka.attributeSelection.SymmetricalUncertAttributeSetEval \" \
-S \"weka.attributeSelection.FCBFSearch -N -1\"" \
-W weka.classifiers.functions.Logistic -- -R 1.0E-8 -M -1