Prerequisites Running EFFECT
Algorithm Running Statistical
Algorithm
Running
Spectrum Running Gibbs
Tuning Algorithms
Running Spectrum Running Gibbs
Tuning Algorithms
Install and Run
Prerequisites:
Following software are required for running EFFECT
end-to-end
1. Java (TM): http://www.oracle.com/technetwork/java/index.html
Base Code is written in Java
2. Bio-Java (TM): http://biojava.org/wiki/Main_Page
version 3.0.3 onwards. Bio-Java is used for Sequence pattern
matching and i/o.
3. ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj/
version 20 onwards. ECJ is the base framework on which the
GP-based feature construction algorithm, EFC, in EFFECT
works.
4. JSTACS (TM): http://www.jstacs.de/index.php/Main_Page
version 2.1 onwards. JSTACS is used for comparative algorithms
in Statistical Experiments.
5. SHOGUN
(TM): http://www.shogun-toolbox.org/
version 2.1 onwards. SHOGUN is used for running SVM-based
kernel methods like Weighted Degree and Weighted Degree with
shift.
6. WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka/
version 3.7.* onwards. WEKA is used for running
training/testing and getting basic evaluation metrics like
auROC and auPRC. WEKA's feature selection and attribute
evaluation is also used for feature selection in EFFECT.
Base sequences with class labels are sorted into 10 samples of
train/test using stratified sampling in WEKA.
Running EFFECT Algorithm:
We give step by step instruction om how to run EFFECT:
Step 1: Stratified Sampling of Sequence Data, by making
the sequences string attribute and class labels into class
attribute into 10 folds using WEKA's filter.
weka.filters.supervised.instance.StratifiedRemoveFolds -S 0 -N
10 -F 1 (give in italics and different color)
Step 2: For each Training fold or when entire data is
training, run the EFC algorithm using the EFC GP code and
sequence.params having the right Problem, right parameters
etc. java ec.Evolve -file sequence.params
Step 3: The Hall of Fame or Features generated as
mentioned in the sequence params file, can be run through the
interpreter to generate a machine learning file in libsvm
format. For example to run the NN269 interpreter:
java
org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter
C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt
NN269Train.libsvm 1
C:\Research\datasets\SpliceData\NN269\splice.train-real.A
C:\Research\datasets\SpliceData\NN269\splice.train-false.A
Step 4: Using the same Features file, we run the
testing fold exactly like Step 3, but the files will be
testing positive and negative and generate a testing machine
learning file. For example:
java
org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter
C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt
NN269Test.libsvm 1
C:\Research\datasets\SpliceData\NN269\splice.test-real.A
C:\Research\datasets\SpliceData\NN269\splice.test-false.A
Step 5: Use the training and testing files (folds when
cross validating) or entire when separate using WEKA. Open the
libsvm files and save them as WEKA's ARFF format in the WEKA
explorer. Manually or using UI change the class value from
numeric to categorical.
Step 6: Run the Training Data and Testing Data through
Explorer for Evaluation using the meta learner
AttributeSelectedClassifier. Evolutionary Feature Selection is
done using GeneticSearch and fitness is evluated using
CfsSubsetEval.
weka.classifiers.meta.AttributeSelectedClassifier -E
"weka.attributeSelection.CfsSubsetEval " -S
"weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M
0.033 -R 20 -S 1" -W weka.classifiers.bayes.NaiveBayes --
Step 7: Note the auROC and auPRC from Evaluator.
Step 8: For cross validation we perform this 10 times
for different training/testing data from original. And get the
mean auROC and mean auPRC.
Running Statistical Algorithm Tests:
Running Statistical tests just needs to change the main() of
the method to call the test we want to perform and pass the
arguments e.x
java org.java.statistics.NN269StatisticalMethodsTest
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta
C:\Research\datasets\SpliceData\NN269\splice.test-real.A
C:\Research\datasets\SpliceData\NN269\splice.test-false.A
Running Spectrum/KMer Tests
Running KMer tests needs to run the code
java org.java.featurebased.kmer.KMerMotifFeatureGenerator
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta
NN269DonorKmer.libsvm 8
Running Gibbs Sampling Tests
Running Gibbs sampling motif generation is done using
java
org.java.featurebased.gibbs.GibbSamplingMotifFeatureGenerator
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta
C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta
NN269AcceptorGibbs.libsvm 8
Tuning
Algorithms (Methodology
and Parameters)
- We took 1% of the training data with 50-50 of each class as the validation/evaluation data which we don't use in the train-test or cross validation. We either used the parameters well known for some techniques either by contacting the original developers by using their manuscript as a guidance or using grid-basedsearch for parameter tuning.
- We also had a fair parameter usage, for example in
all experiments we used K=1 to 8 as going to 8 Kmer
gave best results in most, so even Gibbs Samping motifs, our
own technique EFFECT motif length, weighted postion,
weighted position degree kernel all used same maximum
length as a constraint.
- For SVM-based methods, we used values for C,gamma and lambda either through past research on same dataset (like NN269) or values that allowed best performance on validation dataset. We used Grid Search for tuning, giving large enough ranges (e.g., C= [10 to 0.00001] , epsilon/gamma) =[10 to 0.00001] with medium step sizes 0.1.
- For statistical methods, there are various parameters
based on choice, such as stopping criteria, epsilon of
stopping criteria, etc. Also each algorithms
has choice of Markov Chain order, etc. We used some
default parameters after consulting experts from JSTACS.
For some elements, like markov chain order, we used the
evaluation data and went to 2nd order or 4th order based
on auROC.
- We have given final parameters used for each
experiment as a table in Supporting Information
document accompanying the paper.