svmPRAT - Protein Residue Annotation Toolkit


svmPRAT toolkit is a set of programs that allow building SVM based models for annotating amino acid residues in protein sequences using user supplied features (like PSI-BLAST profiles, or PSIPred profiles). In particular, the toolkit builds features using a window around the residue, and is equipped with a specialized kernel function (normalized second order exponential kernel function nsoe ) along with the standard svm kernel function.

AVAILABILITY and DOWNLOADS

svmPRAT is currently distributed in a binary format with the executables for various architectures and distributions. svmPRAT and SVM_Light are linked with an optimized machine dependent BLAS library.

The following files can be downloaded within svmPRAT's distribution:
    EXECUTABLES
svmPRAT_SunOS-sun4u.zip   This contains SUN Solaris binaries optimized with & without the blas libraries compiled on Sun-Blade-1500 Solaris.
svmPRAT_Linux-i686.zip   This contains the 32-bit Linux binaries optimized with & without the blas libraries compiled on i686 Intel(R) Pentium(R) 4 CPU.
svmPRAT_Linux-x86-64.zip   This contains the 64-bit Linux binaries optimized with & without the blas libraries compiled on Dual Core AMD Opteron(tm) Processor 270.
svmPRAT_Darwin-i386.zip   This contains the 32-bit DARWIN (Mac OS X 10.5.3) binaries optimized with & without the blas libraries compiled on Intel Core Duo Processors.
svmPRAT_MSWIN-x86.zip   This contains the 64-bit MS WIN compiled without the blas libraries compiled on Intel Core 2 Duo Processors using cygwin and Microsoft Visual Studio. svmPRAT_Eval will be added soon to this folder. Note you will need cygwin to be installed on your system to use these binaries.
    EXAMPLE FILES
TOY Data   The zip file provides a set of profiles (PSSMS) for 10 sequences that are used for training and another 10 sequences that are used for testing. The true labels are provided for the training sequences and they include whether a residue is disorder or not.
Steps to check your binaries. 1. Unzip toy_data.zip. 2. Go to the folder toy_data. 3. The files "train_10.lst" and "test_10.lst" contain the listing of the train and test files. The files ending with suffix "pssm.new" are the PSI-BLAST profile files and the files ending with suffix "disnew" are the true labels for the disorder prediction problem. 4. To run the learn program /path-where-you-extracted-svmPRAT/svmPRAT_Learn1.0 train_10.lst model-name . To run the prediction program /path-where-you-extracted-svmPRAT/svmPRAT_Predict1.0 test_10.lst model-name prefix-name . To run the evaluation program ./path-where-you-extracted-svmPRAT/svmPRAT_Eval1.0 train_10.lst

Command Line Interface for svmPRAT_Learn

Usage
svmPRAT_Learn [options] <input file> <model file>
Input
svmPRAT_Learn has two required parameters:
  1. input file: The input file provides the training data and it contains the list features for each protein sequence that will be used for training in the same order as specified in the config file. Also finally the name of the file containing the annotation label should be the last entry of each line. The first line contains the weights for each of the different features. Example of the input file using the PSI-BLAST and PSI-Pred features is shown below, with weights of 0.8 for the PSI-BLAST based features, and 0.2 for the PSI-PRED features.
    Example for input file:

     
    0.8 0.2
    /usr/data/pssm/d101m__.pssm /usr/data/psipred/d101m__.psipred /usr/data/ann/d101m__.true
    /usr/data/pssm/d1aaa__.pssm /usr/data/psipred/d1aaa__.psipred /usr/data/ann/d1aaa__.true
    /usr/data/pssm/d1122__.pssm /usr/data/psipred/d1122__.psipred /usr/data/ann/d1122__.true
     

    Below is an example of an annotation file, which stores the true annotation (one line for each position of the sequence), with the first line being the identifier for a sequence. The annotations can be strings, characters, or integers. svmPRAT uses a mapping to figure out the number of classes..
    Example for annotation file:

     
    > /scratch/glu1/astral-1.69-dssps/d101m__.true
    L1
    L2
    L3
    L1
    L1
    L1
    L2
    L3
    

  2. model file: is the file where the model parameters get stored. The path is extracted from the model file name, and thats the location where the "num element" one-versus models will be stored.
Output
The output is a model file which contains information regarding the models learned and stored, which is also used as input to the svmPRAT_Predict program.
Options
-wmer=<integer>
Specifies the length of the wmer that should be used for a feature for a residue. In particular, features are generated using wmer residues to the left, wmer residues to the right and the central residue (2w+1) residues. Default value is wmer=2
-kernel={custom,linear,quad,soe,rbf}
Specifies the kernel to be used for svm-light. The possible values are:
custom      User defined custom kernel.
linear      Linear dot product kernel.
quad         Quadratic kernel.
soe           Normalized second-order exponential kernel (Default).
rbf           Standard radial basis kernel function.
-c=<float>
The regularization parameter provided to SVM learning. It controls trade-off between training error and margin. Default is 0.1
-smer=<float>
Specifies the length (< wmer) upto which the sequence residues contribute all their feature weights. Residues that are < wmer and > smer are averaged out.
-usecr
Cost Ratio flag is turned on with this parameter. Used when data has uneven distribution of classes. Enables a cost factor by which training errors on positive examples outweigh errors on negative examples (default 1 when the flag is off).
-cascade=<float>
This is used to invoke/learn a cascaded level model, where features are derived by building a first level model. The floating point allows setting up weight for the predictions from first-level model for the second-level model.
-help
Prints the above help message.

Command Line Interface for svmPRAT_Predict

Usage
svmPRAT_Predict <test file> <model file> <prediction file>
Input
svmPRAT_Predict has three required parameters:
  1. test file: The test file provides the list of sequence features for which we need to predict the annotation profiles. It is similar to the input file used in svmPRAT_Learn except it does not contain the names of the true annotation files, since we are predicting the same.
  2. model file: The model file is the model file outputted by svmPRAT_Learn
  3. prediction file: The prediction file provides the path and the prefix of the output predictions in the form of annotation profiles.
Output
The output consists of a set of predicted annotations and profiles, which are nothing but the SVM predictions from each of the "num element" SVM models. An example of an output profile for an annotation containing 16 elements is shown below:
Example of predicted annotation file for a protein:

 
000   14 -1.0004 -1.0592 -1.1488 -1.0352 -1.2810 -1.3322 -1.2483 -1.4799 -1.2799 -1.3312 -1.3516 -1.2172 -0.9993 -1.0002 +0.9993 -1.1071
001   14 -1.5512 -1.0308 -1.1452 -1.0157 -1.1897 -1.2397 -0.9997 -1.1835 -1.2996 -1.3314 -1.2970 -1.5036 -1.0006 -1.1929 +0.9997 -1.5617
002   14 -0.9995 -1.0573 -1.2493 -1.0456 -1.1514 -1.2316 -1.0008 -1.0006 -1.3156 -1.1864 -1.6694 -1.0069 -1.0898 -1.5569 +0.9999 -1.4992
003   14 -1.4105 -1.0994 -1.2708 -1.0072 -1.2759 -1.0005 -0.9995 -1.0840 -1.2864 -0.9999 -1.4743 -1.0155 -0.9995 -1.3479 +0.9995 -1.0651
004   11 -1.2225 -1.0916 -1.1872 -1.0281 -1.2961 -1.1643 -1.1118 -1.1245 -1.1537 -1.3704 -1.3907 +0.9999 -0.9996 -1.0004 -0.9999 -1.2140
005   12 -0.9995 -1.3154 -1.1994 -1.0214 -1.2783 -1.1586 -1.1875 -1.4170 -1.2909 -1.3497 -1.6379 -0.9993 +1.0001 -1.4284 -0.9998 -1.0981
006   12 -1.6333 -1.1797 -1.3248 -1.0424 -1.1269 -1.1318 -1.2340 -1.3809 -1.2517 -1.2218 -1.2112 -0.9995 +1.7395 -1.2302 -1.6457 -1.4433
007   12 -1.4013 -1.0892 -1.2256 -1.0161 -1.3483 -1.4000 -1.0078 -1.2499 -1.1480 -1.4589 -1.1845 -1.6315 +1.3110 -1.0090 -1.7057 -1.2312
008   12 -1.4578 -1.2840 -1.2500 -1.0495 -1.2260 -1.2836 -1.6059 -1.2687 -1.2792 -1.3255 -1.3207 -1.0484 +1.0001 -1.4365 -1.0002 -1.2322
009   12 -1.5083 -1.2800 -1.3130 -1.0532 -1.0515 -1.4708 -1.8131 -1.6142 -1.2192 -1.4026 -1.3504 -1.2876 +1.0007 -1.1374 -0.9997 -1.2064
010   12 -0.9998 -1.0003 -1.2860 -1.0487 -1.2841 -1.2959 -0.9997 -1.4049 -1.2458 -1.3730 -1.3588 -1.3606 +0.9995 -1.5204 -0.9994 -1.4352
011   12 -1.3758 -1.1462 -1.3136 -1.0445 -1.0247 -1.5192 -0.9995 -1.5453 -1.2707 -1.2739 -1.1607 -1.4221 +0.9999 -0.9998 -1.0007 -1.4906
012    6 -0.9994 -1.1934 -1.2947 -1.0490 -1.2755 -1.3801 +0.9994 -1.4202 -1.0704 -1.3127 -1.5102 -1.5656 -0.9995 -1.0937 -0.9994 -1.0534
013   14 -1.3359 -1.2518 -1.2288 -1.0437 -1.2852 -1.2509 -0.9997 -1.1385 -1.2780 -1.4604 -1.3463 -1.6658 -1.2496 -1.0711 +1.3739 -1.1819
014   15 -1.3467 -1.2115 -1.1412 -1.0403 -1.1839 -1.4935 -1.0000 -0.9993 -1.2805 -1.3932 -1.3356 -1.2321 -2.0749 -1.0000 -1.0001 +0.9993
015    7 -1.0002 -1.2452 -1.1169 -1.0453 -1.2380 -1.0668 -0.9995 +1.0000 -1.1621 -1.3070 -1.0004 -0.9993 -1.5969 -0.9998 -0.9998 -1.0000
016   14 -1.2752 -1.1088 -1.0005 -1.0451 -1.1178 -1.0702 -1.0005 -0.9997 -1.1890 -1.3662 -1.4341 -1.0004 -1.5173 -0.9997 +0.9995 -1.0825
........
.....

The first column gives the residue number. The second column gives the predicted annotation label. The sixteen columns represent the profile and are predictions from the 16 SVM models in this case.
Options
No Extra optional parameters for now. All are dependent on the training or svmPRAT_Learn parameters.

Command Line Interface for svmPRAT_Eval

Usage
svmPRAT_Eval [options] <input file>
Input
svmPRAT_Eval has one required parameters:
  1. input file: The input file provides the training data for performing cross-validation and it contains the list of features for each protein sequence. This is exactly identical to the input provided to svmPRAT_Learn.
Options
-nfolds=<integer>
Specifies the number of cross-validation folds used in the evaluation mode. Default value is nfolds=5
All parameters available in svmPRAT_Learn can be used in this mode as well. svmPRAT_Eval can be used to search the best set of parameters using the provided example script - cv_script.pl

Contact Information

If you have any questions or problems with svmPRAT please send an email to rangwala@cs.gmu.edu.

Citing svmPRAT

In citing svmPRAT in your papers, please use the following reference:

svmPRAT: SVM-based Protein Residue Annotation Toolkit. Huzefa Rangwala, Christopher Kauffman and George Karypis (Under Review)
"A kernel framework for protein residue annotation". Huzefa Rangwala, Christopher Kauffman and George Karypis. Proceedings of the 2009 PAKDD conference, Bangkok, Thailand.

Copyright and License Information

svmPRAT is primarily written and maintained by Huzefa Rangwala (George Mason University) and is copyrighted by George Mason University It can be freely used for educational and research purposes by non-profit institutions and US government agencies only. Other organizations are allowed to use svmPRAT only for evaluation purposes, and any further uses will require prior approval.

The software may not be sold or redistributed without prior approval. One may make copies of the software for their use provided that the copies, are not sold or distributed, are used under the same terms and conditions.

As unestablished research software, this code is provided on an ``as is'' basis without warranty of any kind, either expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to these terms. These terms and conditions are subject to change at any time without prior notice.