LSH-Div: Species Diversity Estimation using Locality Sensitive Hashing
(Supplementary Paper: HERE. )

 

LSH-Div algorithm groups sequences into Operational Taxonomic Units (OTUs) using the LSH function within a greedy, iterative clustering framework. LSH-Div reports the standard species richness metrics such as Chao1 Index, Shannon Diversity Index and Abundance-based Coverage Estimator (ACE) Index after assigning sequences within a sample to different OTUs (or clusters)

Availability and Implementation


LSH-Div is currently distributed in a Python script. The source code is available the GNU GPL license. 

LSHDIV_SourceCode.zip Contains Python scripts for LSH-Div algorithm
LSHDIV_DataFiles.zip Contains data files used in LSH-Div paper

Description of Source Code


LSHDiv_SouceCode.zip contains the following scripts.

StatsFasta.py Display the statistics of fasta file such as Number of sequences, minimum sequence length, maximum sequence length, mean and standard deviation of sequence lengths
FilterFasta.py Generates an output fasta file contains filtered sequences based on given minimum and maximum range of sequence length  
EqualLengthFasta.py     Generates an output fasta file contains equal length sequences. All the sequences in output fasta file have same length equal to minimum sequence length in the input file.
LSHDIV.py Estimates the OTUs in a given sample with standard species richness metrics.

How To Use


Here are some examples how to use the LSH-Div scripts.

StatsFasta.py

Usage: StatsFasta.py -i <inputfile.fasta>

Input: <inputfile.fasta> is any sequence file in fasta format

Output: Displays the statistics about fasta file

Reading time: 1.17 seconds
Number of Sequences: 55592
Minimum Sequence Length: 53
Maximum Sequence Length: 100
Mean Sequence Length: 61
Standard Deviation: 2

Done

FilterFasta.py

Usage: FilterFasta.py -i <inputfile.fasta> -o <outputfile> -l <min length> -u <max length>

Input: <inputfile.fasta> is any sequence file in fasta format
<outputfile.fasta> is the name of your output file (could be any name)
<min length> is the minimum length of the sequence in the output filtered file
<max length> is the maximum length of the sequence in the output filtered file

Output: Generates a fasta file that contains sequences having lengths greater than <min length> and less than <max length>

Number of Sequences: 55592
Minimum Sequence Length: 53
Maximum Sequence Length: 100
Mean Sequence Length: 61
Standard Deviation: 2

Writing Output to a fasta file

Writing Time: 1.21 seconds

Number of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 100
Mean Sequence Length: 72
Standard Deviation: 5

Done

EqualLengthfasta.py

Usage: EqualLengthFasta.py -i <inputfile.fasta> -o <outputfile>

Input: <inputfile.fasta> is any sequence file in fasta format
<outputfile.fasta> is the name of your output file (could be any name)

Output: Generates a fasta file that contains sequences of equal length. Equal length is the minimum sequence length in the <inputfile.fasta>

Number of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 100
Mean Sequence Length: 72
Standard Deviation: 5


Writing Fasta File for Equal Length

Writing Time: 0.14 seconds

Number of Sequences: 372
Minimum Sequence Length: 70
Maximum Sequence Length: 70
Mean Sequence Length: 70
Standard Deviation: 0

Done

LSHDIV.py

Usage: LSHDIV.py -i <inputfile.fasta> -o <outputfile> -l <min length> -s <num sampled indices> -w <wmer length> -p <percentage mismatch> -n <num iterations>

Input: <inputfile.fasta> is any sequence file of equal sequence lengths in fasta format.
<outputfile> is the name of your output file (could be any name)
<min length> is the minimum length of the sequence
<num sampled indices> is the number of sampled indices to be chosen from sequences. This number must be less than or equal to <min length>
<wmer length> is the length of w-mer per index
<percentage mismatch> is the percentage of mismacth allowed in order to assign the same OTU.
<num iterations> is the number of iterations LSH-Div algorithm runs, each time with different set of sampled indices
Output: <outputfile_log.txt> is a log file containing all the parameters, number of OTUs, species richness metrics and other statistics
<outputfile_OTUFasta> is a fasta file with each sequence tag contains its OTU label
<outputfile_OTULabels> is index file containing the OTU labels. The number in row one is the OTU label of sequence one in input file.

Reading time: 0.04 seconds
Mean Sequence Length: 70
Standard Deviation: 0
Initializing LSH-Div
Clustering Sequences and Estimating OTUs

Total Number of Sequences: 372
Number of OTUs in Iteration 1 is 214
Time taken by LSH-Div is 0.65 seconds
Total Time upto iteration 1 is 0.65 seconds
Number of Singleton OTUs are 177
Number of Doubleton OTUs are 16
Chao1 Estimate is 1130.24
Chao1 LCI 95% is 747.37
Chao1 UCI 95% is 1787.93
Shannon Index is 4.87
Shannon LCI 95% is 4.74
Shannon UCI 95% is 5.00
ACE is 2088.20

Writing Output to a fasta file

Done

Supplementary Paper


Supplementary paper for LSH-Div can be downloaded from here. This paper contains those results which are not included in the main paper.


Copyright and License Information


The source code for LSH-Div is available under GNU General Public License (GNU GPL)


© COPYRIGHT 2012 by Zeehasham Rasheed and Huzefa Rangwala (George Mason University). All Rights Reserved

 

Free counter and web stats