Bioinformatics Journal

Bioinformatics - RSS feed of current issue
  • String graph construction using incremental hashing
    [Dec 2014]

    Motivation: New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works use the paradigm of string graph, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows–Wheeler transform or the FM index. Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts.

    Results: We introduce a novel hash-based method for constructing the string graph. We use incremental hashing, and specifically a modification of the Karp–Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false-positive and false-negative edges during the algorithm’s execution, but these are all detected and corrected. The advantages of the proposed approach over existing methods are its simplicity and the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable with the first string graph construction of Simpson and Durbin (2010) (but not with subsequent improvements). Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state-of-the-art string graph-based assemblers.

    Availability and implementation: A beta version of all source code used in this work can be downloaded from http://www.cs.tau.ac.il/~bchor/StringGraph/

    Contact: ilanbb@gmail.com or benny@cs.tau.ac.il

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Merging of multi-string BWTs with applications
    [Dec 2014]

    Motivation: The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows–Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis.

    Results: We present a novel algorithm that merges multi-string BWTs in $$O(LCS\times N)$$ time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using $$O(N\times lo{g}_{2}(F))$$ bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.

    Availability and implementation: The MSBWT package is available through PyPI with source code located at https://code.google.com/p/msbwt/.

    Contact: holtjma@cs.unc.edu

    Categories: Journal Articles
  • Quantifying tumor heterogeneity in whole-genome and whole-exome sequencing data
    [Dec 2014]

    Motivation: Most tumor samples are a heterogeneous mixture of cells, including admixture by normal (non-cancerous) cells and subpopulations of cancerous cells with different complements of somatic aberrations. This intra-tumor heterogeneity complicates the analysis of somatic aberrations in DNA sequencing data from tumor samples.

    Results: We describe an algorithm called THetA2 that infers the composition of a tumor sample—including not only tumor purity but also the number and content of tumor subpopulations—directly from both whole-genome (WGS) and whole-exome (WXS) high-throughput DNA sequencing data. This algorithm builds on our earlier Tumor Heterogeneity Analysis (THetA) algorithm in several important directions. These include improved ability to analyze highly rearranged genomes using a variety of data types: both WGS sequencing (including low ~7x coverage) and WXS sequencing. We apply our improved THetA2 algorithm to WGS (including low-pass) and WXS sequence data from 18 samples from The Cancer Genome Atlas (TCGA). We find that the improved algorithm is substantially faster and identifies numerous tumor samples containing subclonal populations in the TCGA data, including in one highly rearranged sample for which other tumor purity estimation algorithms were unable to estimate tumor purity.

    Availability and implementation: An implementation of THetA2 is available at http://compbio.cs.brown.edu/software

    Contact: layla@cs.brown.edu or braphael@brown.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • KmerStream: streaming algorithms for k-mer abundance estimation
    [Dec 2014]

    Motivation: Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment.

    Results: We present KmerStream, a streaming algorithm for estimating the number of distinct k-mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream.

    As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.

    Availability and implementation: The tool KmerStream is written in C++ and is released under a GPL license. It is freely available at https://github.com/pmelsted/KmerStream

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Contact: pmelsted@hi.is or Bjarni.Halldorsson@decode.is.

    Categories: Journal Articles
  • TIPP: taxonomic identification and phylogenetic profiling
    [Dec 2014]

    Motivation: Abundance profiling (also called ‘phylogenetic profiling’) is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads.

    Results: We present taxon identification and phylogenetic profiling (TIPP), a new marker-based taxon identification and abundance profiling method. TIPP combines SAT\'e-enabled phylogenetic placement a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler and MetaPhlAn.

    Availability and implementation: Software and supplementary materials are available at http://www.cs.utexas.edu/users/phylo/software/sepp/tipp-submission/.

    Contact: warnow@illinois.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Chimera: a Bioconductor package for secondary analysis of fusion products
    [Dec 2014]

    Summary: Chimera is a Bioconductor package that organizes, annotates, analyses and validates fusions reported by different fusion detection tools; current implementation can deal with output from bellerophontes, chimeraScan, deFuse, fusionCatcher, FusionFinder, FusionHunter, FusionMap, mapSplice, Rsubread, tophat-fusion and STAR. The core of Chimera is a fusion data structure that can store fusion events detected with any of the aforementioned tools. Fusions are then easily manipulated with standard R functions or through the set of functionalities specifically developed in Chimera with the aim of supporting the user in managing fusions and discriminating false-positive results.

    Availability and implementation: Chimera is implemented as a Bioconductor package in R. The package and the vignette can be downloaded at bioconductor.org.

    Contact: raffaele.calogero@unito.it

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • The Naked Mole Rat Genome Resource: facilitating analyses of cancer and longevity-related adaptations
    [Dec 2014]

    Motivation: The naked mole rat (Heterocephalus glaber) is an exceptionally long-lived and cancer-resistant rodent native to East Africa. Although its genome was previously sequenced, here we report a new assembly sequenced by us with substantially higher N50 values for scaffolds and contigs.

    Results: We analyzed the annotation of this new improved assembly and identified candidate genomic adaptations which may have contributed to the evolution of the naked mole rat’s extraordinary traits, including in regions of p53, and the hyaluronan receptors CD44 and HMMR (RHAMM). Furthermore, we developed a freely available web portal, the Naked Mole Rat Genome Resource (http://www.naked-mole-rat.org), featuring the data and results of our analysis, to assist researchers interested in the genome and genes of the naked mole rat, and also to facilitate further studies on this fascinating species.

    Availability and implementation: The Naked Mole Rat Genome Resource is freely available online at http://www.naked-mole-rat.org. This resource is open source and the source code is available at https://github.com/maglab/naked-mole-rat-portal.

    Contact: jp@senescence.info

    Categories: Journal Articles
  • Human structural proteome-wide characterization of Cyclosporine A targets
    [Dec 2014]

    Motivation: Off-target interactions of a popular immunosuppressant Cyclosporine A (CSA) with several proteins besides its molecular target, cyclophilin A, are implicated in the activation of signaling pathways that lead to numerous side effects of this drug.

    Results: Using structural human proteome and a novel algorithm for inverse ligand binding prediction, ILbind, we determined a comprehensive set of 100+ putative partners of CSA. We empirically show that predictive quality of ILbind is better compared with other available predictors for this compound. We linked the putative target proteins, which include many new partners of CSA, with cellular functions, canonical pathways and toxicities that are typical for patients who take this drug. We used complementary approaches (molecular docking, molecular dynamics, surface plasmon resonance binding analysis and enzymatic assays) to validate and characterize three novel CSA targets: calpain 2, caspase 3 and p38 MAP kinase 14. The three targets are involved in the apoptotic pathways, are interconnected and are implicated in nephrotoxicity.

    Contact: lkurgan@ece.ualberta.ca

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • PrEMeR-CG: inferring nucleotide level DNA methylation values from MethylCap-seq data
    [Dec 2014]

    Motivation: DNA methylation is an epigenetic change occurring in genomic CpG sequences that contribute to the regulation of gene transcription both in normal and malignant cells. Next-generation sequencing has been used to characterize DNA methylation status at the genome scale, but suffers from high sequencing cost in the case of whole-genome bisulfite sequencing, or from reduced resolution (inability to precisely define which of the CpGs are methylated) with capture-based techniques.

    Results: Here we present a computational method that computes nucleotide-resolution methylation values from capture-based data by incorporating fragment length profiles into a model of methylation analysis. We demonstrate that it compares favorably with nucleotide-resolution bisulfite sequencing and has better predictive power with respect to a reference than window-based methods, often used for enrichment data. The described method was used to produce the methylation data used in tandem with gene expression to produce a novel and clinically significant gene signature in acute myeloid leukemia. In addition, we introduce a complementary statistical method that uses this nucleotide-resolution methylation data for detection of differentially methylated features.

    Availability: Software in the form of Python and R scripts is available at http://bioserv.mps.ohio-state.edu/premer and is free for non-commercial use.

    Contact: pearlly.yan@osumc.edu; bundschuh@mps.ohio-state.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Frameshift alignment: statistics and post-genomic applications
    [Dec 2014]

    Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score.

    Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.

    Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp).

    Contact: spouge@ncbi.nlm.nih.gov or martin@cbrc.jp

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Protein-protein binding affinity prediction from amino acid sequence
    [Dec 2014]

    Motivation: Protein–protein interactions play crucial roles in many biological processes and are responsible for smooth functioning of the machinery in living organisms. Predicting the binding affinity of protein–protein complexes provides deep insights to understand the recognition mechanism and identify the strong binding partners in protein–protein interaction networks.

    Results: In this work, we have collected the experimental binding affinity data for a set of 135 protein–protein complexes and analyzed the relationship between binding affinity and 642 properties obtained from amino acid sequence. We noticed that the overall correlation is poor, and the factors influencing affinity depends on the type of the complex based on their function, molecular weight and binding site residues. Based on the results, we have developed a novel methodology for predicting the binding affinity of protein–protein complexes using sequence-based features by classifying the complexes with respect to their function and predicted percentage of binding site residues. We have developed regression models for the complexes belonging to different classes with three to five properties, which showed a correlation in the range of 0.739–0.992 using jack-knife test. We suggest that our approach adds a new aspect of biological significance in terms of classifying the protein–protein complexes for affinity prediction.

    Availability and implementation: Freely available on the Web at http://www.iitm.ac.in/bioinfo/PPA_Pred/

    Contact: gromiha@iitm.ac.in

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Amplicon identification using SparsE representation of multiplex PYROsequencing signal (AdvISER-M-PYRO): application to bacterial resistance genotyping
    [Dec 2014]

    Motivation: Pyrosequencing is a cost-effective DNA sequencing technology that has many applications, including rapid genotyping of a broad spectrum of bacteria. When molecular typing requires to genotype multiple DNA stretches, several pyrosequencing primers could be used simultaneously but this would create overlapping primer-specific signals, which are visually uninterpretable. Accordingly, the objective was to develop a new method for signal processing (AdvISER-M-PYRO) to automatically analyze and interpret multiplex pyrosequencing signals. In parallel, the nucleotide dispensation order was improved by developing the SENATOR (‘SElecting the Nucleotide dispensATion Order’) algorithm.

    Results: In this proof-of-concept study, quintuplex pyrosequencing was applied on eight bacterial DNA and targeted genetic alterations underlying resistance to β-lactam antibiotics. Using SENATOR-driven dispensation order, all genetic variants (31 of 31; 100%) were correctly identified with AdvISER-M-PYRO. Among nine expected negative results, there was only one false positive that was tagged with an ‘unsafe’ label.

    Availability and implementation: SENATOR and AdvISER-M-PYRO are implemented in the AdvISER-M-PYRO R package (http://sites.uclouvain.be/md-ctma/index.php/softwares) and can be used to improve the dispensation order and to analyze multiplex pyrosequencing signals generated in a broad range of typing applications.

    Contact: jerome.ambroise@uclouvain.be

    Categories: Journal Articles
  • Limbform: a functional ontology-based database of limb regeneration experiments
    [Dec 2014]

    Summary: The ability of certain organisms to completely regenerate lost limbs is a fascinating process, far from solved. Despite the extraordinary published efforts during the past centuries of scientists performing amputations, transplantations and molecular experiments, no mechanistic model exists yet that can completely explain patterning during the limb regeneration process. The lack of a centralized repository to enable the efficient mining of this huge dataset is hindering the discovery of comprehensive models of limb regeneration. Here, we introduce Limbform (Limb formalization), a centralized database of published limb regeneration experiments. In contrast to natural language or text-based ontologies, Limbform is based on a functional ontology using mathematical graphs to represent unambiguously limb phenotypes and manipulation procedures. The centralized database currently contains >800 published limb regeneration experiments comprising many model organisms, including salamanders, frogs, insects, crustaceans and arachnids. The database represents an extraordinary resource for mining the existing knowledge of functional data in this field; furthermore, its mathematical nature based on a functional ontology will pave the way for artificial intelligence tools applied to the discovery of the sought-after comprehensive limb regeneration models.

    Availability and implementaion: The Limbform database is freely available at http://limbform.daniel-lobo.com.

    Contact: michael.levin@tufts.edu

    Categories: Journal Articles
  • Multi-factor data normalization enables the detection of copy number aberrations in amplicon sequencing data
    [Dec 2014]

    Motivation: Because of its low cost, amplicon sequencing, also known as ultra-deep targeted sequencing, is now becoming widely used in oncology for detection of actionable mutations, i.e. mutations influencing cell sensitivity to targeted therapies. Amplicon sequencing is based on the polymerase chain reaction amplification of the regions of interest, a process that considerably distorts the information on copy numbers initially present in the tumor DNA. Therefore, additional experiments such as single nucleotide polymorphism (SNP) or comparative genomic hybridization (CGH) arrays often complement amplicon sequencing in clinics to identify copy number status of genes whose amplification or deletion has direct consequences on the efficacy of a particular cancer treatment. So far, there has been no proven method to extract the information on gene copy number aberrations based solely on amplicon sequencing.

    Results: Here we present ONCOCNV, a method that includes a multifactor normalization and annotation technique enabling the detection of large copy number changes from amplicon sequencing data. We validated our approach on high and low amplicon density datasets and demonstrated that ONCOCNV can achieve a precision comparable with that of array CGH techniques in detecting copy number aberrations. Thus, ONCOCNV applied on amplicon sequencing data would make the use of additional array CGH or SNP array experiments unnecessary.

    Availability and implementation: http://oncocnv.curie.fr/

    Contact: valentina.boeva@curie.fr

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • MindTheGap: integrated detection and assembly of short and long insertions
    [Dec 2014]

    Motivation: Insertions play an important role in genome evolution. However, such variants are difficult to detect from short-read sequencing data, especially when they exceed the paired-end insert size. Many approaches have been proposed to call short insertion variants based on paired-end mapping. However, there remains a lack of practical methods to detect and assemble long variants.

    Results: We propose here an original method, called MindTheGap, for the integrated detection and assembly of insertion variants from re-sequencing data. Importantly, it is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. MindTheGap uses an efficient k-mer-based method to detect insertion sites in a reference genome, and subsequently assemble them from the donor reads. MindTheGap showed high recall and precision on simulated datasets of various genome complexities. When applied to real Caenorhabditis elegans and human NA12878 datasets, MindTheGap detected and correctly assembled insertions >1 kb, using at most 14 GB of memory.

    Availability and implementation: http://mindthegap.genouest.org

    Contact: guillaume.rizk@inria.fr or claire.lemaitre@inria.fr

    Categories: Journal Articles
  • Characterization of structural variants with single molecule and hybrid sequencing approaches
    [Dec 2014]

    Motivation: Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent ‘third-generation’ sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates.

    Results: We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly.

    Availability and implementation: MultiBreak-SV is available at http://compbio.cs.brown.edu/software/.

    Contact: annaritz@vt.edu or braphael@cs.brown.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Detecting differential peaks in ChIP-seq signals with ODIN
    [Dec 2014]

    Motivation: Detection of changes in deoxyribonucleic acid (DNA)–protein interactions from ChIP-seq data is a crucial step in unraveling the regulatory networks behind biological processes. The simplest variation of this problem is the differential peak calling (DPC) problem. Here, one has to find genomic regions with ChIP-seq signal changes between two cellular conditions in the interaction of a protein with DNA. The great majority of peak calling methods can only analyze one ChIP-seq signal at a time and are unable to perform DPC. Recently, a few approaches based on the combination of these peak callers with statistical tests for detecting differential digital expression have been proposed. However, these methods fail to detect detailed changes of protein–DNA interactions.

    Results: We propose an One-stage DIffereNtial peak caller (ODIN); an Hidden Markov Model-based approach to detect and analyze differential peaks (DPs) in pairs of ChIP-seq data. ODIN performs genomic signal processing, peak calling and p-value calculation in an integrated framework. We also propose an evaluation methodology to compare ODIN with competing methods. The evaluation method is based on the association of DPs with expression changes in the same cellular conditions. Our empirical study based on several ChIP-seq experiments from transcription factors, histone modifications and simulated data shows that ODIN outperforms considered competing methods in most scenarios.

    Availability and implementation: http://costalab.org/wp/odin.

    Contact: ivan.costa@rwth-aachen.de

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips
    [Dec 2014]

    Motivation: Genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach that analyzes multiple individuals together. A compressed de Bruijn graph is a sophisticated data structure for representing the genomes of entire populations. It robustly encodes shared segments, simple single-nucleotide polymorphisms and complex structural variations far beyond what can be represented in a collection of linear sequences alone.

    Results: We explore deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, splitMEM, that directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size. We introduce suffix skips to traverse several suffix links simultaneously and use them to efficiently decompose maximal exact matches into graph nodes. We demonstrate the utility of splitMEM by analyzing the nine-strain pan-genome of Bacillus anthracis and up to 62 strains of Escherichia coli, revealing their core-genome properties.

    Availability and implementation: Source code and documentation available open-source http://splitmem.sourceforge.net.

    Contact: mschatz@cshl.edu

    Supplementary information: Supplementary data are available at Bioinformatics online.

    Categories: Journal Articles
  • Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone
    [Dec 2014]

    Motivation: The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs.

    Results: We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of ≥30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs >500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions.

    Availability and implementation: Project information, paper benchmark and source code are available via http://www.seqan.de/projects/gustaf/.

    Contact: kathrin.trappe@fu-berlin.de

    Categories: Journal Articles
  • Resolving complex tandem repeats with long reads
    [Dec 2014]

    Motivation: Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington’s diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs.

    Results: Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith–Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models–based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.

    Availability and implementation: https://github.com/alibashir/pacmonstr

    Contact: ajayummat@gmail.com or ali.bashir@mssm.edu

    Categories: Journal Articles