Genomic Data Analysis

Single nucleotide polymorphisms and other mutations have been associated with drug resistance in M. tuberculosis. Although some mutations are well-known and explain the majority of drug resistance cases (except for pyrazinamide), they do not explain all cases.


Pyrazinamide (PZA) is an important first-line drug used to treat tuberculosis (TB) and is included in all new TB treatment regimens. Unfortunately, reported cases of PZA-resistance in M.tuberculosis are on the rise, especially among MDR/XDR cases. Drug Susceptibility Testing (DST) suffers from a relatively high false-resistance rate, requiring investigation into alternative susceptibility tests. This study provides a comprehensive assessment of a potential molecular diagnostic platform for detection of PZA resistance in M/XDR TB cases.

A set of 224 resistant and 72 susceptible, mostly MDR/XDR, clinical M. tuberculosis isolates from four high TB-burden countries were subjected to DST, confirmatory Wayne’s PZase assay, and whole-genome sequencing. Three genes implicated in PZA resistance, pncA, rpsA, panD, and their promoters were investigated for their potential role as molecular markers of PZA resistance.

We report 90% sensitivity and 65% specificity for a pncA-based molecular platform. The addition of rpsA and panD only provides potential markers for an additional 2% of resistant cases. Heterogeneity in pncA had a strong association with resistance and should be evaluated as a diagnostic tool. This was not the case in rpsA and panD. Further analysis indicates lineage-specific mutation patterns in pncA and that Euro-American isolates are most likely to escape molecular detection.

Overall, this study contributes to the body of knowledge regarding the mechanism of PZA resistance, finds pncA not sufficiently sensitive or specific in MDR/XDR cases as a sole basis for molecular diagnostics, and highlights the need for identification of alternative mechanism(s) of resistance beyond rpsA and panD.


Rifampicin (RIF) is a first-line drug used to treat Mycobacterium tuberculosis (M. tuberculosis). Resistance to RIF is most often associated with chromosomal mutations in the rpoB gene, especially in the 81-bp “hot spot” region called the rifampicin-resistance-determining region (RRDR). This study considered 347 isolates (303 resistant, 44 susceptible) from India, Moldova, Philippines, and South Africa; there were 7 isoniazid-monoresistant isolates, 18 multi-drug resistant (MDR), 41 pre-extensively drug resistant (pre-XDR), 242 XDR, 29 pan-susceptible, and 9 with other variable drug profiles . In this study, the rpo complex (rpoA, rpoB, and rpoC) and the promoter regions of three efflux pump genes (emrB, pstB, and drrA) were investigated for the molecular basis of RIF resistance and associated compensatory mutations. We found a molecular basis for RIF resistance in 295 (97.4%) RIF resistant (RIFR) isolates while 36 (81.8%) RIF susceptible (RIFS) isolates harbored no resistance conferring mutations in the six genes. One novel nonsynonymous SNP (Ser450Phe) was identified in RRDR of one RIFR isolate, while eight (2.6%) RIFR isolates lacked a molecular explanation for their resistance in the six genes. Eight (18.2%) RIFS isolates harbored mutations reported to confer RIF resistance. In the efflux pump genes, one particular mutation in the promoter region of drrA, G3243630A, was found in 41 (13.5%) RIFR isolates and six (13.6%) RIFS isolates. Of these, 38 RIFR and one RIFS also had a mutation in the RRDR known to confer resistance. This mutation was also harbored by one (of the eight) RIFR isolates otherwise lacking a resistance-conferring mutation. Occurrence of the mutation among 6 susceptible isolates (two of which tested as resistant with site DST) may then be a result of the resistance level nearing the DST threshold.  


Fluoroquinolones (FQs) are important second-line bactericides used in the treatment of Mycobacterium tuberculosis (M. tuberculosis). Due to the increasing prevelence of multi-drug resistance in new (3.3%) and previously treated (20%) Tuberculosis, use of fluoroquinolones to treat M. tuberculosis is increasing. In this study, we have performed susceptibility testing for ofloxacin (OFX) and moxifloxacin (MOX) and whole genome sequencing (WGS) on 331347 clinical M. tuberculosis isolates from India, Moldova, Philippines, and South Africa. Minimum inhibitory concentration (MICs) was performed for isolates that lacked canonical mutations. We have also collected previously published MIC levels associated with gyrA and gyrB mutations and compared them to MIC levels of our isolates without canonical mutations. A total of 82 different variants (34 SNPs, 4 insertions and 44 deletions) were observed in gyrA among all isolates. In gyrB, 55 variants (26 SNPs, 5 insertions and 24 deletions) were observed amongst all isolates in gyrB. We have determined the distribution of known-resistance conferring mutations in gyrase subunits gyrA and gyrB and searched for novel genomic correlates of FQ-resistance. We have identified novel gyrA and efflux pump mutations associated with MOX and OFX resistance. Susceptibility results were highly concordant between the two drugs despite recent recommendations to treat OFX resistant isolates with MOX. Isolates harboring mutations in QRDR consistently demonstrated high MOX MIC levels relative to OFX MIC levels, suggesting differential efficacy between the two drugs. Twenty-nine isolates had no known resistance conferring mutations but were phenotypically resistant. In these isolates, two novel variants within the gyrA QRDR and two outside the gyrB QRDR were discovered. Lineage analysis revealed that Asp94Gly was overrepresented in the East Asian lineage and Ala90Val in the Euro-American lineage. This study provides possible alternative mechanisms of resistance in isolates with unexplained FQ resistance and contributes to the growing information pool correlating MIC levels to mutations in gyrA and gyrB.


We are investigating the association of specific mutations in rrseis promoter, and tlyA to amikacin, kanamycin, and capreomycin susceptibility testing. These are known mechanisms of resistance and contribute to the rapid spread of XDR-TB.


Whole-genome sequencing offers a high-resolution view of phylogenetics. Constructing phylogenetic trees with such data helps us study patterns of evolution of drug resistance in M tuberculosis.


Lineage-specific differences in the Mycobacterium tuberculosis complex (MTBC) play an important role in virulence and emergence of drug-resistance. Recent research indicates that genome-wide single nucleotide polymorphisms (SNPs) allow for highly sensitive lineage and sublineage differentiation. We analyzed isolates collected from India, Moldova, South Africa, the Philippines, and Sweden and utilized Pacific Biosciences’ (PacBio) single molecule sequencing technology for our Whole Genome Sequencing, making this one of the first studies to use long read sequencing in order to include Mtb’s repetitive genomic regions in a phylogenetic analysis.

         SNP Phylogeny


In order to gain functional insight from DNA sequencing data, the genome must be annotated with structural features (gene boundaries, transcriptional start sites, etc.) and with the gene products encoded by transcribed genes. We are in the process of reconciling conflicting annotations from various databases and publications into a unified, maximally accurate annotation of reference strain H37Rv. We are approaching this from three fronts: Systematic manual curation of gene product function from the literature, compilation of empirically determined changes/additions to structural features of the genome and incorporation of high-confidence protein structural homology predictions.

          Systematic Annotation of the “Hypotheticome”

We are performing a comprehensive manual update of gene product annotation from articles published since 2010. We are focusing on a set of 1,057 genes classified as “conserved hypothetical” or “unknown” on TubercuList, along with an additional 668 genes we determined to be annotated ambiguously. Several hundred of these genes have experimentally determined gene products, and incorporation of these data into a single annotation will improve studies that attempt to relate genotypic changes to potential phenotypic consequences.

          Genome Structural Annotation

The positions of genic and intergenic features such as Transcriptional and Translational Start sites, Ribosomal binding sites, and operon boundaries are often computationaly predicted. While these predictions often hold true, reality often has the final word when different coordinates are revealed empirically. Many of these coordinates remain unintegrated into the primary references genome of Mycobacterium tuberculosis, H37Rv. This leads to suboptimal association of genomic variants with potential phenotypic consequences, needlessly lowering the knowledge yield of comparative genomics studies, and slowing progress in the community’s understanding of Mtb. As part of our multifacted annotation efforts, we are reconciling empirically determined feature coordinates with the reference genome to provide the community with a more accurate and informative reference genome.

          Structural Homology Predictions

Similarly to gene coordinates, and also based off of them, the function of many genes in the genomes are assigned on the basis of sequence siumilarity. While this approach leads to appropriate annotation in many cases, in many others it does not, as protein structure is more indicative of function than is sequence, and is also better conserved across evolutionary time. Due to this, structural similarity is often preserved over evolutionary time while sequence similarity is lost. In order to identify structural homologues invisible through the lens of sequence similarity, we are employing structural homology with I-TASSER on the set of hypothetical and poorly annotated genes in Mtb. These efforts should reveal high-confidence predictions for previously unannotated or misannotated sequences, and enable molecular biology labs to prioritize and guide empirical determination of the function of these gene’s products.

De novo Assembly

Typically sequencing data from genomes of interest are aligned reads to a reference sequence. Though cost-effective, every genome is different and this method preculdes the discovery of key novel elements such as structural variations. By assembling each genome “de novo” we are able to detect important genomic characteristics such as copy number variants, structural variants, and gene transfers.


As a proof-of-concept of our de novo assembly pipeline’s accuracy, we tested our assembly of avirulent reference strain H37Ra. To our surprise, not only did our assembly match up well with the reference assembly published in 2008, but we also discovered that mutations in over half of the genes reportedly different in H37Ra than reference strain H37Rv turned out to be sequencing errors, as our assembly matched H37Rv precisely and did not match the prior assembly of H37Ra. We wrote a paper describing our findings, their potential phenotypic implications on the source of virulence attenuation of H37Ra, and the utility of single-molecule seqeuncing for producing high-quality reference genomes. The pre-print is available here and will soon be published in BMC Genomics.

         Clinical Isolates

We have de novo assembled the genomes of dozens of isolates from TB patients across the world, and are analyzing their genomes for genetic indicators of antibiotic resistance and other characteristics of interest.


                     Positive Selection

Systems Biology

With variants scattered all over the genome, it is difficult to associate those having weaker effects with a phenotype. We employ methods of constraint-based modeling to make hypotheses about the far-reaching effects that antibiotics have on the organism’s biochemical networks.

          Metabolic Model

Genome-scale metabolic models have been instrumental in studying metabolic capabilities of organisms and identifying essential pathways and reactions that can serve as potential drug targets. By reconstructing the metabolic network of M. tuberculosis, we can simulate systems level consequences of genomic mutations.


Protein-protein interaction information can elucidate pathways and functions between two or more proteins.  We are using this data to probe a set of clinically collected Mtb isolates to determine potential indicators of multiple drug resistance, as well as novel mechanisms of resistance.

Pattern Recognition

Once dimensionality is kept under control, pattern recognition becomes a powerful method to identify novel phenomena.

Structural Modeling

It is often difficult what the effect of a coding mutation will be on a resulting protein. By simulating protein-folding, we can formulate hypotheses about the effects of these mutations.

            Structural homology


Epigenetics provide a mechanism for clonal species to alter phenotype is response to environmental cues without a change in DNA. We believe that persistence and other phenotypic characteristics may be related to epigenetic changes in the pathogen. They may buy the organism time until it acquires the genomic mutations that confer drug resistance. With single-molecule sequencing, we are able to study methylation patterns in the M. tuberculosis genome at single-base resolution

           DNA Methylation Patterns

Once thought to be merely a mechanism for differentiation of self-DNA and foreign DNA, DNA methylation has been unveiled as a marker relevant to many physiological processes including antigenic variation, recruitment of alternative DNA repair machinery, cell-cycle control, and transposition frequency. We believe such mechanisms exist in Mtb, and are examining the methylomes of clinical isolates for patterns indicative of such roles.

           DNA Methylation Heterogeneity

Heterogeneity in DNA methylation may act as a means to allow phenoypic variation in a genotpically clonal species such as Mycobacterium tuberculosis.  We are examining this phenomenon in the genomes of clinical isolates. The concept of methylation heterogeneity and its two forms are explained below

A) Site-specific heterogeneity – A clinical isolate genome exhibiting site-specific heterogeneity (“SSH”, black circle). Projection to subpopulations ( “SP”, red bars) within the isolate shows three distributions among SPs that would give rise to the observed methylated read fraction (33%) at the circled position. Competitive antagonism from DNA-binding factors (DBF) can cause both uniform SSH across SPs (first arrow) and varying levels of SSH across SPs (second arrow). Non-competitive antagonism from DBFs (not pictured) can cause binary variation between full methylation and full methylation across SPs (third arrow) by completely occluding the motif from MTase access.

B) Genome-wide heterogeneity (GWH) – A clear signature of GWH is illustrated in pane 3 (green dots). This pattern can occur when an LOF-MTase mutation arises in a SP but is absent in the other SPs. This may confer phenotypic heterogeneity to the colony by creating SP-specific regulatory responses, potentially widening the range of habitable microniches at the population level. Alternatively, the LOF-mutation could either be selected out of the population, or become the dominant genotype.



Motivation: Single molecule real-time (SMRT) sequencing has important and underutilized advantages that amplification-based platforms lack. Among others, lack of GC-bias (systematic error), ability to accurately assemble large repetitive regions, and complete de novo assembly without the need for scaffolding, can be mentioned. Here, we introduce PBHoover, a software that uses a heuristic calling algorithm in order to make base calls with high certainty in low coverage regions. This software is also capable of mixed population detection with high sensitivity. In order to improve coverage depth, PBHoover uses CigarRoller—an in-house developed CIGAR string correction package.

Results: We tested both modules on 349Mycobacterium tuberculosis clinical isolates sequenced on chemistry 1 (C1) (n=76) and chemistry 2 (C2) (n=275). On average, CigarRoller fixed 31% of reads per C1 isolate and 49% of reads per C2 isolate. To validate our results, we compared 801920 PBHoover base calls to those of Sanger sequencing: we observed a 99.97% concordance with Sanger sequencing resulting in a quality value of 35.

          HADTB Database

The effective control of Tuberculosis (TB) carries great importance today as the rise in drug-resistance hinders efforts to control the disease. Recent advances in sequencing have generated vast amounts of TB genomic data but existing databases are not reliably maintained and do not integrate well with other analysis tools. To address these issues, Hub for Aggregated Data for TB (HADTB) is being developed with an interactive user interface that presents all the necessary information in a single page web application. The web interface provides intuitive user interface for gene-based, variant-based, and isolate-based analysis. All search fields provided for each type of analysis are relationally mapped to one another which allows the user to retrieve all associated genomic and meta data in a single table view. Moreover, a built in report function provides users with a broad range of powerful statistical and graphical methods for easy visualization.


MIRU-HERO is a stand-alone linux based software which takes a whole genome sequence assembly from Mycobacterium tuberculosis or M. bovis and predicts spoligotype, MIRU-type, and lineage information for that sequence. The program will also identify M. canettii. As whole-genome sequencing (WGS) becomes cheaper and more commonplace, useful ways to analyze genomic data must also be available for the scientists who use it. MIRU-HERO was designed as a way for public-health scientists working with WGS data to efficiently generate results which are important in Mycobacterial strain typing/genotyping and outbreak detection.


Github website