In order to gain functional insight from DNA sequencing data, the genome must be annotated with structural features (gene boundaries, transcriptional start sites, etc.) and with the gene products encoded by transcribed genes. We are in the process of reconciling conflicting annotations from various databases and publications into a unified, maximally accurate annotation of reference strain H37Rv. We are approaching this from three fronts: Systematic manual curation of gene product function from the literature, compilation of empirically determined changes/additions to structural features of the genome and incorporation of high-confidence protein structural homology predictions.
Systematic Annotation of the “Hypotheticome”
We are performing a comprehensive manual update of gene product annotation from articles published since 2010. We are focusing on a set of 1,057 genes classified as “conserved hypothetical” or “unknown” on TubercuList, along with an additional 668 genes we determined to be annotated ambiguously. Several hundred of these genes have experimentally determined gene products, and incorporation of these data into a single annotation will improve studies that attempt to relate genotypic changes to potential phenotypic consequences.
Genome Structural Annotation
The positions of genic and intergenic features such as Transcriptional and Translational Start sites, Ribosomal binding sites, and operon boundaries are often computationaly predicted. While these predictions often hold true, reality often has the final word when different coordinates are revealed empirically. Many of these coordinates remain unintegrated into the primary references genome of Mycobacterium tuberculosis, H37Rv. This leads to suboptimal association of genomic variants with potential phenotypic consequences, needlessly lowering the knowledge yield of comparative genomics studies, and slowing progress in the community’s understanding of Mtb. As part of our multifacted annotation efforts, we are reconciling empirically determined feature coordinates with the reference genome to provide the community with a more accurate and informative reference genome.
Structural Homology Predictions
Similarly to gene coordinates, and also based off of them, the function of many genes in the genomes are assigned on the basis of sequence siumilarity. While this approach leads to appropriate annotation in many cases, in many others it does not, as protein structure is more indicative of function than is sequence, and is also better conserved across evolutionary time. Due to this, structural similarity is often preserved over evolutionary time while sequence similarity is lost. In order to identify structural homologues invisible through the lens of sequence similarity, we are employing structural homology with I-TASSER on the set of hypothetical and poorly annotated genes in Mtb. These efforts should reveal high-confidence predictions for previously unannotated or misannotated sequences, and enable molecular biology labs to prioritize and guide empirical determination of the function of these gene’s products.
It is often difficult what the effect of a coding mutation will be on a resulting protein. By simulating protein-folding, we can formulate hypotheses about the effects of these mutations.