MicroMeta: Protein sequence search and classification (A reflection on my thesis)

My thesis has been submitted to committee members for several days. I finally get some high quality sleep after one and a half sleepless weeks. For the rest few days before my defense, I just planed to relax and prepare the presentation. My advisor, C. Titus Brown, sent me an email, “next step”, right after I made the plan, saying I have to write a blog about my research according to my Maters program requirement and send some draft to him by Monday morning. I habitually answered “OK” though some fun time this weekend already planned. This just reminds me the saying “Life is like a marathon” and he might also want to emphasize “in graduate school”.

Last summer, I began my first research project in bioinformatics. Before that I was a wet experimental person. The reason that I wanted to try bioinformatics was I had interest in metagenomics which tightly relied on bioinformatics. As a beginner, my advisor asked me to try two sequence similarity search tools, HMMER (Eddy 1998) and BLAST (Altschul, Gish et al. 1990), first.

BLAST is one of the most used tools in all biology fields. It does local alignment, which can align small similar regions in two sequences and is good at detecting small regions that are conserved between two sequences. Word method (shown in Figure 1) is used to speed up the local alignment search in large sequence databases.

Figure 1. The BLAST algorithm. BLAST applies a heuristic search method which finds k-letter words (default = 3 in blastp) scoring at T (neighborhood score threshold) when aligned to a specific k-letter word in the query and scored with a substitution matrix. Words scoring above T (neighborhood words) are searched in database and then extended in both directions till the score start to decrease. The resulting locally optimal scoring alignment is called HSP (high scoring pair) and later reported by BLAST if it has a score of higher than S (BLAST score cutoff) or E-value lower than a specified cutoff.

Figure 2. Simple illustration of profile search. A: profile with absolute frequency at each position (column) is made from MSA (multiple sequence alignment). The profile is transformed into a new one with relative frequency matrix or log likelihood matrix. The new profile is aligned with every 7 base pair region of query sequence. The regions with scores above cutoff are reported. B: sequence logo (Schneider and Stephens 1990) is graphical representation of profiles and their sequence conservation.

I chose nirK, a type of nitrite reductase, as my target gene for BLAST and HMMER search against the soil metagenomic data from Tracy Teal in Tom Schmidt’s Lab. As a result I found BLAST has much more hits than HMMER (Figure 3). Then we began to be skeptical about the BLAST results and BLAST might have false positives. We also found it had been reported that BLAST perform poorly for twilight zone homologous sequence pairs with identity within 20-35% (Rost 1999) and a lot of BLAST hits from the metagenomic data fell in this range (Figure 4). Then we analyzed the BLAST false positive problem by matched the hits back to the query nirK sequences and check where they are matched to (the domain region or non-domain region). I had written a python script to do this automatically, but further analysis was still needed to address the BLAST false positive problem.

Figure 3. BLAST and HMMER search comparison. Search results from both datasets show BLAST gets more hits than HMMER search. Two datasets are technical replicates of same soil DNA sample.

Figure 4. Distribution of identities in BLAST search result of six well known nirKs against dataset1 and dataset2.

Back to novel gene search, BLAST and HMMER use different methods to find sequence similarity. In spite of the false positive problem, they are still very useful tool for similarity search. Our strategy to find homologous sequences is to add a step downstream of the BLAST or HMMER search and screen for sequences more likely to be real homologs. My advisor asked me to try the phylogenetic tree method. The tree I used is neighbor-joining tree, which is made from a distance matrix of all pairwise comparison between all sequences. If there are m well known sequences for building profile and n new sequences for search (n>>m), BLAST would do m*n comparisons, HMMER would do n comparisons, and phylogenetic tree would do n*n comparison. Thus, phylogenetic tree contains the most information on similarities among sequences that is shown in the form of nodes and braches in the tree. Then we tried some ‘tricks’ on the tree, and got some sequences more possible to be real homologs. The tree squeezing method (Figure 5) is the one I used on searching novel nirKs. In the other case study (classification of closely related genes), ammonia monooxygenase subunit A (amoA) and particular methane monooxygenase subunit A are two genes evolutionary related and have close sequence similarity. BLAST and HMMER cannot separate them well (Figure 6, 7). My advisor suggested me to try the tree method on amoA and pmoA. I found tree did have better resolution to separate the amoAs and pmoAs in UniProtKB (Figure 8). Then I used low/high node method (not shown here) to obtain sequences possibly misclassified in UniProtKB.

Figure 5. Simple example of tree squeezing method.

Figure 6. E-value distribution of amoA BLAST hits. Blue is overall E-value distribution of amoA BLAST hits. Green is a subset of overall BLAST hits that are annotated as amoA in UniProtKB. Red is a subset of overall BLAST hits that are annotated as pmoA in UniProtKB.

Figure 7. E-value distribution of amoA HMMER hits. Blue is overall E-value distribution of amoA HMMER hits. Green is a subset of overall HMMER hits that are annotated as amoA in UniProtKB. Red is a subset of overall BLAST hits that are annotated as pmoA in UniProtKB.

Figure 8. Trees of all amoAs and pmoAs. Tree is made from all amoAs and pmoAs (including those from culture and environmental samples). The majority of amoAs and pmoAs separate from each other though some pmoAs are scattered in amoA clusters and some amoAs are scattered in pmoA clusters.

Figure 9. Two primary databases of UniProt KB.

The emergence of metagenomics presents a new challenge to using homology searches to annotate and/or classify sequences. Metagenomics is the study of DNA recovered directly from environmental samples, and it investigates the diversity and functional pathways represented by microbial community. Given the fact that most microbes cannot be cultured in standard lab conditions, metagenomics is an alternative approach to studying the microbial world. With the introduction of next generation sequencing (NGS) techniques, metagenomic data has rapidly accumulated in the last five years. The resulting large number of sequences and their characteristic short lengths present challenges to sequence annotation.

Second, we need to pay attention to BLAST false positive problem.
When a new sequence is sequenced, it is used to BLAST against a reference database. This is how most wet experiment people annotate their sequences. However, the NCBI BLAST default E-value cutoff is 10, so it will give hits with small identity with the query. BLAST performs poorly for sequences with 20-35% identity. Thus, BLAST gives false positives, especially those sequence hits with relatively high E-value but low identity.

Third, phylogenetically diverse well known sequences are important for future new sequence annotation.

Our ability to detect novel nirK and reliably classify pmoA/amoA is strongly affected by the lack of diverse nirK/pmoA/amoA. Tools which could highlight important “sequencing gaps” would be useful in providing more resolution to maximize gene discovery using homology searches to annotate sequences.

Further, some sequences in current major big databases are probably misclassified.

The classifications of amoA and pmoA genes in our tree which disagree with UniProt annotations highlight the possibility of incorrect annotations in reference databases. Similarity searches based on incorrect annotations will result in future incorrect annotations. Tools which could evaluate the accuracy of annotations would be very useful for making sound biological conclusions from similarity searches.

In sum, a repetitive theme throughout this study was that tools that are currently used, whether BLAST, HMMER, or even a reference database, can often give the wrong results which could result in wrong conclusions. Understanding the limits of the tools and their underlying methodology is important in their effective use. Much like experimental methods, using computational tools requires preliminary thought and planning.

References:
. Retrieved July 21, 2010, from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/BLAST_algorithm.html.
Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-410.
Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-763.
Rost, B. (1999). "Twilight zone of protein sequence alignments." Protein Eng 12(2): 85-94.
Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display consensus sequences." Nucleic Acids Res 18(20): 6097-6100.
Wu, C. H., R. Apweiler, et al. (2006). "The Universal Protein Resource (UniProt): an expanding universe of protein information." Nucleic Acids Res 34(Database issue): D187-191.

MicroMeta

Sunday, August 22, 2010

Protein sequence search and classification (A reflection on my thesis)

No comments:

Post a Comment

Blog Archive

Followers