Proteogenomic Workflows on Draft Genomes (#209)
We sought to improve our in-house genome annotation abilities through the use of proteogenomic pipelines. By combining mass spectrometry data with genomic information we improved the number of novel proteins detected than compared with methods such as Mascot searching against a fasta database created through gene-prediction (such as Glimmer) or a simple 6-frame translation.
We used the Nexus Proteogenomic pipeline which works through the creation of intermediate ‘virtual proteins’ which can be confirmed by physical evidence, i.e. mass spectrometry, and combined to form expected proteins. We first tested the pipeline on an unannotated incomplete genome, that of Scedosporium aurantiacum. This was only available to us in a format of ~10000 contigs, the pipeline was able to outperform a standard 6-frame translation when detecting known proteins (verified against swissprot through batch blasting). The pipeline has also been tested on the Pacific Oyster (Crassostrea gigas) genome , currently comprising of 11969 contigs, for similar results. This provides weight towards considering the other detected proteins as potentially novel.
We find that our use of the Nexus pipeline, when accompanied with optimizations for each genome being run, gives us similar results to that of other proteogenomic pipelines. The major advantage is that it can be run on incomplete, unassembled and unannotated genomes wheras other pipelines need complete genomes and/or annotated genomes.