Integration of a central protein repository into a standard data processing application for mining proteomics data

Kai Fritzemeier, Jakob Kristensen, Martin Røssel Larsen, Torsten Ueckert, Bernard Delanghe, Nils J. Færgeman, Julius Fredens, Kasper Engholm-Keller, Christian Ravnsborg

Research output: Contribution to conference without publisher/journalPosterResearch


Novel Aspect
All major protein repositories integrated into a central domain for direct analyses and interpretation in a standard proteomics data analysis software.
Modern proteomics must face the challenge of performing bioinformatics analysis and comparison of large datasets. It is a time consuming and at times nearly impossible task to distinguish known proteins from novel proteins in these data sets without proper annotation and comparison with literature sources. Tools are needed that can handle the complexity of these data including: redundancy (same protein but different accession codes), different protein database accession codes or outdated accession codes and protein annotation. To resolve these issues we have developed a consolidated proteomics database providing annotations to Proteome Discoverer via direct integrated web service technology – a repository that enables efficient data mining and categorizing of large data sets.
All samples were analyzed on an Orbitrap mass Spectrometer coupled to a nano Easy LC.
The proteomics repository database is built using the Sun Java technology and the Microsoft mySQL database technology for optimal performance. Proteome Discoverer version 1.3 is used for database searching and is directly integrated with the proteomics repository.
Preliminary Data
Our proteomics database contains public sequence databases to form a comprehensive and consistent superset of 13 million protein sequences derived from over 100 million protein records from GenBank, Refseq, EMBL, Flybase, UniProt, Wormbase, Swiss-Prot, Trembl, PIR, IPI, PDB, Ensembl etc., including more than 10 million outdated accession numbers. Proteins are richly annotated by consolidation of annotations from public databases together with high-standards annotation from internal computational enrichment of the sequence data. The integrated database is constantly updated depending on its source, enabling tracking of outdated accession keys.
Preliminary results from a comparison of protein annotation coverage in UniProt, NCBI and our proteomics repository on frequently used model organisms’ shows that collecting unique annotation information from multiple sources significantly increases the protein annotation coverage in human, mouse, yeast, C. elegans and E. coli.
A quantitative stable isotope labeling proteomics study comparing wild type C. elegans and a nuclear hormone receptor 49 mutant is used as a case study to display the importance of using a consolidated Proteomics repository.
Original languageEnglish
Publication date2011
Publication statusPublished - 2011
EventAmerican Society of Mass Spectrometry - Denver, CO, United States
Duration: 5. Jun 20119. Jun 2011


ConferenceAmerican Society of Mass Spectrometry
Country/TerritoryUnited States
CityDenver, CO

Cite this