TY - GEN
T1 - Reconciling Inconsistent Molecular Structures from Biochemical Databases
AU - Eriksen, Casper Asbjørn
AU - Andersen, Jakob Lykke
AU - Fagerberg, Rolf
AU - Merkle, Daniel
PY - 2023
Y1 - 2023
N2 - Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, such as metabolomics, systems biology, and drug discovery. However, no such database can be complete, and the chemical structure for a given compound is not necessarily consistent between databases. This paper presents StructRecon, a novel tool for resolving unique and correct molecular structures from database identifiers. StructRecon traverses the cross-links between database entries in different databases to construct what we call an identifier graph, which offers a more complete view of the total information available on a particular compound across all the databases. In order to reconcile discrepancies between databases, we first present an extensible model for chemical structure which supports multiple independent levels of detail, allowing standardisation of the structure to be applied iteratively. In some cases, our standardisation approach results in multiple structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternates. We applied StructRecon to the EColiCore2 model, resolving a unique chemical structure for 85.11% of identifiers. StructRecon is open-source and modular, which enables the potential support for more databases in the future.
AB - Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, such as metabolomics, systems biology, and drug discovery. However, no such database can be complete, and the chemical structure for a given compound is not necessarily consistent between databases. This paper presents StructRecon, a novel tool for resolving unique and correct molecular structures from database identifiers. StructRecon traverses the cross-links between database entries in different databases to construct what we call an identifier graph, which offers a more complete view of the total information available on a particular compound across all the databases. In order to reconcile discrepancies between databases, we first present an extensible model for chemical structure which supports multiple independent levels of detail, allowing standardisation of the structure to be applied iteratively. In some cases, our standardisation approach results in multiple structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternates. We applied StructRecon to the EColiCore2 model, resolving a unique chemical structure for 85.11% of identifiers. StructRecon is open-source and modular, which enables the potential support for more databases in the future.
KW - Chemical structure identifiers
KW - Cheminformatics
KW - Small-molecule databases
KW - Standardisation
U2 - 10.1007/978-981-99-7074-2_5
DO - 10.1007/978-981-99-7074-2_5
M3 - Article in proceedings
AN - SCOPUS:85174222895
SN - 9789819970735
T3 - Lecture Notes in Computer Science
SP - 58
EP - 71
BT - Bioinformatics Research and Applications - 19th International Symposium, ISBRA 2023, Proceedings
A2 - Guo, Xuan
A2 - Mangul, Serghei
A2 - Patterson, Murray
A2 - Zelikovsky, Alexander
PB - Springer Science+Business Media
T2 - 19th International Symposium on Bioinformatics Research and Applications, ISBRA 2023
Y2 - 9 October 2023 through 12 October 2023
ER -