Using machine learning tools for protein database biocuration assistance

König, Caroline; Shaim, Ilmira; Vellido Alcacena, Alfredo; Romero Merino, Enrique; Alquézar Mancho, René; Giraldo Arjonilla, Jesús

doi:10.1038/s41598-018-28330-z

Visualitza/Obre

PRINT.pdf (1,086Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

König, Caroline

Shaim, Ilmira

Vellido Alcacena, Alfredo

Romero Merino, Enrique

Alquézar Mancho, René

Giraldo Arjonilla, Jesús

Tipus de documentArticle

Data publicació2018-07-05

EditorNature

Condicions d'accésAccés obert

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement 3.0 Espanya

Abstract

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.

CitacióKönig, C., Shaim, I., Vellido, A., Romero, E., Alquézar, R., Giraldo, J. Using machine learning tools for protein database biocuration assistance. "Scientific reports", 5 Juliol 2018, article 10148, p. 1-10.

URIhttp://hdl.handle.net/2117/119566

DOI10.1038/s41598-018-28330-z

ISSN2045-2322

Versió de l'editorhttps://www.nature.com/articles/s41598-018-28330-z

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
PRINT.pdf		1,086Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Using machine learning tools for protein database biocuration assistance

Visualitza/Obre

Explora