Using random forests for assistance in the curation of G-protein coupled receptor databases

Shkurin, Aleksei; Vellido Alcacena, Alfredo

doi:10.1186/s12938-017-0357-4

dc.contributor.author	Shkurin, Aleksei
dc.contributor.author	Vellido Alcacena, Alfredo
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Ciències de la Computació
dc.date.accessioned	2017-10-19T08:24:03Z
dc.date.available	2017-10-19T08:24:03Z
dc.date.issued	2017-08-18
dc.identifier.citation	Shkurin, A., Vellido, A. Using random forests for assistance in the curation of G-protein coupled receptor databases. "Biomedical engineering online", 18 Agost 2017, vol. 16, suplement 1, p. 1-21.
dc.identifier.issn	1475-925X
dc.identifier.uri	http://hdl.handle.net/2117/108838
dc.description.abstract	Background: Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. Methods: We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Results: Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. Conclusion: The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.
dc.format.extent	21 p.
dc.language.iso	eng
dc.rights	Attribution 3.0 Spain
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/
dc.subject	Àrees temàtiques de la UPC::Informàtica::Aplicacions de la informàtica::Bioinformàtica
dc.subject.lcsh	Biochemistry
dc.subject.lcsh	G proteins
dc.subject.other	Database curation
dc.subject.other	G-Protein coupled receptors
dc.subject.other	Machine learning
dc.subject.other	Random forests
dc.title	Using random forests for assistance in the curation of G-protein coupled receptor databases
dc.type	Article
dc.subject.lemac	Bioquímica
dc.subject.lemac	Proteïnes G
dc.contributor.group	Universitat Politècnica de Catalunya. SOCO - Soft Computing
dc.identifier.doi	10.1186/s12938-017-0357-4
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	https://biomedical-engineering-online.biomedcentral.com/articles/10.1186/s12938-017-0357-4
dc.rights.access	Open Access
local.identifier.drac	21553971
dc.description.version	Postprint (published version)
local.citation.author	Shkurin, A.; Vellido, A.
local.citation.publicationName	Biomedical engineering online
local.citation.volume	16
local.citation.number	Suplement 1
local.citation.startingPage	1
local.citation.endingPage	21
dc.identifier.pmid	28830426

Fitxers d'aquest items

Nom:: s12938-017-0357-4.pdf
Mida:: 1,337Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Articles de revista [1.049]
Articles de revista [66]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Using random forests for assistance in the curation of G-protein coupled receptor databases

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora