A benchmark of synthetic transcriptomic cancer data reconstruction
Fitxers
Títol de la revista
ISSN de la revista
Títol del volum
Col·laborador
Editor
Tribunal avaluador
Realitzat a/amb
Càtedra / Departament / Institut
Tipus de document
Data publicació
Editor
Part de
Condicions d'accés
Llicència
Datasets relacionats
Projecte CCD
Abstract
Cancer is the second most common cause of death worldwide, and its incidence is increasing [1]. Some methodologies have been developed to study cancer. For instance, PAM50, a collection of 50 genes important for cancer characterization, has helped categorize cancer subtypes [2]. However, the rapid growth of sequenced biological data, or omics, has made acquiring much larger amounts of genes possible. Still, the number of samples available in studies tends to be low. This combination of small sample size and high dimensionality, known as the curse of dimensionality, renders significant data analyses less efficient. Hence, there are limitations to deep learning implementations on omics data generally and cancer data in particular [3]. In the former, the curse of dimensionality has hindered the application of deep learning, given its data-hungry nature. In the latter, our current understanding of the impact of molecular mechanisms of cancer progression challenges our interpretation of the application of deep learning algorithms to omics information [4]. In order to circumvent both of these issues, we aim to learn a low-dimensional representation of the real data, use this representation to augment the original data with improved fidelity in reconstruction and obtain meaningful insights on cancer progression along the way. The Auto Encoder (AE) [5] is a deep learning technique that reduces data dimensionality. In this study, we define and use three types of Auto Encoders: vanilla Auto Encoder [5], Variational Autoencoder (VAE) [6], and Conditional Variational Autoencoder (CVAE) [7]. We discuss how we can learn from real cancer data, such as that provided by The Cancer Genome Atlas (TCGA), reconstruct the original data, and generate new data in silico, i.e., synthetic data.




