Automating installation, testing and development of bcbio-nextgen pipeline
Document typeMaster thesis (pre-Bologna period)
Rights accessOpen Access
In the recent years, the costs of obtaining biological data have been drastically reduced. This has lead into an exponential growth of the available data. Having such growth of data to analyze sometimes results in very platform-dependent and difficult to scale software solutions. This final project tries to provide a solution to those problems in a real bioinformatics core facility in the Science For Life Laboratory. Science For Life Laboratory is a center for large-scale biosciences with the focus in health and environmental research. It is located in Stockholm, Sweden. This laboratory has 15 next generation sequencing instruments at present, with a combined capacity for DNA sequencing equal to several hundreds of complete human genomes per year. This implies a massive amount of data to be managed and analyzed. This data is analyzed using bcbio-nextgen. bcbio-nextgen is an in-house maintained genomics pipeline, originally developed by Brad Chapman at Harvard School of Public Health [Rom12]. The first goal of this project is to automate the installation, deployment and testing of the aforementioned pipeline. On the other hand, the alignment1 step of the analysis will be modified to use Seal, a Hadoop based aligner. This will allow us to check that all automations are working properly, as the pipeline will have to be installed and tested in several nodes.