Coupling databases and advanced analytic tools (R)
Tutor / director / evaluatorAbelló Gamazo, Alberto
Document typeMaster thesis
Rights accessOpen Access
Today, several contemporary organizations collect various kinds of data, creating large data repositories. But the capacity to perform advanced analytics over these large amount of data stored in databases remains a significant challenge to statistical software (R, S, SAS, SPSS, etc) and data management systems (DBMSs). This is because while statistical software provide comprehensive analytics and modelling functionalities, they can only handle limited amounts of data. The data management systems in contrast have capacity to handle large amount of data but lack adequate analytical facilities. The need to draw on the strengths of both camps gave rise to the idea of coupling databases and advanced analytical or statistical tools which seems very promising and is gaining a lot of grounds. This work studied the level of development of integration of a rising popular advanced analytical tool (R) with database systems (PostgreSQL, Oracle, DB2, SQL Server) and investigated the analytic performance of such coupling vis-`a-vis the performance of stand-alone implementation of (R). The results showed that the overall performance of coupling databases and R is about two (2) times faster than performance of stand-alone R. In the case of some individual benchmarks, the coupled systems (R+DBMS) performance is more than ten (10) times faster. However, there remain the challenges of efficient retrieval and passing of data to analytic functions, code portability, indistinguishable or flat analytics performance on small datasets and integration configuration snags with some of the well-known DBMSs. Although, stand-alone R performs competitively well compared to DBMSs coupled with R in cases of very small datasets analytics, the issue of data security still lingers. Our conclusion is that coupling databases with advanced analytical tools (R) is a good concept and technique which yields considerable performance gains for advanced analytics on substantial datasets provided retrieval and passing of data to the analytical functions are efficiently done. Thus, we confirm the initial assertion or hypothesis but on the condition that significant amount of data is involved in the process and the data is efficiently retrieved and passed to analytic functions. Overall, we recommend an integration which synergizes the robust DBMSs' data management capabilities and the rich statistical functionalities of advanced analytical tools for complex analytics in-situ databases in all situations for faster performance and data security.