Predicting the outcome of a chess game by statistical and machine learning techniques
Tutor / director / evaluadorDelicado Useros, Pedro Francisco
Tipo de documentoProjecte Final de Màster Oficial
Condiciones de accesoAcceso abierto
This document will go through the process of Big Data analytics, which combines computer science, data warehousing and applied statistics. We plan to to predict the result of Chess matches after a twenty full movements. To do this we are constrained to work with the complete database that was provided at the start of this project. The Gorgo Base  consist of around three million matches, comes in an unknown database format, and once we were able to read it, we were confronted with it’s size, this database is able to overwhelm any computer that tried to compute many operations at the same time, this was one major challenge to overcome. As with database this size, we had to spend significant of resources filtering out missing and faulty data. To process this database we had to tokenize it, separate it into chunks we could actually compute, and then we started aggregating and filtering data. Aggregating data is an important part of any dataset creation, using all the database we were for example able to capture the average ELO of all the players we found. We also generated the score of every board later used to predict game results. At this step we generated our test and train files, we separated 70% to training and 30% for testing purposes. One final challenge was to collect all the information of the board positions, this was challenging because we wanted to keep a record of the historical results for every game that was in our database, and to do this we had to compare and add results, and at the end we end up recording thirteen million board historical records. We did the same with the historical record of competitors, we stored their average ELO, and their results history, to create the competitors database. The biggest problem in predicting chess matches is the enormous amount of legally possible board positions, it has been estimated at 1043 by Shanon  and others, but since we are not taking into account the endgame, because we want to predict the result at an early stage, we believe that we might be able to use the information on matches of this database. Finally we gathered all the data from our three sources, the refined Gorgo Base, the Movement history, and the Competitors records, to generate a dataset we could work with. We applied an SVM with RBF kernel, and compared it to a random forest model. At the end we were satisfied with our results, which showed us how powerful using big data is to solve problems.