Market Basket analysis in Retail
Tutor / director / evaluatorSànchez-Marrè, Miquel
Document typeMaster thesis
Rights accessOpen Access
This Master Thesis memory describes a full end-to-end data science project performed in CleverData, a successful start-up specialized in data mining techniques and analytics tools. This project was performed for one of its clients, which is an important retail company from Spain. The aim of the project was both the analysis of the possibly different selling behaviour of the stores or shops of the client and the analysis of customers’ purchase behaviour, also known as Market Basket Analysis, to confirm the hypotheses from the client regarding the existence of different customer purchasing profiles and different store selling profiles in its company. The project was divided in three tasks. The first one was oriented to the study, detection and validation of different behaviour profiles of the shops/stores of the client. This analysis was done by means of a descriptive process using clustering techniques. In order to guarantee a minimum robustness of the profiles obtained, three clustering algorithms were used: a hierarchical agglomerative clustering technique, a partitional clustering technique with a fixed number of clusters (Kmeans) and a partitional clustering technique with automatic detection of the number of clusters (G-means). For each algorithm, the output clusters were analysed and compared. First, the similarity of the composition of the clusters between algorithms was analysed. Secondly, the resulting clusters (each partition) from each method were structurally validated using four Clusters Validity Indexes (CVIs): Minimum Cluster Separation Index, Maximum Cluster Diameter Index, Dunn Index and Davies-Bouldin Index. Finally, the best partition was found from a technical point of view. After that, the client should be able to interpret and validate the meaning of the clusters obtained. Once chosen the partition more meaningful to the client, the second task was devoted to provide a descriptive analysis of the clusters as meaningful as possible to the client. To that end, some common techniques tools were used, as the computation of the centroids of the clusters, and the characterisation of each one of the clusters through the variables used. However, an important obstacle appeared in this task. The number of variables was so high (around 400) that made impossible that the client was able to analyse and summarise the selling behaviour profile of the different shops. The proposed solution was to apply a feature selection approach, taking advantage from the clustering process done, and to make an aggregation process of variables with temporal relationship. This way, the information about the cluster to which each store belonged, was recorded as a label of a new created class variable. Then, a Random Forest ensemble technique was selected and applied to the new dataset. This discriminant technique, in addition to be able to predict an unlabelled new instance or observation, provides information about the relevant attributes for the discrimination purpose (i.e., the ones being used in the trees of the forest). Then, based on those most important attributes, the descriptive analysis of each cluster was done, and it could be interpreted and fully understood by the client. The third task was focused on the analysis of customers’ purchase behaviour through the analysis of the historic purchase tickets recorded from one year. To identify possible different purchase patterns, it was decided to apply an associative model to find out whether some cooccurrences or associations could be identified. Concretely, the association rules model was used. Because the set of clusters was meaningful to the client, it was decided that the analysis of the purchase behaviour would be done locally to each cluster. Therefore, each cluster was examined to discover associations or co-occurrences of purchase patterns among the customers in each cluster. Hence, some association rules were discovered for the purchase patterns in each store. Two strategies were used to generate the rules: the Lift measure and the Leverage measure. To summarise and conclude the analysis, a web page was created where the results were published to make easier the access of the client to the results. Through the memory, it is gradually explained how the project was developed. Since the first step of defining the objectives, until the last results’ delivery. In the project, both the Python language and machine learning libraries were used, as well as the BigML tool, which uses machine learning as a service. At the end of the project, the results accomplished were analysed, and the aims of the project were compared against the initial goals of the project, with satisfactory results, both from the client practical point of view, and from a technical point of view.
En col·laboració amb la Universitat de Barcelona (UB) i la Universitat Rovira i Virgili (URV)