Market Basket analysis in Retail
Visualitza/Obre
Estadístiques de LA Referencia / Recolecta
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/109798
Tipus de documentProjecte Final de Màster Oficial
Data2017-05
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
This Master Thesis memory describes a full end-to-end data science project performed in
CleverData, a successful start-up specialized in data mining techniques and analytics tools.
This project was performed for one of its clients, which is an important retail company from
Spain. The aim of the project was both the analysis of the possibly different selling behaviour
of the stores or shops of the client and the analysis of customers’ purchase behaviour, also
known as Market Basket Analysis, to confirm the hypotheses from the client regarding the
existence of different customer purchasing profiles and different store selling profiles in its
company. The project was divided in three tasks.
The first one was oriented to the study, detection and validation of different behaviour
profiles of the shops/stores of the client. This analysis was done by means of a descriptive
process using clustering techniques. In order to guarantee a minimum robustness of the
profiles obtained, three clustering algorithms were used: a hierarchical agglomerative
clustering technique, a partitional clustering technique with a fixed number of clusters (Kmeans)
and a partitional clustering technique with automatic detection of the number of
clusters (G-means). For each algorithm, the output clusters were analysed and compared.
First, the similarity of the composition of the clusters between algorithms was analysed.
Secondly, the resulting clusters (each partition) from each method were structurally validated
using four Clusters Validity Indexes (CVIs): Minimum Cluster Separation Index, Maximum
Cluster Diameter Index, Dunn Index and Davies-Bouldin Index. Finally, the best partition
was found from a technical point of view.
After that, the client should be able to interpret and validate the meaning of the clusters
obtained. Once chosen the partition more meaningful to the client, the second task was
devoted to provide a descriptive analysis of the clusters as meaningful as possible to the
client. To that end, some common techniques tools were used, as the computation of the
centroids of the clusters, and the characterisation of each one of the clusters through the
variables used. However, an important obstacle appeared in this task. The number of
variables was so high (around 400) that made impossible that the client was able to analyse
and summarise the selling behaviour profile of the different shops. The proposed solution was
to apply a feature selection approach, taking advantage from the clustering process done, and
to make an aggregation process of variables with temporal relationship. This way, the
information about the cluster to which each store belonged, was recorded as a label of a new
created class variable. Then, a Random Forest ensemble technique was selected and applied
to the new dataset. This discriminant technique, in addition to be able to predict an unlabelled
new instance or observation, provides information about the relevant attributes for the
discrimination purpose (i.e., the ones being used in the trees of the forest). Then, based on
those most important attributes, the descriptive analysis of each cluster was done, and it could
be interpreted and fully understood by the client.
The third task was focused on the analysis of customers’ purchase behaviour through the
analysis of the historic purchase tickets recorded from one year. To identify possible different
purchase patterns, it was decided to apply an associative model to find out whether some cooccurrences
or associations could be identified. Concretely, the association rules model was
used. Because the set of clusters was meaningful to the client, it was decided that the analysis
of the purchase behaviour would be done locally to each cluster. Therefore, each cluster was
examined to discover associations or co-occurrences of purchase patterns among the
customers in each cluster. Hence, some association rules were discovered for the purchase
patterns in each store. Two strategies were used to generate the rules: the Lift measure and
the Leverage measure.
To summarise and conclude the analysis, a web page was created where the results were
published to make easier the access of the client to the results.
Through the memory, it is gradually explained how the project was developed. Since the first
step of defining the objectives, until the last results’ delivery. In the project, both the Python
language and machine learning libraries were used, as well as the BigML tool, which uses
machine learning as a service. At the end of the project, the results accomplished were
analysed, and the aims of the project were compared against the initial goals of the project,
with satisfactory results, both from the client practical point of view, and from a technical
point of view.
Descripció
En col·laboració amb la Universitat de Barcelona (UB) i la Universitat Rovira i Virgili (URV)
TitulacióMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2012)
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
129057.pdf | 3,360Mb | Visualitza/Obre |