Articles de revista
http://hdl.handle.net/2117/639
2016-10-01T01:40:14ZPut three and three together: Triangle-driven community detection
http://hdl.handle.net/2117/89696
Put three and three together: Triangle-driven community detection
Prat Pérez, Arnau; Domínguez Sal, David; Brunat Blay, Josep Maria; Larriba Pey, Josep
Community detection has arisen as one of the most relevant topics in the field of graph data mining due to its applications in many fields such as biology, social networks, or network traffic analysis. Although the existing metrics used to quantify the quality of a community work well in general, under some circumstances, they fail at correctly capturing such notion. The main reason is that these metrics consider the internal community edges as a set, but ignore how these actually connect the vertices of the community. We propose the Weighted Community Clustering (WCC), which is a new community metric that takes the triangle instead of the edge as the minimal structural motif indicating the presence of a strong relation in a graph. We theoretically analyse WCC in depth and formally prove, by means of a set of properties, that the maximization of WCC guarantees communities with cohesion and structure. In addition, we propose Scalable Community Detection (SCD), a community detection algorithm based on WCC, which is designed to be fast and scalable on SMP machines, showing experimentally that WCC correctly captures the concept of community in social networks using real datasets. Finally, using ground-truth data, we show that SCD provides better quality than the best disjoint community detection algorithms of the state of the art while performing faster.
2016-09-08T08:14:33ZPrat Pérez, ArnauDomínguez Sal, DavidBrunat Blay, Josep MariaLarriba Pey, JosepCommunity detection has arisen as one of the most relevant topics in the field of graph data mining due to its applications in many fields such as biology, social networks, or network traffic analysis. Although the existing metrics used to quantify the quality of a community work well in general, under some circumstances, they fail at correctly capturing such notion. The main reason is that these metrics consider the internal community edges as a set, but ignore how these actually connect the vertices of the community. We propose the Weighted Community Clustering (WCC), which is a new community metric that takes the triangle instead of the edge as the minimal structural motif indicating the presence of a strong relation in a graph. We theoretically analyse WCC in depth and formally prove, by means of a set of properties, that the maximization of WCC guarantees communities with cohesion and structure. In addition, we propose Scalable Community Detection (SCD), a community detection algorithm based on WCC, which is designed to be fast and scalable on SMP machines, showing experimentally that WCC correctly captures the concept of community in social networks using real datasets. Finally, using ground-truth data, we show that SCD provides better quality than the best disjoint community detection algorithms of the state of the art while performing faster.Automatic multi-partite graph generation from arbitrary data
http://hdl.handle.net/2117/26446
Automatic multi-partite graph generation from arbitrary data
Álvarez García, Sandra; Baeza Yates, Ricardo; Brisaboa, Nieves R.; Larriba Pey, Josep; Pedreira, Oscar
In this paper we present a generic model for automatic generation of basic multi-partite graphs obtained from collections of arbitrary input data following user indications. The paper also presents GraphGen, a tool that implements this model. The input data is a collection of complex objects composed by a set or list of heterogeneous elements. Our tool provides a simple interface for the user to specify the types of nodes that are relevant for the application domain in each case. The nodes and the relationships between them are derived from the input data through the application of a set of derivation rules specified by the user. The resulting graph can be exported in the standard GraphML format so that it can be further processed with other graph management and mining systems. We end by giving some examples in real scenarios that show the usefulness of this model.
2015-02-20T11:24:35ZÁlvarez García, SandraBaeza Yates, RicardoBrisaboa, Nieves R.Larriba Pey, JosepPedreira, OscarIn this paper we present a generic model for automatic generation of basic multi-partite graphs obtained from collections of arbitrary input data following user indications. The paper also presents GraphGen, a tool that implements this model. The input data is a collection of complex objects composed by a set or list of heterogeneous elements. Our tool provides a simple interface for the user to specify the types of nodes that are relevant for the application domain in each case. The nodes and the relationships between them are derived from the input data through the application of a set of derivation rules specified by the user. The resulting graph can be exported in the standard GraphML format so that it can be further processed with other graph management and mining systems. We end by giving some examples in real scenarios that show the usefulness of this model.Two-way replacement selection
http://hdl.handle.net/2117/25227
Two-way replacement selection
Martínez Palau, Xavier; Domínguez Sal, David; Larriba Pey, Josep
The performance of external sorting using merge sort is highly dependent on the length of the runs generated. One of the most commonly used run generation strategies is Replacement Selection (RS) because, on average, it generates runs that are twice the size of the memory available. However, the length of the runs generated by RS is downsized for data with certain characteristics, like inputs sorted inversely with respect to the desired output order.
The goal of this paper is to propose and analyze two-way replacement selection (2WRS), which is a generalization of RS obtained by implementing two heaps instead of the single heap implemented by RS. The appropriate management of these two heaps allows generating runs larger than the memory available in a stable way, i.e. independent from the characteristics of the datasets. Depending on the changing characteristics of the input dataset, 2WRS assigns a new data record to one or the other heap, and grows or shrinks each heap, accommodating to the growing or decreasing tendency of the dataset. On average, 2WRS creates runs of at least the length generated by RS, and longer for datasets that combine increasing and decreasing data subsets. We tested both algorithms on large datasets with different characteristics and 2WRS achieves speedups at least similar to RS, and over 2.5 when RS fails to generate large runs.
2015-01-12T15:12:35ZMartínez Palau, XavierDomínguez Sal, DavidLarriba Pey, JosepThe performance of external sorting using merge sort is highly dependent on the length of the runs generated. One of the most commonly used run generation strategies is Replacement Selection (RS) because, on average, it generates runs that are twice the size of the memory available. However, the length of the runs generated by RS is downsized for data with certain characteristics, like inputs sorted inversely with respect to the desired output order.
The goal of this paper is to propose and analyze two-way replacement selection (2WRS), which is a generalization of RS obtained by implementing two heaps instead of the single heap implemented by RS. The appropriate management of these two heaps allows generating runs larger than the memory available in a stable way, i.e. independent from the characteristics of the datasets. Depending on the changing characteristics of the input dataset, 2WRS assigns a new data record to one or the other heap, and grows or shrinks each heap, accommodating to the growing or decreasing tendency of the dataset. On average, 2WRS creates runs of at least the length generated by RS, and longer for datasets that combine increasing and decreasing data subsets. We tested both algorithms on large datasets with different characteristics and 2WRS achieves speedups at least similar to RS, and over 2.5 when RS fails to generate large runs.Using genetic algorithms for attribute grouping in multivariate microaggregation
http://hdl.handle.net/2117/25011
Using genetic algorithms for attribute grouping in multivariate microaggregation
Balasch Masoliver, Jordi; Muntés Mulero, Víctor; Nin Guerrero, Jordi
Anonymization techniques that provide k-anonymity suffer from loss of quality when data dimensionality is high. Microaggregation techniques are not an exception. Given a set of records, attributes are grouped into non-intersecting subsets and microaggregated independently. While this improves quality by reducing the loss of information, it usually leads to the loss of the k-anonymity property, increasing entity disclosure risk. In spite of this, grouping attributes is still a common practice for data sets containing a large number of records. Depending on the attributes chosen and their correlation, the amount of information loss and disclosure risk vary. However, there have not been serious attempts to propose a way to find the best way of grouping attributes. In this paper, we present GOMM, the Genetic Optimizer for Multivariate Microaggregation which, as far as we know, represents the first proposal using evolutionary algorithms for this problem. The goal of GOMM is finding the optimal, or near-optimal, attribute grouping taking into account both information loss and disclosure risk. We propose a way to map attribute subsets into a chromosome and a set of new mutation operations for this context. Also, we provide a comprehensive analysis of the operations proposed and we show that, after using our evolutionary approach for different real data sets, we obtain better quality in the anonymized data comparing it to previously used ad-hoc attribute grouping techniques. Additionally, we provide an improved version of GOMM called D-GOMM where operations are dynamically executed during the optimization process to reduce the GOMM execution time.
2014-12-12T12:11:36ZBalasch Masoliver, JordiMuntés Mulero, VíctorNin Guerrero, JordiAnonymization techniques that provide k-anonymity suffer from loss of quality when data dimensionality is high. Microaggregation techniques are not an exception. Given a set of records, attributes are grouped into non-intersecting subsets and microaggregated independently. While this improves quality by reducing the loss of information, it usually leads to the loss of the k-anonymity property, increasing entity disclosure risk. In spite of this, grouping attributes is still a common practice for data sets containing a large number of records. Depending on the attributes chosen and their correlation, the amount of information loss and disclosure risk vary. However, there have not been serious attempts to propose a way to find the best way of grouping attributes. In this paper, we present GOMM, the Genetic Optimizer for Multivariate Microaggregation which, as far as we know, represents the first proposal using evolutionary algorithms for this problem. The goal of GOMM is finding the optimal, or near-optimal, attribute grouping taking into account both information loss and disclosure risk. We propose a way to map attribute subsets into a chromosome and a set of new mutation operations for this context. Also, we provide a comprehensive analysis of the operations proposed and we show that, after using our evolutionary approach for different real data sets, we obtain better quality in the anonymized data comparing it to previously used ad-hoc attribute grouping techniques. Additionally, we provide an improved version of GOMM called D-GOMM where operations are dynamically executed during the optimization process to reduce the GOMM execution time.A discussion on the design of graph database benchmarks
http://hdl.handle.net/2117/24027
A discussion on the design of graph database benchmarks
Domínguez Sal, David; Martínez Bazán, Norbert; Muntés Mulero, Víctor; Baleta Ferrer, Pedro; Larriba Pey, Josep
Graph Database Management systems (GDBs) are gaining popularity. They are used to analyze huge graph datasets that are naturally appearing in many application areas to model interrelated data. The objective of this paper is to raise a new topic of discussion in the benchmarking community and allow practitioners having a set of basic guidelines for GDB benchmarking. We strongly believe that GDBs will become an important player in the market field of data analysis, and with that, their performance and capabilities will also become important. For this reason, we discuss those aspects that are important from our perspective, i.e. the characteristics of the graphs to be included in the benchmark, the characteristics of the queries that are important in graph analysis applications and the evaluation workbench.
2014-09-10T08:42:32ZDomínguez Sal, DavidMartínez Bazán, NorbertMuntés Mulero, VíctorBaleta Ferrer, PedroLarriba Pey, JosepGraph Database Management systems (GDBs) are gaining popularity. They are used to analyze huge graph datasets that are naturally appearing in many application areas to model interrelated data. The objective of this paper is to raise a new topic of discussion in the benchmarking community and allow practitioners having a set of basic guidelines for GDB benchmarking. We strongly believe that GDBs will become an important player in the market field of data analysis, and with that, their performance and capabilities will also become important. For this reason, we discuss those aspects that are important from our perspective, i.e. the characteristics of the graphs to be included in the benchmark, the characteristics of the queries that are important in graph analysis applications and the evaluation workbench.Generalized median string computation by means of string embedding in vector spaces
http://hdl.handle.net/2117/19430
Generalized median string computation by means of string embedding in vector spaces
Jiang, Xiaoyi; Wentker, Jöran; Ferrer Sumsi, Miquel
In structural pattern recognition the median string has been established as a useful tool to represent a set of strings. However, its exact computation is complex and of high computational burden. In this paper we propose a new approach for the computation of median string based on string embedding. Strings are embedded into a vector space and the median is computed in the vector domain. We apply three different inverse transformations to go from the vector domain back to the string domain in order to obtain a final approximation of the median string. All of them are based on the weighted mean of a pair of strings. Experiments show that we succeed to compute good approximations of the median string.
2013-05-28T16:06:00ZJiang, XiaoyiWentker, JöranFerrer Sumsi, MiquelIn structural pattern recognition the median string has been established as a useful tool to represent a set of strings. However, its exact computation is complex and of high computational burden. In this paper we propose a new approach for the computation of median string based on string embedding. Strings are embedded into a vector space and the median is computed in the vector domain. We apply three different inverse transformations to go from the vector domain back to the string domain in order to obtain a final approximation of the median string. All of them are based on the weighted mean of a pair of strings. Experiments show that we succeed to compute good approximations of the median string.Using Evolutive Summary Counters for Efficient Cooperative Caching in Search Engines
http://hdl.handle.net/2117/16552
Using Evolutive Summary Counters for Efficient Cooperative Caching in Search Engines
Domínguez Sal, David; Aguilar Saborit, Josep; Surdeanu, Mihai; Larriba Pey, Josep
We propose and analyze a distributed cooperative
caching strategy based on the Evolutive Summary Counters
(ESC), a new data structure that stores an approximated record
of the data accesses in each computing node of a search engine.
The ESC capture the frequency of accesses to the elements
of a data collection, and the evolution of the access patterns
for each node in a network of computers. The ESC can be
efficiently summarized into what we call ESC-summaries to
obtain approximate statistics of the document entries accessed
by each computing node.
We use the ESC-summaries to introduce two algorithms that
manage our distributed caching strategy, one for the distribution
of the cache contents, ESC-placement, and another one for the
search of documents in the distributed cache, ESC-search. While
the former improves the hit rate of the system and keeps a large
ratio of data accesses local, the latter reduces the network traffic
by restricting the number of nodes queried to find a document.
We show that our cooperative caching approach outperforms
state of the art models in both hit rate, throughput, and location
recall for multiple scenarios, i.e., different query distributions
and systems with varying degrees of complexity.
2012-09-21T09:29:32ZDomínguez Sal, DavidAguilar Saborit, JosepSurdeanu, MihaiLarriba Pey, JosepSocial based layouts for the increase of locality in graph operations
http://hdl.handle.net/2117/13533
Social based layouts for the increase of locality in graph operations
Prat Pérez, Arnau; Domínguez Sal, David; Larriba Pey, Josep
Graphs provide a natural data representation for analyzing the relationships among entities in many application areas. Since the
analysis algorithms perform memory intensive operations, it is important that the graph layout is adapted to take advantage of the memory hierarchy.
Here, we propose layout strategies based on community detection to improve the in-memory data locality of generic graph algorithms. We
conclude that the detection of communities in a graph provides a layout strategy that improves the performance of graph algorithms consistently over other state of the art strategies.
2011-10-17T11:15:59ZPrat Pérez, ArnauDomínguez Sal, DavidLarriba Pey, JosepGraphs provide a natural data representation for analyzing the relationships among entities in many application areas. Since the
analysis algorithms perform memory intensive operations, it is important that the graph layout is adapted to take advantage of the memory hierarchy.
Here, we propose layout strategies based on community detection to improve the in-memory data locality of generic graph algorithms. We
conclude that the detection of communities in a graph provides a layout strategy that improves the performance of graph algorithms consistently over other state of the art strategies.Cooperative cache analysis for distributed search engines
http://hdl.handle.net/2117/13116
Cooperative cache analysis for distributed search engines
Domínguez Sal, David; Pérez Casany, Marta; Larriba Pey, Josep
In this paper, we study the performance of a distributed search engine from a data caching point of view using statistical tools on a varied set of configurations. We study two strategies to achieve better performance: cacheaware load balancing that issues the queries to nodes that store the computation in cache; and cooperative caching (CC) that stores and transfers the available computed contents from one node in the network to others. Since cache-aware
decisions depend on information about the recent history, we also analyse how the ageing of this information impacts the system performance. Our results show that the combination of both strategies yield better throughput than individually implementing cooperative cache or cache-aware load balancing strategies because
of a synergic improvement of the hit rate. Furthermore, the analysis concludes that the data structures to monitor the system need only moderate precision to achieve optimal throughput.
2011-08-25T10:59:55ZDomínguez Sal, DavidPérez Casany, MartaLarriba Pey, JosepIn this paper, we study the performance of a distributed search engine from a data caching point of view using statistical tools on a varied set of configurations. We study two strategies to achieve better performance: cacheaware load balancing that issues the queries to nodes that store the computation in cache; and cooperative caching (CC) that stores and transfers the available computed contents from one node in the network to others. Since cache-aware
decisions depend on information about the recent history, we also analyse how the ageing of this information impacts the system performance. Our results show that the combination of both strategies yield better throughput than individually implementing cooperative cache or cache-aware load balancing strategies because
of a synergic improvement of the hit rate. Furthermore, the analysis concludes that the data structures to monitor the system need only moderate precision to achieve optimal throughput.