LARCA - Laboratori d'Algorísmia Relacional, Complexitat i Aprenentatge
http://hdl.handle.net/2117/3486
2015-11-25T18:21:12ZNon-crossing dependencies: Least effort, not grammar
http://hdl.handle.net/2117/79345
Non-crossing dependencies: Least effort, not grammar
Ferrer Cancho, Ramon
The use of null hypotheses (in a statistical sense) is common in hard sciences but not in theoretical linguistics. Here the null hypothesis that the low frequency of syntactic dependency crossings is expected by an arbitrary ordering of words is rejected. It is shown that this would require star dependency structures, which are both unrealistic and too restrictive. The hypothesis of the limited resources of the human brain is revisited. Stronger null hypotheses taking into account actual dependency lengths for the likelihood of crossings are presented. Those hypotheses suggests that crossings are likely to reduce when dependencies are shortened. A hypothesis based on pressure to reduce dependency lengths is more parsimonious than a principle of minimization of crossings or a grammatical ban that is totally dissociated from the general and non-linguistic principle of economy.
2015-11-17T09:45:11ZFerrer Cancho, RamonThe use of null hypotheses (in a statistical sense) is common in hard sciences but not in theoretical linguistics. Here the null hypothesis that the low frequency of syntactic dependency crossings is expected by an arbitrary ordering of words is rejected. It is shown that this would require star dependency structures, which are both unrealistic and too restrictive. The hypothesis of the limited resources of the human brain is revisited. Stronger null hypotheses taking into account actual dependency lengths for the likelihood of crossings are presented. Those hypotheses suggests that crossings are likely to reduce when dependencies are shortened. A hypothesis based on pressure to reduce dependency lengths is more parsimonious than a principle of minimization of crossings or a grammatical ban that is totally dissociated from the general and non-linguistic principle of economy.Entailment among probabilistic implications
http://hdl.handle.net/2117/79017
Entailment among probabilistic implications
Atserias, Albert; Balcázar Navarro, José Luis
We study a natural variant of the implicational fragment of propositional logic. Its formulas are pairs of conjunctions of positive literals, related together by an implicational-like connective, the semantics of this sort of implication is defined in terms of a threshold on a conditional probability of the consequent, given the antecedent: we are dealing with what the data analysis community calls confidence of partial implications or association rules. Existing studies of redundancy among these partial implications have characterized so far only entailment from one premise and entailment from two premises. By exploiting a previously noted alternative view of this entailment in terms of linear programming duality, we characterize exactly the cases of entailment from arbitrary numbers of premises. As a result, we obtain decision algorithms of better complexity, additionally, for each potential case of entailment, we identify a critical confidence threshold and show that it is, actually, intrinsic to each set of premises and antecedent of the conclusion.
2015-11-11T12:38:12ZAtserias, AlbertBalcázar Navarro, José LuisWe study a natural variant of the implicational fragment of propositional logic. Its formulas are pairs of conjunctions of positive literals, related together by an implicational-like connective, the semantics of this sort of implication is defined in terms of a threshold on a conditional probability of the consequent, given the antecedent: we are dealing with what the data analysis community calls confidence of partial implications or association rules. Existing studies of redundancy among these partial implications have characterized so far only entailment from one premise and entailment from two premises. By exploiting a previously noted alternative view of this entailment in terms of linear programming duality, we characterize exactly the cases of entailment from arbitrary numbers of premises. As a result, we obtain decision algorithms of better complexity, additionally, for each potential case of entailment, we identify a critical confidence threshold and show that it is, actually, intrinsic to each set of premises and antecedent of the conclusion.A multi-scale smoothing kernel for measuring time-series similarity
http://hdl.handle.net/2117/78645
A multi-scale smoothing kernel for measuring time-series similarity
Troncoso, Alicia; Arias Vicente, Marta; Riquelme Santos, José Cristóbal
In this paper a kernel for time-series data is introduced so that it can be used for any data mining task that relies on a similarity or distance metric. The main idea of our kernel is that it should recognize as highly similar time-series that are essentially the same but may be slightly perturbed from each other: for example, if one series is shifted with respect to the other or if it slightly misaligned. Namely, our kernel tries to focus on the shape of the time-series and ignores small perturbations such as misalignments or shifts. First, a recursive formulation of the kernel directly based on its definition is proposed. Then it is shown how to efficiently compute the kernel using an equivalent matrix-based formulation. To validate the proposed kernel three experiments have been carried out. As an initial step, several synthetic datasets have been generated from UCR time-series repository and the KDD challenge of 2007 with the purpose of validating the kernel-derived distance over shifted time-series. Also, the kernel has been applied to the original UCR time-series to analyze its potential in time-series classification in conjunction with Support Vector Machines. Finally, two real-world applications related to ozone concentration in atmosphere and electricity demand have been considered.
2015-11-02T14:11:48ZTroncoso, AliciaArias Vicente, MartaRiquelme Santos, José CristóbalIn this paper a kernel for time-series data is introduced so that it can be used for any data mining task that relies on a similarity or distance metric. The main idea of our kernel is that it should recognize as highly similar time-series that are essentially the same but may be slightly perturbed from each other: for example, if one series is shifted with respect to the other or if it slightly misaligned. Namely, our kernel tries to focus on the shape of the time-series and ignores small perturbations such as misalignments or shifts. First, a recursive formulation of the kernel directly based on its definition is proposed. Then it is shown how to efficiently compute the kernel using an equivalent matrix-based formulation. To validate the proposed kernel three experiments have been carried out. As an initial step, several synthetic datasets have been generated from UCR time-series repository and the KDD challenge of 2007 with the purpose of validating the kernel-derived distance over shifted time-series. Also, the kernel has been applied to the original UCR time-series to analyze its potential in time-series classification in conjunction with Support Vector Machines. Finally, two real-world applications related to ozone concentration in atmosphere and electricity demand have been considered.An agent-based model of the emergence and transmission of a language system for the expression of logical combinations
http://hdl.handle.net/2117/77870
An agent-based model of the emergence and transmission of a language system for the expression of logical combinations
Sierra Santibáñez, Josefina
This paper presents an agent-based model of the emergence and transmission of a language system for the expression of logical combinations of propositions. The model assumes the agents have some cognitive capacities for invention, adoption, repair, induction and adaptation, a common vocabulary for basic categories, and the ability to construct complex concepts using recursive combinations of basic categories and logical categories. It also supposes the agents initially do not have a vocabulary for logical categories (i.e. logical connectives), nor grammatical constructions for expressing logical
combinations of basic categories through language. The results of the experiments we have performed show that a language system for the expression of logical combinations emerges as a result of a process of self-organisation of the agents’ linguistic interactions. Such a language system is concise, because it only uses words and grammatical constructions for three logical categories (i.e. and, or, not). It is also expressive, since it allows the communication of logical combinations of categories of the same complexity as propositional logic formulas, using linguistic devices such as syntactic categories, word order and auxiliary words. Furthermore, it is easy to learn and reliably transmitted across generations, according to the results of our experiments.
2015-10-19T10:27:18ZSierra Santibáñez, JosefinaThis paper presents an agent-based model of the emergence and transmission of a language system for the expression of logical combinations of propositions. The model assumes the agents have some cognitive capacities for invention, adoption, repair, induction and adaptation, a common vocabulary for basic categories, and the ability to construct complex concepts using recursive combinations of basic categories and logical categories. It also supposes the agents initially do not have a vocabulary for logical categories (i.e. logical connectives), nor grammatical constructions for expressing logical
combinations of basic categories through language. The results of the experiments we have performed show that a language system for the expression of logical combinations emerges as a result of a process of self-organisation of the agents’ linguistic interactions. Such a language system is concise, because it only uses words and grammatical constructions for three logical categories (i.e. and, or, not). It is also expressive, since it allows the communication of logical combinations of categories of the same complexity as propositional logic formulas, using linguistic devices such as syntactic categories, word order and auxiliary words. Furthermore, it is easy to learn and reliably transmitted across generations, according to the results of our experiments.Zipf's law for word frequencies: Word forms versus lemmas in long texts
http://hdl.handle.net/2117/77862
Zipf's law for word frequencies: Word forms versus lemmas in long texts
Corral, Alvaro; Boleda Torrent, Gemma; Ferrer Cancho, Ramon
Zipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkavble transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.
2015-10-19T08:26:11ZCorral, AlvaroBoleda Torrent, GemmaFerrer Cancho, RamonZipf's law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf's law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf's law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf's law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkavble transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments
http://hdl.handle.net/2117/77791
ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments
Berral García, Josep Lluís; Poggi, Nicolas; Carrera Pérez, David; Call, Aaaron; Reinauer, Rob; Green, Daron
This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results.
The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop.
Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices.
Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services.
Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces.
ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters.
The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests.
Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs.
In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.
2015-10-15T17:25:03ZBerral García, Josep LluísPoggi, NicolasCarrera Pérez, DavidCall, AaaronReinauer, RobGreen, DaronThis article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results.
The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop.
Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices.
Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services.
Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces.
ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters.
The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests.
Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs.
In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.Displacement logic for anaphora
http://hdl.handle.net/2117/28349
Displacement logic for anaphora
Morrill, Glyn; Valentín Fernández Gallart, José Oriol
The displacement calculus of Morrill, Valentín and Fadda (2011) [25] aspires to replace the calculus of Lambek (1958) [13] as the foundation of categorial grammar by accommodating intercalation as well as concatenation while remaining free of structural rules and enjoying Cut-elimination and its good corollaries. Jäger (2005) [11] proposes a type logical treatment of anaphora with syntactic duplication using limited contraction. Morrill and Valentín (2010) [24] apply (modal) displacement calculus to anaphora with lexical duplication and propose extension with a negation as failure in conjunction with additives to capture binding conditions. In this paper we present an account of anaphora developing characteristics and employing machinery from both of these proposals.
2015-06-19T09:15:45ZMorrill, GlynValentín Fernández Gallart, José OriolThe displacement calculus of Morrill, Valentín and Fadda (2011) [25] aspires to replace the calculus of Lambek (1958) [13] as the foundation of categorial grammar by accommodating intercalation as well as concatenation while remaining free of structural rules and enjoying Cut-elimination and its good corollaries. Jäger (2005) [11] proposes a type logical treatment of anaphora with syntactic duplication using limited contraction. Morrill and Valentín (2010) [24] apply (modal) displacement calculus to anaphora with lexical duplication and propose extension with a negation as failure in conjunction with additives to capture binding conditions. In this paper we present an account of anaphora developing characteristics and employing machinery from both of these proposals.The placement of the head that minimizes online memory: a complex systems approach
http://hdl.handle.net/2117/28306
The placement of the head that minimizes online memory: a complex systems approach
Ferrer Cancho, Ramon
It is well known that the length of a syntactic dependency determines its online memory cost. Thus, the problem of the placement of a head and its dependents (complements or modifiers) that minimizes online memory is equivalent to the problem of the minimum linear arrangement of a star tree. However, how that length is translated into cognitive cost is not known. This study shows that the online memory cost is minimized when the head is placed at the center, regardless of the function that transforms length into cost, provided only that this function is strictly monotonically increasing. Online memory defines a quasi-convex adaptive landscape with a single central minimum if the number of elements is odd and two central minima if that number is even. We discuss various aspects of the dynamics of word order of subject (S), verb (V) and object (O) from a complex systems perspective and suggest that word orders tend to evolve by swapping adjacent constituents from an initial or early SOV configuration that is attracted towards a central word order by online memory minimization. We also suggest that the stability of SVO is due to at least two factors, the quasi-convex shape of the adaptive landscape in the online memory dimension and online memory adaptations that avoid regression to SOV. Although OVS is also optimal for placing the verb at the center, its low frequency is explained by its long distance to the seminal SOV in the permutation space.
2015-06-15T11:41:48ZFerrer Cancho, RamonIt is well known that the length of a syntactic dependency determines its online memory cost. Thus, the problem of the placement of a head and its dependents (complements or modifiers) that minimizes online memory is equivalent to the problem of the minimum linear arrangement of a star tree. However, how that length is translated into cognitive cost is not known. This study shows that the online memory cost is minimized when the head is placed at the center, regardless of the function that transforms length into cost, provided only that this function is strictly monotonically increasing. Online memory defines a quasi-convex adaptive landscape with a single central minimum if the number of elements is odd and two central minima if that number is even. We discuss various aspects of the dynamics of word order of subject (S), verb (V) and object (O) from a complex systems perspective and suggest that word orders tend to evolve by swapping adjacent constituents from an initial or early SOV configuration that is attracted towards a central word order by online memory minimization. We also suggest that the stability of SVO is due to at least two factors, the quasi-convex shape of the adaptive landscape in the online memory dimension and online memory adaptations that avoid regression to SOV. Although OVS is also optimal for placing the verb at the center, its low frequency is explained by its long distance to the seminal SOV in the permutation space.Reply to the commentary "Be careful when assuming the obvious", by P. Alday
http://hdl.handle.net/2117/28305
Reply to the commentary "Be careful when assuming the obvious", by P. Alday
Ferrer Cancho, Ramon
Here we respond to some comments by Alday concerning headedness in linguistic theory and the validity of the assumptions of a mathematical model for word order. For brevity, we focus only on two assumptions: the unit of measurement of dependency length and the monotonicity of the cost of a dependency as a function of its length. We also revise the implicit psychological bias in Alday’s comments. Notwithstanding, Alday is indicating the path for linguistic research with his unusual concerns about parsimony from multiple dimensions.
2015-06-15T11:27:58ZFerrer Cancho, RamonHere we respond to some comments by Alday concerning headedness in linguistic theory and the validity of the assumptions of a mathematical model for word order. For brevity, we focus only on two assumptions: the unit of measurement of dependency length and the monotonicity of the cost of a dependency as a function of its length. We also revise the implicit psychological bias in Alday’s comments. Notwithstanding, Alday is indicating the path for linguistic research with his unusual concerns about parsimony from multiple dimensions.The risks of mixing dependency lengths from sequences of different length
http://hdl.handle.net/2117/28279
The risks of mixing dependency lengths from sequences of different length
Ferrer Cancho, Ramon; Liu, Haitao
Mixing dependency lengths from sequences of different length is a common practice in language research. However, the empirical distribution of dependency lengths of sentences of the same length differs from that of sentences of varying length. The distribution of dependency lengths depends on sentence length for real sentences and also under the null hypothesis that dependencies connect vertices located in random positions of the sequence. This suggests that certain results, such as the distribution of syntactic dependency lengths mixing dependencies from sentences of varying length, could be a mere consequence of that mixing. Furthermore, differences in the global averages of dependency length (mixing lengths from sentences of varying length) for two different languages do not simply imply a priori that one language optimizes dependency lengths better than the other because those differences could be due to differences in the distribution of sentence lengths and other factors.
2015-06-11T11:35:43ZFerrer Cancho, RamonLiu, HaitaoMixing dependency lengths from sequences of different length is a common practice in language research. However, the empirical distribution of dependency lengths of sentences of the same length differs from that of sentences of varying length. The distribution of dependency lengths depends on sentence length for real sentences and also under the null hypothesis that dependencies connect vertices located in random positions of the sequence. This suggests that certain results, such as the distribution of syntactic dependency lengths mixing dependencies from sentences of varying length, could be a mere consequence of that mixing. Furthermore, differences in the global averages of dependency length (mixing lengths from sentences of varying length) for two different languages do not simply imply a priori that one language optimizes dependency lengths better than the other because those differences could be due to differences in the distribution of sentence lengths and other factors.