The process of developing hybrid MT systems
is guided by the evaluation method used to
compare different combinations of basic subsystems.
This work presents a deep evaluation
experiment of a hybrid architecture that
tries to get the best of both worlds, rule-based and statistical. In a first evaluation human assessments were used to compare just the single statistical system and the hybrid one, the rule-based system was not compared by hand because the results of automatic evaluation showed a clear disadvantage. But a second and wider evaluation experiment surprisingly showed that according to human evaluation the best system was the rule-based, the one that achieved the worst results using automatic evaluation. An examination of sentences with controversial results suggested that linguistic well-formedness in the output
should be considered in evaluation. After experimenting with 6 possible metrics we conclude that a simple arithmetic mean of BLEU and BLEU calculated on parts of speech of words is clearly a more human conformant
metric than lexical metrics alone.
CitationLabaka, G. [et al.]. Deep evaluation of hybrid architectures: simple metrics correlated with human judgments. A: International Workshop on Using Linguistic Information for Hybrid Machine Translation. "LIHMT 2011 Sponsors International Workshop on Using Linguistic Information for Hybrid Machine Translation". Barcelona: 2011, p. 50-57.
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder. If you wish to make any use of the work not provided for in the law, please contact: firstname.lastname@example.org