Deep evaluation of hybrid architectures: simple metrics correlated with human judgments
Document typeConference lecture
Rights accessOpen Access
The process of developing hybrid MT systems is guided by the evaluation method used to compare different combinations of basic subsystems. This work presents a deep evaluation experiment of a hybrid architecture that tries to get the best of both worlds, rule-based and statistical. In a first evaluation human assessments were used to compare just the single statistical system and the hybrid one, the rule-based system was not compared by hand because the results of automatic evaluation showed a clear disadvantage. But a second and wider evaluation experiment surprisingly showed that according to human evaluation the best system was the rule-based, the one that achieved the worst results using automatic evaluation. An examination of sentences with controversial results suggested that linguistic well-formedness in the output should be considered in evaluation. After experimenting with 6 possible metrics we conclude that a simple arithmetic mean of BLEU and BLEU calculated on parts of speech of words is clearly a more human conformant metric than lexical metrics alone.
CitationLabaka, G. [et al.]. Deep evaluation of hybrid architectures: simple metrics correlated with human judgments. A: International Workshop on Using Linguistic Information for Hybrid Machine Translation. "LIHMT 2011 Sponsors International Workshop on Using Linguistic Information for Hybrid Machine Translation". Barcelona: 2011, p. 50-57.