Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization
Document typeConference report
Rights accessOpen Access
The process of developing hybrid MT systems is usually guided by an evaluation method used to compare different combinations of basic subsystems. This work presents a deep evaluation experiment of a hybrid architecture, which combines rule-based and statistical translation approaches. Differences between the results obtained from automatic and human evaluations corroborate the inappropriateness of pure lexical automatic evaluation metrics to compare the outputs of systems that use very different translation approaches. An examination of sentences with controversial results suggested that linguistic well-formedness should be considered in the evaluation of output translations. Following this idea, we have experimented with a new simple automatic evaluation metric, which combines lexical and PoS information. This measure showed higher agreement with human assessments than BLEU in a previous study (Labaka et al., 2011). In this paper we have extended its usage throughout the system development cycle, focusing on its ability to improve parameter optimization. Results are not totally conclusive. Manual evaluation reflects a slight improvement, compared to BLEU, when using the proposed measure in system optimization. However, the improvement is too small to draw any clear conclusion. We believe that we should first focus on integrating more linguistically representative features in the developing of the hybrid system, and then go deeper into the development of automatic evaluation metrics.
CitationEspaña-Bonet, C. [et al.]. Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization. A: Free/Open-Source Rule-Based Machine Translation. "Proceedings of the Free/Open-Source Rule-Based Machine Translation Workshop". Gòteborg: 2013, p. 65-76.