The automatic generation of image captions has received considerable attention. The problem of evaluating caption generation systems, though, has not been that much explored. We propose a novel evaluation approach based on comparing the underlying visual semantics of the candidate and ground-truth captions. With this goal in mind we have defined a semantic representation for visually descriptive language and have augmented a subset of the Flickr-8K dataset with semantic annotations. Our evaluation metric (BAST) can be used not only to compare systems but also to do error analysis and get a better understanding of the type of mistakes a system does. To compute BAST we need to predict the semantic representation for the automatically generated captions. We use the Flickr-ST dataset to train classifiers that predict STs so that evaluation can be fully automated.
CitacióEllebracht, L., Ramisa, A., Shantharam, P., Cordero, J., Moreno-Noguer, F., Quattoni, A. Semantic tuples for evaluation of image sentence generation. A: Workshop on Vision and Language. "Proceedings of the 4th Workshop on Vision and Language, 2015, Lisbon.". Lisboa: 2015, p. 18-28.