Evaluation of Natural Language Generation

As benchmarking and evaluation are vital for both scientific progress and successful commercial use of machine learning and NLP systems, I focused a lot on this research topic. In the field of natural language generation (NLG), I emphasized and motivated the need for a new, more reliable metric after thoroughly analyzing multiple existing automatic evaluation metrics, such as BLEU and ROUGE among others. As a natural follow up step, I worked on a novel automatic quality estimation metric for NLG that did not use human-authored references and still was able to improve the correlation with human ratings. In addition to analyzing automatic metrics, I demonstrated that the experimental design has a significant impact on the reliability and consistency of human judgements and introduced a novel rank-based magnitude estimation method (RankME) that results in better agreement amongst human raters and more discriminative results in human-based NLG evaluation.

To address the problem of the scarcity of high quality in-domain corpora, which is a major bottleneck for proper benchmarking, with my colleagues from the Heriot-Watt University I collected a large (at the time) and lexically-diverse E2E dataset for training end-to-end NLG systems in the restaurant domain. With this dataset, we organized the first E2E NLG shared task aimed to assess whether recent end-to-end NLG systems can generate more complex output by learning from data containing higher lexical richness, syntactic complexity and diverse discourse phenomena. Since then, our E2E dataset was included into Hugging Face datasets repository, the GEM living benchmark, and the BIG-bench collaborative benchmark, while the shared task itself has influenced, inspired and motivated a number of studies outwith the original competition.

Some additional details in the presentation below:

WiNLP 2017 presentation from Jekaterina Novikova, PhD