Data-to-text Natural Language Generation and Evaluation

HIT-MSRA Summer School, Harbin Institute of Technology, Harbin, China, July 2012

Course description

Natural Language Generation (NLG) is concerned with systems that produce text from non-linguistic input (for example, from numerical data, logical forms, structured database entries, etc). Over the past couple of decades, there have been several successful applications of NLG technology, both commercially and academically.

This set of lectures will introduce the field of NLG in some detail. After an introductory overview, the lectures will focus on the following current research issues:

  1. The design of systems that summarise raw, unstructured data from different sources. As a running example, recent work done on the BabyTalk project will be used. BabyTalk was concerned with the development of a family of systems to summarise large amounts of raw medical data collected about patients in a Neonatal Intensive Care unit, for different audiences (doctors, nurses, family members). Through this example, the lectures will illustrate some of the challenges in the following areas of NLG: (a) Document Planning and content selection from large, heterogeneous datasets; (b) Dealing with large numbers of events with diverse temporal, causal and rhetorical relations, and especially, the challenges raised by this for the generation of coherent narrative text.
  2. The second broad topic that will be covered is that of evaluation in NLG. The themes in this part of the course will be the following: (a) Methodological issues in large-scale system evaluations with users; (b) Types of evaluation methods for assessing the quality of specific sub-tasks of NLG. In this context, the particular focus will be on Referring Expression Generation, and the lectures will consider some recent work on comparative evaluation of REG algorithms, as well as attempts to evaluate these algorithms from the point of view of psycholinguistic evidence.

Lectures and slides

The lectures have been divided into four broad topics, though we may not have time to cover all material

  1. Lecture 1: Introduction to NLG and data-to-text systems. [pptx] [pdf]
  2. Lecture 2: Planning documents from large data sets [pptx] [pdf]
  3. Lecture 3: Handling time and temporal coherence in the generation of narratives. [pptx] [pdf]
  4. Lecture 4: NLG Evaluation: Problems and methods. [pptx] [pdf]

Practical

The project related to this course will focus on a small-scale task to generate text from real-world data. Some parts (but not all) require programming. For the task, you'll need the following:

Bibliography

Here are some readings which give background to the material covered in the course. Most of these are available online (if no link is provided, try Google!)

General background on NLG

E. Reiter and R. Dale (2000). Building Natural Language Generation Systems. Cambridge: CUP

E. Reiter and R. Dale (1997). Building Applied Natural-Language Generation Systems. Journal of Natural-Language Engineering, 3:57-87. [pdf]

E Reiter (2010). Natural Language Generation. In: A. Clark, C. Fox, and S. Lappin (Eds.), Handbook of Computational Linguistics and Natural Language Processing. Wiley.

Data to text generation and example systems

Here is some background literature on NLG systems that generate text from raw data in some form. This is not intended to be exhaustive, nor is it intended to cover the field of NLG as a whole, but only those kinds of approaches that have some relationship to the things discussed in the course.

Kukich (1983). Design of a knowledge-based report generator. Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics (ACL-83), Cambridge, MA.

L. Iordanskaja, M. Kim, R. Kittredge, B. Lavoie and A. Polguere (1992). Generation of extended bilingual statistical reports. Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), Nantes, France.

E. Goldberg, N. Driedger and R.I. Kittredge (1994). Using natural language processing to produce weather forecasts. IEEE Expert 9(2): 45-53.

J. Coch (1998). Interactive generation and knowledge administration in MULTIMETEO. Proceedings of the 9th International Workshop on Natural Language Generation (IWNLG-98), Niagara-on-the-Lake, ON, Canada.

C.B. Callaway and J.C. Lester (2002). Narrative prose generation. Artificial Intelligence, 139(2): 213-252.

D. Hueske-Kraus (2003). Suregen-2: A shell system for the generation of clinical documents. Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), Budapest, Hungary.

D. Hueske-Kraus (2003). Text generation in clinical medicine: A review. Methods of Information in Medicine, 42(1): 51-60.

S. Sripada, E. Reiter and I. Davy (2003). Sumtime-mousam: Configurable marine weather forecast generator. Expert Update, 6(3): 4-10.

E. Reiter, S. Sripada, J. Hunter, J. Yu and I. Davy (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 67: 137-169.

L. Ferres, A. Parush, S. Roberts and G. Lindgaard (2006) Helping people with visual impairments gain access to graphical information through natural language: The igraph system. Proceedings of the 10th International Conference on Computers Helping People with Special Needs (ICCHP-06), University of Linz, Linz, Austria.

C. Mellish, D. Scott, L. Cahill, D. Paiva, R. Evans and M. Reape (2006). A reference architecture for Natural Language Generation systems. Natural Language Engineering, 12(1): 1-34.

A. Cawsey, R. Jones and J. Pearson (2007). An evaluation of a personalised health information system for patients with cancer. User Modelling and User-Adapted Interaction, 10: 47-72.

R. Turner, S. Sripada, E. Reiter and I. Davy (2007). Selecting the content of textual descriptions of geographically located events in spatio-temporal weather data. Proceedings of the Conference on Applications and Innovations in Intelligent Systems XV, Cambridge, UK.

J. Yu, E. Reiter, J. Hunter and C. Mellish (2007). Choosing the content of textual summaries of large time-series data sets. Natural Language Engineering 13: 25-49.

E. Reiter (2007). An architecture for data-to-text systems. Proceedings of the European Workshop on Natural Language Generation (ENLG'07) [pdf]

B. Bohnet, F. Lareau and L. Wanner (2007). Automatic production of multilingual environmental information, Proceedings of the 21st Conference on Informatics for Environmental Protection (EnviroInfo-07), Warsaw, Poland.

M. Harris (2008). Building a large-scale commercial NLG system for an EMR. Proceedings of the 5th International Conference on Natural Language Generation (INLG-08), Salt Fork, OH, USA.

E. Reiter, R. Turner, N. Alm, R. Black, M. Dempster and A. Waller (2009). Using NLG to help language-impaired users tell stories and participate in social dialogues. Proceedings of the 12th European Workshop on Natural Language Generation (ENLG-09), Athens, Greece.

The BabyTalk systems

These are some papers describing various aspects of the BabyTalk Project

A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood, W. Moncur, S. Sripada (2009). From data to text in the neonatal intensive care unit: Using NLG technology for decision support and information management. AI Communications, 22: 153--186 [pdf]

F. Portet, E. Reiter, A. Gatt, J. Hunter, S. Sripada, Y. Freer, and C. Sykes. (2009). Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence, 173: 789-186 (doi: 10.1016/j.artint.2008.12.002) [pdf]

J. Hunter, Y. Freer, A. Gatt, E. Reiter, S. Sripada, C. Sykes and D. Westwater (2011). BT-Nurse: Computer Generation of Natural Language Shift Summaries from Complex Heterogeneous Medical Data. Journal of the American Medical Informatics Association (JAMIA), 18(5): 621-624. [pdf]

Computational approaches to handling time

This is a (very selective!) list of references with useful background on time in natural language, especially from a computational point of view. Note that some of these works deal with understanding, not generation, but they are nevertheless useful.

M. Moens and M. Steedman (1988). Temporal ontology and temporal reference Computational Linguistics, 14: 15-28. [pdf]

R.J. Passonneau (1988). A computational model of the semantics of tense and aspect. Computational Linguistics, 14(2): 44-60. [pdf]

B.L. Webber (1988). Tense as a discourse anaphor. Computational Linguistics, 14(2): 61-73. [pdf]

J. Oberlander and A. Lascarides (1992). Preventing false implicatures: Interactive defaults for text generation. Proceedings of the 15th International Conference on Computational Linguistics (COLING-92). [pdf]

B. Dorr and T. Gaasterland (1995). Selecting tense, aspect and connecting words in language generation. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95). [pdf]

P. Bramsen, P. Deshpande, Y.K. Lee and R. Barzilay (2006). Inducing temporal graphs. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-06).

I. Mani, M. Verhagen, B. Wellner, C.M. Lee and J. Pustejovksy (2006). Machine learning of temporal relations. Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL-06).

D.K. Elson and K.R. McKeown (2010). Tense and aspect assignment in narrative discourse. Proceedings of the 6th International Conference in Natural Language Generation(INLG-10), Dublin, Ireland. [pdf]

F. Portet and A. Gatt (2010). Towards a Possibility-Theoretic Approach to Uncertainty in Medical Data Interpretation for Text Generation. In: D. Riano, A. ten Teije, S. Miksch, and M. Peleg (Eds.), Knowledge Representation for Health-Care: Data, Processes and Guidelines. Berlin and Heidelberg: Springer (LNAI 5943). [pdf]

I. Mani (2010). The Imagined Moment: Time, Narrative and Computation. Nebraska: University of Nebraska Press.

A. Gatt and F. Portet (2011). If it may have happened before, it happened, but not necessarily before. Proceedings of the 13th European Workshop on Natural Language Generation (ENLG'11). [pdf]

Evaluation issues in generation and beyond

Here are some recent papers on evaluation in NLP. You can find more information, including datasets and downloadable reports, on the NLG Generation Challenges here.

J.C. Lester and B.W. Porter (1997). Developing and empirically evaluating robust explanation generators: The KNIGHT experiments. Computational Linguistics 23(1): 65-101 [pdf]

S. Papineni, T. Roukos, W. Ward and W. Zhu. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, PA. [pdf]

I. Langkilde-Geary (2002). An empirical verification of coverage and correctness for a general-purpose sentence generator. Proceedings of the 2nd International Conference on Natural Language Generation (INLG'02), Harriman, NY. [pdf]

C.B. Callaway (2003). Evaluating coverage for large, symbolic NLG grammars. Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico. [pdf]

C-Y. Lin and E. Hovy (2003). Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. Proceedings of HLT-NAACL-03, Edmonton, Canada. [pdf]

E. Reiter, R. Robertson and L. Osman (2003). Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144: 41-58. [pdf]

A. Nenkova, R. Passonneau and K. McKeown (2007). The Pyramid Method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, 4(2). [pdf]

A. Belz and A. Gatt (2008). Intrinsic vs. extrinsic evaluation measures for referring expression generation. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL-08). [pdf]

E. Reiter and A. Belz (2009). An investigation into the validity of some metrics for automatically evaluating NLG systems. Computational Linguistics, 35(4): 529-558. [pdf]

A. Gatt and A. Belz (2010). Introducing shared task evaluation to NLG: The TUNA shared task evaluation challenges. In: E. Krahmer and M. Theune (Eds.), Empirical Methods in Natural Language Generation. Berlin and Heidelberg: Springer (LNCS 5790). [pdf]

A. Gatt and F. Portet (2010). Textual properties and task-based evaluation: Investigating the role of surface properties, structure and content. Proceedings of the 6th International Conference on Natural Language Generation (INLG-10). [pdf]

A. Belz, E. Kow, J. Viethen and A. Gatt (2010). Generating referring expressions in context: The GREC task evaluation challenges. In: E. Krahmer and M. Theune (Eds.), Empirical Methods in Natural Language Generation. Berlin and Heidelberg: Springer (LNCS 5790). [pdf]

A. Belz and A. Gatt (2012) A Repository of Data and Evaluation Resources for Natural Language Generation. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'12). [pdf]