Corpus Analysis: A great way to understand what your NLG system needs to do

Practical advice on building NLG systems

One of the biggest challenges in creating NLG systems is mapping the input data to words and phrases in the narrative. If you have some examples of good narratives (written by a person), then I recommend that you start the process of mapping data to words by marking up (annotating) some of the examples, based on where the data comes from. The process is fairly quick, and can give good insight as to challenges and feasibility.

I suggest you do this by loading the examples into Word (or whatever editor you use), and color-coding text fragments as follows:

Unchanging (normal) – text does not change (i.e., this text does not depend on data)

Direct data (yellow) – input data

Derived insights (blue) – insights derived from the data

Not available (purple) – information is not present or derivable from data

It’s probably easiest to show this with an example: Let’s assume that the NLG system is supposed to generate a summary of an election result. The developers ask a human journalist to write a short summary of an election result, and she produces the following:

“Kirsty Blackman was re-elected as the MP for Aberdeen North, with 54% of the votes. Her margin of victory was higher than in the previous election, perhaps because her main opponent, Ryan Houghton, was suspended from the Conservative Party because of anti-Semitic comments.”

Let’s also assume that the input data to the system is election data, including historic as well as current elections. In this case, we can annotate what the journalist wrote as follows:

“Kirsty Blackman was re-elected as the MP for Aberdeen North, with 54% of the votes. Her margin of victory was much higher than in previous elections, perhaps because her main opponent, Ryan Houghton, was suspended from the Conservative Party because of anti-Semitic comments.”

In this case,

“Kirsty Blackman”, “re (elected)”, “Aberdeen North”, and “54%” are directly present in the data.
“much higher than in previous elections” is an insight which is computed by looking at historic election data.
“her main opponent, Ryan Houghton, was suspended from the Conservative Party because of anti-Semitic comments” is not derivable just from election data.

This analysis is fairly quick and easy to do. It highlights the sorts of insights that we will need to compute, and also useful content that cannot be generated unless we provide the system with more data. I recommend that anyone building an NLG system start off by doing corpus analysis, if example narratives are available!

If you want to learn more about corpus analysis, see chapter 2 of my book Building Natural Language Generation Systems.

About the author: Arria Chief Scientist, Prof. Ehud Reiter, is a pioneer in the science of Natural Language Generation (NLG) and one of the world’s foremost authorities in the field of NLG. He is responsible for the overall direction of Arria’s core technology development as well as supervision of specific NLG projects. He is Professor of Computing Science in the University of Aberdeen School of Natural and Computing Sciences. Visit his blog here.

Technology & Products

Converging Technologies:

Industry Expertise

Converging Technologies:

Community Resources

Converging Technologies:

About Arria

Converging Technologies:

Corpus Analysis: A great way to understand what your NLG system needs to do

MORE BLOG AND NEWS

The State of AI: Lessons From The Gartner Data and Analytics Summit

How a Global Asset Management Firm Uses AI to Automate Client Reporting

The Challenges and Benefits of Generative AI in Financial Services

Augmented Analytics Reinvented: Arria Partners with GoodData