On the Evaluation of Machine-Generated Reports

James Mayfield; Eugene Yang; Dawn Lawrie; Sean MacAvaney; Paul McNamee; Douglas Oard; Luca Soldaini; Ian Soboroff; Orion Weller; Efsun Kayi; Kate Sanders; Marc Mason; Noah Hibbler

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

On the Evaluation of Machine-Generated Reports

Published

July 14, 2024

Author(s)

James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

Abstract

Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and— critically—a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable—if not required— in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.

Proceedings Title

Proceedings of ACM SIGIR 2024

Conference Dates

July 15-19, 2024

Conference Location

Washington, DC, US

Conference Title

ACM SIGIR 2024

Pub Type

Conferences

Download Paper

https://doi.org/10.1145/3626772.3657846

Local Download

Keywords

generative AI, AI evaluation

Information retrieval, Human language technology and AI measurement and evaluation

Citation

Mayfield, J. , Yang, E. , Lawrie, D. , MacAvaney, S. , McNamee, P. , Oard, D. , Soldaini, L. , Soboroff, I. , Weller, O. , Kayi, E. , Sanders, K. , Mason, M. and Hibbler, N. (2024), On the Evaluation of Machine-Generated Reports, Proceedings of ACM SIGIR 2024, Washington, DC, US, [online], https://doi.org/10.1145/3626772.3657846, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=957812 (Accessed November 5, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created July 14, 2024, Updated October 1, 2024

Was this page helpful?

On the Evaluation of Machine-Generated Reports

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues