QuALITY is a multiple-choice question answering dataset with context passages in English that have an average length of about 5,000 tokens. QuALITY is distributed under a CC BY 4.0 License. The dataset can be downloaded from the repo here. For more details about QuALITY, please refer to the paper: Pang et al. (2022).
For submission instructions, please refer to this page.
@inproceedings{pang-etal-2022-quality, title = "{Q}u{ALITY}: Question Answering with Long Input Texts, Yes!", author = "Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.391", pages = "5336--5358", abstract = "To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4{\%}) and significantly lag behind human performance (93.5{\%}).", } }
Leaderboard (last updated: September 2024)
Important notes:
- Rankings are determined by the accuracy on the entire test set.
- Accuracy = (number of correct answers) / (num of examples).
- SAT-style score = (number of correct answers - (1/3) * number of incorrect answers + 0 * number of abstained answers) / (number of examples).
- [2023/5] Please also refer to the SCROLLS benchmark which includes the QuALITY task; as of May 2023, the top QuALITY accruacy on SCROLLS is 48.1 (test set) / 43.8 (hard subset of the test set) by CoLT5 XL.
- [2022/11] We have added promising but unranked results at the bottom of the table.