QuALITY

Question Answering with Long Input Texts, Yes!

QuALITY is a multiple-choice question answering dataset with context passages in English that have an average length of about 5,000 tokens. QuALITY is distributed under a CC BY 4.0 License. The dataset can be downloaded from the repo here. For more details about QuALITY, please refer to the paper: Pang et al. (2022).

For submission instructions, please refer to this page.

@inproceedings{pang-etal-2022-quality,
    title = "{Q}u{ALITY}: Question Answering with Long Input Texts, Yes!",
    author = "Pang, Richard Yuanzhe  and
      Parrish, Alicia  and
      Joshi, Nitish  and
      Nangia, Nikita  and
      Phang, Jason  and
      Chen, Angelica  and
      Padmakumar, Vishakh  and
      Ma, Johnny  and
      Thompson, Jana  and
      He, He  and
      Bowman, Samuel",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.391",
    pages = "5336--5358",
    abstract = "To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4{\%}) and significantly lag behind human performance (93.5{\%}).",
}
}
Leaderboard (last updated: September 2024)
Important notes:
  • Rankings are determined by the accuracy on the entire test set.
  • Accuracy = (number of correct answers) / (num of examples).
  • SAT-style score = (number of correct answers - (1/3) * number of incorrect answers + 0 * number of abstained answers) / (number of examples).
Updates:
  • [2023/5] Please also refer to the SCROLLS benchmark which includes the QuALITY task; as of May 2023, the top QuALITY accruacy on SCROLLS is 48.1 (test set) / 43.8 (hard subset of the test set) by CoLT5 XL.
  • [2022/11] We have added promising but unranked results at the bottom of the table.
Model name Paper Code Accuracy SAT-style score
Test set Hard subset Test set Hard subset
0
2021/12
Human annotators
New York University
93.5 89.1 91.4 85.4
1
2024/09
Baseline model: RAPTOR + gpt-4o w/ query intent & entity understanding
powerdrill.ai
83.1 77.3 77.5 69.7
2
2023/06
RAPTOR (collapsed tree) + GPT-4
Stanford University
82.6 76.2 77.5 69.3
3
2024/01
Baseline model: Long-context GPT-3.5 (gpt-3.5-turbo-16k) as of January 2024
Anonymous
74.7 64.3 66.2 52.4
4
2023/10
LongMA: Fine-Tuning TechGPT-7B using QLoRA on QuALITY and RACE subset
Qi Ma, Northeastern University
73.0 64.0 64.8 53.0
5
2022/05
CoLISA: DPR & DeBERTaV3-large architecture plus contrastive learning & in-sample attention
SUDA NLP & I2R at Soochow University
62.3 54.7 49.7 39.6
6
2022/04
CoLISA: DPR & DeBERTaV3-large architecture & contrastive learning
SUDA NLP & I2R at Soochow University
62.1 54.3 49.5 39.1
7
2021/12
Baseline model: DPR retrieval using questions & DeBERTaV3-large with intermediate training on RACE
New York University
55.4 46.1 40.5 28.1
8
2021/12
Baseline model: DPR retrieval using questions & RoBERTa-large with intermediate training on RACE
New York University
51.4 44.7 35.2 26.3
9
2021/12
Baseline model: DPR retrieval using questions & DeBERTaV3-large
New York University
49.0 41.2 32.0 21.6
10
2021/12
Question-only baseline: DeBERTaV3-large with intermediate training on RACE
New York University
43.3 38.2 24.4 17.6
11
2021/12
Baseline model: fastText retrieval using questions & RoBERTa-large
New York University
42.7 35.7 23.6 14.3
12
2021/12
Question-only baseline: DeBERTaV3-large
New York University
39.7 35.2 19.6 13.5
13
2021/12
Baseline model: Longformer with intermediate training on RACE
New York University
39.5 35.3 19.4 13.8
14
2024/01
Baseline model: Vicuna-7B
Anonymous
39.1 33.9 18.8 11.9
15
2021/12
Baseline model: Longformer
New York University
30.7 29.3 7.6 5.7
--
2022/11
Best-of-20 chain-of-thought w/ a 52B-parameter LM (Bai et al., 2022) fine-tuned by reinforcement learning with human feedback (RLHF) [Note: added by QuALITY authors; unranked given that performance is on dev set only]
Anthropic, Surge AI

&
66.9