QuALITY

Question Answering with Long Input Texts, Yes!

QuALITY is a multiple-choice question answering dataset with context passages in English that have an average length of about 5,000 tokens. QuALITY is distributed under a CC BY 4.0 License. The dataset can be downloaded from the repo here. For more details about QuALITY, please refer to the paper: Pang et al. (2021).

For submission instructions, please refer to this page.

@article{pang2021quality,
  title={{QuALITY}: Question Answering with Long Input Texts, Yes!},
  author={Pang, Richard Yuanzhe and Parrish, Alicia and Joshi, Nitish and Nangia, Nikita and Phang, Jason and Chen, Angelica and Padmakumar, Vishakh and Ma, Johnny and Thompson, Jana and He, He and Bowman, Samuel R.},
  journal={arXiv preprint arXiv:2112.08608},
  year={2021}
}
Leaderboard
  • Rankings are determined by the accuracy on the entire test set.
  • Accuracy = (number of correct answers) / (num of examples).
  • SAT-style score = (number of correct answers - (1/3) * number of incorrect answers + 0 * number of abstained answers) / (number of examples).
Model name Paper Code Accuracy SAT-style score
Test set Hard subset Test set Hard subset
0
2021/12
Human annotators
New York University
93.5 89.1 91.4 85.4
1
2022/05
CoLISA: DPR & DeBERTaV3-large architecture plus contrastive learning & in-sample attention
Anonymous (temporary)
62.3 54.7 49.7 39.6
2
2022/04
CoLISA: DPR & DeBERTaV3-large architecture & contrastive learning
Anonymous (temporary)
62.1 54.3 49.5 39.1
3
2021/12
Baseline model: DPR retrieval using questions & DeBERTaV3-large with intermediate training on RACE
New York University
55.4 46.1 40.5 28.1
4
2021/12
Baseline model: DPR retrieval using questions & RoBERTa-large with intermediate training on RACE
New York University
51.4 44.7 35.2 26.3
5
2021/12
Baseline model: DPR retrieval using questions & DeBERTaV3-large
New York University
49.0 41.2 32.0 21.6
6
2021/12
Question-only baseline: DeBERTaV3-large with intermediate training on RACE
New York University
43.3 38.2 24.4 17.6
7
2021/12
Baseline model: fastText retrieval using questions & RoBERTa-large
New York University
42.7 35.7 23.6 14.3
8
2021/12
Question-only baseline: DeBERTaV3-large
New York University
39.7 35.2 19.6 13.5
9
2021/12
Baseline model: Longformer with intermediate training on RACE
New York University
39.5 35.3 19.4 13.8
10
2021/12
Baseline model: Longformer
New York University
30.7 29.3 7.6 5.7