QuALITY Leaderboard

QuALITY is a multiple-choice question answering dataset with context passages in English that have an average length of about 5,000 tokens. QuALITY is distributed under a CC BY 4.0 License. The dataset can be downloaded from the repo here. For more details about QuALITY, please refer to the paper: Pang et al. (2022).

For submission instructions, please refer to this page.

@inproceedings{pang-etal-2022-quality,
    title = "{Q}u{ALITY}: Question Answering with Long Input Texts, Yes!",
    author = "Pang, Richard Yuanzhe  and
      Parrish, Alicia  and
      Joshi, Nitish  and
      Nangia, Nikita  and
      Phang, Jason  and
      Chen, Angelica  and
      Padmakumar, Vishakh  and
      Ma, Johnny  and
      Thompson, Jana  and
      He, He  and
      Bowman, Samuel",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.391",
    pages = "5336--5358",
    abstract = "To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4{\%}) and significantly lag behind human performance (93.5{\%}).",
}
}

Leaderboard (last updated: January 2025)

Important notes:

Rankings are determined by the accuracy on the entire test set.
Accuracy = (number of correct answers) / (num of examples).
SAT-style score = (number of correct answers - (1/3) * number of incorrect answers + 0 * number of abstained answers) / (number of examples).

Updates:

[2023/5] Please also refer to the SCROLLS benchmark which includes the QuALITY task; as of May 2023, the top QuALITY accruacy on SCROLLS is 48.1 (test set) / 43.8 (hard subset of the test set) by CoLT5 XL.
[2022/11] We have added promising but unranked results at the bottom of the table.

	Model name	Paper	Code	Accuracy		SAT-style score
	Model name	Paper	Code	Test set	Hard subset	Test set	Hard subset
0 2021/12	Human annotators New York University		–	93.5	89.1	91.4	85.4
Model description: We estimate human accuracy on QuALITY on a random sample of 20 passages (367 questions). Each question is annotated by 3 new validation annotators who had not previously annotated that passage, and whose labels do not contribute to the assignment of the gold label. See paper for details.
1 2025/01	Clustering and Decomposition using Qwen2.5-7b and chat using DeepSeek Mike Wang			88.0	81.9	84.7	76.8
Model description: We first cluster the initial chunks using a clustering algorithm like RAPTOR to obtain some new chun ks (clustered chunks). After that, we use the Qwen2.5-7b model to do the same information decomposition for all chunks. Each chunk decomposed contains two parts, one is the QA for the content of the chunk itself, and the other is the QA for some details inside. These QAs and all chunks are stored in the vector database as retrieval entries. When entering a query for retrieval and question-answering, all retrieved results will be traced back to the original chunk. We accumulated 30 original chunks from the top-ranked recall results as the context information finally submitted to LLM. [External data or resources: we use the SBERT to generate embeddings, use the Qwen2.5-7b model to generate pseudo QA-pairs, and use the DeepSeek-V3 to get the predict choice.]
2 2024/09	Baseline model: RAPTOR + gpt-4o w/ query intent & entity understanding powerdrill.ai			83.1	77.3	77.5	69.7
Model description: Based on RAPTOR, we first decompose user query into a multi-hop plan. For each plan, key entities are extracted to precisely match corresponding information. We also use a rerank model to filter out unnecessary chunk pieces to avoid long context for cost saving and latency improvement. [dev@powerdrill.ai]
3 2023/06	RAPTOR (collapsed tree) + GPT-4 Stanford University			82.6	76.2	77.5	69.3
Model description: RAPTOR recursively clusters chunks of text and generates text summaries of those clusters constructing a summarization tree from the bottom-up. At inference time, RAPTOR retrieves from this tree, allowing it to integrate information across large text corpora at varying levels of abstraction. RAPTOR employs a novel variant of the Gaussian Mixture Model (GMM) for text clustering and clusters the text after dimensionality reduction on the embeddings using Uniform Manifold Approximation and Projection (UMAP). Then, the formed clusters are summarized using a large language model (LLM). The generated summary text is again subjected to clustering. This process is iteratively performed for a predetermined number of layers. The outcome of this iterative process is a bottom-up hierarchical tree structure, wherein each node signifies a cluster of related text chunks. RAPTOR uses SBERT embeddings for clustering.
4 2024/01	Baseline model: Long-context GPT-3.5 (gpt-3.5-turbo-16k) as of January 2024 Anonymous			74.7	64.3	66.2	52.4
Model description: This entry is submitted anonymously. The numbers reflect the zero-shot performance of gpt-3.5-turbo-16k as of January 2024, but the specific prompt used is unclear.
5 2023/10	LongMA: Fine-Tuning TechGPT-7B using QLoRA on QuALITY and RACE subset Qi Ma, Northeastern University			73.0	64.0	64.8	53.0
Model description: We utilize TechGPT to identify the most pertinent contexts for a given question and set of options, which is used to create subsets derived from the QuALITY and RACE datasets. To ensure compatibility, we limit the maximum token length to 2048 for the question, options, and contexts. During the training phase, we use QLoRA to fine-tune the TechGPT-7b model. In the first stage, we fine-tuned 2 epochs on the RACE subset, followed by a second stage of fine-tuning on the QuALITY subset for 2 epochs.
6 2024/12	RAPTOR+GPT-4o-mini, initial cluster centers determined through subqueries Guanran Luo, Xiamen University			71.7	60.0	62.3	46.7
Model description: Based on RAPTOR, we first decompose the query into multiple subqueries by incorporating the document structure, using these subqueries as the initial cluster centers for the first layer of recursive summarization. Additionally, we have eliminated global clustering. This approach helps improve the efficiency of document tree generation and search, and makes the generated summaries and queries more relevant. [30920240157704 at stu.xmu.edu.cn]
7 2022/05	CoLISA: DPR & DeBERTaV3-large architecture plus contrastive learning & in-sample attention SUDA NLP & I2R at Soochow University			62.3	54.7	49.7	39.6
Model description: We extract a short context from the original passage by DPR (using questions and options). Then we successively fine-tune a multiple-choice model on RACE and QuALITY. Besides, we introduce contrastive learning method and in-sample attention mechanism within each sample to better distinguish the answers from the distractors. External data/resources: We use the off-the-shelf DPR retriever to extract the short contexts from original articles. And we use the pre-trained DeBERTaV3-large model on Hugging Face. Besides, RACE is introduced into our first stage of fine tuning the model.
8 2022/04	CoLISA: DPR & DeBERTaV3-large architecture & contrastive learning SUDA NLP & I2R at Soochow University			62.1	54.3	49.5	39.1
Model description: We extract a short context from the original passage by DPR (using questions and options). Then we successively fine-tune a multiple-choice model on RACE and QuALITY. Besides, we introduce contrastive learning method within each sample to better distinguish the answers from the distractors. External data/resources: We use the off-the-shelf DPR retriever to extract the short contexts from original articles. And we use the pre-trained DeBERTaV3-large model on Hugging Face. Besides, RACE is introduced into our first stage of fine tuning the model.
9 2021/12	Baseline model: DPR retrieval using questions & DeBERTaV3-large with intermediate training on RACE New York University			55.4	46.1	40.5	28.1
Model description: We use DeBERTaV3-large and first do intermediate training on RACE; then we fine-tune the model on QuALITY. For fine-tuning, we use the similarity (based on DPR) between each source sentence and the question to select shorter contexts to feed into the QA model. See paper for details.
10 2021/12	Baseline model: DPR retrieval using questions & RoBERTa-large with intermediate training on RACE New York University			51.4	44.7	35.2	26.3
Model description: We use RoBERTa-large and first do intermediate training on RACE; then we fine-tune the model on QuALITY. For fine-tuning, we use the similarity (based on DPR) between each source sentence and the question to select shorter contexts to feed into the QA model. See paper for details.
11 2021/12	Baseline model: DPR retrieval using questions & DeBERTaV3-large New York University			49.0	41.2	32.0	21.6
Model description: We fine-tune DeBERTa-V3-large on QuALITY. We use the similarity (based on DPR) between each source sentence and the question to select shorter contexts to feed into the QA model. See paper for details.
12 2021/12	Question-only baseline: DeBERTaV3-large with intermediate training on RACE New York University			43.3	38.2	24.4	17.6
Model description: We use DeBERTaV3-large and first do intermediate training on RACE; then we fine-tune the model on QuALITY. For fine-tuning the input only consists of the questions and options (so, no articles).
13 2021/12	Baseline model: fastText retrieval using questions & RoBERTa-large New York University			42.7	35.7	23.6	14.3
Model description: We fine-tune RoBERTa-large on QuALITY. We use the similarity (based on fastText) between each source sentence and the question to select shorter contexts to feed into the QA model. See paper for details.
14 2021/12	Question-only baseline: DeBERTaV3-large New York University			39.7	35.2	19.6	13.5
Model description: Model description: We fine-tune DeBERTa-V3-large on QuALITY but the input only consists of the questions and options (so, no articles).
15 2021/12	Baseline model: Longformer with intermediate training on RACE New York University			39.5	35.3	19.4	13.8
Model description: We first do intermediate training using a pretrained Longformer (which supports up to 4,096 tokens) on RACE. Then, we fine-tune the model on QuALITY. See paper for details.
16 2024/01	Baseline model: Vicuna-7B Anonymous			39.1	33.9	18.8	11.9
Model description: This entry is submitted anonymously. The numbers reflect the zero-shot performance of Vicuna-7B as of January 2024, but the specific prompt used is unclear.
17 2021/12	Baseline model: Longformer New York University			30.7	29.3	7.6	5.7
Model description: We fine-tune a pretrained Longformer which supports up to 4,096 tokens, on QuALITY. See paper for details.
-- 2022/11	Best-of-20 chain-of-thought w/ a 52B-parameter LM (Bai et al., 2022) fine-tuned by reinforcement learning with human feedback (RLHF) [Note: added by QuALITY authors; unranked given that performance is on dev set only] Anthropic, Surge AI	&		66.9	–	–	–
Model description: This submission is manually added by QuALITY authors. From the paper: "We prompt the model with a chain-of-thought-style input before asking them to answer (Nye et al., 2021; Wei et al., 2022). We sample 20 instances of model reasoning and choose the one that scores best under our RLHF preference model before conditioning on that reasoning string to generate the answer."