VEX is a longitudinal benchmark for Automated Short Answer Grading (ASAG).
It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards virtual exam-based assessment, student-level aggregation, and feedback quality evaluation.
VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint grading and feedback generation, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.
| Property | Value |
|---|---|
| Total student responses | ~31k |
| Unique questions | 239 |
| Students | 173 |
| Gold-labeled responses | 3,222 |
| Language | German |
| Domain | University database systems course |
| Score scale | 0, 0.25, 0.5, 0.75, 1 |
| Split strategy | Question-disjoint |
| Evaluation setting | Item-level and virtual-exam level |
The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.
Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.
However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:
VEX introduces virtual exams to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.
Standard ASAG metrics are reported for comparability with prior work, including:
These metrics measure local scoring quality on individual responses.
Virtual exams evaluate cumulative grading behaviour using metrics such as:
VEX supports both absolute and distribution-based grading schemes.
Generated feedback is evaluated along pedagogical dimensions:
This allows VEX to evaluate systems that produce both grades and natural-language feedback.
VEX is intended for research on:
The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.
This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.
Typical resources include:
Please refer to the individual repository README files for exact file descriptions and usage instructions.
If you use VEX, please cite the corresponding paper once available.
@misc{vex2026,
title = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
author = {TBD},
year = {2026},
note = {Dataset and benchmark for longitudinal ASAG evaluation}
}