VEX — Virtual Exam Benchmark

VEX is a longitudinal benchmark for Automated Short Answer Grading (ASAG).

It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards virtual exam-based assessment, student-level aggregation, and feedback quality evaluation.

VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint grading and feedback generation, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.

What VEX Provides

Item-level grading: evaluation of individual student answers using standard ASAG metrics.
Exam-level evaluation: construction of virtual exams from multiple held-out questions answered by overlapping students.
Feedback quality evaluation: assessment of generated feedback for diagnostic accuracy, groundedness, actionability, specificity, score alignment, and pedagogical tone.
Question-disjoint splits: strict question-level splitting to reduce leakage between training and evaluation.
Longitudinal student structure: repeated responses from the same student cohort enable aggregation across questions and exam-style analysis.

Dataset Overview

Property	Value
Total student responses	~31k
Unique questions	239
Students	173
Gold-labeled responses	3,222
Language	German
Domain	University database systems course
Score scale	0, 0.25, 0.5, 0.75, 1
Split strategy	Question-disjoint
Evaluation setting	Item-level and virtual-exam level

The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.

Why Virtual Exams?

Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.

However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:

total exam scores,
student rankings,
pass/fail thresholds,
final grade categories,
grade-boundary decisions.

VEX introduces virtual exams to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.

Evaluation Dimensions

1. Item-Level Grading

Standard ASAG metrics are reported for comparability with prior work, including:

Mean Squared Error,
Mean Absolute Error,
Quadratic Weighted Kappa,
ordinal agreement metrics.

These metrics measure local scoring quality on individual responses.

2. Exam-Level Assessment

Virtual exams evaluate cumulative grading behaviour using metrics such as:

EL-τb: exam-level Kendall rank correlation,
EL-Acc: exact final-grade agreement,
EL-QWK: ordinal agreement on final grade categories,
pass/fail consistency under different grading schemes.

VEX supports both absolute and distribution-based grading schemes.

3. Feedback Utility

Generated feedback is evaluated along pedagogical dimensions:

diagnostic accuracy,
groundedness,
score alignment,
actionability,
specificity,
pedagogical tone.

This allows VEX to evaluate systems that produce both grades and natural-language feedback.

Intended Use

VEX is intended for research on:

automated short answer grading,
educational NLP,
LLM-based grading,
feedback generation,
exam-level evaluation,
benchmark design,
model calibration in educational assessment,
teacher-student and silver-label training pipelines.

The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.

Organization Contents

This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.

Typical resources include:

curated dataset files,
gold-labeled benchmark subsets,
optional silver-label training data,
virtual exam construction utilities,
metric computation scripts,
evaluation reports,
documentation for reproducing reported results.

Please refer to the individual repository README files for exact file descriptions and usage instructions.

Citation

If you use VEX, please cite the corresponding paper once available.

@misc{vex2026,
  title        = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
  author       = {TBD},
  year         = {2026},
  note         = {Dataset and benchmark for longitudinal ASAG evaluation}
}