VEX — Virtual Exam Benchmark

VEX is a longitudinal benchmark for Automated Short Answer Grading (ASAG).

It is designed to evaluate grading systems under realistic educational conditions by moving beyond isolated item-level prediction and towards virtual exam-based assessment, student-level aggregation, and feedback quality evaluation.

VEX is built from real student responses collected in a live university database systems course. The benchmark supports research on joint grading and feedback generation, with a focus on deployment-relevant evaluation: ranking consistency, pass/fail decisions, grade-boundary agreement, and pedagogical feedback usefulness.


What VEX Provides


Dataset Overview

Property Value
Total student responses ~31k
Unique questions 239
Students 173
Gold-labeled responses 3,222
Language German
Domain University database systems course
Score scale 0, 0.25, 0.5, 0.75, 1
Split strategy Question-disjoint
Evaluation setting Item-level and virtual-exam level

The gold subset contains expert-annotated ordinal grades and is used as the benchmark evaluation standard. The remaining data can be used for optional training, representation learning, or silver-label experiments.


Why Virtual Exams?

Most ASAG benchmarks evaluate models as isolated item-level predictors: one answer in, one score out.

However, real educational decisions are rarely based on a single response. They depend on cumulative performance across multiple questions, such as:

VEX introduces virtual exams to model this setting. A virtual exam is created by sampling multiple held-out questions and aggregating the answers of students who responded to all selected questions. This makes it possible to evaluate whether a grading system preserves student-level outcomes, not just individual answer scores.


Evaluation Dimensions

1. Item-Level Grading

Standard ASAG metrics are reported for comparability with prior work, including:

These metrics measure local scoring quality on individual responses.

2. Exam-Level Assessment

Virtual exams evaluate cumulative grading behaviour using metrics such as:

VEX supports both absolute and distribution-based grading schemes.

3. Feedback Utility

Generated feedback is evaluated along pedagogical dimensions:

This allows VEX to evaluate systems that produce both grades and natural-language feedback.


Intended Use

VEX is intended for research on:

The benchmark is especially useful for studying cases where item-level metrics alone are insufficient to judge whether a system is safe and useful for educational deployment.


Organization Contents

This Hugging Face organization hosts the VEX benchmark resources, including dataset releases, evaluation code, documentation, and supporting artifacts.

Typical resources include:

Please refer to the individual repository README files for exact file descriptions and usage instructions.


Citation

If you use VEX, please cite the corresponding paper once available.

@misc{vex2026,
  title        = {VEX: A Virtual Exam Benchmark for Automated Short Answer Grading},
  author       = {TBD},
  year         = {2026},
  note         = {Dataset and benchmark for longitudinal ASAG evaluation}
}