Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

Loading...
Publication Logo

Date

2026

Journal Title

Journal ISSN

Volume Title

Publisher

Wiley

Open Access Color

Green Open Access

No

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

Research Projects

Journal Issue

Abstract

Background Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning-such as sustainability-remains unclear. Objectives This study aims to evaluate the agreement between human raters and several LLMs (GPT-4o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short-answer responses from a university-level Sustainability course. It also investigates how this agreement varies across cognitive skill levels. Methods A total of 232 short-answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM-generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap. Results Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585-0.640; r: 0.660-0.668; eta: 0.681-0.803). Inter-rater reliability among humans was good to excellent (ICC: 0.667-0.800). Criterion-level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher-order skills. Conclusions Overall, LLM-human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.

Description

Keywords

Rubric-Based Evaluation, Sustainability Education, Automated Assessment, Large Language Model, Scoring Agreement, Educational AI

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Logo
OpenCitations Citation Count
N/A

Source

Journal of Computer Assisted Learning

Volume

42

Issue

1

Start Page

End Page

PlumX Metrics
Citations

Scopus : 0

Captures

Mendeley Readers : 28

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
9.9024

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.