Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

Emirtekin, Emrah; Ozarslan, Yasin

Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

Date

2026

Authors

Emirtekin, Emrah

Ozarslan, Yasin

Publisher

Wiley

Green Open Access

No

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Average

Abstract

Background Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning-such as sustainability-remains unclear. Objectives This study aims to evaluate the agreement between human raters and several LLMs (GPT-4o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short-answer responses from a university-level Sustainability course. It also investigates how this agreement varies across cognitive skill levels. Methods A total of 232 short-answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM-generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap. Results Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585-0.640; r: 0.660-0.668; eta: 0.681-0.803). Inter-rater reliability among humans was good to excellent (ICC: 0.667-0.800). Criterion-level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher-order skills. Conclusions Overall, LLM-human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.

Keywords

Rubric-Based Evaluation, Sustainability Education, Automated Assessment, Large Language Model, Scoring Agreement, Educational AI

OpenCitations Citation Count

N/A

Source

Journal of Computer Assisted Learning

Volume

42

Issue

1

URI

https://hdl.handle.net/123456789/13853
https://doi.org/10.1002/jcal.70160

Collections

Scopus İndeksli Yayınlar Koleksiyonu
WoS İndeksli Yayınlar Koleksiyonu

PlumX Metrics

Citations

Scopus : 0

Captures

Mendeley Readers : 28

Full item page

Google Scholar™

Check

Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

Research Projects

Journal Issue

Abstract

Description

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

PlumX Metrics

Citations

Captures

Google Scholar™

OpenAlex FWCI

9.9024

Sustainable Development Goals

SDG data could not be loaded because of an error. Please refresh the page or try again later.