Title : Enhancing academic assessment:Leveraging aI for grading scholarly student papers
Abstract:
Background: Interest in AI-assisted grading has grown as nursing faculty face increasing workload and ongoing faculty shortages. However, evidence remains limited regarding whether large language model (LLM) platforms can evaluate scholarly nursing writing with sufficient reliability for high-stakes assessment.
Objectives: To examine the extent to which LLM-based AI platforms generate rubric-aligned scores for undergraduate nursing papers that are consistent with faculty scoring.
Design: Observational method-comparison study.
Methods: Twenty-six de-identified Bachelor of Science in Nursing student papers were graded using a standardized rubric by one experienced faculty grader and three AI platforms: ChatGPT, Gemini, and Grammarly. Analyses included descriptive statistics, Friedman rank-sum testing, Bonferroni-adjusted Wilcoxon signed-rank tests, Spearman rank-order correlations, and Bland–Altman agreement analysis.
Results: Mean scores were broadly similar across graders; however, no AI platform demonstrated significant rank-order association with faculty scoring. Bland–Altman analyses showed wide limits of agreement for all platforms, indicating substantial variability at the individual-paper level. Additional operational concerns included arithmetic errors, inconsistent scoring behavior and greater leniency toward lower-performing papers.
Conclusions: Although AI platforms may support grading efficiency, they did not demonstrate sufficiently reliable agreement with faculty judgment for independent use in high-stakes nursing assessment. AI-assisted grading in nursing education should therefore remain cautious, faculty-supervised, and subject to further validation
Key words: Nursing education; nursing educator workloads; AI-assisted grading; large language models (LLMs); faculty scoring; high-stakes assessment; workforce crisis; pipeline.

