AIREA Competition 2025 – Open Category – All Submissions | AIREA - Artificial Intelligence Education and Research Alliance

Open Category

Entry ID

341

Participant Type

Team

Expected Stream

Stream 1: Identifying an educational problem and proposing a solution.

Section A: Project Information

Project Title:

EssayGrader

Project Description (maximum 300 words):

In recent years, there has been increasing national attention on humanities education, with a strong focus on improving students’ critical thinking, writing, and communication skills. Automated Essay Scoring (AES) has emerged as a key tool to support large-scale writing instruction and assessment. However, most existing AES systems rely on handcrafted features and provide only overall scores, lacking the depth needed for educational feedback. To address this, we introduce EssayGrader, a fine-grained, multimodal benchmark designed to evaluate the scoring ability of Multimodal Large Language Models (MLLMs) in realistic educational settings. EssayGrader aims to provide a standardized and interpretable evaluation framework that aligns with classroom goals and promotes fairness in AI-powered assessment. Its main innovations include: (1) fine-grained trait-level evaluation across ten writing dimensions such as coherence, argument clarity, and grammatical diversity; (2) support for multimodal inputs, where each essay is paired with both a text and an image prompt, reflecting real-world assessment formats; and (3) a modular, open-source design that allows researchers to extend the benchmark with new models, prompts, and traits. EssayGrader not only enables deeper analysis of MLLM performance, but also helps guide the development of more accurate and equitable AES systems.

File Upload

EssayGrader.pptx

Section B: Participant Information

Personal Information (Team Member)

Title	First Name	Last Name	Organisation/Institution	Faculty/Department/Unit	Email	Phone Number
Prof.	Xuming	Hu	The Hong Kong University of Science and Technology (Guangzhou)	AI Thrust,Info Hub;	sujiamin360@gmail.com	18041163989
Mr.	Yibo	Yan	The Hong Kong University of Science and Technology (Guangzhou)	AI Thrust, Info Hub;	sujiamin360@gmail.com	18054231221
Mr.	Zhuoran	Gao	The Hong Kong University of Science and Technology (Guangzhou)	AI Thrust, Info Hub;	18091000007@163.com	18091000007
Miss.	Jiamin	Su	The Hong Kong University of Science and Technology (Guangzhou)	AI Thrust, Info Hub	sujiamin360@gmail.com	15029609223

Section C: Project Details

Project Details

Please answer the questions from the perspectives below regarding your project.

1.Problem Identification and Relevance in Education (Maximum 300 words)

The thought process that led to our idea originates from observing critical limitations in traditional AES. Firstly, existing AES systems overly rely on handcrafted linguistic features, limiting their adaptability across diverse writing tasks. Secondly, traditional AES struggles to accurately evaluate nuanced traits, such as coherence and argumentative persuasiveness, essential for meaningful feedback. Thirdly, current AES methods inadequately handle multimodal information, despite growing integration of text and visuals in educational contexts. Inspired by recent advancements in MLLMs, we identified their potential to overcome these limitations by inherently understanding complex textual and visual contexts. This motivated us to develop EssayGrader, a multimodal benchmark tailored for comprehensive essay assessment.
Our hypothesis is that MLLMs can significantly enhance AES performance by capturing rich semantic patterns and multimodal contexts inherently. Specifically, MLLMs excel at interpreting detailed linguistic traits across lexical, sentence, and discourse levels without manual feature engineering. Moreover, their proven success in multimodal reasoning tasks provides strong evidence for their potential effectiveness in AES. Preliminary experiments comparing various MLLMs against human scorers have already validated this hypothesis, showing promising improvements in accuracy and reliability. Thus, we are confident EssayGrader will successfully advance educational assessment by enabling more precise, nuanced, and contextually-aware automated scoring.

2a. Feasibility and Functionality (for Streams 1&2 only) (Maximum 300 words)

To implement EssayGrader, we design a comprehensive benchmark that evaluates the trait-level scoring capabilities of MLLMs in AES. The benchmark includes over 1,000 multimodal essays, each paired with image prompts, text prompts, and ground-truth scores across ten fine-grained traits. We provide standardized prompts and expert-designed rubrics to guide models in performing consistent trait-level scoring. The project relies on three essential resources: high-quality data that includes essay images, textual prompts, and authentic student essays; expert suggestion of detailed rubrics and reliable human-annotated scores; and a collection of representative MLLMs for evaluation. To assess broader demand, we plan to engage both the educational technology and AI research communities through open-source release and community-based comparisons.
The core functionality of EssayGrader is to offer a standardized, multi-level evaluation protocol that quantitatively measures how well MLLMs align with human scores across lexical, sentence, and discourse levels. It enables consistent and fair comparison across different models and input types (e.g., text-only vs. multimodal) using robust metrics such as Quadratic Weighted Kappa (QWK). Evaluation rubrics are based on QWK, which is the most widely adopted metric in AES research for evaluating scoring agreement. By focusing on trait-specific performance, EssayGrader reveals model strengths and weaknesses in a reproducible and transparent manner, facilitating future research on trustworthy, interpretable, and high-performing AES systems.

2b. Technical Implementation and Performance (for Stream 3&4 only) (Maximum 300 words)

Stream 1

3. Innovation and Creativity (Maximum 300 words)

EssayGrader presents an innovative solution to a key gap in current AES research by introducing a new benchmark focused on evaluating MLLMs across fine-grained traits. Our work centers on standardized evaluation—measuring how well existing MLLMs perform in scoring real student essays based on multimodal prompts. This benchmark framework addresses the lack of research attention on MLLMs’ scoring capabilities and their alignment with human rubrics at the lexical, sentence, and discourse levels. By enabling trait-specific and modality-sensitive evaluation, EssayGrader helps the research community identify nuanced strengths and weaknesses across models, promoting more transparent, interpretable, and rigorous assessment in educational AI.
The project further demonstrates creativity through its methodological design: integrating multimodal prompts, expert-defined rubrics, and trait-specific performance analysis to achieve a highly efficient and comprehensive evaluation process. By applying the benchmark to real-world student essays, EssayGrader delivers high-resolution insights that inform the development of next-generation AES tools and foster targeted model improvements. Its innovation lies not only in technical implementation, but also in its integrated vision that connects educational needs, model capabilities, and evaluation science in a coherent and scalable framework.

4. Scalability and Sustainability (Maximum 300 words)

EssayGrader is designed with scalability at its core, enabling consistent evaluation of a wide range of MLLMs under growing model diversity and use cases. To support expansion, we adopt a modular design: essays, rubrics, and evaluation scripts are decoupled, allowing easy addition of new prompts, traits, or model outputs. Our benchmark infrastructure supports batch evaluation and parallel processing, helping to reduce computation time when benchmarking hundreds of models or configurations. Potential bottlenecks, such as model API rate limits or computational overhead, will be mitigated by caching model outputs, distributing tasks across GPUs or cloud resources, and releasing subsets of the benchmark for lightweight use cases.
For long-term sustainability, our benchmark promotes reusability and community contribution. It will be maintained as an open-source project with clear documentation, version control, and transparent evaluation protocols. We aim to engage educators, researchers, and developers in expanding the benchmark with new traits, essay types, or languages. To support environmental sustainability, we prioritize efficient inference strategies and encourage selective model evaluation rather than exhaustive testing. By remaining adaptable to new educational needs and model capabilities, EssayGrader will continue to serve as a reliable foundation for future AES innovation.

5. Social Impact and Responsibility (Maximum 300 words)

EssayGrader aims to improve fairness, transparency, and inclusiveness in AES, which plays an increasingly important role in writing instruction at scale. In many classrooms, teachers lack the time and resources to provide consistent and detailed feedback to every student. Existing AES systems often use simplistic scoring methods and may overlook the needs of students from diverse backgrounds. To address this, EssayGrader introduces a benchmark for evaluating the scoring ability of MLLMs across ten fine-grained writing traits, such as coherence, argument clarity, and grammatical diversity. Our team collaborates closely with leading education companies in China and has access to over 100,000 real student essays from a wide range of learning environments. This ensures that the benchmark is grounded in real classroom data. Expert educators also help design rubrics and annotate essays, aligning the benchmark with real teaching goals. Beyond technical evaluation, EssayGrader supports the long-term goal of building more equitable and responsible AES systems. We will measure social impact through score consistency across student groups, adoption in research and industry, and feedback from teachers. By supporting better evaluation tools, EssayGrader helps make writing assessment more fair, efficient, and accessible—contributing to higher-quality education and greater opportunity for all students.

Do you have additional materials to upload?

PIC

Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.

Agreement

I have read and agree to the competition rules and privacy policy.

Open Category Submission

341 – EssayGrader

Section A: Project Information

Section B: Participant Information

Section C: Project Details

Recent Posts

Recent Comments

Menu

Home

Forums

Competitions

Phone

Email

Address

Open Category Submission341 – EssayGrader

Section A: Project Information

Section B: Participant Information

Section C: Project Details

Recent Posts

Recent Comments

Open Category Submission

341 – EssayGrader