AIREA Competition 2025 – Open Category – All Submissions | AIREA - Artificial Intelligence Education and Research Alliance

Open Category

Entry ID

348

Participant Type

Team

Expected Stream

Stream 1: Identifying an educational problem and proposing a solution.

Section A: Project Information

Project Title:

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Project Description (maximum 300 words):

ErrorRadar is a benchmark designed to evaluate the complex mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs). Unlike traditional math benchmarks that focus solely on solution accuracy, ErrorRadar delves deeper into the cognitive processes involved in problem-solving. It features a dataset of 2,500 real-world, multimodal K-12 math problems, each accompanied by multiple student solutions containing various types of errors. These errors are meticulously categorized (e.g., visual perception, reasoning, knowledge, calculation, misinterpretation) providing a granular analysis of the MLLM’s ability to not only identify if an error exists but also where and why the error occurred. This granular analysis allows for a more nuanced evaluation of MLLMs and can guide future development towards models that truly understand mathematical concepts.

File Upload

ErrorRadar.pptx

Section B: Participant Information

Personal Information (Team Member)

Title	First Name	Last Name	Organisation/Institution	Faculty/Department/Unit	Email	Phone Number
Dr.	Shen	Wang	Squirrel Ai Learning	Squirrel Ai Learning	shenwang@squirrelai.com	+1 7347306787
Mr.	Yibo	YAN	Hong Kong University of Science and Technology (Guangzhou)	AI Thrust	yanyibo70@gmail.com	18054231221
Dr.	Qingsong	Wen	Squirrel Ai Learning	Squirrel Ai Learning	qingsongedu@gmail.com	+1 (425)520-1766

Section C: Project Details

Project Details

Please answer the questions from the perspectives below regarding your project.

1.Problem Identification and Relevance in Education (Maximum 300 words)

Current methods for evaluating MLLMs on math problems often fall short of capturing the nuances of student reasoning. Simply checking for correct answers ignores the valuable information contained within incorrect solutions. ErrorRadar addresses this gap by providing a rich dataset of student errors, enabling researchers to assess an MLLM’s ability to diagnose and understand these errors. This is crucial for developing AI systems that can provide effective and personalized feedback in educational settings. By focusing on the “process” of problem-solving, ErrorRadar promotes the development of MLLMs that can support deeper learning and improved problem-solving skills.

2a. Feasibility and Functionality (for Streams 1&2 only) (Maximum 300 words)

ErrorRadar’s dataset is constructed from real student responses to K-12 math problems, ensuring its relevance to real-world educational scenarios. The multimodal nature of the dataset, including handwritten solutions, typed text, and diagrams, reflects the diverse ways students engage with mathematical concepts. The benchmark will provide clear evaluation metrics, such as the accuracy of error detection, the precision of error localization, and the correctness of error categorization. This allows for a comprehensive assessment of MLLM performance on complex mathematical reasoning tasks.

2b. Technical Implementation and Performance (for Stream 3&4 only) (Maximum 300 words)

Stream 1

3. Innovation and Creativity (Maximum 300 words)

ErrorRadar introduces a novel approach to evaluating MLLMs in the context of mathematics education. Its focus on error analysis and categorization provides a level of granularity not found in existing benchmarks. This innovative approach pushes the boundaries of MLLM evaluation beyond simple answer checking, encouraging the development of models that can truly understand and diagnose student difficulties. The multimodal nature of the dataset further enhances the benchmark’s relevance and challenges MLLMs to handle the complexities of real-world mathematical problem-solving.

4. Scalability and Sustainability (Maximum 300 words)

ErrorRadar is designed to be a scalable and sustainable benchmark. The dataset can be easily expanded with additional problems and error annotations. The evaluation metrics are clearly defined and can be readily applied to new MLLMs. The benchmark’s focus on fundamental cognitive processes in mathematical reasoning ensures its long-term relevance as MLLM technology continues to evolve. By providing a robust and adaptable framework for evaluating MLLMs, ErrorRadar contributes to the ongoing development of AI systems that can effectively support and enhance math education.

5. Social Impact and Responsibility (Maximum 300 words)

ErrorRadar addresses social issues by enhancing math education through error detection, promoting equity and inclusion. It empowers educators with insights into student errors, enabling tailored interventions that support diverse learning needs. To measure its social impact, we will track improvements in student performance, gather teacher feedback, monitor equity metrics, and assess community engagement. We ensure responsiveness through a continuous feedback loop, regular updates, community collaboration, and pilot programs. By doing so, ErrorRadar aims to create a more inclusive and effective educational environment, contributing to broader social goals.

Do you have additional materials to upload?

PIC

Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.

Agreement

I have read and agree to the competition rules and privacy policy.

Open Category Submission

348 – ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Section A: Project Information

Section B: Participant Information

Section C: Project Details

Recent Posts

Recent Comments

Menu

Home

Forums

Competitions

Phone

Email

Address

Open Category Submission348 – ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Section A: Project Information

Section B: Participant Information

Section C: Project Details

Recent Posts

Recent Comments

Open Category Submission

348 – ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection