Open Category
Entry ID
348
Participant Type
Team
Expected Stream
Stream 1: Identifying an educational problem and proposing a solution.

Section A: Project Information

Project Title:
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Project Description (maximum 300 words):

ErrorRadar is a benchmark designed to evaluate the complex mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs). Unlike traditional math benchmarks that focus solely on solution accuracy, ErrorRadar delves deeper into the cognitive processes involved in problem-solving. It features a dataset of 2,500 real-world, multimodal K-12 math problems, each accompanied by multiple student solutions containing various types of errors. These errors are meticulously categorized (e.g., visual perception, reasoning, knowledge, calculation, misinterpretation) providing a granular analysis of the MLLM’s ability to not only identify if an error exists but also where and why the error occurred. This granular analysis allows for a more nuanced evaluation of MLLMs and can guide future development towards models that truly understand mathematical concepts.

File Upload

Section B: Participant Information

Personal Information (Team Member)
Title First Name Last Name Organisation/Institution Faculty/Department/Unit Email Phone Number Contact Person / Team Leader
Dr. Shen Wang Squirrel Ai Learning Squirrel Ai Learning shenwang@squirrelai.com +1 7347306787
Mr. Yibo YAN Hong Kong University of Science and Technology (Guangzhou) AI Thrust yanyibo70@gmail.com 18054231221
Dr. Qingsong Wen Squirrel Ai Learning Squirrel Ai Learning qingsongedu@gmail.com +1 (425)520-1766

Section C: Project Details

Project Details
Please answer the questions from the perspectives below regarding your project.
1.Problem Identification and Relevance in Education (Maximum 300 words)

Current methods for evaluating MLLMs on math problems often fall short of capturing the nuances of student reasoning. Simply checking for correct answers ignores the valuable information contained within incorrect solutions. ErrorRadar addresses this gap by providing a rich dataset of student errors, enabling researchers to assess an MLLM’s ability to diagnose and understand these errors. This is crucial for developing AI systems that can provide effective and personalized feedback in educational settings. By focusing on the “process” of problem-solving, ErrorRadar promotes the development of MLLMs that can support deeper learning and improved problem-solving skills.

2a. Feasibility and Functionality (for Streams 1&2 only) (Maximum 300 words)

ErrorRadar’s dataset is constructed from real student responses to K-12 math problems, ensuring its relevance to real-world educational scenarios. The multimodal nature of the dataset, including handwritten solutions, typed text, and diagrams, reflects the diverse ways students engage with mathematical concepts. The benchmark will provide clear evaluation metrics, such as the accuracy of error detection, the precision of error localization, and the correctness of error categorization. This allows for a comprehensive assessment of MLLM performance on complex mathematical reasoning tasks.

2b. Technical Implementation and Performance (for Stream 3&4 only) (Maximum 300 words)

Stream 1

3. Innovation and Creativity (Maximum 300 words)

ErrorRadar introduces a novel approach to evaluating MLLMs in the context of mathematics education. Its focus on error analysis and categorization provides a level of granularity not found in existing benchmarks. This innovative approach pushes the boundaries of MLLM evaluation beyond simple answer checking, encouraging the development of models that can truly understand and diagnose student difficulties. The multimodal nature of the dataset further enhances the benchmark’s relevance and challenges MLLMs to handle the complexities of real-world mathematical problem-solving.

4. Scalability and Sustainability (Maximum 300 words)

ErrorRadar is designed to be a scalable and sustainable benchmark. The dataset can be easily expanded with additional problems and error annotations. The evaluation metrics are clearly defined and can be readily applied to new MLLMs. The benchmark’s focus on fundamental cognitive processes in mathematical reasoning ensures its long-term relevance as MLLM technology continues to evolve. By providing a robust and adaptable framework for evaluating MLLMs, ErrorRadar contributes to the ongoing development of AI systems that can effectively support and enhance math education.

5. Social Impact and Responsibility (Maximum 300 words)

ErrorRadar addresses social issues by enhancing math education through error detection, promoting equity and inclusion. It empowers educators with insights into student errors, enabling tailored interventions that support diverse learning needs. To measure its social impact, we will track improvements in student performance, gather teacher feedback, monitor equity metrics, and assess community engagement. We ensure responsiveness through a continuous feedback loop, regular updates, community collaboration, and pilot programs. By doing so, ErrorRadar aims to create a more inclusive and effective educational environment, contributing to broader social goals.

Do you have additional materials to upload?
No
PIC
Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.
Agreement
  • I have read and agree to the competition rules and privacy policy.