Section A: Project Information
ErrorRadar is a benchmark designed to evaluate the complex mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs). Unlike traditional math benchmarks that focus solely on solution accuracy, ErrorRadar delves deeper into the cognitive processes involved in problem-solving. It features a dataset of 2,500 real-world, multimodal K-12 math problems, each accompanied by multiple student solutions containing various types of errors. These errors are meticulously categorized (e.g., visual perception, reasoning, knowledge, calculation, misinterpretation) providing a granular analysis of the MLLM’s ability to not only identify if an error exists but also where and why the error occurred. This granular analysis allows for a more nuanced evaluation of MLLMs and can guide future development towards models that truly understand mathematical concepts.
Section B: Participant Information
Title | First Name | Last Name | Organisation/Institution | Faculty/Department/Unit | Phone Number | Contact Person / Team Leader | |
---|---|---|---|---|---|---|---|
Dr. | Shen | Wang | Squirrel Ai Learning | Squirrel Ai Learning | shenwang@squirrelai.com | +1 7347306787 | |
Mr. | Yibo | YAN | Hong Kong University of Science and Technology (Guangzhou) | AI Thrust | yanyibo70@gmail.com | 18054231221 | |
Dr. | Qingsong | Wen | Squirrel Ai Learning | Squirrel Ai Learning | qingsongedu@gmail.com | +1 (425)520-1766 |
Section C: Project Details
Current methods for evaluating MLLMs on math problems often fall short of capturing the nuances of student reasoning. Simply checking for correct answers ignores the valuable information contained within incorrect solutions. ErrorRadar addresses this gap by providing a rich dataset of student errors, enabling researchers to assess an MLLM’s ability to diagnose and understand these errors. This is crucial for developing AI systems that can provide effective and personalized feedback in educational settings. By focusing on the “process” of problem-solving, ErrorRadar promotes the development of MLLMs that can support deeper learning and improved problem-solving skills.
ErrorRadar’s dataset is constructed from real student responses to K-12 math problems, ensuring its relevance to real-world educational scenarios. The multimodal nature of the dataset, including handwritten solutions, typed text, and diagrams, reflects the diverse ways students engage with mathematical concepts. The benchmark will provide clear evaluation metrics, such as the accuracy of error detection, the precision of error localization, and the correctness of error categorization. This allows for a comprehensive assessment of MLLM performance on complex mathematical reasoning tasks.
Stream 1
ErrorRadar introduces a novel approach to evaluating MLLMs in the context of mathematics education. Its focus on error analysis and categorization provides a level of granularity not found in existing benchmarks. This innovative approach pushes the boundaries of MLLM evaluation beyond simple answer checking, encouraging the development of models that can truly understand and diagnose student difficulties. The multimodal nature of the dataset further enhances the benchmark’s relevance and challenges MLLMs to handle the complexities of real-world mathematical problem-solving.
ErrorRadar is designed to be a scalable and sustainable benchmark. The dataset can be easily expanded with additional problems and error annotations. The evaluation metrics are clearly defined and can be readily applied to new MLLMs. The benchmark’s focus on fundamental cognitive processes in mathematical reasoning ensures its long-term relevance as MLLM technology continues to evolve. By providing a robust and adaptable framework for evaluating MLLMs, ErrorRadar contributes to the ongoing development of AI systems that can effectively support and enhance math education.
ErrorRadar addresses social issues by enhancing math education through error detection, promoting equity and inclusion. It empowers educators with insights into student errors, enabling tailored interventions that support diverse learning needs. To measure its social impact, we will track improvements in student performance, gather teacher feedback, monitor equity metrics, and assess community engagement. We ensure responsiveness through a continuous feedback loop, regular updates, community collaboration, and pilot programs. By doing so, ErrorRadar aims to create a more inclusive and effective educational environment, contributing to broader social goals.
Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.
- I have read and agree to the competition rules and privacy policy.