Section A: Project Information
Key innovations:
This project presents a novel AI-driven educational tool that performs fine-grained error attribution for K-12 mathematics using large language models (LLMs) and multi-agent orchestration. The system mimics expert teacher behavior by analyzing student problem-solving drafts, reconstructing their thought processes, and pinpointing the exact causes of mistakes, whether conceptual misunderstandings, procedural errors, or misapplied logic.
Design concepts:
Unlike traditional intelligent tutoring systems that rely on template matching, our system achieves highly personalized diagnostics. When errors are detected, the model engages students in human-like multi-turn conversations to guide reflective thinking, which is completely error-context-aware since prior to the conversation the cause of error has been embedded into the conversation system.
Technical principles:
Our tool is built on a modular, agent-based architecture that includes an OCR reader, a semantic reasoner, a knowledge point mapper, and a Socratic-style dialogue generator. Each module is orchestrated via controller logic to compose step-wise AI reasoning to replicate the chain-of-thought of a student at the very first hand. We also cache similar error reasoning results, shared with all students, to avoid unnecessary costs, which is also computationally efficient.
Evaluations:
The current accuracy of error attribution, by manual evaluation of over 800 cases from pedagogical experts, is around 83% for K-6 and 79% for 7-12 grades. We expect the precision to boost significantly (by 10%+ within months) with the advancing of the vision LLMs.
Potential impacts:
The system is now supporting around 112K daily active students in math subject. It has generated extremely insightful data signals to power downstream tasks such as the recommendation of relevant contents in weak knowledge concepts (e.g., practice problems and teaching videos), fine-grained student profile creation, and explorations of alternative solutions to the same problem. Broader impacts include the knowledge tagging by addressing nuanced inter-knowledge relationships, recognitions of the psychological patterns, etc.
Section B: Participant Information
Title | First Name | Last Name | Organisation/Institution | Faculty/Department/Unit | Phone Number | Contact Person / Team Leader | |
---|---|---|---|---|---|---|---|
Dr. | Tianlong | Xu | Squirrel Ai Learning | AI Research Lab | tianlongxu@squirrelai.com | +14043987727 | |
Dr. | Shen | Wang | Squirrel Ai Learning | AI Research Lab | shenwang@squirrelai.com | +17347306787 | |
Dr. | Zhendong | Chu | Squirrel Ai Learning | AI Research Lab | zhendongchu@squirrelai.com | +14343289773 | |
Dr. | Qingsong | Wen | Squirrel Ai Learning | AI Research Lab | qingsongwen@squirrelai.com | +14255201766 |
Section C: Project Details
In as early as Spring 2024, we launched an initiative at Squirrel Ai encouraging students to write down their thought processes while solving math problems. Our goal was to cultivate disciplined, step-by-step reasoning and normalize structured thinking. Within a month, we had amassed millions of digital drafts, raw, messy, and full of untapped learning signals. It was immediately clear to us: there was immense value hidden in these drafts, but we needed the right technology to unlock it.
Not long after, advances in multimodal large language models, especially ChatGPT’s emerging vision capabilities, presented a clear opportunity. These models could understand handwritten math drafts with surprising competence. We began testing hundreds of examples, and even our earliest prototypes showed promising results for simpler problems. Encouraged, we designed a multi-agent architecture to tackle more complex cases, integrating OCR, semantic reasoning, and dynamic feedback generation.
At the core of this project are two intersecting forces: the data (millions of diverse student thought processes) and the vision capabilities of modern LLMs. With both evolving rapidly, especially with the emergence of models like Gemini 2.5—we believe this application will become significantly more powerful and relevant in the near future.
Our hypothesis is that if we can accurately reconstruct a student’s thinking, diagnose the root cause of their mistake, and deliver personalized, emotionally resonant feedback, then students will improve not only in correctness but in confidence and conceptual mastery. As attribution accuracy increases, so does the quality of the AI-student conversation, leading to more meaningful engagement, precise recommendations, and deeper learning.
This project originated from a practical need, but it has since become a vision: that every student can have a tutor who understands how they think, and helps them think better.
NA - see 2b instead.
Our system uses a modular, multi-agent architecture to replicate student reasoning, attribute root causes of math errors, and generate personalized tutoring interactions.
Function Point Technical Application Progress
Draft Recognition OCR+Vision LLM chain-of-thought deep thinking Integrated and operational
Error Attribution Multi-agent logic with controller + cached retrieval layer Accuracy at 83% (K-6) and 79% (7-12), improving, to be improved to 90%+
Personalized Dialogue Generation Socratic-style response agent with detected errors embedded in system Live with 75% meaningfulness, to be improved to 90%+
Recommendation Engine Content-tagging + knowledge point mapper + retrieval model Training and integration in progress
Standalone model that performs error attribution and conversations Opensource vision thinking model finetuning (e.g., sky-r1v-38b) Training in progress
Expanding the current solution to other STEM subjects Modification of the vision LLM pipelines and adjust the core prompts to be aligned with the specific requirements of other subjects such as Physics, Chemistry, and Biology. Developing in progress
Evaluation and Metrics
Offline:
• Error Attribution Accuracy: one-by-one manual evaluation by experts.
• Conversation Meaningfulness (rubric-base): structured rubric with four criteria (meaningful when at least two met):
o Being corrective & anti-misleading.
o Being inspirational & encouraging.
o Being precise & rigorous with calculation.
o Being effective & to-the-point.
Online:
• User satisfaction: ~80% positive.
• Knowledge mastery indicators:
o NIACT: The cumulative number of incorrect answers in the knowledge point associated with the current question during a learning session.
o NQCT: The cumulative number of questions in the knowledge point associated with the current question during a learning session.
o ARCT: The average correct answer rate for the knowledge point associated with the current question during a learning session.
o NVRS: The number of times students rewatched videos and relearned the knowledge point associated with the current question due to low mastery during a learning session.
Our project is among the first to apply multimodal large language models (LLMs) to interpret student-generated content in math learning, specifically, handwritten solution drafts. While traditional AI in education focuses on grading answers or recommending practice problems, our system goes a step further: it deciphers how students think.
Student problem-solving processes are messy, diverse, and deeply personal. A wrong answer might stem from a misunderstanding of a concept, a skipped step, a careless arithmetic slip, or even illegible handwriting. No static rule-based system, or even a skilled teacher with limited time, can possibly track every nuance—especially at scale. With over 112,000 daily active students in our system, each potentially making 50+ mistakes per day, it is humanly impossible to diagnose every individual error in a timely, precise, and personalized manner.
This is where our innovation shines. By leveraging vision-capable LLMs in a multi-agent system, we accurately reconstruct students’ reasoning, identify subtle mistakes, and classify their root causes. These may include misapplied logic, incomplete steps, typos, misinterpretations of constraints, or visually ambiguous expressions (e.g., writing “20 + 1” in a way that looks like “20 − 11”).
But we don’t stop at error detection. Our system initiates tailored, context-aware human-AI conversations that guide students to reflect, correct, and understand—just like a real tutor would. Then it dynamically recommends content aligned with the identified weak knowledge point.
This level of fine-grained, empathetic, and scalable feedback has never been achieved before in K-12 education. It brings AI into the heart of how students learn and struggle—turning every mistake into a meaningful learning opportunity.
Our tool is engineered for seamless scalability across subjects, languages, and educational ecosystems. Its modular architecture and minimal integration requirements make it easy to deploy at scale, ensuring sustained effectiveness as usage grows.
Current Scale and Validation
With 112,000 daily active users, the system has already demonstrated robust performance under real-world load. Should the user demand increase drastically, we should have sufficient resilience to cope the challenge. Since the initial launch, we have cached over at least 100 error cases per problem, which are shared with all students. Such mechanism by design is highly efficient computationally as it ensures only a small fraction of the demands (i.e., 10%) need processing from scratch.
Cross-Subject, language, and curriculum edition adaptability
The tool’s architecture supports rapid expansion to new STEM subjects including Physics, Chemistry, and Biology. It requires only a few standard fields, problem statements, student solutions, and drafts, all commonly available on most learning platforms. Its language-agnostic design enables easy extension from English and Chinese (compatible now) to other languages with minimal retraining. Our system already supports nine curriculum editions within China and can be adapted to other national standards, supporting long-term engagement through localized alignment.
Deployment and Integration
The tool integrates directly with online tutoring platforms and digital learning content systems via simple APIs. Its lightweight setup enables fast rollout across diverse educational contexts, from large centralized systems to smaller regional platforms.
Cost and Environmental Sustainability
The primary recurring cost is LLM API usage. To optimize for efficiency, we’ve implemented intelligent caching for repeated reasoning patterns, reducing computation load and energy usage. This strategy makes the system financially and environmentally sustainable as it scales.
Addressing Access Challenges
To overcome infrastructure gaps, we are developing offline-capable standalone versions for deployment in low-connectivity regions.
Our tool is built with a strong commitment to equity, safety, and responsible AI, ensuring that every student, regardless of background, has access to personalized learning support that is ethical, inclusive, and secure.
Equity and Inclusion
To reduce educational disparities, we offer free account options to students in low-income regions, ensuring that financial limitations do not restrict access to quality AI tutoring. The system is designed for remote-first usage, removing geographic barriers, and it will soon support a standalone offline version for learners in areas with limited internet connectivity.
Its architecture allows easy adaptation across diverse cultural and linguistic contexts and expansion into other STEM subjects. Additionally, the platform supports real-time access by parents and educators, enabling more inclusive oversight and support for underrepresented or struggling students.
To measure and enhance equity, we track user demographic distributions across gender, race, region, and socioeconomic background, and are working to include LGBTQ+ visibility and accessibility features in our roadmap. This ensures our system not only reaches a diverse population, but also serves them with awareness and sensitivity.
Safety, Privacy, and Transparency
We prioritize AI safety through:
• Hallucination control mechanisms, reducing irrelevant or incorrect outputs.
• Abuse detection and real-time conversation shutdown if harmful language is detected.
• Exclusion of unclear or misleading drafts from training and reasoning processes.
Privacy is protected by data anonymization, strict zero-retention policies with LLM providers, and clear user consent protocols. All interactions are labeled as AI-generated, and students are educated on how their data contributes to system improvement.
Impact Evaluation
We actively monitor and optimize:
• Hallucination rates
• Frequency of serious user complaints
• Equity metrics across diverse user identities and regions
Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.
- I have read and agree to the competition rules and privacy policy.