Section A: Project Information
Automatic Multi-modal Deep Learning Analysis System is a cutting-edge system that leverages fine-tuned Multimodal Large Language Model (MLLM) to personalize English language learning. Our system allows students to describe real-world scenario images and receive immediate, high-quality feedback tailored to their proficiency levels.
Unlike conventional tools relying solely on textual inputs, our system integrates a fine-tuned MLLM that interprets images and text. This dynamic and automated approach reduces hallucinations common in general models. Students engage with real-world photos and generate descriptions, which provide level-specific. The system also supports continuous improvement through real-time user feedback collection, improving accuracy and relevance.
Piloted with over 1,000 students across Hong Kong and Mainland China, our system has demonstrated strong pedagogical impact and scalability. It promises to revolutionize English language education by aligning cutting-edge AI with authentic, personalized learning experiences.
Section B: Participant Information
Title | First Name | Last Name | Organisation/Institution | Faculty/Department/Unit | Phone Number | Current Study Programme | Current Year of Study | Contact Person / Team Leader | |
---|---|---|---|---|---|---|---|---|---|
Mr. | Zhiwei | XIE | The Education University of Hong Kong | Department of Mathematics and Information Technology | s1141195@s.eduhk.hk | +852 63651896 | Doctoral Programme | Year 1 |
|
Prof. | Leung Ho | Yu | The Education University of Hong Kong | Department of Mathematics and Information Technology | plhyu@eduhk.hk | 29487819 | Doctoral Programme | Professor |
Section C: Project Details
Students often struggle with contextual writing tasks due to traditional language exercises' abstract and decontextualized nature. Descriptions based on real-life scenarios, essential for practical communication, are rarely emphasized, and individualized feedback is limited due to teacher time constraints.
Our team observed that students learn more effectively when language is connected to personal, visual experiences. However, existing AI tools offer generic feedback that fails to accommodate the learner's level and often introduces factual inaccuracies (hallucinations).
We hypothesize that multimodal learning, grounding language in images and personalized feedback can significantly improve writing fluency, engagement, and vocabulary retention. Our system builds on this hypothesis by integrating fine-tuned MLLM with educational stage-aligned datasets to ensure age-appropriate, curriculum-based support.
Our system aligns with educational psychology (dual coding theory, scaffolding) and AI capabilities (multimodal understanding, generative feedback). It automates repetitive tasks for teachers while enriching the learning experience for students. Early pilot studies showed improved student motivation and writing quality, supporting its real-world potential.
N/A
Architecture Overview:
Our MLLM is composed of three fundamental components: an LLM backbone for text generation, a visual encoder to extract features from images, and a vision-language connector that effectively bridges the visual and textual modalities.
Workflow:
Students describe a given real-life photo.
Our model analyzes the image and text, extracting key visual elements and linguistic features.
The model provides immediate, level-specific feedback across multiple dimensions (grammar, vocabulary, sentence structure etc.).
Innovative Features:
Multi-layer feedback mechanism tailored to learner profiles
Curriculum-aligned fine-tuning to reduce hallucinations
Real-time analysis with batch processing support
Work plan:
Step 1: Dataset curation
Step 2: Model training
Step 3: System development and pilot testing
Step 4: Evaluation, refinement and expansion
Key Innovations:
Multimodal Contextualization: Most writing tools rely on text prompts. Our system uses images as authentic contexts, simulating real-world language applications.
Level-aware Personalization: Our system dynamically adapts feedback to the student’s educational stage.
Curriculum-grounded Fine-tuning: Our model is trained on level-appropriate corpora to ensure relevance and reduce hallucinations.
Creative Impact:
Our system turns a static writing task into an engaging learning journey. The feedback is no longer corrective but constructive. Students receive examples, not just scores. The system learns from their patterns, offering iterative support that mimics a personalized tutor.
Differentiation:
Our system is not just a tech demo compared to other AI systems. It’s pedagogically informed and field-tested. Its creative value lies in combining AI fluency with human-like, education-centred interaction that is scalable for classrooms.
Scalability Strategy:
Our system is designed for institutional-scale deployment. With a containerized backend, REST APIs, and support for cloud and on-premise hosting, schools can integrate our model with minimal friction. It supports multilingual and multi-curricular configurations via retraining with new corpora.
Bottlenecks and Mitigation:
Latency: Optimized transformer inference pipelines reduce processing time.
Dataset expansion: Use of synthetic augmentation and active learning to scale labelled data.
User load: Built-in load balancing and serverless deployment options for peak usage periods.
Sustainability Strategies:
Eco-aware computing: Using lightweight fine-tuning techniques on pre-trained models to reduce energy costs.
Continual learning: Feedback data from users is anonymized and used to fine-tune future versions.
Teacher co-piloting: Teachers remain in control, guiding and customizing feedback, ensuring long-term trust.
Long-term Viability:
Our system encourages habitual use by building learner profiles over time and adjusting difficulty and feedback accordingly. Its adaptability to different subjects (e.g., medical descriptions) opens pathways for cross-disciplinary impact.
Our system directly addresses equity, inclusion, and access in language education. Many students struggle with contextual writing tasks due to limited exposure and one-size-fits-all instruction, which democratizes high-quality feedback by providing tailored, AI-powered learning support anytime and anywhere.
Social benefits include:
Reduced inequality: Students from under-resourced schools gain access to personalized feedback traditionally limited to private instruction.
Inclusive design: The system supports multiple educational stages and is adaptable to different learning abilities.
Empowered teachers: By reducing repetitive grading, our system allows educators to focus on mentorship and student interaction.
Global adaptability: Fine-tuning with different corpora allows adoption in diverse linguistic and cultural settings.
Metrics for social impact:
Feedback usefulness scores from students and teachers
Improvement in student outcomes in underperforming schools
Engagement metrics in lower-income or rural settings
Gender and language equity analysis in system usage data
Responsiveness to Communities:
Our system is co-developed with educators. Their feedback shapes content design, ensuring the tool supports human instruction. Continuous community outreach (via workshops, surveys, and teacher networks) aligns the project with real needs.
Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.
- I have read and agree to the competition rules and privacy policy.