Section A: Project Information
This project aims to bridge the communication gap for the deaf and hard-of-hearing community by developing an innovative AI-driven system capable of translating multiple spoken or written languages into real-time American Sign Language (ASL) videos. Inspired by the challenges faced by individuals who use sign language, the core design philosophy focuses on accessibility, accuracy, and affordability.
Key Innovations & Design: The system accepts input via text, voice recording, or audio file upload through a user-friendly web interface. It leverages AI to provide accurate, on-demand translations, overcoming the limitations of human interpreter availability, cost, and potential inconsistencies. By pre-processing and storing ASL video segments, the system ensures fast and cost-effective delivery.
Technical Principles: The workflow involves several AI technologies:
1.Speech-to-Text: OpenAI's Whisper transcribes audio input into text.
2.Translation & ASL Gloss: Google Gemini processes the text, translating it if necessary, and converts it into ASL gloss notation following specific grammatical rules.
3.Video Generation: ComfyUI, utilizing Stable Diffusion, DWPose for pose extraction, and AnimateDiff for animation, generates corresponding sign language video snippets from a dataset of sign language videos.
4.Delivery: The system, hosted on Google Cloud Platform and Firebase, matches the ASL gloss to the pre-rendered video segments stored in cloud storage and presents the final video sequence to the user.
Potential Impact: This translator can significantly enhance communication access in educational settings, allowing deaf students to follow lectures in real-time. It also serves as an educational tool for learning correct ASL and provides a low-cost alternative to human interpreters for individuals and organizations, fostering greater inclusion.
Section B: Participant Information
Title | First Name | Last Name | Organisation/Institution | Faculty/Department/Unit | Phone Number | Current Study Programme | Current Year of Study | Contact Person / Team Leader | |
---|---|---|---|---|---|---|---|---|---|
Ms. | Hau Yee | LEUNG | Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) | Information Technology | 240515500@stu.vtc.edu.hk | 53742162 | Higher Diploma | Year 1 | |
Ms. | Oi Ki | WU | Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) | Information Technology | 240367049@stu.vtc.edu.hk | 61517228 | Higher Diploma | Year 1 | |
Mr. | Haoyuan | XIAO | Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) | Information Technology | 240363898@stu.vtc.edu.hk | 96071660 | Higher Diploma | Year 1 | |
Mr. | Hiu Fung | MAK | Hong Kong Institute of Information Technology at IVE (Lee Wai Lee) | Information Technology | 240302611@stu.vtc.edu.hk | 52669619 | Higher Diploma | Year 1 |
Section C: Project Details
Thought Process & Inspiration:
Our inspiration stemmed significantly from the film "The Way We Talk" (看我今天怎麼說). The movie highlighted that sign language is not merely a communication tool but a vital part of deaf identity and culture. It presented the perspective that relying solely on alternatives like subtitles can feel inadequate or even disrespectful, as it doesn't fully embrace the linguistic identity of the deaf community. This resonated with our observations of the communication barriers deaf individuals face, particularly the severe shortage and high cost of professional interpreters in Hong Kong (less than 200 for over 100,000 deaf individuals). This scarcity is especially critical in education, hindering students' access to lectures and participation. Motivated by the desire to offer a solution that respects deaf culture and addresses practical barriers, we conceived this AI project to translate directly into sign language.
Hypothesis & Rationale for Success:
Our hypothesis is that an AI-powered system capable of instantly translating spoken or written educational content into accurate sign language video can significantly enhance learning accessibility and outcomes for deaf students, offering a culturally respectful alternative to text-only solutions.
We believe this will succeed because:
Addresses Unmet Need: It tackles the interpreter shortage and cost, providing an on-demand resource.
Technological Feasibility: Current AI (speech recognition, translation, video generation) makes automated sign language feasible.
Scalability & Consistency: The AI offers a scalable, lower-cost solution with consistent translation quality.
Equitable and Accessible Learning: Providing immediate sign language translation allows students to learn in a way that affirms their linguistic identity, potentially improving engagement and comprehension.
Technology, Resources & Market Validation: Our solution leverages a combination of cutting-edge AI and cloud technologies. The frontend uses React.js for a responsive user interface accepting text, voice (via browser microphone), and audio file uploads (WAV, MP3, M4A). The backend relies heavily on Google Cloud Platform (GCP): Firebase for hosting and authentication, OpenAI's Whisper (via API) for speech-to-text, Google Gemini (Vertex AI) for translation and ASL gloss generation, and ComfyUI (running on Compute Engine) utilizing Stable Diffusion, DWPose, and AnimateDiff for generating video signs from a pre-existing dataset. Videos are stored in Cloud Storage for efficient retrieval.
Resources required include cloud service subscriptions (GCP, potentially Hugging Face), computing power (especially GPUs for video generation), a comprehensive ASL video dataset, and expertise in AI/ML, cloud architecture, and web development. Market demand is initially validated by the documented severe shortage and high cost of human interpreters in Hong Kong. Further validation will involve pilot programs with educational institutions and feedback collection from the deaf community.
Core Functionalities, UX & Metrics: The core functionalities are:
1.Multi-modal input (text, voice, audio file).
2.AI-driven transcription, translation to ASL gloss, and mapping to sign video segments.
3.Output display of original text, ASL gloss, and concatenated sign language video.
We ensure a positive user experience through a simple, intuitive interface, clear feedback during processing, and optimized performance by using pre-generated video assets. Key performance metrics include:
Translation Accuracy: Correctness of ASL gloss compared to expert evaluation.
Sign Intelligibility: Clarity and understandability of generated video signs rated by native users.
Latency: End-to-end processing time from input to video output.
User Satisfaction: Qualitative feedback on usability and effectiveness.
N/A
This project represents an innovative solution by leveraging a creative synthesis of cutting-edge AI technologies to address the communication barriers faced by the deaf community, moving beyond traditional methods. The core innovation lies in automating the generation of dynamic sign language video from spoken or written language in real-time.
Demonstration of Innovation & Creativity:
1.AI Integration for Video Generation: Unlike text-based captions or simpler translation tools, we creatively combine multiple AI models: OpenAI Whisper for speech recognition, Google Gemini for nuanced translation into ASL gloss (incorporating specific linguistic rules), and the ComfyUI platform (using Stable Diffusion, DWPose, AnimateDiff) to generate fluid, avatar-based sign language videos. This complex pipeline automates a highly specialized human skill.
2.Pre-computation Strategy: Recognizing the high cost of real-time video synthesis, we innovatively use pre-generated video segments stored in the cloud. This creative architectural choice drastically reduces latency and operational costs, making the solution more feasible and responsive.
3.Focus on Visual Language: By generating video output, the project respects sign language as a visual and gestural language, offering a potentially more effective and culturally appropriate communication method for users compared to text-only solutions.
Enhanced Effectiveness: These innovative elements directly enhance effectiveness:
Accessibility & Scalability: AI automation overcomes the scarcity and high cost of human interpreters, making sign language translation widely available on demand.
Consistency: AI ensures standardized translations based on learned patterns and rules.
Speed: Real-time processing facilitates timely communication, crucial in settings like education.
User Experience: Providing visual sign language caters directly to the preferred communication mode of many users, potentially improving comprehension and engagement.
Scalability & Bottleneck Management:
Our strategy leverages Google Cloud Platform (GCP) and Firebase, which offer inherently scalable infrastructure for hosting, databases (Firestore), AI services (Vertex AI), and serverless functions. The use of pre-generated video segments stored in Cloud Storage significantly reduces real-time computational load, enhancing responsiveness and scalability. Potential bottlenecks include the video generation component (ComfyUI on Compute Engine) if extensive new sign generation is needed, and potential rate limits on external AI APIs (Whisper, Gemini). Mitigation involves auto-scaling Compute Engine resources, optimizing generation workflows, implementing caching for translations (as suggested by sentence_cache in Firestore), and potentially using a Content Delivery Network (CDN) for efficient video distribution.
Sustainability, Engagement & Adaptability:
Environmental sustainability is addressed by using efficient cloud infrastructure (GCP often utilizes renewable energy and operates at scale) and minimizing real-time GPU usage via pre-computation. Long-term user engagement relies on continuously improving translation accuracy and sign intelligibility, expanding the sign vocabulary dataset, and actively incorporating user feedback from the deaf community and educators. The modular architecture, relying on distinct components and APIs for functions like transcription and translation, allows for adaptation; individual modules can be updated or replaced as AI technology evolves or user needs change. Regularly updating the underlying sign language datasets and mappings is crucial for maintaining relevance and accuracy over time.
Addressing Social Issues & Enhancing Lives: This solution directly tackles the social issue of communication inequality faced by the deaf and hard-of-hearing community. By providing automated, real-time translation into sign language video, it addresses the critical shortage and prohibitive cost of human interpreters, which often limits access to education, information, and services. For primary beneficiaries—deaf individuals and students—it enhances life by fostering greater independence, enabling better access to real-time information (like classroom lectures), and facilitating communication in various settings. Crucially, by translating into sign language, it respects and affirms the linguistic identity and culture of the deaf community, aligning with broader social goals of equity and inclusion.
Measuring Impact & Responsiveness: Social impact will be measured through:
User Adoption: Tracking the number of active users, particularly within educational contexts.
Usage Metrics: Analyzing frequency and types of use (e.g., educational, personal).
Qualitative Feedback: Conducting surveys and interviews with deaf users and educators to assess perceived improvements in communication access, learning outcomes, and overall satisfaction.
Responsiveness to evolving community needs will be ensured through:
Community Partnerships: Establishing ongoing collaboration with deaf organizations and educational institutions.
Feedback Channels: Implementing clear mechanisms for users to report inaccuracies and suggest improvements.
Iterative Development: Regularly updating the sign vocabulary, improving AI accuracy, and adding features based directly on community input.
Personal Information Collection Statement (PICS):
1. The personal data collected in this form will be used for activity-organizing, record keeping and reporting only. The collected personal data will be purged within 6 years after the event.
2. Please note that it is obligatory to provide the personal data required.
3. Your personal data collected will be kept by the LTTC and will not be transferred to outside parties.
4. You have the right to request access to and correction of information held by us about you. If you wish to access or correct your personal data, please contact our staff at lttc@eduhk.hk.
5. The University’s Privacy Policy Statement can be access at https://www.eduhk.hk/en/privacy-policy.
- I have read and agree to the competition rules and privacy policy.