AI-Based Deaf Companion System for Bridging Communication Between Deaf and Hearing Communities

I. Introduction

Sign language is the native language of hearing impaired. It is developed and used as a mode of communication among the group of people, which includes hearing impaired people, their friends and families. It is basically a visual-gesture language that uses visual modality for information exchange through gestures. William Stokoe was the first to propose that sign language was not a pantomime, but a complex system of symbols or gestures, containing different parts. The different parts are made up of manual and non-manual gestures, which are called as basic units of sign gestures or phonemes. The sign themselves are analogous to morphemes of the spoken language. Manual gestures refer to movements of the hands and the use of different hand shapes in different locations and orientations. Body languages and facial expressions are referred to as non-manual gestures. Sign language differs from speaking language in many aspects. The main difference is modality.

The mode of communication in sign language is mainly through visual gestures which are based on manual and non-manual features. The arrangements of gestures which convey the meaning also vary widely with the arrangement of spoken language as they are non-linear in nature, limited in number and articulated within signing space. It is also important that gestures should be properly articulated otherwise meaning could change drastically. The emotions are usually expressed through nonmanual features where as in spoken language it is through tone and pitch. Therefore, in sign language manual gesture sequences should be co- articulated along with non-manual gesture sequences.

Communication between hearing individuals and the deaf or hard- of-hearing community remains a significant barrier in many everyday situations, including education, healthcare, workplaces, and public services. While sign language serves as a vital tool for deaf communication, most hearing individuals do not understand it, and human sign language interpreters are not always available or practical in real-time settings.

Current technologies provide limited support for real-time translation of spoken language into sign language. Furthermore, existing solutions often lack expressive, naturalistic, and real-time animated avatars that can visually convey sign language in a human- like manner. Therefore, there is a critical need for an intelligent system that can automatically convert spoken language into sign language using animated avatars in real-time, enabling inclusive and seamless communication for the deaf and hard-of-hearing community.

II. RELATED WORK

The avenues include various MT systems that are used in text to sign language translation, various methodologies used for emotion detection, generation of avatar animations and signwriting systems. Xu Lin & GAO wen (2023) proposed a system to translate Chinese language to Chinese Sign Language using rules for translating each word in Chinese text to Chinese sign language and arranging them according to Chinese sign language format. After arranging all the words into sentence form, then equivalent gestures are mapped using a dictionary that has both Chinese words and corresponding gestures to it. Then the generated gestures are animated using avatar animation.

Kaushik Datta et al. (2021) proposed Bangla text to Bangla Sign Language translation system. They finally visualized sign gestures as video clips. The system basically depends on the dictionary which maps text to-sign based on the rules. The process is carried out in step, first the input text is rearranged according to the Bangla sign language structure then a mapping is done for each word and video clips stored in a database. Finally all the video clips are concatenated to from the sign gesture sequences. The system mainly depends on the dictionary containing corresponding word and sign, proper rule set to rearrange the words according to sign language sequences and techniques for proper concatenation of the video clips. Rhythm Shahriar et al. (2022) proposed a system for digitally converting Bangla speech to Bangla sign language and another system for converting text to speech to ensure two-way communication. The system accepts Bangla speech as input which is converted to Bangla text. The text is separated into individual words. These words are mapped to image of sign for that particular word which are stored in the database. The main drawback of this approach is that understanding static image will be difficult for hearing impaired person.

Ian Marshall & Eva Safar (2024) proposed a prototype which translates English to British Sign language. The process is done in four stages. To start with, the English text is parsed for syntactic information and from this semantic details are derived by discourse representation. This representation is then transformed according to the sign language semantic structure. After the semantic transfer, HamNosys is generated along with video clips for the text.

Ameera Amasoud & Hend Al-Khalifa (2022) describes the translation of Arabic text to Arabic Sign Language by applying a set of translation rules and domain ontology. They created a sign language translation system for prayer domain. They analysed the morphological structure of a sentence and checked the grammatical transformation based on semantic analysis, based on which SignWriting is developed. Spanish text to Spanish sign language is developed by Jordi Porta et al. (2024). It is based on rules for translating Spanish to Spanish Sign Language glosses. The evaluation of this system reports 0.30 BiLingual Evaluation Under study and 42% Translation Error Rate.

III. METHODOLOGY

The system is designed to bridge communication between deaf and hearing individuals by integrating three main modules: the Sign Language Recognition Module (SRM), the Speech Recognition and Synthesis Module (SRSM), and the Avatar Module (AM). The SRM plays a crucial role in interpreting sign language gestures made by a deaf user. It uses a webcam to capture real-time hand movements, identifying the positions and trajectories of both hands across video frames. These hand movements are analyzed to extract numerical features based on joint positions, which are then compiled over short time intervals to capture the dynamics of the gestures. A deep learning model trained on sign language data processes these features to accurately classify gestures into corresponding words or phrases. The recognized gestures are translated into text and stored in a message buffer, which can be used by other modules. To avoid redundancy, the system compares current outputs with recent history and suppresses repeated detections of the same gesture. Additionally, the recognized text is displayed on-screen to provide visual feedback to the user.

Simultaneously, the SRSM captures spoken language from hearing users through a microphone. It converts this audio input into text using a speech recognition engine and enhances recognition accuracy by analyzing audio patterns and adapting to environmental noise. When the system is uncertain about a recognition result, it flags the text with a low- confidence indicator. The recognized text is then passed to the Avatar Module so that the message can be visually signed to the deaf user. Furthermore, this module performs text-to-speech synthesis to vocalize the signed messages from the deaf user, allowing hearing users to understand what has been signed. Users can choose different voice options, such as switching between male and female voices, for better personalization.

The Avatar Module functions as the visual communicator of the system. It receives text input either from the SRSM (speech-to-text) or from the SRM (recognized gestures) and converts it into sign language animations. A 3D virtual avatar performs these gestures, making it easier for deaf users to understand spoken content in a visual format. The module manages incoming text messages in sequence, ensuring that each message is animated clearly and in order. When no messages are being processed, a placeholder or idle animation is displayed to indicate readiness. This avatar acts as a real-time visual interpreter, running continuously to update animations as soon as new input is received.

Overall, the system enables smooth and continuous bidirectional communication. Deaf users can sign naturally, and their gestures are translated into text and voice for hearing individuals. In return, hearing individuals can speak normally, and their words are converted into sign animations, making conversations accessible in both directions. Each module operates autonomously but remains interconnected, updating and responding in real time to user input until the system is turned off or paused.

IV. SYSTEM ARCHITECTURE

Step 1: Speech Input and Recognition

Objective: Convert spoken words into text.

Tools: Google Speech Recognition API / OpenAI Whisper / Vosk (for offline use)

Process:

· Capture audio via microphone.

· Convert audio to text using a speech-to-text model.

· Output real-time transcribed text.

Step 2: Natural Language Processing (NLP)

Objective: Process the raw text into a simplified, grammatically correct form for sign language translation.

Tools: Python with spaCy or NLTK for text processing.

Process:

· Remove filler words (e.g., "uh", "like").

· Normalize sentence structure (lemmatization, POS tagging).

· Translate full sentences into simplified sign language gloss.

Example: "I am going to the market" → "I GO MARKET"

Step 3: Gloss to Sign Gesture Mapping

Objective: Match each gloss word with a corresponding sign.

Approach:

· Use a sign language dictionary (like WLASL or ASLLVD) that maps glosses to gesture data.

· If animation clips are available: directly link gloss to animation.

· If not: generate hand pose sequences using a machine learning model (LSTM, RNN, etc.).

Step 4: Avatar Animation Rendering

Objective: Display sign language using an animated avatar.

Tools: Unity 3D / Blender (3D avatar design and animation)

WebGL / Three.js (for browser-based deployment)

Process:

· Load animation clip or generate skeletal motion based on gloss.

· Animate the avatar’s hand gestures and optionally facial expressions.

· Sync gestures with audio timing for realism.

Step 5: System Integration and User Interface

Objective: Build a cohesive interface for user interaction.

Tools:

Frontend: HTML, CSS, JS (or Tkinter for desktop)

Backend: Python (Flask/Django) or Node.js

Process:

· Integrate all modules into a real-time pipeline.

· Display live avatar output based on spoken input.

· Provide options for language selection and playback controls.

Step 6: Testing and Evaluation

Objective: Evaluate accuracy, speed, and usability.

Process:

· Test with a dataset of spoken phrases.

· Validate output against known sign language sequences.

· Gather feedback from sign language users and interpreters.

V. RESULT

A. Model Evaluation Metrics

B. OBSERVATION

· The model achieves high accuracy in transcribing speech and translating it into gloss.

· Data augmentation (e.g., mirroring, scaling skeletons) improved generalization.

· ML-based keypoint generation produced more natural gestures than template-based animation.

· Avatar realism is positively rated by users, especially when facial expressions and fluid transitions were added.

C. LIMITATIONS

· Real-time performance dropped slightly when rendering full- body avatars with facial expressions (~15 FPS on low-end GPUs).

· Some ambiguous glosses (e.g., “light” as noun vs. verb) confused the ML model.

· Regional sign language variations are not accounted for.

VI. Conclusion

The development of a real-time speech-to-sign language translation system using animated avatars demonstrates a promising step toward bridging the communication gap between the hearing and deaf communities. By integrating speech recognition, natural language processing, gloss translation, and avatar-based sign rendering, the system provides an accessible, automated, and inclusive communication tool. This project successfully converts spoken English into simplified sign language gloss, maps those glosses to corresponding signs, and animates them using a virtual avatar allowing deaf users to visually understand spoken content without relying on human interpreters. The use of machine learning enhances the system’s ability to adapt, process language accurately, and support scalable enhancements in the future. While some challenges remain such as handling complex grammar, supporting multiple sign languages, and incorporating facial expressions the project lays a strong foundation for future expansion. With further refinement, this system can be deployed in educational institutions, public service environments, and communication apps, contributing significantly to accessibility and social inclusion.

References

[1] Archana S. Ghotkar and G K. Kharate., “Vision based multi- feature hand gesture Recognition for Indian sign language manual signsZ”, International journal on smart sensing and intelligent systems vol. 9(1), 2023.

[2] Assaleh K, T. Shanableh., M. Fanaswala., F. Amin and H. Bajaj., “Continuous Arabic Sign Language Recognition in User Dependent Mode. Journal of Intelligent Learning Systems and Applications, vol. 2 (1), pp. 19-27, 2022

[3] Aujeszky T. and Eid M., “A gesture recognition architecture for Arabic sign language communication system”. Multimedia Tools and Applications, Springer, 1-19, 2023.

[4] Bauer K.F., and Kraiss., “Video-based sign recognition using self-organizing subunits”. Proceedings of the 16th International Conference on Pattern Recognition, pp. 434–437, 2022.

[5] Cooper H., Ong, E. J., Pugeault, N., & Bowden R, “Sign language recognition using sub-units”. The Journal of Machine Learning Research, 13(1), 2205- 2231, 2019

[6] Bauer B. and Hienz H., 2019. Relevant features for video-based continuous sign language recognition, in: FG00 Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 440-445., 2019

[7] Dreuw P., Ney H., Pérez G. M., Crasborn O., Piater J. H., Moya J. M., and Wheatley M, “The SignSpeak Project-Bridging the Gap Between Signers and Speakers”. LREC, 2019