Instagram Video Translation System Overview Talk by Meta
Scaling AI Translations at Meta: tackling latency and media processing challenges (March 2025)
I came across a really interesting tech talk from an engineer at Meta, going over how the video translation system being implemented on Instagram reels works.
Video: https://www.youtube.com/watch?v=wJOjVr2_WH0
Title: Scaling AI translations at Meta: tackling latency and media processing challenges
Speaker: Jordi Cenzano (Software Engineer, Meta)
Topic: Meta AI’s media translation system with audio and video (lip-sync) translations
Conference: 2025/03/27 MAD Video Tech.
Overview
This talk from an engineer at meta goes through how they built the auto-video translation feature built into Instagram.
The primary model used is the open source seamless-m4t model. But the overall feature requires pre and post processing with other tools and models to get everything feature complete.
Models Used
The pipeline uses more than 10 models, including:
-
Voice-Ambient Audio Separation Model
Separates speech from background noise for clean source audio. -
Language Detection Model
Is the audio even in a supported language? -
Sentence Segmentation Model
Seamless is designed for short 10-15s audio clips, so the input audio needs to be broken up into small chunks, which still contain enough context for an accurate translation. -
Seamless audio to audio translation model
Does the actual translation from the source language into the target language. -
LipSync Model
Takes translated clean audio + original video to generate lip-synced video. -
Toxicity Filter / Hallucination Detection Model
Attempts to detect if the translation is more toxic than the input was, and so the translation might have been incorrect.
The talks outlines that more than 10 ML models are used, though only Seamless is named directly.
Translation Pipeline Overview
-
Upload
- User uploads video via Instagram or Facebook.
- Video is uploaded to Meta’s distributed storage system.
-
Audio Translation
- Audio Extraction & Decoded into the correct format for processing (AAC → PCM)
- Pre-Processing:
• Clean speech extraction (background noise removal)
• Language detection
• Sentence segmentation - Seamless Model Audio Translation:
• Input: short PCM audio segments
• Output: translated PCM audio - Post-Processing:
• Re-align segments (time alignment across languages)
• (Implied they also adjust the video slightly if more time is needed in the target language?)
• Recombine segments
• Re-add ambient noise
• Re-encode audio (PCM → AAC)
-
Video (LipSync)
- Using clean translated audio and original video
- LipSync Model:
• Generates new video with synchronized lip movement synced to the new audio - Post-process:
• Overlay visible watermark (e.g., visible tag for AI-generated content)
• Encode and mux final audio/video
-
Safety and Approval
• Toxicity filter checks both input and output
• Invisible watermarking
• Optional user approval flow before publishing
• Feedback system for corrections/removals -
Playback Logic
• Language-aware serving:
• Prefers original if user device matches
• Auto-serves translated/lip-synced audio and video content otherwise
Future Improvements
The talk also mentioned various points which need improvement.
- Add support for more languages (Seamless supports over 100 languages, but the overall system is more limited)
- Support multi-speaker content (currently this system only supports a single speaker)
- Improve speaker voice cloning (matching the emotion and intonation of the original speaker)
- Better support videos with background music (auto detect which song is playing to help separate and re-add the audio)
- Improve latency and real-time deployment (make it faster)
- Enhance evaluation via better metrics & automation
Given the base audio to audio translation model has been open source for some time the main new information in the talk is how such a model is productised and the amount of pre and post processing needed to deploy the model in a real end user product.