We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and longterm pose drift. We address these limitations with a tailored flow matching framework. Coupled with system optimizations, Livatar achieves state-of-the-art lipsync quality with a 8.50 LipSync Confidence on the HDTF dataset; and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to a broader applications. Our project page is available at https://www.hedra.com/.
Livatar supports a wide variety of character styles, including anime, cartoon, photorealistic, and oil painting aesthetics, among others.
Livatar supports the animation of reference images with diverse facial orientations, including individuals facing forward, in profile, or with head turns.
Livatar supports character animations across a wide range of ages and genders, including elderly, middle-aged, young adults, and adolescents, and is robust to variations in ethnicity, facial morphology, and demographic attributes.