Embodiment - Part 1

A D-ID-like on your local machine

I was delighted to discover an Internet creating an embodiment of an AI - a “talking head” video where the text, speaker’s voice and a video of the speaker actually talking were the result of AI synthesis. Do check it out - click in to navigate to the full tweet for the video:

1　WebUIでAI画像を生成
2　ChatGPTで文章を生成（合成音声AIについての説明）
3　COEIROINKでの合成音声AI(25/100epoch)を作成し、文章を音声出力
4　D-IDで1,3を読み込ませてAIによる自動口パク・動画化
コエイロインクが学習途中データだからまだまだガビってるけど面白い。 pic.twitter.com/yVB8A9pTLD
— 852話 (@8co28) February 15, 2023

In English, this was the author’s workflow:

Generate AI image with WebUI
Generate text with ChatGPT (An explanation about sythetic voice AI)
Create a voice AI with COEIROINK and synthesize voice from text
Use D-ID for lipsync and speaking animation

I have a capable graphics card, so I wondered if I could achieve something similar, running locally, for free. I succeeded:

The duration of our passions is no more dependant upon us than the duration of our life.

My workflow:

Generate AI image with local AUTOMATIC1111 WebUI
I don’t have an LLVM running locally (yet), I just made do with “Maxims” - a collection of 500 or so aphorisms left to us by Francois Duc De La Rochefoucauld. They have a winning combination of being brief and being enjoyable to read and reflect on.
Create a voice AI with TorToiSe. In order to lean into spoken word synthesis, I did not build my training data from any living person’s recordings, but took advantage of SynthV Karin
Used SadTalker for lipsync and speaking animation

My hope here is to share the details on setting up this stack locally and my learnings in my future posts, so please stay tuned for Part 2, which will be on local setup and iterative development.