LipSync, Sound to Pose Morphs, Need suggestions.

n00bi · Jan 24, 2025

Howdy.

I am experimenting with some sounds and lip-sync and this is a area i have no clue how one would do.
Currently i am just messing about with misc things.
Basically i am adjusting morph sliders by sound.

What i have done so far is to use a online AI to separate the vocal and instruments.
But its not free, well 1st sample is free

This is kind of working but it has issues.

Words like.
"Beeep" creates a mouth expression like: === ,, mouth is horizontally stretched while the pressure is on the E
"Baaap" creates a mouth expression like: -0- ,, mouth shaped like a O while the "pressure is on the A

Currently i have no way to tell them apart in the sound.
Since the vocal is in its own sound file I was wondering if one could use stereo.
And the left channel for the "===" expressions, right channel for the -0- expression.

But problem still remains. how do tell them apart.
I have also tried openai-whisper to extract the lyrics.
but im not sure where to go from there.
words needs to be timed etc.

My current experiments.
Short sniped of music form YT. so it might be muted due to Copyright.,, let me know

Anyway. I am not asking advice for C4D. Those pose morph'si have used are from Daz. "Vis AA ---->
I just want some suggestion, ideas from other how it can be done.
Any ideas are welcome.

osanaiko · Jan 24, 2025

I recall having in my Daz install a set of morph sliders for each of the English phonemes that had mouth/face effects. Each control was literally called "Oh" "ahhh" "sss" or similar. It might have been only for G3F or perhaps an even older generation.

n00bi · Jan 24, 2025

yea. but the main problem is how to extract "phonemes" from a voice.
one idea i had was to use openai-whisper on the vocal.mp3 to get the lyric as text,
to get the "text as a subtitle" figure of speech. because then i have the time stamp of each word.
but problem with lyric is that. its sometimes spelled different from how you pronounce it.
also it just gave a plane text. so its not like a subtitle to the "voice"
could always make up a small DB idk.

osanaiko · Jan 25, 2025

There are definitely phonemic mappings for words in common online dictionaries, that's how the automated pronunciation playback is implemented. I'd imagine with some searching around you could surely find a dataset of english words with their phonetic mappings.

n00bi · Jan 25, 2025

I think i am going down the wrong rabbit hole here.
The hole approach of using openai-whisper to get the text from a sound sample,
then convert that to phonemic mappings by checking words against a dict is wrong.

Its over complicated and i am not really interested in words, so translating words to vowels/ phonemes "(or whatever the term is)" i just a waste of time.
And it kind of locks you down to one language.

The mouth will make a general posture based on what "vowel/phonemes" you are speaking
and vowels/phonemes are kind of related to frequency not language (to some degree it is and dialect).
But it doesn't matter if you speak Arabic, Chines, German. Try and sing/Speak a "AAAA tone, or OOOO tone",
its the same general mouth posture across all thous languages.

In general the frequency's are like "(these freq are not written in stone and variate from person to person

vowel

Freq1

Freq2

/A/ ("father")	700-900	1000-1300
/O/ ("bought")	400-600	800-1100
/U/ ("boot")	300-500	700-1000
/E/ ("bet")	400-600	1800-2300
/I/ ("tree")	200-400	2200-3000
/æ/ ("cat")	600-800	1700-2500
/ɛ/ ("bed")	400-600	1900-2500
/ʌ/ ("cup")	500-700	1400-2000
/ɪ/ ("bit")	300-500	1800-2500
/ʊ/ ("book")	400-600	1000-1500

Freq1 and Freq2 are known as Formant's
In shorts.
The role of F1 (First Formant) and F2 (Second Formant):

F1: This formant is largely related to the height of the tongue (how high or low it is in the mouth) and how open the mouth is. A higher F1 indicates that the mouth is more open (like in vowels such as /a/), and a lower F1 indicates a more closed mouth (like in /i/).
F2: This formant is linked to the position of the tongue from front to back. A higher F2 is associated with front vowels (like /i/ and /e/), where the tongue is positioned toward the front of the mouth. A lower F2 is found in back vowels (like /u/ and /o/), where the tongue is further back.

I wrote a small script using python and matplotlib to analyze the voice to get the gist of what the posture a mouth has.

Now i am trying to mess with FFmpeg to separate out the vowels/phonemes into their own sound clips.

Python:

fileName = "song-vocal.wav"
outdir="snd/"

# Metadata
artist = "x"
title = "x"

# High front vowel, closed mouth, I
# F1 in range 200 to 400, F2 in range 2200 to 3000
f1_width = (400 - 200) / 2
f1_band = 400 - f1_width  #center freq in bandpass..
f2_width = (3000-2200) / 2
f2_band = 3000 - f2_width
bp1='bandpass=f='+str(f1_band)+':w='+str(f1_width)
bp2='bandpass=f='+str(f2_band)+':w='+str(f2_width)
file1= "I_l_out.wav"
cmd1 = ["ffmpeg", "-y","-i", fileName, "-af", bp1,"-metadata", f"IART={artist}","-metadata", f"INAM={title}",  outdir+file1]
subprocess.run(cmd1)

file2= "I_h_out.wav"
cmd2 = ["ffmpeg", "-y","-i", fileName, "-af", bp2, "-metadata", f"IART={artist}","-metadata", f"INAM={title}",  outdir+file2]
subprocess.run(cmd2)

#Combine...
output_file ="I.wav"
cmd3 = [
    "ffmpeg",
    "-y",
    "-i", outdir+file1,
    "-i", outdir+file2,
    "-filter_complex", "amix=inputs=2:duration=longest",
    #"-c:a", "aac",  # or "pcm_s16le" for WAV output
    "-c:a", "pcm_s16le",  # WAV format
     outdir+output_file
]
subprocess.run(cmd3)

This is kind of making some weird results.
I have also tried to mess with the EQ.

Python:

....
...
eq1 = f"equalizer=f={f1_band}:width_type=h:w={f1_width}:g=0"
eq2 = f"equalizer=f={f2_band}:width_type=h:w={f2_width}:g=0"
# Combine filters
filter_chain = f"{eq1},{eq2}"
outputFile = "I_out.wav"
cmd = [ "ffmpeg", "-y", "-i", fileName, "-af", filter_chain, "-metadata", f"IART={artist}", "-metadata", f"INAM={title}", outdir+outputFile ]
subprocess.run(cmd)

Anyway i think this is a more correct approach.
Once i have the files. i can merge/overlay the sound clips ontop of eachother and make 3 or 4 main clips.
For the most dominate postures.
I will then start the samples at the same time and blend the morph based on the strength the samples has.
Still a lot to do.

I was also thinking about how to connect to Daz for future reference.
C4D can communicate with external programs though its scripting language python or C++ plugin for more hardcore stuff.
But the quick looks of it, looks like daz is limited with what it can do with external apps. not sure how striped down the javascript stuff is. can it do named pipes ? udp sockets ?
anyway that was just thoughts as i am mainly using C4D

osanaiko · Jan 25, 2025

You're obviously going much deeper into this question than my own poor understanding encompasses, so all credit to you for thoroughness and practical experimentation.

However my point was this: the whole reason for existence of the International Phonetic Alphabet is to provide a commonly agreed notation for the range of sounds present in human speech. Some of these sounds will map to various shapes of lips and mouth, the rest could be ignored. So I was suggesting that "mapping from word -> IPA -> filter to those sounds that have visible effect -> link to animation presets on the timeline" could be a way forward. I'd bet dollars to peanuts that there are existing academic papers on this exact process for animation of speaking 3d avatars, although whether there is a practical way to use that in your project is another matter.

Regards DazScript (as you say, a weirdly constrained javascript implementation) there is very little doco around these days as they purged their old CMS from the interwebs, but you can find some stuff from their site back in the mid 2010s using the Internet Archive Wayback machine. I don't know what / if any method exists for interfacing with external programs. The resource I consider for anything complex to do with Daz scripting is "mCasual", as he has made lots of interesting stuff over the years and shared it publicly:

You must be registered to see the links

LipSync, Sound to Pose Morphs, Need suggestions.

n00bi

Active Member

osanaiko

Engaged Member

n00bi

Active Member

osanaiko

Engaged Member

n00bi

Active Member

osanaiko

Engaged Member