Crazy Joe Davola
I used suno-ai/bark and haoheliu/audio-ldm to replace all the audio from a scene in Seinfeld. Of course, I chose a scene from my favorite episode, Crazy Joe Davola. Here are the results and how I made it.
Step 1: Download
First I downloaded the video clip using pytube.
python -m pip install pytube
pytube https://www.youtube.com/watch?v=rrtwWraQDn8
open .
Then I wrote down the audio I'd need to generate:
[italian opera singing]
[seinfeld music]
hi clown, make us laugh clown, make me laugh clown!
[fighting sounds]
[groans]
---
paliacci, paliacci, tragic clown
what did you say?
what are you, a cop?
no, I'm a clown.
You look familiar.
You ever been to the circus?
When I was a kid.
Did you like it?
Uh, well, it was fun, but I was kind of scared of the clowns.
Are you still scared of clowns?
...Yeaaah
[audience laugh effect]
Step 2: Generate and clip together
Then, I pulled the mp4 into iMovie and clipped out the audio. For each sound effect / line, I generated the audio using the bark colab notebook or the Replicate GUI for audio-ldm.
It took some trial and error to overlap the audio over the lip movements. The end result looked like:
Prompts
italian opera singing (suno-ai/bark colab notebook)
text_prompt = """
♪ ridi, Pagliaccio, e ognun applaudirà! ♪
"""
audio_array = generate_audio(text_prompt)
Audio(audio_array, rate=SAMPLE_RATE)
jazz brush beat (Replicate)
replicate.run(
"haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
input={"prompt": "upbeat jazz beat brush", guidance_scale: 10, duration: 2.5}
)
guys in park
text_prompt = """MAN: Hi clown. MAN: Make us laugh clown. MAN: Make me laugh clown!"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_2")
Audio(audio_array, rate=SAMPLE_RATE)
fighting sounds
replicate.run(
"haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
input={"prompt": "a group of men groaning outside", guidance_scale: 10, duration: 2.5}
)
sitcom audience
replicate.run(
"haoheliu/audio-ldm/b61392adecdd660326fc9cfc5398182437dbe5e97b5decfb36e1a36de68b5b95",
input={"prompt": "upbeat jazz beat brush", guidance_scale: 2.5, duration: 5}
)
kramer
Note that I split up the Kramer / Joe Davola text. It didn't work well when I asked bark to do the whole conversation together. I got decently consistent results with NARRATOR
prepended to each line.
text_prompt = """
NARRATOR: paliacci, paliacci, tragic clown
NARRATOR: what are you, a cop?
NARRATOR: You look familiar.
NARRATOR: When I was a kid.
NARRATOR: Uh, well, it was fun, but I was kind of scared of the clowns.
NARRATOR: ...Yeaaah
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)
crazy joe davola
text_prompt = """
MAN: what did you say?
MAN: no, I'm a clown.
MAN: You ever been to the circus?
MAN: Did you like it?
MAN: Are you still scared of clowns?
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
Audio(audio_array, rate=SAMPLE_RATE)
Take-aways
- When bark works, it's incredible. When it doesn't, or something is off, it's terrifying. I would say I get amazing results ~20% of the time. It can take a few tries to get things right.
- I'm not sure how, but bark added the eerie music behind Joe Davola's lines by itself. Did it realize he was supposed to be creepy?
- Sometimes it helps to have
MAN/NARRATOR/WOMAN:
appended. Sometimes it doesn't. - You can enter sound effects in bark, like [gasp] and [laughter], and I even got it to work with [answering machine beep]. But that tends to make it less likely that the rest of the audio generation sounds good.
- I'm really happy this is open source. This is definitely the most exciting open source text-to-audio model. I'm sure it will get better fast.