Turn Voice Recordings into Shareable Videos with Python and FFmpeg

October 25, 2021
Written by
Carlos Mucuho
Contributor
Opinions expressed by Twilio contributors are their own

Turn Voice Recordings into Shareable Videos with Python and FFmpeg

In this tutorial, we are going to learn how to build an application with Python and FFmpeg that will allow us to turn voice recordings into cool videos that can be easily shared on social media.

At the end of the tutorial we will have turned a voice recording into a video that looks similar to the following:

Project demo

Tutorial requirements

To follow this tutorial you are going to need the following components:

  • One or more voice recordings that you want to convert to videos. Programmable Voice recordings stored in your Twilio account work great for this tutorial.
  • Python 3.6+ installed.
  • FFmpeg version 4.3.1 or newer installed.

Creating the project structure

In this section, we will create our project directory, and inside this directory, we will create sub-directories where we will store the recordings, images, fonts, and videos that will be used in this tutorial. Lastly, we will create the Python file that will contain the code that will allow us to use FFmpeg to create and edit a video.

Open a terminal window and enter the following commands to create the project directory move into it:

mkdir twilio-turn-recording-to-video
cd twilio-turn-recording-to-video

Use the following commands to create four subdirectories:

mkdir images
mkdir fonts
mkdir videos
mkdir recordings

The images directory is where we will store the background images of our videos. Download this image, and store it in the images directory with the name bg.png. This image was originally downloaded from Freepik.com.

In the fonts directory we will store font files used to write text in our videos. Download this font, and store it in the fonts directory with the name LeagueGothic-CondensedRegular.otf. This font was originally downloaded from fontsquirrel.com.

The videos directory will contain videos and animations that will be added on top of the background image. Download this video of a spinning record with the Twilio logo in the center, and store it in the videos directory with the name spinningRecord.mp4. The source image used in this video was downloaded from flaticon.com.

The recordings directory is where we will store the voice recordings that will be turned into videos. Add one or more voice recordings of your own to this directory.

Now that we have created all the directories needed, open your favorite code editor and create a file named main.py in the top-level directory of the project. This file will contain the code responsible for turning our recordings into videos.

If you don’t want to follow every step of the tutorial, you can get the complete project source code here.

Turning an audio file into a video

In this section, we are going to add the code that will allow us to turn a recording into a video that shows the recording’s sound waves.

We are going to use FFmpeg to generate a video from an audio file. So in order to call FFmpeg and related programs from Python, we are going to use python’s subprocess module.

Running a command

Add the following code inside the main.py file:

import subprocess


def run_command(command):
    p = subprocess.run(
        command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
    )
    print('Done!!!')
    print('stdout:\n{}'.format(p.stdout.decode()))

    return p.stdout.decode().strip()

In the block of code above, we have imported the subprocess module and created a run_command() function. As the name suggests, this function is responsible for running a command that is passed in the argument. When the command completes, we print the output and also return it to the caller.

Obtaining the duration of a recording

Add the following code below the run_command() function:

def get_rec_duration(rec_name):
    rec_path = "./recordings/{}".format(rec_name)

    command = "ffprobe -i {rec_path} -show_entries format=duration -v quiet \
    -of csv=\"p=0\"".format(rec_path=rec_path)

    rec_duration = run_command(command)
    print("rec duration", rec_duration)

    return rec_duration

Here, we created a function named  get_rec_duration(). This function is responsible for retrieving the duration of a recording. The function receives a recording name (rec_name) as an argument, which is prepended with the name of the recordings directory and stored in the rec_path local variable.

The ffprobe program, which is part of FFmpeg, is used to create a command string to get the duration of the recording. We call the run_command() function with this command and store the value returned in rec_duration.

Lastly, we print and then return the recording duration we obtained.

The recording duration is needed to specify that the duration of the video that will be generated from it is the same.

Converting audio to video

Add the following code below the get_rec_duration() function:

def turn_audio_to_video(rec_name, rec_duration):
    rec_path = "./recordings/{}".format(rec_name)
    bg_image_path = "./images/bg.png"
    video_name = "video_with_sound_waves.mp4"

    command = 'ffmpeg -y -i {rec_path} -loop 1 -i {bg_image_path} -t {rec_duration} \
    -filter_complex "[0:a]showwaves=s=1280x150:mode=cline:colors=00e5ff[fg];  \
    drawbox=x=0:y=285:w=1280:h=150:color=black@0.8:t=fill[bg]; \
    [bg][fg]overlay=format=auto:x=(W-w)/2:y=(H-h)/2 " \
    -map 0:a -c:v libx264 -preset fast -crf 18 -c:a aac \
    -shortest ./videos/{video_name}'.format(
        rec_path=rec_path,
        bg_image_path=bg_image_path,
        rec_duration=rec_duration,
        video_name=video_name,
    )

    print(video_name)
    run_command(command)
    return video_name

The turn_audio_to_video() function will turn recordings into videos showing the recordings sound waves. The function takes as an argument the recording name (rec_name) and the recording duration (rec_duration).

The FFmpeg command that generates the video from the audio uses the recording path (rec_path), the path to a background image (bg_image_path), and the output filename for the video (video_name).

Let’s take a closer look at the FFmpeg command:

ffmpeg -y -i {rec_path} -loop 1 -i {bg_image_path} -t {rec_duration} \
-filter_complex \"[0:a]showwaves=s=1280x150:mode=cline:colors=00e5ff[fg];  \
drawbox=x=0:y=285:w=1280:h=150:color=black@0.8:t=fill[bg]; \
[bg][fg]overlay=format=auto:x=(W-w)/2:y=(H-h)/2 \" \
-map 0:a -c:v libx264 -preset fast -crf 18 -c:a aac -shortest ./videos/{video_name}

The -y tells ffmpeg to overwrite the output file if it exists on disk.

The -i option specifies the inputs. In this case, we have 2 input files, the recording file, rec_path, and the image that we are using has a background, which is stored in bg_image_path.

The -loop option to generate a video by repeating (looping) the input file(s). Here we are looping over our image input in bg_image_path.  The default value is 0 (don’t loop), so we set it to 1(loop) to repeat this image in all the video frames.

The -t option specifies a duration in seconds, or using the "hh:mm:ss[.xxx]" syntax. Here we are using the recording duration (rec_duration) value to set the duration of our output video.

-filter_complex: allows us to define a complex filtergraph, one with an arbitrary number of inputs and/or outputs. This is a complex option that takes a number of arguments, discussed below.

First, we use the showwaves filter to convert the voice recording, referenced as [0:a], to video output. The s parameter is used to specify the video size for the output, which we set to 1280x150. The mode parameter defines how the audio waves are drawn. The available values are: point, line, p2p, and cline. The colors parameter specifies the color of the waveform. The waveform drawing is assigned the label [fg].

We use the drawbox filter to draw a colored box on top of our background image to help the waveform stand out. The x and y parameters specify the top left corner coordinates of the box, while w and hset its width and height. The color parameter configures the color of the box to black with an opacity of 80%. The t parameter sets the thickness of the box border. By setting the value to fill we create a solid box.

To complete the definition of this filter we use overlay to put the waveform drawing on top of the black box. The overlay filter is configured with format, which sets the pixel format automatically, and x and y, which specify the coordinates in which the overlay will be placed in the video frame. We use some math to specify that x and y should be placed in the center of our video.

The -map option is used to choose which streams from the input(s) should be included or excluded in the output(s). We choose to add all streams of our recording to our output video.

The -c:v option is used to encode a video stream with a certain codec. We are telling FFmpeg to use the libx264 encoder.

The -preset option selects a collection of options that will provide a certain encoding speed to compression ratio. We are using the fast option here, but feel free to change the preset to a slower (better quality) or faster (lower quality) one if you like.

The -crf option stands for constant rate factor. Rate control decides how many bits will be used for each frame. This will determine the file size and also the quality of the output video. A value of 18 is recommended to obtain visually lossless quality.

The -c:a option is used to encode an audio stream with a certain codec. We are encoding the audio with the AAC codec.

The -shortest option tells FFmpeg to stop writing the output when the shortest of the input streams ends.

The ./videos/{video_name} option at the end of the command specifies the path of our output file.

In case you are curious, here is what all the FFmpeg waveforms modes discussed above do and how they look.

Point draws a point for each sample:

Point waveform mode

Line draws a vertical line for each sample:

Line waveform mode

P2p draws a point for each sample and a line between them:

P2p waveform mode

Cline draws a centered vertical line for each sample. This is the one we are using in this tutorial:

Cline waveform mode

Add the following code below the turn_audio_to_video() function:

def main():
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    turn_audio_to_video(rec_name,rec_duration)


main()

In this newly introduced code, we have a function named main(). In it we store the recording name in a variable named rec_name. You should update this line to include the name of your own voice recording file.

After that, we call the get_rec_duration() function to get the recording duration.

Then, we call the turn_audio_to_video function with the recording name and duration, and store the value returned in a variable named video_with_sound_waves.

Lastly, we call the main() function to run the whole process. Remember to replace the  value of the rec_name variable with the name of the recording you want to process.

Go back to your terminal, and run the following command to generate the video:

python main.py

Look for a file named video_with_sound_waves.mp4 in the videos directory, open it and you should see something similar to the following:

Waveform rendering

Adding a video on top of the background

In this section, we are going to add a video of a spinning record on the bottom left corner of the generated video. The video that we are going to add is stored in the file named spinningRecord.mp4 in the videos directory.

Spinning record animation

Go back to your code editor, open the main.py file, and add the following code below the turn_audio_to_video() function:

def add_spinning_record(video_name, rec_duration):
    video_path = "./videos/{}".format(video_name)
    spinning_record_video_path = "./videos/spinningRecord.mp4"
    new_video_name = "video_with_spinning_record.mp4"

    command = 'ffmpeg -y -i {video_path} -stream_loop -1 -i {spinning_record_video_path} \
    -t {rec_duration} -filter_complex "[1:v]scale=w=200:h=200[fg]; \
    [0:v] scale=w=1280:h=720[bg], [bg][fg]overlay=x=25:y=(H-225)" \
    -c:v libx264 -preset fast -crf 18 -c:a copy \
    ./videos/{new_video_name}'.format(
        video_path=video_path,
        spinning_record_video_path=spinning_record_video_path,
        rec_duration=rec_duration,
        new_video_name=new_video_name,
    )

    print(new_video_name)
    run_command(command)
    return new_video_name

Here, we have created a function named add_spinning_record(). This function will be responsible for adding the spinningRecord.mp4 video on top of the video showing sound waves. It takes as an argument the name of the video generated earlier (video_name) and the recording duration (rec_duration).

This function also runs FFmpeg. Here is the command in detail:

$ ffmpeg -y -i {video_path} -stream_loop -1 -i {spinning_record_video_path} \
-t {rec_duration} -filter_complex \"[1:v]scale=w=200:h=200[fg]; \
 [0:v] scale=w=1280:h=720[bg], [bg][fg]overlay=x=25:y=(H-225)\" \
-c:v libx264 -preset fast -crf 18 -c:a copy ./videos/{new_video_name}

The command above has the following options:

The -y, -t, -c:v, -preset, and -crf options are the same as in the FFmpeg command that generated the audio waves.

The -i option was also used before, but in this case, we have 2 videos as input files, the video file generated in the previous step, and the spinning record video file.

The -stream_loop option allows us to set the number of times an input stream should be looped. A value of 0 means to disable looping, while -1 means to loop infinitely. We set the spinning record video to loop infinitely. This would make FFmpeg encode the output video indefinitely, but since we also specified the duration of the output video, FFmpeg will stop encoding the video when it reaches this duration.

The -filter_complex option: also has the same function as before, but here we have two videos as input files, the video created in the previous section [0:v] and the spinning record video [1:v].

The filter first uses scale to resize the spinning record video so that it has 200x200 dimensions and assigns it the [fg] label. We then use the scale filter again to set the video created in the previous section to a 1280x720 size, with the [bg] label. And finally, we use the overlay filter to put the spinning record video on top of the video created in the previous section, in the coordinates x=25, and y=H-225 (H stands for the video height).

The -c:a option was also introduced in the previous section, but In this case, we use the special value copy to tell ffmpeg to copy the audio stream from the source video without re-encoding it.

The final part of the command, ./videos/{new_video_name} sets the path of our output file.

Replace the code inside the main() function with the following, which adds the call to the add_spinning_record() function:

def main():
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    video_with_sound_waves = turn_audio_to_video(rec_name, rec_duration)
    add_spinning_record(video_with_sound_waves, rec_duration)

Run the following command in your terminal to generate a video:

python main.py

Look for a file named video_with_spinning_record.mp4 in the videos directory, open it and you should see something similar to the following:

Waveform rendering with spinning record

Adding text to video

In this section, we are going to add a title on the top portion of the video. As part of this we are going to learn how to use FFmpeg to draw text, change the color, size, font, and position.

Go back to your code editor, open the main.py file, and add the following code below the add_spinning_record function:

def add_text_to_video(video_name):
    video_path = "./videos/{}".format(video_name)
    new_video_name = "video_with_text.mp4"
    font_path = "./fonts/LeagueGothic-CondensedRegular.otf"

    command = "ffmpeg -y -i {video_path} -vf \"drawtext=fontfile={font_path}:  \
    text='Turning your Twilio voice recordings into videos':fontcolor=black: \
    fontsize=90:box=1:boxcolor=white@0.5 \
    :boxborderw=5:x=((W/2)-(tw/2)):y=100\" \
    -c:a copy ./videos/{new_video_name}".format(
        video_path=video_path,
        font_path=font_path,
        new_video_name=new_video_name
    )

    print(new_video_name)
    run_command(command)
    return new_video_name

In this function, we have created a function named add_text_to_video(), which invokes a new FFmpeg command to draw the text. Let’s take a closer look at the FFmpeg command:

ffmpeg -y -i {video_path} -vf \"drawtext=fontfile={font_path}:  \
text='Turning your Twilio voice recordings into videos':fontcolor=black: \
fontsize=90:box=1:boxcolor=white@0.5:boxborderw=5:x=((W/2)-(tw/2)):y=100\" \
-c:a copy ./videos/{new_video_name}

The -y, and the -c:a options are used exactly as before.

The -i option, which defines the inputs, now has only one input file, the video file generated in the previous section.

The -vf option allows us to create a simple filtergraph and use it to filter the stream. Here we use the drawtext filter to draw the text on top of the video, with a number of parameters: fontfile is the font file to be used for drawing text, text defines the text to draw (feel free to change it to your liking),fontcolor sets the text color to black, fontsize sets the text size, box to enable a box around the text, boxcolor to set the color of this box to white with a 50% opacity, boxborderw to set the width of the border box, and x and y to set the position within the video where the text is to be printed. We used a little math to draw the text centered.

The ./videos/{new_video_name} option at the end sets the output file, just like in the previous FFmpeg commands.

Replace the code inside the main() function with the following version, which adds the title step:

def main():
    rec_name = "rec_1.mp3"
    rec_duration = get_rec_duration(rec_name)
    video_with_sound_waves = turn_audio_to_video(rec_name, rec_duration)
    video_with_spinning_record = add_spinning_record(video_with_sound_waves, rec_duration)
    video_with_text = add_text_to_video(video_with_spinning_record)

Go back to your terminal, and run the following command to generate a video with a title:

python main.py

Look for a file named video_with_text.mp4 in the videos directory, open it and you should see something similar to the following:

Waveform rendering with spinning record and title

Conclusion

In this tutorial, we learned how to use some of the advanced options in FFmpeg to turn a voice recording into a video that can be shared on social media. I hope this encourages you to learn more about FFmpeg.

The code for the entire application is available in the following repository https://github.com/CSFM93/twilio-turn-recording-to-video.

Carlos Mucuho is a Mozambican geologist turned developer who enjoys using programming to bring ideas into reality. https://github.com/CSFM93.