Transcribe your Phone Calls to Text in Real Time with Twilio and Vosk

March 16, 2022
Written by
Reviewed by

In this tutorial, you are going to learn how to implement live transcription of phone calls to text. The phone calls will be routed through a Twilio phone number, and we will use the Media Streams API to stream the incoming audio to a small WebSocket server built using Python. Once in your server, the audio stream will be passed to Vosk, a lightweight open-source speech recognition engine that runs locally on your computer, with support for many languages.

Live transcription of phone calls demonstration

Requirements

To work on this tutorial, you will need:

Add a Twilio phone number

Your first task is to add a Twilio phone number to your account. This is the number that will receive the phone calls to transcribe.

Log in to the Twilio Console, select “Phone Numbers”, and then click on the “Buy a number” button to buy a Twilio number. Note that if you have a free account, you will be using your trial credit for this purchase.

On the “Buy a Number” page, select your country and check “Voice” in the “Capabilities” field. If you’d like to request a number from your region, you can enter your area code prefix in the “Number” field.

Buy a Twilio phone number

Click the “Search” button to see what numbers are available, and then click “Buy” for the number you like from the results. If you are using a trial account, this purchase uses your trial credit. After you confirm your purchase, write down your new phone number and click the “Close” button.

Project setup

In this section, you are going to set up a brand new Python project. To keep things nicely organized, open a terminal or command prompt, find a suitable location, create a new directory where the project you are about to create will live, and navigate into the project directory:

mkdir vosk-live-transcription
cd vosk-live-transcription

Create a virtual environment

Following Python best practices, you are now going to create a virtual environment, where you are going to install the Python packages needed for this project.

If you are using a Unix or macOS system, open a terminal and enter the following commands:

python3 -m venv venv
source venv/bin/activate

If you are following the tutorial on Windows, enter the following commands in a command prompt window:

python -m venv venv
venv\Scripts\activate

With the virtual environment activated, you are ready to install the packages required by this project:

pip install twilio vosk flask flask-sock simple-websocket pyngrok

The packages installed are:

  • twilio: the Twilio helper library for Python
  • vosk: a lightweight speech recognition engine
  • flask: a Python web framework
  • flask-sock: a WebSocket extension for Flask
  • simple-websocket: a WebSocket server used by Flask-Sock
  • pyngrok: a Python wrapper for ngrok, a utility to temporarily make a server running on your computer publicly available

Download a language model for Vosk

The Vosk package installed in the previous section is just an engine. To be able to transcribe audio, this engine needs to pass the incoming audio data through a model that has been trained for the intended language.

The Vosk models page has models for many languages. Pick one of the models and download it. To test this project, I used the “vosk-model-small-en-us-0.15” model for American English.

Each model comes as a zip file. Extract the contents of the zip file you downloaded to the vosk-live-transcription directory. The contents of the zip file should all be inside a single folder. Change the name of this top-level model folder to model.

The directory structure of the project, including the Python virtual environment and the Vosk model, should match the following:

Project"s directory structure

Configure the Twilio credentials

To work with Twilio, the Python application needs to have access to your account credentials to authenticate. The most convenient way to define these configuration values is to set environment variables for them. In a bash or zsh session, you can configure these settings as follows:

export TWILIO_ACCOUNT_SID=xxxxxxxxx
export TWILIO_AUTH_TOKEN=xxxxxxxxx

If you are following this tutorial on Windows, use set instead of export in your command prompt window.

You will need to replace the xxxxxxxxx placeholders with the correct values from your account. The two variables are your Twilio “Account SID” and your “Auth Token”. You can find them in the dashboard of the main page of the Twilio Console, under “Account Info”:

Twilio credentials in the Console

Python web server

You are now ready to code the web server that will support this project in Python. For this, you are going to use the Flask web framework. Since audio will be streamed by Twilio over WebSocket, and Flask does not support this protocol natively, the Flask-Sock extension will be used for this route.

Here is the general structure of the web server. Copy this code to a file named app.py in the project directory. Note that the two functions in this code will be defined later, for now only their definition is provided.

import audioop
import base64
import json
import os
from flask import Flask, request
from flask_sock import Sock, ConnectionClosed
from twilio.twiml.voice_response import VoiceResponse, Start
from twilio.rest import Client
import vosk

app = Flask(__name__)
sock = Sock(app)
twilio_client = Client()
model = vosk.Model('model')

CL = '\x1b[0K'
BS = '\x08'


@app.route('/call', methods=['POST'])
def call():
    """Accept a phone call."""
    # TODO


@sock.route('/stream')
def stream(ws):
    """Receive and transcribe audio stream."""
    # TODO


if __name__ == '__main__':
    from pyngrok import ngrok
    port = 5000
    public_url = ngrok.connect(port, bind_tls=True).public_url
    number = twilio_client.incoming_phone_numbers.list()[0]
    number.update(voice_url=public_url + '/call')
    print(f'Waiting for calls on {number.phone_number}')

    app.run(port=port)

The following sections discuss the different sections of this file.

Imports

This server is going to do several things, so it needs to import a variety of modules. Many of these imports are well known packages that provide general support to the web server, but there are some notable imports that you may not be familiar with.

For example, the audioop module is a little known module that comes with the Python standard library. It provides functions to perform audio encoding, decoding, and conversion. It is going to be extremely useful for this project.

Some of the imports are related to standing up a web server. The Flask class is used to implement HTTP web servers in Python. The Sock class extends Flask with WebSocket support.

The VoiceResponse and Start imports from the twilio package will be used to generate the commands that instruct Twilio to stream audio to the server. The Client import, also from twilio, is used to make Twilio API calls.

Finally, vosk is the speech recognition engine that will do the transcriptions to text.

Global variables

The server has a few variables that are initialized in the global scope. The app variable represents the web server, while sock enables this server to create WebSocket routes.

The twilio_client variable is an instance of Twilio’s Client class, used to make Twilio API calls. This instance will fail to initialize if the TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN environment variables aren’t defined as indicated above.

The model variable holds the language speech recognition model, loaded by Vosk. The 'model' argument passed when this instance is created, is the path to the directory where the model data is stored on disk.

The CL and BS constants define VT-100 terminal codes to clear the line from the cursor position to the end, and to move the cursor back one character respectively. These will be used when printing live transcriptions to the terminal.

Server initialization

At the bottom of app.py, the web server is initialized and started. The logic in this section is more complex than what you may have seen in other Flask based web servers, because the web server needs to have a public URL that can be passed on to Twilio to use. Let’s go over the statements in this section of the application in detail.

First, the ngrok service is initialized:

    from pyngrok import ngrok
    port = 5000
    public_url = ngrok.connect(port, bind_tls=True).public_url

These instructions create an ngrok tunnel to port 5000, which is the port on which the Flask web server will run. The ngrok service will set up a public web server on a random URL on its ngrok.io domain, and will forward all the traffic it receives on it to port 5000. This is necessary when testing Twilio applications that require webhooks, because Twilio needs to have a public URL to connect to. The bind_tls argument tells ngrok to generate an https:// URL with encryption. The public_url variable receives the URL that ngrok assigned to us. On recent macOS versions, port 5000 might not be available. In that case, switch to a different port.

Using ngrok in this way gives you access to their entry-level service tier, which provides tunnels that expire after two hours. If you run the application for longer than that, you will need to restart it to generate a fresh tunnel with a new URL. If you have an ngrok account, you can configure your ngrok token to remove the time limitation.

The next part of the server initialization configures the webhook URL that Twilio will call when there is an incoming phone call to the Twilio phone number.

    number = twilio_client.incoming_phone_numbers.list()[0]
    number.update(voice_url=public_url + '/call')
    print(f'Waiting for calls on {number.phone_number}')

To keep this application as simple as possible, this code uses the Twilio API to get a list of phone numbers associated with the account. From this list, only the first number is used. If you have a single number in your account, then this code will work just fine. If you have more than one number and need to choose a specific one to use with this project, then you’ll have to iterate over the returned numbers to find the correct one to use.

The update() method on the phone number object is passed a voice_url argument, set to the public URL from ngrok with a /call path added at the end. This is the webhook URL that will handle incoming phone calls.

The final step to start the server is probably the one you are most familiar with:

    app.run(port=port)

This call starts the Flask web server.  At this point, the local computer is accepting requests on port 5000, and any requests that are sent to the public URL provisioned by ngrok will be redirected to it.

Accepting phone calls

When a call is made to the Twilio phone number, Twilio sends a POST request to the URL that was configured as the voice_url for the number. The request includes information about the call, such as the caller ID, which is given in a From parameter in the body of the request.

The request handler needs to tell Twilio how it wants to handle the incoming call by returning a TwiML response. TwiML is a language created by Twilio that is derived from XML. It includes an extensive list of “verbs” that allow the application to indicate how calls should be handled. The most simple TwiML example is one in which a call is answered with a text-to-speech message, using the Say verb:

<Response>
  <Say>Please leave a message</Say>
</Response>

Instead of writing raw XML, Twilio provides a collection of classes that create the XML for us. The above example can be written in Python code as follows:

response = VoiceResponse()
response.say('Please leave a message');

When Twilio receives TwiML from the application’s webhook, it executes the instructions provided inside the <Response> element, and when it reaches the end it hangs up. The above example says “Please leave a message” to the caller and then immediately hangs up. The Pause verb can be used to give the caller time to speak:

response = VoiceResponse()
response.say('Please leave a message');
response.pause(length=60)

The TwiML response for our application needs to tell Twilio to stream the audio from the caller to the application, so that it can be transcribed. The Stream verb, which is slightly more complex than the previous ones, is used for that purpose.

Below you can find the complete implementation of the /call webhook.

@app.route('/call', methods=['POST'])
def call():
    """Accept a phone call."""
    response = VoiceResponse()
    start = Start()
    start.stream(url=f'wss://{request.host}/stream')
    response.append(start)
    response.say('Please leave a message')
    response.pause(length=60)
    print(f'Incoming call from {request.form["From"]}')
    return str(response), 200, {'Content-Type': 'text/xml'}

To help you understand the TwiML response that is being constructed, here is its XML representation:

<Response>
  <Start>
    <Stream url="..." />
  </Start>
  <Say>Please leave a message</Say>
  <Pause length="60" />
</Response>

The Stream verb has two modes of operation: synchronous and asynchronous. For this application, an asynchronous stream is best. This means that Twilio will start streaming audio to our application while at the same time will continue to execute the remaining verbs in the TwiML response.

To create an asynchronous stream, the Stream verb must be enclosed in a Start element. The url attribute of the Stream verb is the URL of the WebSocket endpoint where Twilio should stream the audio data. In Flask, the request.host expression is the domain that was used in the current request. The WebSocket URL is constructed with the wss:// scheme, the same domain used in the /call endpoint, and a /stream path. The Stream verb also supports a track attribute, which can be used to specify if the application wants a stream for the inbound, outbound or both audio tracks. The default is to only stream the inbound audio, which is what this application needs.

For information purposes, the handler prints a message with the phone number of the caller, which in Flask can be obtained with the request.form['From'] expression.

The XML response is generated by converting the response object to a string. A 200 status code is used to tell Twilio the call was successful. The Content-Type header is set to indicate that the response contains an XML body.

Shortly after this request ends, Twilio will initiate a WebSocket connection to the URL passed in the Stream verb.

Streaming and transcribing the audio from the call

The last piece of this application is the WebSocket endpoint. The complete code for this endpoint is below.

@sock.route('/stream')
def stream(ws):
    """Receive and transcribe audio stream."""
    rec = vosk.KaldiRecognizer(model, 16000)
    while True:
        message = ws.receive()
        packet = json.loads(message)
        if packet['event'] == 'start':
            print('Streaming is starting')
        elif packet['event'] == 'stop':
            print('\nStreaming has stopped')
        elif packet['event'] == 'media':
            audio = base64.b64decode(packet['media']['payload'])
            audio = audioop.ulaw2lin(audio, 2)
            audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]
            if rec.AcceptWaveform(audio):
                r = json.loads(rec.Result())
                print(CL + r['text'] + ' ', end='', flush=True)
            else:
                r = json.loads(rec.PartialResult())
                print(CL + r['partial'] + BS * len(r['partial']), end='', flush=True)

As soon as the /call endpoint returns the TwiML response, Twilio will make a WebSocket connection to this endpoint.

The rec variable that is initialized at the start is an instance of the Vosk speech recognition engine. The arguments that are passed are the language model loaded earlier, and the sample rate of the audio. At the time I’m writing this, this recognizer only supports an audio rate of 16K samples per second.

The main logic in this function has to deal with a stream of messages that Twilio sends in JSON format. A while loop is used to read each message and decode it to a Python dictionary.

All messages have an event key that indicates their type. The complete list of events is in the documentation, but for the purposes of this application, the most interesting messages are the one with type media, as these messages include the audio data. In addition, the start and stop messages are sent before and after the streaming respectively. This application prints messages to the terminal when these messages are received.

The core portion of this function is in the section that handles the media messages. Let’s go over this part in detail. First, the audio needs to be converted to the proper format for Vosk:

            audio = base64.b64decode(packet['media']['payload'])
            audio = audioop.ulaw2lin(audio, 2)
            audio = audioop.ratecv(audio, 2, 1, 8000, 16000, None)[0]

Twilio provides the audio data encoded in a format called μ-law (pronounced mu-law). The encoded audio data is added to the message in base64 format. The code above extracts the base64 payload from the JSON packet and removes the base64 encoding. Then the audioop.ulaw2lin() function from the Python standard library is used to decode the μ-law encoded data to 16-bit uncompressed format. Finally, the audioop.ratecv() function converts the audio from Twilio’s sample rate of 8000 samples per second to the 16000 required by Vosk.

The audio variable now has the raw audio data in the format that Vosk needs. The next section sends the data to the Vosk engine for transcription.

            if rec.AcceptWaveform(audio):
                r = json.loads(rec.Result())
                print(CL + r['text'] + ' ', end='', flush=True)
            else:
                r = json.loads(rec.PartialResult())
                print(CL + r['partial'] + BS * len(r['partial']), end='', flush=True)

The rec.AcceptWaveform() method receives the blob of audio data, and returns True or False depending on the resulting transcription being final or partial respectively. The idea is that the engine is going to receive the audio from the caller in small chunks, so until the speaker makes a pause or finishes a sentence, it is unlikely that the recognizer will have enough context to make an accurate transcription. When Vosk believes that the transcription can improve after more audio data is provided, it returns False and provides a best-effort partial transcription that is going to be superseded by a better one later. A return value of True means that the provided transcription is final.

Results from the speech recognition engine are provided in JSON format via the rec.Result() method. The application prints the transcribed text to the terminal, regardless of being a final or partial result. When the results are partial, it moves the cursor back to the start of the partial section, so that the next time results are printed, they overwrite the previous text. When a final transcription is provided, the cursor is finally advanced, so that it can start printing the next portion of the dialogue.

To support the cursor movement, the CL and BS constants defined at the beginning of the file are used. These are VT-100 terminal control codes that clear the line from the cursor to the end and move the cursor back one character respectively.

The while loop will continue to run for as long as Twilio maintains the WebSocket connection. When the caller hangs up, or the 60-second timeout from the Pause verb is reached, the connection will end.

Running the application

Ready to try this application out? With the Python virtual environment activated, run the application as follows:

python app.py

You will see some messages from Vosk as it loads the language model, then you’ll see a message printed by the application:

Waiting for calls on +1xxxxxxxxxx

Right after this, Flask is going to print some log messages regarding the state of the web server.

At this point, you can pick up your phone and call your Twilio phone number. Twilio will answer and connect the call to the application, which will start receiving the audio as you speak. A moment later, the transcription of what you speak will start appearing in real time in the terminal.

The effect of partial and final results is clearly seen in the example below. The initial guesses the recognizer made about what I was saying in this example were wrong a couple of times, but they were corrected automatically as I continued speaking and provided more context.

Live transcription of phone calls demonstration

Conclusion

I hope this tutorial gets you started on live transcribing your phone calls. If you are looking for ideas to enhance this project, here are a few:

  • Try out different Vosk models. The larger models have greater accuracy.
  • Change the Pause verb to Dial, to forward the incoming call to your personal phone. Also change the streaming to send both inbound and outbound audio channels, so that both callers are transcribed.
  • Instead of printing the transcribed text to the console, push it to a web application through WebSocket or maybe Socket.IO.
  • Expand the application to allow multiple callers to communicate over a conference call, and transcribe each participant’s audio track as a record of the conversation.

I can’t wait to see what you transcribe with Twilio and Vosk!

Miguel Grinberg is a Principal Software Engineer for Technical Content at Twilio. Reach out to him at mgrinberg@twilio.com if you have a cool project you’d like to share on this blog!