Convoice Construction Chronicles: Fully Serverless, Low Latency, LLM-to-Voice Service

Published in

Convoice

5 min readJan 10, 2024

Our goal at Convoice is to deliver value to small businesses via low costs and high functionality. When we were first building, we aimed for a fully serverless infrastructure built on AWS to reach our goals as fast as possible. As a first step, we set out to solve a limited version of the task, with only one-way voice (the bot can talk to you, but you can’t speak to the bot), primarily accessed through a website.

This service has some important properties:

LLM inference can be much faster than speech.
While the user is typing their input, no action from the service is required.
With a website as the primary interface, the client can maintain state.
Latency to the user is measured from the second they press the ‘send’ button on the client to the first sound of voice from the bot.

Points 1, 2, and 3 meant that the only time our service has to be active is immediately after the user sends their message: while they are typing, we have to do nothing, and the service can send voice chunks as fast as it can, in any order, as we can queue and sort chunks in the client. As such, we arrived at our big development goal: to make a product that delivers market-leading value and the lowest prices, we had to make a service billed by message, not session duration.

Part 1: Managing client connections

As a team, we decided to communicate with the client via WebSockets from the start, as it most closely matched what we intended our final product to use. Unfortunately, holding a WebSocket connection inherently requires long-running processes, directly contradicting our big development goal. To avoid having a service running for the duration of the call, we turned towards API Gateway WebSockets.

Though API Gateway does bill for WebSocket connection minutes, at $0.25 per million minutes connected, its primary cost for our use case is the per-message fee: $1.00 per million messages

API Gateway WebSockets allows attaching an AWS integration to WebSocket events ($connect, $disconnect, and other user-defined actions). We defined the $message action as the primary method for a user to interact with our service. Each integration connected to a WebSocket event also receives a globally unique connection ID; thus, our service now has the means for managing state.

API Gateway integration for $message

Part 2: Orchestrating each message step

Though the connectionID gave us a means for deciding which incoming events belong to which session, we now had to build the meat of the service: a method to maintain history, call LLM and voice generation services, and send the responses back to the client. Step Functions to the rescue!

Step Functions are state machines in which state transitions manipulate the state. Each state transition invokes an AWS Service and can transform its input and result with basic JSON manipulations. They come in two different variants: standard and express.

Standard step functions:

are billed by the number of state transitions,
have higher latency between state transitions and on instantiation,
and have access to control flow (including blocking until released by another service).

In contrast, express step functions are:

billed by execution time,
have (very) low latency for both state transitions and instantiation,
and have no control flow.

We defined three Step Functions and attached them to the $connect, $message, and $disconnect API Gateway WebSocket actions.

$connect:

This service is long-running, so it best fits a standard step function to meet our big development goal. Additionally, its instantiation latency has minimal impact on the user experience, so it is acceptable.

For this service, we used DynamoDB as the session store (this could easily be replaced with something like ElasticCache or Redis). On connecting to the API Gateway Websocket, an instance of the $connect standard Step Function is instantiated — this serves as the session manager.

After instantiation, it puts the connectionId to a row in a DynamoDB table, Sessions. Importantly, DynamoDB PutItem state transitions in Step Functions support blocking until another function releases it through the “Wait For Callback” setting (or some amount of time has passed). If enabled, the state transition is also given a task token as input — any other service on AWS can then call the Step Function API with the task token and additional arbitrary data, and the Step Function corresponding to that task token is released, and given the data as the result for that state transition. We found that this had ~200ms of latency on average. The state transition puts the task token into the DynamoDB row with the session’s connectionId as the key.
It checks whether the function that released it was the $disconnect function. If so, jump to 7. Otherwise, it was called by $message and currently holds the user’s message.
It invokes a Lambda function (serverless functions for small pieces of code) that runs the LLM inference and generates voice chunks as fast as possible. Make sure to pick the correct language—the time to start a Lambda function may vary dramatically! The Lambda function directly sends the voice chunks to the client to avoid the latency of waiting for all the chunks to finish generating, passing them back to the Step Function, and another state transition. The Lambda function then returns the text LLM response to the Step Function.
It manipulates the input to the Lambda function and the output of the Lambda function to form a list of messages from the user and LLM (the conversation history).
Again, it puts a new task token into the corresponding DynamoDB row.
Again, it checks for disconnect. If disconnected, proceed. Otherwise, jump back to 3.
It deletes the row from the DynamoDB session.

$message and $disconnect:

Both follow the same procedure, the difference being the data it attaches to the task token completion callback.

It gets the task token from the DynamoDB row with the corresponding connectionID.
It completes the task token, calling the callback. If $disconnect, it puts a marker indicating the session is done. If $message, it puts the user’s query.

And there we go — fully serverless and billed (almost) only when the user sends messages. Feel free to check out the Step Function source code!

Convoice Construction Chronicles: Fully Serverless, Low Latency, LLM-to-Voice Service

Part 1: Managing client connections

Part 2: Orchestrating each message step

$connect:

$message and $disconnect:

Written by Ashwin Baluja