AI อะไรเนี่ย

Tools

Amazon Polly Gets Bidirectional Streaming for Real-Time TTS

Amazon Polly Gets Bidirectional Streaming for Real-Time TTS

Real-Time Conversational AI Just Got Faster with Amazon Polly

Exciting news for developers building dynamic, real-time conversational AI applications! Amazon Polly has rolled out a brand new Bidirectional Streaming API, fundamentally changing how text-to-speech (TTS) synthesis integrates with applications that generate text incrementally, such as those powered by large language models (LLMs). This innovative API is engineered to slash latency and deliver a much smoother, more natural spoken interaction.

Gone are the days of waiting for an LLM to generate an entire response before TTS can even begin. The new Bidirectional Streaming API enables true duplex communication over HTTP/2, allowing you to send text to Amazon Polly incrementally and receive synthesized audio bytes back simultaneously. This means your application can start speaking as soon as the first words are generated, making conversations feel much more immediate and fluid. The API operates using key components like TextEvent (client to Polly) for sending text, CloseStreamEvent (client to Polly) to signal the end of input, and AudioEvent (Polly to client) for delivering synthesized audio, alongside StreamClosedEvent.

Why It Matters: Dramatically Reduced Latency and Simpler Workflows

The traditional approach to TTS has always faced a bottleneck: the need to collect the complete text before initiating a synthesis request. In conversational AI, where LLMs generate responses token by token or word by word, this meant significant delays as users had to wait for the full text, then synthesis, and finally audio playback.

The Bidirectional Streaming API directly addresses this by processing text as it arrives, eliminating that frustrating wait. The performance improvements are impressive: in benchmarks using 7,045 characters (970 words), the new API completed synthesis in 70,071 ms (approximately 70 seconds), making it a remarkable 39% faster than the traditional SynthesizeSpeech API, which took 115,226 ms (around 115 seconds). Even better, this efficiency comes with a drastic reduction in overhead, requiring just 1 API call compared to 27 calls for the traditional method – a 27x reduction! This means significantly reduced end-to-end latency from user prompt to spoken response, paving the way for truly responsive conversational experiences. You can dive deeper into the technical details and benefits by reading the full announcement about Introducing Amazon Polly Bidirectional Streaming.

How to Get Started: Integrating the New API into Your Projects

Integrating the new Bidirectional Streaming API is designed to be developer-friendly. It’s currently supported across a wide range of AWS SDKs, including Java-2x, JavaScript v3, .NET v4, C++, Go v2, Kotlin, PHP v3, Ruby v3, Rust, and Swift. Developers can leverage these SDKs to implement the new streaming capabilities in their applications.

For those planning their integration, it's worth noting that while broad SDK support is available, the AWS Command Line Interface (AWS CLI) v1 and v2, PowerShell v4 and v5, Python, and .NET v3 are not currently supported for this new API. Developers will primarily interact with the API by setting up an asynchronous client, sending TextEvent objects as text becomes available, and processing AudioEvent chunks in real-time as they are streamed back from Polly. This approach greatly simplifies the architecture for real-time speech synthesis in complex AI systems. To learn more about implementation details and code examples, check out the official AWS blog post.

Read more: Introducing Amazon Polly Bidirectional Streaming and start building more responsive conversational AI today!