AI อะไรเนี่ย

Tools

AWS Introduces Disaggregated LLM Inference with llm-d

AWS Introduces Disaggregated LLM Inference with llm-d

What llm-d Does for LLM Inference

AWS has just announced a significant partnership with the open-source llm-d team, bringing powerful new disaggregated inference capabilities to its cloud platform. At its core, llm-d is an open-source, Kubernetes-native framework built on vLLM, designed to enhance large language model (LLM) serving with production-grade orchestration, advanced scheduling, and high-performance interconnect support. This collaboration aims to revolutionize how large-scale LLM inference workloads are handled on AWS, promising boosted performance, maximized GPU utilization, and improved costs.

The framework introduces a cutting-edge architectural pattern called "disaggregated serving." This means that rather than treating LLM inference as a single, monolithic process, llm-d intelligently separates and optimizes distinct stages such as prefill, decode, and KV-cache management across distributed GPU resources. To make this integration seamless, AWS and the llm-d team have released a new container, ghcr.io/llm-d/llm-d-aws, which incorporates AWS-specific libraries like Elastic Fabric Adapter (EFA) and libfabric. Furthermore, llm-d integrates with the NIXL library to support crucial features like multi-node disaggregated inference and expert parallelism, particularly beneficial for Mixture-of-Experts (MoE) models.

Why This Matters for Your AI Workloads

In the era of agentic AI and complex reasoning, LLMs are generating significantly more tokens and processing intricate computational chains, leading to highly variable demands that can bog down traditional inference processes. LLM inference involves two distinct phases – a compute-bound "prefill" phase and a memory-bound "decode" phase – which historically have been difficult to optimize independently on the same hardware. This often results in suboptimal GPU utilization and increased costs.

llm-d directly addresses these challenges by allowing independent optimization of prefill and decode stages. Moreover, it introduces intelligent inference scheduling, which attempts to guess the locality of requests in the KV-cache. This smart routing decision-making, particularly in multi-replica environments, ensures that requests are directed to instances that already hold relevant cached context, significantly improving throughput and latency for workloads with high prefix reuse, such as multi-turn conversations. This capability translates directly into more efficient resource usage and a better user experience for your AI applications.

Getting Started on AWS

For organizations ready to move from prototyping to deploying AI at scale, these new capabilities are now readily available on AWS. Customers can leverage llm-d's powerful features on AWS Kubernetes systems, specifically Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS). The dedicated ghcr.io/llm-d/llm-d-aws container simplifies deployment, providing out-of-the-box access to these optimizations.

By adopting disaggregated inference with llm-d on AWS, you can unlock significant improvements in inference performance, resource utilization, and operational efficiency for your large language models. Whether you're dealing with long prompts, massive models, or complex agentic workflows, this new offering provides the tools to scale your LLM deployments more effectively and cost-efficiently.

Read more: Introducing Disaggregated Inference on AWS to dive deeper into these powerful capabilities and start optimizing your LLM inference workloads today.