Unlocking LLM superpowers: How PagedAttention helps the reminiscence maze

Learn extra at:

1. Reminiscence fragmentation

Inner fragmentation

Methods pre-allocate a big chunk of reminiscence for every request, assuming the utmost attainable output size (e.g., 2048 tokens). Nevertheless, if a request solely generates a brief output, a lot of that reserved reminiscence goes unused, resulting in vital waste.

Exterior fragmentation

As a result of completely different requests reserve chunks of various sizes, the GPU reminiscence turns into scattered with unusable small gaps, making it arduous to suit new requests even when whole free reminiscence is accessible. Our sources present that in current programs, solely 20.4% – 38.2% of KV cache reminiscence is definitely used to retailer token states, with the remaining being waste.

Superior decoding methods like parallel sampling or beam search typically generate a number of outputs from a single immediate, that means they might share components of the KV cache. Nevertheless, current programs can not simply share this reminiscence as a result of every sequence’s KV cache is in its personal separate, contiguous block.

Unlocking LLM superpowers: How PagedAttention helps the reminiscence maze

1. Reminiscence fragmentation

Inner fragmentation

Exterior fragmentation

What’s subsequent for Azure infrastructure

U.S. Regulators Increase Crypto Actions For Banks

8 Uncommon Classic Tech Merchandise Value Hundreds That May Be Hiding In Your Attic

JetBrains releases Kotlin 2.3.0 | InfoWorld