← 返回列表

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face Blog · 2026-05-27 08:00 ·原文

原始内容

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Back to Articles
Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
Published
May 27, 2026
Update on GitHub
Upvote
6
Amine Dirhoussi
aminediroHF
Follow
Quentin Gallouédec
qgallouedec
Follow
Kashif Rasul
kashif
Follow
Lewis Tunstall
lewtun
Follow
Edward Beeching
edbeeching
Follow
Albert Villanova del Moral
albertvillanova
Follow
Leandro von Werra
lvwerra
Follow
1. The One Terabyte Problem
2. Why bf16 RL Weights Are Almost Always Sparse
3. HF Buckets and the Architecture
3.1 What is a Bucket?
3.2 The Three Boxes
4. The Protocol
4.1 Safetensors as the Wire Format
4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook
4.3 The vLLM Side: a 30 Line Extension
5. Standing It Up on Spaces, For Real
6. So What Does This Actually Unlock?
7. What's Still on Our Plate
8. Try It
TL;DR
, because you have models to train and we respect that:
Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step.
It turns out you do not have to. Between two consecutive RL optimizer steps,
roughly 99% of bf16 weights are bit-identical
(and never less than 98% in the worst case). The actual delta is tiny.
We landed
a TRL PR
that encodes just the changed elements as a
sparse safetensors file
, uploads it to a
Hugging Face Bucket
, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to
20 to 35 MB
.
The cherry on top: we ran a full disaggregated training where the
trainer was on one box
,
vLLM lived in a Hugging Face Space
, the
Wordle environment lived in another Space
, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN.
Async RL just got a lot cheaper. Read on.
Two ways to ship the same weights. Red is wall-clock time during which no tokens are being generated.
1. The One Terabyte Problem
If you read our previous post on
the landscape of async RL training
, you already know the punchline. Every async RL library, regardless of how it spells "actor model" or which color its NCCL backend is painted, eventually trips over the same root:
weight synchronization
.
The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. This sits on the critical path whether you are running sync or async: a blocking transfer is
wasted idle compute
of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes "weights ready" and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time.
Fireworks put a very memorable number on this in their post
Frontier RL Is Cheaper Than You Think
: for a frontier 1T-parameter checkpoint at fp8 (their setting), a full snapshot is
1024 GiB
, and that is what conventional wisdom says you have to ship every time you update your rollout fleet. That is the kind of number that gets people to start drawing diagrams with mega-clusters, RDMA fabrics, and dedicated cross-region links. Their measured average delta between adjacent checkpoints lands at
20.3 GiB, or 1.98% of the full model
, and "more than 98% of weights in bf16 format remain bit-equivalent between consecutive checkpoints".
Cursor's
Composer 2 report
tells a parallel story. They run training and inference in different regions and stitch them together with a
shared S3 bucket
(their exact words), into which the trainer uploads compressed weight diffs
every training step
. Each cluster independently downloads and reconstructs from the shared delta chain, "requiring no direct connectivity to the training cluster". The two sides never speak to each other about parameters directly. The bucket is the wire.
Both papers agree on three things, and we want to repeat them slowly, because the rest of this post is essentially a faithful open source translation:
Most of the weights have not actually changed between two adjacent RL steps.
If you send only the parts that changed, your bandwidth bill collapses by roughly two orders of magnitude.
If you route those tiny diffs through a shared object store, you no longer need the trainer and the inference cluster to live in the same data center.
The only thing missing was a version of this story that you can
pip install
. So we wrote one.
2. Why bf16 RL Weights Are Almost Always Sparse
Before we wire anything up, it is worth understanding why this whole game is even winnable. The "98% of weights do not change" claim sounds suspiciously like one of those numbers that works in the demo and falls apart in the wild. It is not. It falls out of how bf16 arithmetic works at the learning rates RL uses.
A bf16 number has 7 mantissa bits. Between two consecutive powers of two, there are exactly $2^7 = 128$ representable values, so the spacing between adjacent bf16 numbers around $|w|$ is roughly $|w| \cdot 2^{-7}$. An update gets absorbed by the bf16 cast whenever it sits below
half
of that spacing, i.e., when $|\Delta w| < |w|/256$. This is the "bf16 visibility threshold" PULSE plots in their Figure 3.
Now look at what Adam does. At an RL learning rate of, say, $3 \times 10^{-6}$, the update to a single weight is:
Δ
w
=

η

m
^
v
^
+
ϵ
\Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}
Δ
w
=

η

v
^

+
ϵ
m
^

The normalized step $\hat{m}/(\sqrt{\hat{v}}+\epsilon)$ is roughly order one, so $|\Delta w| \approx \eta \approx 3 \times 10^{-6}$. For most weights, $|w|$ sits somewhere around $10^{-2}$ to $10^{-1}$ (PULSE reports a median of 0.019 for representative LLM weights). The threshold $|w|/256$ at that magnitude is around $4 \times 10^{-5}$ to $4 \times 10^{-4}$, which is
bigger
than the update.
In other words: the optimizer is whispering, and bf16 cannot hear it. The update gets absorbed by rounding, the byte representation of $w$ does not change, and from the inference engine's perspective, this weight did not move. Multiply that by a few hundred million parameters, and you get the >99% sparsity number, for free, with zero approximation.
This is exactly the argument made formal in the PULSE paper (
Mihai & Belilovsky, 2026
). They define two thresholds. The
absorption bound
$10\eta$ is the conservative worst case for an Adam update, and the
effective bound
$\eta$ is the regime you actually live in. The
bf16 visibility threshold
is $|w|/256$. Whenever the update sits below the visibility threshold, it gets absorbed, and the bf16 byte does not change. Their Figure 3 plots both bounds against a cloud of representative LLM weights, and the conclusion is unambiguous: at $\eta = 3 \times 10^{-6}$, the absorption bound itself already sits below the visibility threshold for almost every weight in the model. They measure this empirically across Qwen2.5 (0.5B/1.5B/7B), Llama-3.2-3B, and Gemma-3-4B, and consistently find a mean per-step sparsity of
~99%, with a standard deviation of 0.2 to 0.4% over 400 training steps
. The worst-case step stays above 98%. So <1% changed is not a lucky measurement; it is what the arithmetic guarantees.
We do not have to predict this analytically (and indeed, we tried predicting the change mask from Adam's $m$ and $v$ statistics, but recall was a sad 30%, more on that later). We just need to
observe which bytes flipped
. That is a tiny boolean tensor per parameter, computed right around the optimizer step.
Drag the learning rate down to RL territory and watch the cast-back-to-bf16 marker snap to the original tick. The 256-element grid on the bottom left is the aggregate effect across a tiny model.
3. HF Buckets and the Architecture
Here is where the second piece of the story comes in, and where this post stops being a translation of Fireworks/Cursor and starts being a Hugging Face thing.
3.1 What is a Bucket?
A
Bucket
is a repo type on the Hub designed for high-frequency object storage. No commit ceremony, no PR workflow, no LFS quirks. You add files, you list files, you download files. The Python interface is two functions:
from
huggingface_hub
import
batch_bucket_files, download_bucket_files
# Trainer side
batch_bucket_files(
"my-org/wordle-deltas"
, add=[(buffer,
"deltas/step_000042.safetensors"
)])
# Inference side
download_bucket_files(
"my-org/wordle-deltas"
, files=[(
"deltas/step_000042.safetensors"
, local_path)])
That is it. Two function calls and your weights are in flight.
Under the hood, buckets are backed by
Xet
, the Hub's content-defined chunking storage layer. Xet looks at every file you upload, slices it into chunks based on its actual content (not fixed offsets), and deduplicates against everything already in the bucket. The practical upshot, which is delightful in this context, is that even if we were too lazy to write the sparse encoding and just uploaded full anchors every step, Xet would
still
only transfer the changed chunks. Sparse encoding + Xet stack: we pay for what moved, and we pay for it once.
This is the open source equivalent of the "shared S3 bucket" both Fireworks and Cursor reach for, except the storage layer already knows about content hashing, your existing HF token already has permission, and it composes natively with the rest of the stack (Spaces, datasets, models).
3.2 The Three Boxes
The full architecture has exactly three boxes and one shared substrate:
Trainer.
Wherever you want. One GPU, eight GPUs, a laptop with a USB-attached H100, we will not judge. Owns the model weights, runs the optimizer, emits sparse deltas.
HF Bucket.
A single repo, two prefixes:
anchors/
for occasional full snapshots and
deltas/
for the sparse patches in between. This is the only thing both sides agree on.
vLLM rollout server.
Wherever you want, and crucially
not necessarily where the trainer is
. Pulls from the bucket, applies the delta, and serves rollouts.
Environment.
Hangs off the rollout server in the usual way (HTTP, function calls, whatever your env speaks).
The property to internalize, the one Cursor's paper sells hard and that holds verbatim here:
the trainer and the rollout server never talk to each other about weights
. They exchange a tiny POST containing
{"repo_id": ..., "filename": ...}
, and that is the entire control plane. The actual byte transfer happens between each side and the bucket, in parallel, with no shared network fabric.
Why that matters in practice:
The rollout server can be in another region, another cloud, or behind NAT inside a Hugging Face Space. It does not care.
N inference replicas can pull the same delta from the same bucket, and Xet deduplicates the bytes across all of them.
The trainer never has to know how many inference replicas exist, or where, or whether one of them just crashed.
The trainer writes. Replicas read. The Hub does the plumbing.
4. The Protocol
Now we can open the hood. The protocol has four parts: a wire format, a bucket layout, a 30 line vLLM extension, and a trainer side change detector. It is honestly less code than it sounds.
4.1 Safetensors as the Wire Format
We picked
safetensors
for the on-disk and on-wire format. It is already the canonical checkpoint format on the Hub, every reasonable framework can read it, and the header carries arbitrary string metadata. That metadata field is where we hide the protocol.
There are two kinds of files in the bucket.
Anchors
look like a normal checkpoint: one tensor per parameter, full bf16 weights, written every $N$ syncs (we default to $N=10$).
anchors/step_000010.safetensors
├── model.layers.0.self_attn.q_proj.weight (bf16, full)
├── model.layers.0.self_attn.k_proj.weight (bf16, full)
└── ...
metadata:
sparse=False, model_version=10, sparsity=0.0
Deltas
are the interesting bit. For each parameter that actually changed, we store two entries: a flat int32 tensor of element indices, and a bf16 tensor of values at those indices.
deltas/step_000011.safetensors
├── model.layers.0.self_attn.q_proj.weight.indices (int32, [num_changed])
├── model.layers.0.self_attn.q_proj.weight.values (bf16, [num_changed])
├── model.layers.0.mlp.gate_proj.weight.indices
├── model.layers.0.mlp.gate_proj.weight.values
└── ...
metadata:
sparse=True, model_version=11, sparsity=0.9938, changed_params=[...]
A few nice consequences of this choice:
A delta is a
file
. You can open it with
safe_open(...)
in Python and inspect every tensor in it. No proprietary framing, no length prefixes, no version handshake.
The metadata is self-describing. The receiver reads
sparse=True/False
and branches. There is no separate manifest.
It is zero-copy via mmap on the inference side, which matters when you are doing this every few seconds.
The cadence is straightforward: anchor every Nth step, delta in between. Both end up in the same bucket under
anchors/
and
deltas/
prefixes. Each new inference replica only needs to grab the most recent anchor and then replay the deltas since.
Ten training steps. Anchor (full snapshot) on step 1 and step 6, sparse delta on every other step. Files land in the bucket as you watch.
4.2 The Trainer Side: a Boolean Mask From an Optimizer Hook
The trainer needs to know which bf16 elements actually flipped. We do this with a tiny
BF16ChangeDetector
that registers a pre-step and post-step hook on the optimizer:
class
BF16ChangeDetector
:
def
__init__
(
self, model, optimizer
):
self._pre_step_bf16:
dict
[
str
, torch.Tensor] = {}
self._validated_masks:
dict
[
str
, torch.Tensor] = {}
optimizer.register_step_pre_hook(self._pre_step_hook)
optimizer.register_step_post_hook(self._post_step_hook)
def
_pre_step_hook
(
self, opt, args, kwargs
):
for
p
in
self._params:
self._pre_step_bf16[name_of(p)] = p.detach().to(torch.bfloat16).cpu().clone()
def
_post_step_hook
(
self, opt, args, kwargs
):
for
p
in
self._params:
self._validated_masks[name_of(p)] = (
p.detach().to(torch.bfloat16).cpu() != self._pre_step_bf16[name_of(p)]
)
The actual code in the PR has a bit more plumbing (matching optimizer param objects to model params via
data_ptr()
, because Accelerate wraps them as different Python objects), but the idea fits on a napkin: snapshot, step, diff.
This is ground truth. We
tried
the more elegant path of predicting the mask from Adam's $m$ and $v$ statistics, using the bf16 ULP threshold directly. It works in principle. In practice, recall was around 30%, which means we would have shipped a delta missing two thirds of the actual updates. Adam's normalization is messy enough that the analytical threshold is not tight. So we just compare bytes. It costs one bf16 CPU snapshot of the model on the trainer side, which we are willing to pay.
The four phases of the new
_sync_weight
flow are:
Upload while inference keeps running.
The trainer encodes the masked elements into a safetensors buffer and pushes it to the bucket. vLLM is still happily serving the old policy during this whole step.
Pause vLLM.
A short HTTP call, hundreds of milliseconds.
Signal
/update_weights
.
Send the bucket coordinates. vLLM downloads, applies, returns.
Resume.
vLLM is back on the air.
The log lines tell the story:
Delta: 1234567/200000000 elements changed (sparsity=99.38%)
[delta_engine] uploaded user/wordle-deltas/deltas/step_000042.safetensors (27.4 MB, ...)
Weight sync: done. Total 9.4s (inference paused 1.1s)
The line that matters is the parenthesis. Inference was paused for
1.1 seconds
. The remaining 9.4 seconds were spent uploading, which occurred while the rollout server was still generating tokens. With NCCL, we were paying the full sync time as pause time. Here we are paying for it as background time.
A single sync, end to end. Switch between delta-over-bucket and NCCL broadcast, and try the replica count toggle to see the fan-out story.
4.3 The vLLM Side: a 30 Line Extension
vLLM has a clean abstraction for this called
WeightTransferEngine
. We implement a
DeltaWeightTransferEngine
whose
receive_weights
method is, in spirit:
def
receive_weights
(
self, update_info, load_weights
):
download_bucket_files(update_info.repo_id, files=[(update_info.filename, local_path)])
with
safe_open(local_path, framework=
"pt"
, device=
"cpu"
)
as
f:
meta = PatchMetadata.from_metadata_dict(f.metadata())
if
not
meta.sparse:
# Anchor: feed every tensor and snapshot for future deltas
for
name
in
f.keys():
tensor = f.get_tensor(name)
self._bf16_snapshot[name] = tensor.clone()
load_weights([(name, tensor)])
else
:
# Delta: apply (indices, values) to snapshot, hand full tensor to vLLM
for
name
in
json.loads(meta.changed_params):
indices = f.get_tensor(
f"
{name}
.indices"
).long()
values = f.get_tensor(
f"
{name}
.values"
)
snap = self._bf16_snapshot[name].flatten()
snap[indices] = values
self._bf16_snapshot[name] = snap.reshape(self._bf16_snapshot[name].shape)
load_weights([(name, self._bf16_snapshot[name])])
We register it via vLLM's
--worker-extension-cls
flag, which means
no fork of vLLM is required
. You install TRL into the same image as vLLM, point the CLI at our class, and you are done.
Worth mentioning: vLLM itself has an in-flight effort to land sparse weight transfer natively,
vllm-project/vllm#40096
. It adds
receive_sparse_weights()
and
trainer_send_sparse_weights()
directly on the
WeightTransferEngine
base class, with patches encoded as
(indices, values)
and applied in place via
index_copy_()
, removing the GPU/CPU validation roundtrip entirely. The PR reports a transfer of
0.16 MB in 0.40 ms
for a sparse patch on Qwen3-1.7B versus
942 MB in 192 ms
for a full dense send.
One honest caveat in our implementation on the inference side: we keep a CPU bf16 snapshot of the model so we can reconstruct full tensors from sparse
(indices, values)
patches, because
load_weights
in vLLM today expects full tensors. Once
#40096
(or its successor) lands and exposes an in-place sparse
load_weights
path, we can apply the indices directly on the GPU and drop the snapshot!
5. Standing It Up on Spaces, For Real
This is the part we are smug about. Everything we have described so far works on your laptop, but the point of routing weights through a Hub bucket is that the trainer and the rollout server do not have to live anywhere near each other. So we ran a fully disaggregated training with three machines, none of which share a network:
A box with one GPU running the
trainer
.
A
Hugging Face Space
(Docker SDK, L4 GPU) running
vLLM
with our extension class.
A second
Hugging Face Space
(CPU) running the
Wordle environment
server with 256 concurrent session capacity.
A
Hub bucket
in the middle.
Setting this up is genuinely a few
hf
CLI calls. The vLLM Space's
Dockerfile
is essentially the upstream vLLM image plus
pip install trl@...
plus the entrypoint:
FROM
vllm/vllm-openai:latest
RUN
pip install
"trl @ git+https://github.com/huggingface/trl.git@delta-weight-sync"
ENV
VLLM_SERVER_DEV_MODE=
1
EXPOSE
7860
ENTRYPOINT
[
"vllm"
,
"serve"
,
"Qwen/Qwen3-1.7B"
, \
"--host"
,
"0.0.0.0"
,
"--port"
,
"7860"
, \
"--worker-extension-cls"
,
"trl.experimental.async_grpo.delta_engine.DeltaWorkerExtension"
, \
"--weight-transfer-config"
,
"{\"backend\":\"nccl\"}"
, \
"--max-model-len"
,
"32768"
, \
"--gpu-memory-utilization"
,
"0.8"
]
Deploy it as a Space:
hf repos create
$USER
/vllm-wordle-inference \
--
type
space --space-sdk docker --flavor l4x1 \
--secrets HF_TOKEN=
$HF_TOKEN
hf upload
$USER
/vllm-wordle-inference examples/scripts/openenv/vllm_space/ --
type
space
And kick of