Documentation Index
Fetch the complete documentation index at: https://openpipe-art-austin-megatron-models.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Training jobs can run for thousands of steps, and each step generates a new model checkpoint. For most training runs, these checkpoints are LoRAs that takes up 80-150MB of disk space. To reduce storage overhead and preserve only the best checkpoint from your runs, you can set up automatic deletion of all but your best-performing and most recent checkpoints.
To delete all but the most recent and best-performing checkpoints of a model, call the delete_checkpoints method as shown below.
import art
# also works with LocalBackend
from art.serverless.backend import ServerlessBackend
model = art.TrainableModel(
name="agent-001",
project="checkpoint-deletion-demo",
base_model="OpenPipe/Qwen3-14B-Instruct",
)
backend = ServerlessBackend()
# in order for the model to know where to look for its existing checkpoints,
# we have to point it to the correct backend
await model.register(backend)
# deletes all but the most recent checkpoint
# and the checkpoint with the highest val/reward
await model.delete_checkpoints()
By default, delete_checkpoints ranks existing checkpoints by their val/reward score and erases all but the highest-performing and most recent. However, delete_checkpoints can be configured to use any metric that it is passed.
await model.delete_checkpoints(best_checkpoint_metric="train/eval_1_score")
Keep in mind that once checkpoints are deleted, they generally cannot be recovered, so use this method with caution.
Deleting within a training loop
Below is a simple example of a training loop that trains a model for 50 steps before exiting. By default, the LoRA checkpoint generated by each step will automatically be saved in the storage mechanism your backend uses (in this case W&B Artifacts).
import art
from art.serverless.backend import ServerlessBackend
from .rollout import rollout
from .scenarios load_train_scenarios
TRAINING_STEPS = 50
model = art.TrainableModel(
name="agent-001",
project="checkpoint-deletion-demo",
base_model="OpenPipe/Qwen3-14B-Instruct",
)
backend = ServerlessBackend()
await model.register(backend)
train_scenarios = load_train_scenarios()
# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
train_groups = await art.gather_trajectory_groups(
(
art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
for scenario in train_scenarios
),
pbar_desc=f"gather(train:{step})",
)
# trains model and automatically persists each LoRA as a W&B Artifact
# ~120MB per step
await model.train(
train_groups,
config=art.TrainConfig(learning_rate=5e-5),
)
# ~6GB of storage used by checkpoints
However, since each LoRA checkpoint generated by this training run is ~120MB, in total this training run will require ~6GB of storage for the model checkpoints alone. To reduce our storage overhead, letβs implement checkpoint deletion on each step.
...
# training loop
for _step in range(await model.get_step(), TRAINING_STEPS):
train_groups = await art.gather_trajectory_groups(
(
art.TrajectoryGroup(rollout(model, scenario, step) for _ in range(8))
for scenario in train_scenarios
),
pbar_desc=f"gather(train:{step})",
)
# trains model and automatically persists each LoRA as a W&B Artifact
# ~120MB per step
await model.train(
train_groups,
config=art.TrainConfig(learning_rate=5e-5),
)
# clear all but the most recent and best-performing checkpoint on the train/reward metric
await model.delete_checkpoints(best_checkpoint_metric="train/reward")
# ~240MB of storage used by checkpoints
With this change, weβve reduced the total amount of storage used by checkpoints from 6GB to 240MB, while preserving the checkpoint that performed the best on train/reward.