Garbage Collection and Forgetting
Repositories grow over time as commits, branch metadata, and user blobs accumulate. Because every blob is content addressed and immutable, nothing is ever overwritten and there is no automatic reclamation when branches move or objects become orphaned. To keep disk usage in check a repository can periodically forget blobs that are no longer referenced.
Forgetting is deliberately conservative. It only removes local copies, so re-synchronising from a peer or pushing a commit that references an "forgotten" blob will transparently restore it. Forgetting therefore complements the monotonic model: history never disappears globally, but any node can opt-out of retaining data it no longer needs.
The main challenge is deciding which blobs are still reachable without
reconstructing every TribleSet. The sections below outline how the repository
module solves that problem and how you can compose the building blocks in your
own tools.
Understanding the Roots
The walk begins with a root set—the handles you know must stay alive. In a
typical repository this includes the metadata blob for each branch (which in
turn names the commit heads), tags, or any additional anchors your deployment
requires. Roots are cheap to enumerate: walk the branch store via
BranchStore::branches
and load each branch head, or read the subset of metadata relevant to the
retention policy you are enforcing. Everything reachable from those handles
will be retained by the traversal; everything else is eligible for forgetting.
Conservative Reachability
Every commit and branch metadata record is stored as a SimpleArchive. The
archive encodes a canonical TribleSet as 64-byte tribles, each containing a
32-byte value column. The blob store does not track which handles correspond to
archives, so the collector treats every blob identically: it scans the raw bytes
in 32-byte chunks and treats each chunk as a candidate handle. Chunks that are
not value columns—for example the combined entity/attribute half of a trible or
arbitrary attachment bytes—are discarded when the candidate lookup fails. If a
chunk matches the hash of a blob in the store we assume it is a reference,
regardless of the attribute type. With 32-byte hashes the odds of a random
collision are negligible, so the scan may keep extra blobs but will not drop a
referenced one.
Content blobs that are not SimpleArchive instances (for example large binary
attachments) therefore behave as leaves: the traversal still scans them, but
because no additional lookups succeed they contribute no further handles. They
become reachable when some archive references their handle and are otherwise
eligible for forgetting.
Traversal Algorithm
- Enumerate all branches and load their metadata blobs.
- Extract candidate handles from the metadata. This reveals the current commit head along with any other referenced blobs.
- Recursively walk the discovered commits and content blobs. Each blob is scanned in 32-byte steps; any chunk whose lookup succeeds is enqueued instead of deserialising the archive.
- Stream the discovered handles into whatever operation you need. The
reachablehelper returns an iterator of handles, so you can retain them, transfer them into another store, or collect them into whichever structure your workflow expects.
Because the traversal is purely additive you can compose additional filters or instrumentation as needed—for example to track how many objects are held alive by a particular branch or to export a log of missing blobs for diagnostics.
Automating the Walk
The repository module already provides most of the required plumbing. The
reachable
helper exposes the traversal as a reusable iterator so you can compose other
operations along the way, while
transfer
duplicates whichever handles you feed it. The in-memory MemoryBlobStore can
retain live blobs, duplicate them into a scratch store, and report how many
handles were touched without writing bespoke walkers:
#![allow(unused)] fn main() { use triblespace::core::blob::memoryblobstore::MemoryBlobStore; use triblespace::core::repo::{self, BlobStoreKeep, BlobStoreList, BranchStore}; use triblespace::core::value::schemas::hash::Blake3; let mut store = MemoryBlobStore::<Blake3>::default(); // ... populate the store or import data ... let mut branch_store = /* your BranchStore implementation */; let reader = store.reader()?; // Collect the branch metadata handles we want to keep alive. let mut roots = Vec::new(); for branch_id in branch_store.branches()? { if let Some(meta) = branch_store.head(branch_id?)? { roots.push(meta.transmute()); } } // Trim unreachable blobs in-place. store.keep(repo::reachable(&reader, roots.clone())); // Optionally copy the same reachable blobs into another store. let mut scratch = MemoryBlobStore::<Blake3>::default(); let visited = repo::reachable(&reader, roots.clone()).count(); let mapping: Vec<_> = repo::transfer( &reader, &mut scratch, repo::reachable(&reader, roots), ) .collect::<Result<_, _>>()?; println!("visited {} blobs, copied {}", visited, mapping.len()); println!("rewrote {} handles", mapping.len()); }
In practice you will seed the walker with the handles extracted from branch
metadata or other root sets instead of iterating the entire store. The helper
takes any IntoIterator of handles, so once branch heads (and other roots) have
been identified, they can be fed directly into the traversal without writing
custom queues or visitor logic. Passing the resulting iterator to
MemoryBlobStore::keep or repo::transfer makes it easy to implement
mark-and-sweep collectors or selective replication pipelines without duplicating
traversal code.
When you already have metadata represented as a TribleSet, the
potential_handles
helper converts its value column into the conservative stream of
Handle<H, UnknownBlob> instances expected by these operations.
Operational Tips
- Schedule forgetting deliberately. Trigger it after large merges or imports rather than on every commit so you amortise the walk over meaningful changes.
- Watch available storage. Because forgetting only affects the local node, replicating from a peer may temporarily reintroduce forgotten blobs. Consider monitoring disk usage and budgeting headroom for such bursts.
- Keep a safety margin. If you are unsure whether a handle should be retained, include it in the root set. Collisions between 32-byte handles are effectively impossible, so cautious root selection simply preserves anything that might be referenced.
Future Work
The public API for triggering garbage collection is still evolving. The
composition-friendly walker introduced above is one building block; future work
could layer additional convenience helpers or integrate with external retention
policies. Conservative reachability by scanning SimpleArchive bytes remains
the foundation for safe space reclamation.