Garbage Collection and Forgetting

Repositories grow over time as commits, branch metadata, and user blobs accumulate. Because every blob is content addressed and immutable, nothing is ever overwritten and there is no automatic reclamation when branches move or objects become orphaned. To keep disk usage in check a repository can periodically forget blobs that are no longer referenced.

Forgetting is deliberately conservative. It only removes local copies, so re-synchronising from a peer or pushing a commit that references an "forgotten" blob will transparently restore it. Forgetting therefore complements the monotonic model: history never disappears globally, but any node can opt-out of retaining data it no longer needs.

The main challenge is deciding which blobs are still reachable without reconstructing every TribleSet. The sections below outline how the repository module solves that problem and how you can compose the building blocks in your own tools.

Understanding the Roots

The walk begins with a root set—the handles you know must stay alive. In a typical repository this includes the metadata blob for each branch (which in turn names the commit heads), tags, or any additional anchors your deployment requires. Roots are cheap to enumerate: walk the branch store via BranchStore::branches and load each branch head, or read the subset of metadata relevant to the retention policy you are enforcing. Everything reachable from those handles will be retained by the traversal; everything else is eligible for forgetting.

Conservative Reachability

Every commit and branch metadata record is stored as a SimpleArchive. The archive encodes a canonical TribleSet as 64-byte tribles, each containing a 32-byte value column. The blob store does not track which handles correspond to archives, so the collector treats every blob identically: it scans the raw bytes in 32-byte chunks and treats each chunk as a candidate handle. Chunks that are not value columns—for example the combined entity/attribute half of a trible or arbitrary attachment bytes—are discarded when the candidate lookup fails. If a chunk matches the hash of a blob in the store we assume it is a reference, regardless of the attribute type. With 32-byte hashes the odds of a random collision are negligible, so the scan may keep extra blobs but will not drop a referenced one.

Content blobs that are not SimpleArchive instances (for example large binary attachments) therefore behave as leaves: the traversal still scans them, but because no additional lookups succeed they contribute no further handles. They become reachable when some archive references their handle and are otherwise eligible for forgetting.

Traversal Algorithm

  1. Enumerate all branches and load their metadata blobs.
  2. Extract candidate handles from the metadata. This reveals the current commit head along with any other referenced blobs.
  3. Recursively walk the discovered commits and content blobs. Each blob is scanned in 32-byte steps; any chunk whose lookup succeeds is enqueued instead of deserialising the archive.
  4. Stream the discovered handles into whatever operation you need. The reachable helper returns an iterator of handles, so you can retain them, transfer them into another store, or collect them into whichever structure your workflow expects.

Because the traversal is purely additive you can compose additional filters or instrumentation as needed—for example to track how many objects are held alive by a particular branch or to export a log of missing blobs for diagnostics.

Automating the Walk

The repository module already provides most of the required plumbing. The reachable helper exposes the traversal as a reusable iterator so you can compose other operations along the way, while transfer duplicates whichever handles you feed it. The in-memory MemoryBlobStore can retain live blobs, duplicate them into a scratch store, and report how many handles were touched without writing bespoke walkers:

#![allow(unused)]
fn main() {
use triblespace::core::blob::memoryblobstore::MemoryBlobStore;
use triblespace::core::repo::{self, BlobStoreKeep, BlobStoreList, BranchStore};
use triblespace::core::value::schemas::hash::Blake3;

let mut store = MemoryBlobStore::<Blake3>::default();
// ... populate the store or import data ...

let mut branch_store = /* your BranchStore implementation */;
let reader = store.reader()?;

// Collect the branch metadata handles we want to keep alive.
let mut roots = Vec::new();
for branch_id in branch_store.branches()? {
    if let Some(meta) = branch_store.head(branch_id?)? {
        roots.push(meta.transmute());
    }
}

// Trim unreachable blobs in-place.
store.keep(repo::reachable(&reader, roots.clone()));

// Optionally copy the same reachable blobs into another store.
let mut scratch = MemoryBlobStore::<Blake3>::default();
let visited = repo::reachable(&reader, roots.clone()).count();
let mapping: Vec<_> = repo::transfer(
    &reader,
    &mut scratch,
    repo::reachable(&reader, roots),
)
.collect::<Result<_, _>>()?;

println!("visited {} blobs, copied {}", visited, mapping.len());
println!("rewrote {} handles", mapping.len());
}

In practice you will seed the walker with the handles extracted from branch metadata or other root sets instead of iterating the entire store. The helper takes any IntoIterator of handles, so once branch heads (and other roots) have been identified, they can be fed directly into the traversal without writing custom queues or visitor logic. Passing the resulting iterator to MemoryBlobStore::keep or repo::transfer makes it easy to implement mark-and-sweep collectors or selective replication pipelines without duplicating traversal code.

When you already have metadata represented as a TribleSet, the potential_handles helper converts its value column into the conservative stream of Handle<H, UnknownBlob> instances expected by these operations.

Operational Tips

  • Schedule forgetting deliberately. Trigger it after large merges or imports rather than on every commit so you amortise the walk over meaningful changes.
  • Watch available storage. Because forgetting only affects the local node, replicating from a peer may temporarily reintroduce forgotten blobs. Consider monitoring disk usage and budgeting headroom for such bursts.
  • Keep a safety margin. If you are unsure whether a handle should be retained, include it in the root set. Collisions between 32-byte handles are effectively impossible, so cautious root selection simply preserves anything that might be referenced.

Future Work

The public API for triggering garbage collection is still evolving. The composition-friendly walker introduced above is one building block; future work could layer additional convenience helpers or integrate with external retention policies. Conservative reachability by scanning SimpleArchive bytes remains the foundation for safe space reclamation.