Introduction

Welcome to the Tribles Book. This guide provides a gentle introduction to Trible Space, its design goals and how to use the tribles crate. The aim is to present a clear narrative for newcomers while linking to in‑depth reference material for those who want to dig deeper.

Trible Space combines ideas from databases and version control. Data is stored in small immutable blobs that can live in memory, on disk or inside remote object stores without conversion. Cryptographic identifiers ensure integrity and enable efficient sharing across systems.

What makes Trible Space stand out?

  • Content‑addressed immutability – values are stored by the hash of their contents, making them easy to verify and share.
  • Lightweight queries – work over unordered collections like hash maps, trible sets or custom full‑text/semantic indexes; you can embed domain‑specific DSLs and they run fast enough to use like everyday language constructs.
  • Repository workflows – history and collaboration follow familiar version‑control patterns.

The opening chapters introduce these ideas at a gentle pace. Later sections dive into architecture, query semantics, schemas and commit selectors so you can build and reason about complex data sets.

While this book walks you through the basics, the crate documentation offers a complete API reference. Use the links throughout the text to jump directly to the modules that interest you. A Glossary is provided for quick reference to common terms.

If you would like to work on Tribles yourself, check out Developing Locally.

Getting Started

First add the required crates to your project:

cargo add tribles ed25519-dalek rand

This example uses ed25519-dalek to generate a signing key and rand for randomness.

Next create a simple repository and commit some data:

use tribles::prelude::*;
use tribles::examples::literature;
use tribles::repo::Repository;
use std::path::Path;
use ed25519_dalek::SigningKey;
use rand::rngs::OsRng;

const MAX_PILE_SIZE: usize = 1 << 20;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pile: Pile<MAX_PILE_SIZE> = Pile::open(Path::new("example.pile"))?;
    let mut repo = Repository::new(pile, SigningKey::generate(&mut OsRng));
    let mut ws = repo.branch("main")?;

    ws.commit(literature::entity!(&ufoid(), { firstname: "Alice" }), None);
    repo.push(&mut ws)?;
    Ok(())
}

Running this program with cargo run creates an example.pile file in the current directory and pushes a single entity to the main branch.

See the crate documentation for additional modules and examples.

Developing Locally

To build and test Tribles yourself you need a recent Rust toolchain. Install it from rustup.rs.

The repository provides helper scripts for common workflows:

  • ./scripts/preflight.sh formats the code, runs the full test suite and rebuilds this book. Run it before committing.
  • ./scripts/devtest.sh executes only the tests for faster feedback during development.
  • ./scripts/build_book.sh rebuilds the documentation once mdbook is installed with cargo install mdbook.

These scripts keep the project formatted, tested and documented while you iterate locally.

Architecture Overview

Trible Space is designed to keep data management simple, safe and fast. The README introduces these goals in more detail, emphasizing a lean design with predictable performance and straightforward developer experience. This chapter explains how the pieces fit together and why they are organised this way.

Design Goals

A full discussion of the motivation behind Trible Space can be found in the Philosophy section. At a high level we want a self‑contained data store that offers:

  • Simplicity – minimal moving parts and predictable behaviour.
  • Developer Experience – a clear API that avoids complex servers or background processes.
  • Safety and Performance – sound data structures backed by efficient content addressed blobs.

These goals grew out of earlier "semantic" technologies that attempted to model knowledge as graphs. While systems like RDF promised great flexibility, in practice they often became difficult to host, query and synchronise. Trible Space keeps the idea of describing the world with simple statements but stores them in a form that is easy to exchange and reason about.

Data Model

The fundamental unit of information is a Trible. Its 64 byte layout is described in Trible Structure. A Trible links a subject entity to an attribute and value. Multiple tribles are stored in a TribleSet, which behaves like a hashmap with three columns — subject, attribute and value.

Trible Sets

TribleSets provide fast querying and cheap copy‑on‑write semantics. They can be merged, diffed and searched entirely in memory. When durability is needed the set is serialised into a blob and tracked by the repository layer.

To keep joins skew‑resistant, each set maintains all six orderings of entity, attribute and value. The trees reuse the same leaf nodes so a trible is stored only once, avoiding a naïve six‑fold memory cost while still letting the query planner pick the most selective permutation.

Blob Storage

All persistent data lives in a BlobStore. Each blob is addressed by the hash of its contents, so identical data occupies space only once and readers can verify integrity by recomputing the hash. The trait exposes simple get and put operations, leaving caching and eviction strategies to the backend. Implementations decide where bytes reside: an in‑memory MemoryBlobStore, an on‑disk Pile described in Pile Format or a remote object store. Because handles are just 32‑byte hashes, repositories can copy or cache blobs without coordination. Trible sets, user blobs and commit records all share this mechanism.

Branch Store

A BranchStore keeps track of the tips of each branch. Updates use a simple compare‑and‑set operation so concurrent writers detect conflicts. Both the in‑memory and pile repositories implement this trait.

Because only this single operation mutates repository state, nearly all other logic is value oriented and immutable. Conflicts surface only at the branch store update step, which simplifies concurrent use and reasoning about changes.

Repository

The Repository combines a blob store with a branch store. Commits store a trible set blob along with a parent link and signature. Because everything is content addressed, multiple repositories can share blobs or synchronize through a basic file copy.

Workspaces

A Workspace represents mutable state during editing. Checking out or branching yields a workspace backed by a fresh MemoryBlobStore. Commits are created locally and only become visible to others when pushed, as described in Repository Workflows.

Commits and History

TribleSets written to blobs form immutable commits. Each commit references its parent, creating an append‑only chain signed by the author. This is the durable history shared between repositories.

Putting It Together

+-----------------------------------------------------------+
|                        Repository                          |
|   +---------------------+   +----------------------------+ |
|   |      BlobStore      |   |        BranchStore        | |
|   +---------------------+   +----------------------------+ |
+-----------------------------------------------------------+
           ^ pull                                | push
           |                                     v
+-----------------------------------------------------------+
|                        Workspace                           |
|   +---------------------+   +----------------------------+ |
|   |   MemoryBlobStore   |   |          TribleSet        | |
|   +---------------------+   +----------------------------+ |
+-----------------------------------------------------------+
                      |
                      | commit/add_blob
                      v
                 TribleSet blobs

Repositories persist blobs and branch metadata while workspaces stage changes before pushing them. Because every blob is addressed by its hash, repositories can safely share data through any common storage without coordination.

Query Engine

Queries describe the patterns you want to retrieve. The engine favors extreme simplicity and aims for predictable latency and skew resistance without any tuning. Every constraint implements the Constraint trait so operators, sub-languages and even alternative data sources can compose cleanly.

Queries as Schemas

You might notice that trible.space does not define a global ontology or schema beyond associating attributes with a ValueSchema or BlobSchema. This is deliberate. The semantic web taught us that per-value typing, while desirable, was awkward in RDF: literal datatypes are optional, custom types need globally scoped IRIs and there is no enforcement, so most data degenerates into untyped strings. Trying to regain structure through global ontologies and class hierarchies made schemas rigid and reasoning computationally infeasible. Real-world data often arrives with missing, duplicate or additional fields, which clashes with these global, class-based constraints.

Our approach is to be sympathetic to edge cases and have the system deal only with the data it declares capable of handling. These application-specific schema declarations are exactly the shapes and constraints expressed by our queries1. Data not conforming to these queries is simply ignored by definition, as a query only returns data satisfying its constraints.2

Join Strategy

The query engine uses the Atreides family of worst-case optimal join algorithms. These algorithms leverage cardinality estimates to guide a depth-first search over variable bindings, providing skew-resistant and predictable performance. For a detailed discussion, see the Atreides Join chapter.

Query Languages

Instead of a single query language, the engine exposes small composable constraints that combine with logical operators such as and and or. These constraints are simple yet flexible, enabling a wide variety of operators while still allowing the engine to explore the search space efficiently.

The query engine and data model are flexible enough to support many query styles, including graph, relational and document-oriented queries.

For example, the namespace module offers macros that generate constraints for a given trible pattern in a query-by-example style reminiscent of SPARQL or GraphQL but tailored to a document-graph data model. It would also be possible to layer a property-graph language like Cypher or a relational language like Datalog on top of the engine.3

Great care has been taken to ensure that query languages with different styles and semantics can coexist and even be mixed with other languages and data models within the same query. For practical examples of the current facilities, see the Query Language chapter.

1

Note that this query-schema isomorphism isn't necessarily true in all databases or query languages, e.g., it does not hold for SQL.

2

In RDF terminology: We challenge the classical A-Box & T-Box dichotomy by replacing the T-Box with a "Q-Box", which is descriptive and open rather than prescriptive and closed. This Q-Box naturally evolves with new and changing requirements, contexts and applications.

3

SQL would be a bit more challenging, as it is surprisingly imperative with its explicit JOINs and ORDER BYs, and its lack of a clear declarative semantics. This makes it harder to implement on top of a constraint-based query engine tailored towards a more declarative and functional style.

The Atreides Family of Worst-case Optimal Join Algorithms

The heart of the system is a constraint-solving approach based on the theory of worst-case optimal joins. Queries are represented as sets of constraints over variables, and the join algorithm explores this search space by binding one variable at a time.

Size estimates, normally used only by query optimizers, directly guide the join. Instead of building explicit join plans, the engine asks each constraint for cardinality bounds and chooses the next variable with the smallest estimated result. Traditional databases rely on sorted indexes to efficiently find corresponding values. Atreides still uses random lookups when confirming each candidate, but the bounds let it enumerate the most specific constraint sequentially and probe only a few possibilities in the larger ones, offering similar pruning power without maintaining sorted indexes everywhere.

Maintaining accurate estimates is crucial. We therefore provide data structures that update cardinalities in O(1) time so the algorithm can adapt quickly as bindings accumulate.

We currently expose four variants of the Atreides family, each with a descriptive name and a playful Dune nickname. They differ in how much information they consider when predicting the remaining search space:

  • Row-count Join (Jessica) – uses per-constraint row counts to pick the variable with the smallest number of matching rows.
  • Distinct-value Join (Paul) – estimates the smallest number of distinct values for the variable across one column.
  • Partial-binding Join (Ghanima) – refines the estimate using the current partial binding but ignores yet-to-be-bound variables.
  • Exact-result Join (Leto) – the ideal algorithm that knows the exact result size even for unbound variables.

Each variant trades complexity for precision. More accurate estimates let the engine prune failing paths earlier.

For example, given constraints that relate ?person to ?parent and ?city, the Distinct-value Join (Paul) will bind whichever variable has the fewest distinct candidates. If ?city only has a handful of possibilities, the search tries them first and backtracks quickly when they fail.

The join proceeds as a depth-first search. At every step the engine binds a value, evaluates the affected constraints, and backtracks when a constraint reports no matches. Because estimates come directly from the constraints, the engine avoids complex query planning and remains resistant to data skew — even wildly imbalanced cardinalities do not degrade performance.

Query Language

This chapter introduces the basic query facilities provided by tribles. Queries are declared in a small declarative language that describes which values should match rather than how to iterate them.

The find! macro builds a Query by declaring variables and a constraint expression. A minimal invocation looks like:

#![allow(unused)]
fn main() {
let results = find!((a), a.is(1.into())).collect::<Vec<_>>();
}

find! returns an Iterator over tuples of the bound variables. Matches can be consumed lazily or collected into common collections:

#![allow(unused)]
fn main() {
for (a,) in find!((a), a.is(1.into())) {
    println!("match: {a}");
}
}

Variables can optionally specify a concrete type to convert the underlying value:

#![allow(unused)]
fn main() {
find!((x: i32, y: Value<ShortString>),
      and!(x.is(1.into()), y.is("foo".to_value())))
}

The first variable is read as an i32 and the second as a short string if the conversion succeeds. The query engine walks all possible assignments that satisfy the constraint and yields tuples of the declared variables.

Built-in constraints

find! queries combine a small set of constraint operators to form a declarative language for matching tribles:

  • and! builds an IntersectionConstraint requiring all sub-constraints to hold.
  • or! constructs a UnionConstraint accepting any satisfied alternative.
  • ignore! tells the query engine to ignore variables in a sub-query. Constraints mentioning ignored variables are evaluated without checking those positions, so their bindings neither join with surrounding constraints nor appear in the result set.
  • Collection types such as TribleSet provide a has method yielding a ContainsConstraint for membership tests.

Any data structure that can iterate its contents, test membership and report its size can implement ContainsConstraint, so queries have no inherent ordering requirements.

Ignored variables are handy when a sub-expression references fields you want to drop. The engine skips checking them entirely, effectively treating the positions as wildcards. If the underlying constraint guarantees some value, ignoring works like existential quantification; otherwise the ignored portion is simply discarded. Without ignoring, those variables would leak into the outer scope and either appear in the results or unintentionally join with other constraints.

Alternatives are expressed with or! and temporary variables can be hidden with ignore!:

#![allow(unused)]
fn main() {
find!((x), or!(x.is(1.into()), x.is(2.into())));

find!((x),
      ignore!((y), and!(x.is(1.into()), y.is(2.into()))));
}

In the second query y is ignored entirely—the engine never checks the y.is(2.into()) part—so the outer query only enforces x.is(1.into()) regardless of whether any y equals 2.

Example

#![allow(unused)]
fn main() {
use tribles::prelude::*;
use tribles::examples::{self, literature};

let dataset = examples::dataset();

for (title,) in find!((title: Value<_>),
                     and!(dataset.has(title), title.is("Dune".to_value()))) {
    println!("Found {}", title.from_value::<&str>());
}
}

This query searches the example dataset for the book titled "Dune". The variables and constraint can be adapted to express more complex joins and filters.

matches!

Sometimes you only want to check whether a constraint has any solutions. The matches! macro mirrors the find! syntax but returns a boolean:

#![allow(unused)]
fn main() {
use tribles::prelude::*;

assert!(matches!((x), x.is(1.into())));
assert!(!matches!((x), and!(x.is(1.into()), x.is(2.into()))));
}

Custom constraints

Every building block implements the Constraint trait. You can implement this trait on your own types to integrate custom data sources or query operators with the solver.

Regular path queries

The path! macro lets you search for graph paths matching a regular expression over edge attributes. It expands to a RegularPathConstraint and can be combined with other constraints. Invoke it through a namespace module (social::path!) to implicitly resolve attribute names:

#![allow(unused)]
fn main() {
use tribles::prelude::*;
NS! { namespace social {
    "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" as follows: GenId;
    "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" as likes: GenId;
} }
let mut kb = TribleSet::new();
let a = fucid(); let b = fucid(); let c = fucid();
kb += social::entity!(&a, { follows: &b });
kb += social::entity!(&b, { likes: &c });

let results: Vec<_> = find!((s: Value<_>, e: Value<_>),
    social::path!(&kb, s (follows | likes)+ e)).collect();
}

The middle section uses a familiar regex syntax to describe allowed edge sequences. Editors with Rust macro expansion support provide highlighting and validation of the regular expression at compile time. Paths reference attributes from a single namespace; to traverse edges across multiple namespaces, create a new namespace that re-exports the desired attributes and invoke path! through it.

Incremental Queries

The query engine normally evaluates a pattern against a complete TribleSet, recomputing every match from scratch. Applications that ingest data continuously often only need to know which results are introduced by new tribles. Tribles supports this with semi‑naive evaluation, a classic incremental query technique.

Delta evaluation

Given a base dataset and a set of newly inserted tribles, the engine runs the original query multiple times. Each run restricts a different triple constraint to the delta while the remaining constraints see the full set. The union of these runs yields exactly the new solutions. The process is:

  1. compute a delta TribleSet containing only the inserted tribles,
  2. for every triple in the query, evaluate a variant where that triple matches against delta,
  3. union all per‑triple results to obtain the incremental answers.

Because each variant touches only one triple from the delta, the work grows with the number of constraints and the size of the delta set rather than the size of the full dataset.

Monotonicity and CALM

Removed results are not tracked. Tribles follow the CALM principle: a program whose outputs are monotonic in its inputs needs no coordination. Updates simply add new facts and previously derived conclusions remain valid. When conflicting information arises, applications append fresh tribles describing their preferred view instead of retracting old ones. Stores may forget obsolete data, but semantically tribles are never deleted.

Example

Namespaces provide a pattern_changes! operator to express these delta queries. It behaves like pattern! but takes the current TribleSet and a precomputed changeset. The macro unions variants of the query where each triple is constrained to that changeset, matching only the newly inserted tribles. Combined with the union constraint, this lets us run incremental updates using the familiar find! interface.

#![allow(unused)]
fn main() {
    let storage = MemoryRepo::default();
    let mut repo = Repository::new(storage, SigningKey::generate(&mut OsRng));
    let mut ws = repo.branch("main").expect("branch");

    // Commit initial data
    let shakespeare = ufoid();
    let hamlet = ufoid();
    let mut base = TribleSet::new();
    base += literature::entity!(&shakespeare, { firstname: "William", lastname: "Shakespeare" });
    base += literature::entity!(&hamlet, { title: "Hamlet", author: &shakespeare });
    ws.commit(base.clone(), None);
    let c1 = ws.head().unwrap();

    // Commit a new book
    let macbeth = ufoid();
    let mut change = TribleSet::new();
    change += literature::entity!(&macbeth, { title: "Macbeth", author: &shakespeare });
    ws.commit(change.clone(), None);
    let c2 = ws.head().unwrap();

    // Compute updated state and delta between commits
    let base_state = ws.checkout(c1).expect("base");
    let updated = ws.checkout(c2).expect("updated");
    let delta = updated.difference(&base_state);

    // Find new titles by Shakespeare
    let results: Vec<_> = find!(
        (author: Value<_>, book: Value<_>, title: Value<_>),
        literature::pattern_changes!(&updated, &delta, [
            { author @ firstname: ("William"), lastname: ("Shakespeare") },
            { book @ author: author, title: title }
        ])
    )
    .map(|(_, b, t)| (b, t))
    .collect();

    println!("{results:?}");
}

Comparing history points

Workspace::checkout accepts commit selectors which can describe ranges in repository history. By checking out a range like a..b we obtain exactly the tribles introduced by commits reachable from b but not from a. Feeding that delta into pattern_changes! lets us ask, “What new matches did commit b introduce over a?”

Trade‑offs

  • Applications must compute and supply the delta set; the engine does not track changes automatically.
  • Queries must remain monotonic since deletions are ignored.
  • Each triple incurs an extra variant, so highly selective constraints keep incremental evaluation efficient.

Schemas

Trible Space stores data in strongly typed values and blobs. A schema defines the language‑agnostic byte layout for these types: [Value]s always occupy exactly 32 bytes while [Blob]s may be any length. Schemas translate those raw bytes to concrete application types and decouple persisted data from a particular implementation. This separation lets you refactor to new libraries or frameworks without rewriting what's already stored. The crate ships with a collection of ready‑made schemas located in src/value/schemas and src/blob/schemas.

Why 32 bytes?

Storing arbitrary Rust types requires a portable representation. Instead of human‑readable identifiers like RDF's URIs, Tribles uses a fixed 32‑byte array for all values. This size provides enough entropy to embed intrinsic identifiers—typically cryptographic hashes—when a value references data stored elsewhere in a blob. Keeping the width constant avoids platform‑specific encoding concerns and makes it easy to reason about memory usage.

Conversion traits

Schemas define how to convert between raw bytes and concrete Rust types. The conversion traits ToValue/FromValue and their fallible counterparts live on the schema types rather than on Value itself, avoiding orphan‑rule issues when supporting external data types. The Value wrapper treats its bytes as opaque; schemas may validate them or reject invalid patterns during conversion.

Schema identifiers

Every schema declares a unique 128‑bit identifier such as VALUE_SCHEMA_ID (and optionally BLOB_SCHEMA_ID for blob handles). Persisting these IDs allows applications to look up the appropriate schema at runtime, even when they were built against different code. The schema_id method on Value and Blob returns the identifier so callers can dispatch to the correct conversion logic.

Built‑in value schemas

The crate provides the following value schemas out of the box:

  • GenId – an abstract 128 bit identifier.
  • ShortString – a UTF-8 string up to 32 bytes.
  • U256BE / U256LE – 256-bit unsigned integers.
  • I256BE / I256LE – 256-bit signed integers.
  • R256BE / R256LE – 256-bit rational numbers.
  • F256BE / F256LE – 256-bit floating point numbers.
  • Hash and Handle – cryptographic digests and blob handles (see hash.rs).
  • ED25519RComponent, ED25519SComponent and ED25519PublicKey – signature fields and keys.
  • NsTAIInterval to encode time intervals.
  • UnknownValue as a fallback when no specific schema is known.
#![allow(unused)]
fn main() {
use tribles::value::schemas::shortstring::ShortString;
use tribles::value::{ToValue, ValueSchema};

let v = "hi".to_value::<ShortString>();
assert_eq!(v.schema_id(), ShortString::VALUE_SCHEMA_ID);
}

Built‑in blob schemas

The crate also ships with these blob schemas:

  • LongString for arbitrarily long UTF‑8 strings.
  • SimpleArchive which stores a raw sequence of tribles.
  • SuccinctArchive providing a compressed index for offline queries.
  • UnknownBlob for data of unknown type.
#![allow(unused)]
fn main() {
use tribles::blob::schemas::longstring::LongString;
use tribles::blob::{ToBlob, BlobSchema};

let b = "example".to_blob::<LongString>();
assert_eq!(LongString::BLOB_SCHEMA_ID, b.schema_id());
}

Defining new schemas

Custom formats implement [ValueSchema] or [BlobSchema]. A unique identifier serves as the schema ID. The example below defines a little-endian u64 value schema and a simple blob schema for arbitrary bytes.

#![allow(unused)]

fn main() {
pub struct U64LE;

impl ValueSchema for U64LE {
    const VALUE_SCHEMA_ID: Id = id_hex!("0A0A0A0A0A0A0A0A0A0A0A0A0A0A0A0A");
    type ValidationError = Infallible;
}

impl ToValue<U64LE> for u64 {
    fn to_value(self) -> Value<U64LE> {
        let mut raw = [0u8; VALUE_LEN];
        raw[..8].copy_from_slice(&self.to_le_bytes());
        Value::new(raw)
    }
}

impl FromValue<'_, U64LE> for u64 {
    fn from_value(v: &Value<U64LE>) -> Self {
        u64::from_le_bytes(v.raw[..8].try_into().unwrap())
    }
}

pub struct BytesBlob;

impl BlobSchema for BytesBlob {
    const BLOB_SCHEMA_ID: Id = id_hex!("B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0");
}

impl ToBlob<BytesBlob> for Bytes {
    fn to_blob(self) -> Blob<BytesBlob> {
        Blob::new(self)
    }
}

impl TryFromBlob<BytesBlob> for Bytes {
    type Error = Infallible;
    fn try_from_blob(b: Blob<BytesBlob>) -> Result<Self, Self::Error> {
        Ok(b.bytes)
    }
}

}

See examples/custom_schema.rs for the full source.

Repository Workflows

Tribles borrows much of its vocabulary from Git:

  • Repository – top-level object that tracks history through a BlobStore and BranchStore.
  • Workspace – mutable view of a branch, similar to Git's working directory and index combined.
  • BlobStore – stores commits and blobs.
  • BranchStore – records branch metadata.

Both stores can be in memory, on disk or backed by a remote service. The examples in examples/repo.rs and examples/workspace.rs showcase this API and should feel familiar to anyone comfortable with Git.

Branching

A branch records a line of history. Creating one writes initial metadata to the underlying store and yields a Workspace pointing at that branch. Typical steps look like:

  1. Create a repository backed by blob and branch stores.
  2. Open or create a branch to obtain a Workspace.
  3. Commit changes in the workspace.
  4. Push the workspace to publish those commits.

While Repository::branch is a convenient way to start a fresh branch, most workflows use Repository::pull to obtain a workspace for an existing branch:

#![allow(unused)]
fn main() {
let mut repo = Repository::new(pile, SigningKey::generate(&mut OsRng));
let mut ws = repo.branch("main").expect("create branch");
let mut ws2 = repo.pull(ws.branch_id()).expect("open branch");
}

After committing changes you can push the workspace back:

#![allow(unused)]
fn main() {
ws.commit(change, Some("initial commit"));
repo.push(&mut ws)?;
}

Inspecting History

You can explore previous commits using Workspace::checkout which returns a TribleSet with the union of the specified commit contents. Passing a single commit returns just that commit. To include its history you can use the ancestors helper. Commit ranges are supported for convenience. The expression a..b yields every commit reachable from b that is not reachable from a, treating missing endpoints as empty (..b) or the current HEAD (a.. and ..):

#![allow(unused)]
fn main() {
let history = ws.checkout(commit_a..commit_b)?;
let full = ws.checkout(ancestors(commit_b))?;
}

The history_of helper builds on the filter selector to retrieve only the commits affecting a specific entity. Commit selectors are covered in more detail in the next chapter:

#![allow(unused)]
fn main() {
let entity_changes = ws.checkout(history_of(my_entity))?;
}

Merging and Conflict Handling

When pushing a workspace another client might have already updated the branch. Repository::push attempts to update the branch atomically and returns an optional conflicting Workspace if the head moved. The usual loop looks like:

#![allow(unused)]
fn main() {
while let Some(mut incoming) = repo.push(&mut ws)? {
    incoming.merge(&mut ws)?;
    ws = incoming;
}
}

This snippet is taken from examples/workspace.rs. The examples/repo.rs example demonstrates the same pattern with two separate workspaces. The returned Workspace already contains the remote commits, so after merging your changes you push that new workspace to continue.

Typical CLI Usage

There is a small command line front-end in the trible repository. It exposes push and merge operations over simple commands and follows the same API presented in the examples. The tool is currently experimental and may lag behind the library, but it demonstrates how repository operations map onto a CLI.

Diagram

A simplified view of the push/merge cycle:


        ┌───────────┐         pull          ┌───────────┐
        | local ws  |◀───────────────────── |   repo    |
        └─────┬─────┘                       └───────────┘
              │
              │ commit
              │                                                                      
              ▼                                   
        ┌───────────┐         push          ┌───────────┐
        │  local ws │ ─────────────────────▶│   repo    │
        └─────┬─────┘                       └─────┬─────┘
              │                                   │
              │ merge                             │ conflict?
              └──────▶┌─────────────┐◀────────────┘
                      │ conflict ws │       
                      └───────┬─────┘
                              │             ┌───────────┐
                              └────────────▶|   repo    │
                                     push   └───────────┘
   

Each push either succeeds or returns a workspace containing the other changes. Merging incorporates your commits and the process repeats until no conflicts remain.

Commit Selectors

Commit selectors describe which commits to load from a workspace. The Workspace::checkout method accepts any type implementing the CommitSelector trait and returns a TribleSet containing data from those commits. It currently supports individual commit handles, lists of handles and a handful of higher level selectors.

Most selectors operate on ranges inspired by Git's two‑dot syntax. a..b selects all commits reachable from b that are not reachable from a. In set terms it computes ancestors(b) - ancestors(a). Omitting the start defaults a to the empty set so ..b yields ancestors(b). Omitting the end defaults b to the current HEAD, making a.. resolve to ancestors(HEAD) - ancestors(a) while .. expands to ancestors(HEAD).

#![allow(unused)]
fn main() {
// Check out the entire history of the current branch
let history = ws.checkout(ancestors(ws.head()))?;
}

While convenient, the range-based design makes it difficult to compose complex queries over the commit graph.

Implemented selectors

CommitSelector is implemented for:

  • CommitHandle – a single commit.
  • Vec<CommitHandle> and &[CommitHandle] – explicit lists of commits.
  • ancestors(commit) – a commit and all of its ancestors.
  • nth_ancestor(commit, n) – follows the first-parent chain n steps.
  • parents(commit) – direct parents of a commit.
  • symmetric_diff(a, b) – commits reachable from either a or b but not both.
  • Standard ranges: a..b, a.., ..b and .. following the two‑dot semantics described above.
  • filter(selector, predicate) – retains commits for which predicate returns true.
  • history_of(entity) – commits touching a specific entity (built on filter).
  • time_range(start, end) – commits whose timestamps intersect the inclusive range.

A future redesign could mirror Git's revision selection semantics. Instead of passing ranges, callers would construct commit sets derived from reachability. Primitive functions like ancestors(<commit>) and descendants(<commit>) would produce sets. Higher level combinators such as union, intersection and difference would then let users express queries like "A minus B" or "ancestors of A intersect B". Each selector would return a CommitSet patch of commit handles for checkout to load.

This approach aligns with Git's mental model and keeps selection logic separate from workspace mutation. It also opens the door for additional operations on commit sets without complicating the core API.

Filtering commits

The filter selector wraps another selector and keeps only the commits for which a user provided closure returns true. The closure receives the commit metadata and its payload, allowing inspection of authors, timestamps or the data itself. Selectors compose, so you can further narrow a range:

#![allow(unused)]
fn main() {
use hifitime::Epoch;
use tribles::repo::{filter, time_range};

let since = Epoch::from_unix_seconds(1_609_459_200.0); // 2020-12-01
let now = Epoch::now().unwrap();
let recent = ws.checkout(filter(time_range(since, now), |_, payload| {
    payload.iter().any(|t| t.e() == &my_entity)
}))?;
}

Higher level helpers can build on this primitive. For example history_of(entity) filters ancestors(HEAD) to commits touching a specific entity:

#![allow(unused)]
fn main() {
let changes = ws.checkout(history_of(my_entity))?;
}

Git Comparison

The table below summarizes Git's revision grammar. Each row links back to the official documentation. Forms that rely on reflogs or reference objects other than commits are listed for completeness but are unlikely to be implemented.

Git SyntaxPlanned EquivalentReferenceStatus
Acommit(A)gitrevisionsImplemented
A^/A^Nnth_parent(A, N)gitrevisionsNot planned
A~Nnth_ancestor(A, N)gitrevisionsImplemented
A^@parents(A)gitrevisionsImplemented
A^!A minus parents(A)gitrevisionsUnimplemented
A^-NA minus nth_parent(A, N)gitrevisionsNot planned
A^0commit(A)gitrevisionsImplemented
A^{}deref_tag(A)gitrevisionsUnimplemented
A^{type}object_of_type(A, type)gitrevisionsNot planned: non-commit object
A^{/text}search_from(A, text)gitrevisionsNot planned: requires commit message search
:/textsearch_repo(text)gitrevisionsNot planned: requires repository search
A:pathblob_at(A, path)gitrevisionsNot planned: selects a blob not a commit
:[N:]pathindex_blob(path, N)gitrevisionsNot planned: selects from the index
A..Brange(A, B)gitrevisionsImplemented
A...Bsymmetric_diff(A, B)gitrevisionsImplemented
^Aexclude(reachable(A))gitrevisionsUnimplemented
A@{upstream}upstream_of(A)gitrevisionsNot planned: depends on remote config
A@{push}push_target_of(A)gitrevisionsNot planned: depends on remote config
A@{N}reflog(A, N)gitrevisionsNot planned: relies on reflog history
A@{<date>}reflog_at(A, date)gitrevisionsNot planned: relies on reflog history
@{N}reflog(HEAD, N)gitrevisionsNot planned: relies on reflog history
@{-N}previous_checkout(N)gitrevisionsNot planned: relies on reflog history

Only a subset of Git's revision grammar will likely be supported. Selectors relying on reflog history, remote configuration, or searching commits and blobs add complexity with little benefit for workspace checkout. They are listed above for completeness but remain unplanned for now.

TimeRange

Commits record when they were made via a timestamp attribute of type NsTAIInterval. When creating a commit this interval defaults to (now, now) but other tools could provide a wider range if the clock precision is uncertain. The TimeRange selector uses this interval to gather commits whose timestamps fall between two Epoch values:

#![allow(unused)]
fn main() {
use hifitime::Epoch;
use tribles::repo::time_range;

let since = Epoch::from_unix_seconds(1_609_459_200.0); // 2020-12-01
let now = Epoch::now().unwrap();
let tribles = ws.checkout(time_range(since, now))?;
}

This walks the history from HEAD and returns only those commits whose timestamp interval intersects the inclusive range.

Internally it uses filter(ancestors(HEAD), ..) to check each commit's timestamp range.

Garbage Collection and Forgetting

Repositories grow over time as commits, branch metadata and user blobs accumulate. Because each blob is content addressed and immutable, nothing is ever overwritten and there is no automatic reclamation when branches move or objects become orphaned. To keep disk usage in check a repository can periodically forget blobs that are no longer referenced. Forgetting only removes local copies; any blob can be reintroduced later without violating TribleSpace's monotonic model. The challenge is deciding which blobs are still reachable without rebuilding every record.

Conservative Reachability

Every commit and branch metadata record is stored as a SimpleArchive. The archive encodes a canonical TribleSet as alternating 32‑byte keys and values. Instead of deserialising the archive, the collector scans the raw bytes in 32‑byte chunks. Each second chunk is treated as a candidate value. If a chunk matches the hash of a blob in the store we assume it is a reference, regardless of the attribute type. With 32‑byte hashes the odds of a random collision are negligible, so the scan may keep extra blobs but will not drop a referenced one.

Traversal Algorithm

  1. Enumerate all branches and load their metadata blobs.
  2. Extract candidate handles from the metadata. This reveals the current commit head along with any other referenced blobs.
  3. Recursively walk the discovered commits and content blobs. Whenever a referenced blob is a SimpleArchive, scan every second 32‑byte segment for further handles instead of deserialising it.
  4. Collect all visited handles into a plain set or list of 32‑byte handles. A keep‑style operation can pass this collection to the blob store and prune everything else without imposing any trible semantics.

Content blobs that are not SimpleArchive instances (for example large binary attachments) act as leaves. They become reachable when some archive references their handle and are otherwise eligible for forgetting.

Future Work

The public API for triggering garbage collection is still open. The blob store could expose a method that retains only a supplied collection of handles, or a helper such as Repository::forget_unreachable() might compute those handles before delegating pruning. A more flexible ReachabilityWalker could also let applications decide how to handle reachable blobs. Whatever interface emerges, conservative reachability by scanning SimpleArchive bytes lays the groundwork for safe space reclamation.

Glossary

Blob: A chunk of binary data addressed by the hash of its contents. Blobs are immutable and can live in memory, on disk, or in remote object stores.

Trible: A three-part tuple of entity, attribute, and value. Tribles are the atomic facts that make up all higher level structures in Trible Space.

Trible Space: The overall storage model that organises tribles across blobs, PATCHes, and repositories. It emphasises immutability and content addressing.

PATCH: A tree-shaped index used to organise tribles for efficient queries. Different permutations of the same trible share leaves to resist skewed data.

Pile: An append-only collection of blobs such as tribles, patches, and other data. Piles can be opened from local files or object stores.

Workspace: A mutable working area for preparing commits. Once a commit is finalised, it becomes immutable like the rest of Trible Space.

Deep Dive

This section contains in-depth discussions about Trible Space internals.

Philosophy

This section collects the more detailed discussions around the design of Trible Space and the reasoning behind certain choices. It is meant as an optional read for the curious.

We prioritise a simple and predictable system over clever heuristics. Each component should be understandable on its own and interact cleanly with the rest of the stack.

Developer experience is equally important. APIs aim to be straightforward and use synchronous building blocks that can be composed as needed.

Finally, we strive for soundness and performance. Safety checks guard against invalid data while efficient data structures keep the core fast.

Identifiers for Distributed Systems

We found it useful to categorize identifiers along two axes:

AbstractSemantic
IntrinsicHash, Signature, PubKeyembeddings
ExtrinsicUUID, UFOID, FUCIDnames, DOI, URL

Abstract vs. Semantic Identifiers

  • Semantic Identifiers (e.g., human-readable names, URLs, embeddings) These identifiers carry meaning and context about the entity they represent. This can make them them useful for human users, as they can convey information about the entity without requiring additional lookups. For example, a URL can provide information about the location of a resource, or a human readable name can provide information about the entity itself. Embeddings are a special case of semantic identifiers, as they represent the content of an entity in a way that can be compared to other entities. They are also more likely to change over time, as the context of the entity changes. This makes them less useful for identity, as they are not necessarily unique; their strength is to aid interpretation rather than define persistence. To avoid ambiguities and conflicts or the need for a central authority to manage them, semantic identifiers should always be explicitly scoped to a context, such as a namespace or system environment. This ensures that the same name can coexist in different contexts without collision or confusion. This scoping also addresses social challenges inherent in human-readable names: different users may prefer different names for the same entity. By allowing local names to reference persistent identifiers (extrinsic or intrinsic), each user can adopt their preferred naming conventions while maintaining a shared understanding of the underlying identity.

  • Abstract Identifiers (e.g., UUIDs, UFOIDs, FUCIDs, hashes, signatures)
    These identifiers provide abstract identity without imposing any semantic meaning or cultural connotations. They can be generated cheaply and without coordination, relying on high entropy to make collisions practically impossible, uniquely, globally, and persistently addressing an entity, regardless of its content or context. Abstract identifiers, when used to reference entities in a system, provide a stable and unique identity that is independent of the content or context of the entity. They are particularly useful in distributed systems, where they can be used to address entities across different nodes without requiring a central authority.

Intrinsic vs. Extrinsic Identifiers

  • Intrinsic Identifiers (e.g., hashes, signatures)
    These identifiers provide intrinsic identity by acting as unique fingerprints of the exact content they represent. Unlike abstract identifiers, intrinsic identifiers are directly tied to the data itself, ensuring immutability and self-validation.

    Intrinsic identifiers are generated by applying cryptographic functions to the content. Their entropy requirements are higher than those of abstract identifiers, as they must not only prevent accidental collisions but also withstand adversarial scenarios, such as deliberate attempts to forge data.

  • Extrinsic Identifiers (e.g., human-readable names, URLs, DOIs, UUIDs, UFOIDs, FUCIDs) These identifiers provide identity that is not tied to the content itself, but only by association. They are used to reference entities in a system, but do not provide any guarantees about the content or the entity itself. Allowing for continuity even as that entity may change or evolve.

Extrinsic identifiers and intrinsic identifiers represent different kinds of metaphysical identity.
For example, in the ship of Theseus thought experiment, both the original ship and the reconstructed ship
would share the same extrinsic identity but have different intrinsic identities.

Embeddings as Semantic Intrinsic Identifiers

Note that embeddings are the somewhat curious case of semantic intrinsic identifiers. They are intrinsic in that they are tied to the content they represent, but they are also semantic in that they carry meaning about the content. Embeddings are used to represent the content of an entity in a way that can be compared to other entities, such as for similarity search or classification. This makes them especially interesting for search and retrieval systems, where they can be used to find similar entities based on a reference entity. But less useful for identity, as they are not necessarily unique.

One thing that makes them especially interesting is that they can be used to compare entities across different systems or contexts, even if the entities themselves are not directly comparable. For example, you could compare the embeddings of a text document and an image to find similar content, even though the two entities are of different types.

Furthermore they aid in the decentralization and commoditization of search and retrieval systems, as they allow for the relatively expensive process of generating embeddings to be done decoupled from the indexing and retrieval process. This allows for the embedding generation to be done once in a distributed manner, and then the embeddings can be used by any system that needs to compare entities. With the embeddings acting as a common language for comparing entities, different embeddings can be compared without needing to know about the specifics of each system.

Contrastingly classic search and retrieval systems require a central authority to index and search the content, as the indexing process is tightly coupled with the indexed data. This makes it difficult to compare entities across different systems, as each system has its own index and retrieval process. It also makes merging indexes virtually impossible, as the indexes are tightly coupled with the structure of the data they index.

High-Entropy Identifiers

For a truly distributed system, the creation of identifiers must avoid the bottlenecks and overhead associated with a central coordinating authority. At the same time, we must ensure that these identifiers are unique.

To guarantee uniqueness, we use abstract identifiers containing a large amount of entropy, making collisions statistically irrelevant. However, the entropy requirements differ based on the type of identifier:

  • Extrinsic abstract identifiers need enough entropy to prevent accidental collisions in normal operation.
  • Intrinsic abstract identifiers must also resist adversarial forging attempts, requiring significantly higher entropy.

From an information-theoretic perspective, the length of an identifier determines the maximum amount of entropy it can encode. For example, a 128-bit identifier can represent ( 2^{128} ) unique values, which is sufficient to make collisions statistically negligible even for large-scale systems.

For intrinsic identifiers, 256 bits is widely considered sufficient when modern cryptographic hash functions (e.g., SHA-256) are used. These hash functions provide strong guarantees of collision resistance, preimage resistance, and second-preimage resistance. Even in the event of weaknesses being discovered in a specific algorithm, it is more practical to adopt a new hash function than to increase the bit size of identifiers.

Additionally, future advances such as quantum computing are unlikely to undermine this length. Grover's algorithm would halve the effective security of a 256-bit hash, reducing it to ( 2^{128} ) operations—still infeasible with current or theoretical technology. As a result, 256 bits remains a future-proof choice for intrinsic identifiers.

Such 256-bit intrinsic identifiers are represented by the types tribles::value::schemas::hash::Hash and tribles::value::schemas::hash::Handle.

Additionally, we define three types of high-entropy abstract identifiers to address different requirements:
RNGID, UFOID, and FUCID. Each balances trade-offs between entropy, locality, compression, and predictability, as summarized below.

Comparison of Identifier Types

RNGIDUFOIDFUCID
EntropyHighHighLow
LocalityNoneHighHigh
CompressionNoneLowHigh
PredictabilityNoneLowMid

Example: Scientific Publishing

Consider the case of published scientific papers. Each artifact, such as a .html or .pdf file, should be identified by its abstract intrinsic identifier, typically a cryptographic hash of its content. This ensures that any two entities referencing the same hash are referring to the exact same version of the artifact, providing immutability and validation.

Across different versions of the same paper, an abstract extrinsic identifier can be used to tie these artifacts together as part of one logical entity. The identifier provides continuity, regardless of changes to the paper’s content over time.

Semantic (human-readable) identifiers, such as abbreviations in citations or bibliographies, are scoped to individual papers and provide context-specific usability for readers. These names do not convey identity but serve as a way for humans to reference the persistent abstract identifiers that underlie the system.

Sadly the identifiers used in practice, such as DOIs, fail to align with these principles and strengths. They attempt to provide global extrinsic semantic identifiers for scientific papers, an ultimately flawed approach. They lack the associated guarantees of intrinsic identifiers and bring all the challenges of semantic identifiers. With their scope defined too broadly, and their authority centralized, they fail to live up to the potential of distributed systems.

ID Ownership

In distributed systems, consistency requires monotonicity due to the CALM principle. However, this is not necessary for single writer systems. By assigning each ID an owner, we ensure that only the current owner can write new information about an entity associated with that ID. This allows for fine-grained synchronization and concurrency control.

To create a transaction, you can uniquely own all entities involved and write new data for them simultaneously. Since there can only be one owner for each ID at any given time, you can be confident that no other information has been written about the entities in question.

By default, all minted ExclusiveIds are associated with the thread they are dropped from. These IDs can be found in queries via the local_ids function.

Ownership and Eventual Consistency

While a simple grow set like the history stored in a Head already constitutes a conflict-free replicated data type (CRDT), it is also limited in expressiveness. To provide richer semantics while guaranteeing conflict-free mergeability we allow only "owned" IDs to be used in the entity position of newly generated triples. As owned IDs are [Send] but not [Sync] owning a set of them essentially constitutes a single writer transaction domain, allowing for some non-monotonic operations like if-does-not-exist, over the set of contained entities. Note that this does not make operations that would break CALM (consistency as logical monotonicity) safe, e.g. delete.

The trible module contains the definition of the Trible struct, which is the fundamental unit of knowledge in the knowledge graph. Instance of Tribles are stored in TribleSets which index the trible in various ways, allowing for efficient querying and retrieval of data.

┌────────────────────────────64 byte───────────────────────────┐
┌──────────────┐┌──────────────┐┌──────────────────────────────┐
│  entity-id   ││ attribute-id ││        inlined value         │
└──────────────┘└──────────────┘└──────────────────────────────┘
└────16 byte───┘└────16 byte───┘└────────────32 byte───────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─▶

On a high level, a trible is a triple consisting of an entity, an attribute, and a value. The entity and attribute are both 128‑bit abstract extrinsic identifiers as described in [crate::id], while the value is an arbitrary 256‑bit [crate::value::Value]. The width of the value is chosen so that it can hold an entire intrinsic identifier, allowing larger payloads to be referenced via blobs without inflating the inlined representation.

Abstract identifiers

Entities and attributes are purely extrinsic; their identifiers do not encode any meaning beyond uniqueness. An entity may accrue additional tribles over time and attributes simply name relationships without prescribing a schema. This keeps the format agnostic to external ontologies and minimises accidental coupling between datasets.

The value slot can carry any 256‑bit payload. Its size is dictated by the need to embed an intrinsic identifier for out‑of‑line data. When a fact exceeds this space the value typically stores a blob handle pointing to the larger payload.

Tribles are stored as a contiguous 64‑byte array with the entity occupying the first 16 bytes, the attribute the next 16 and the value the final 32 bytes. The name "trible" is a portmanteau of triple and byte and is pronounced like "tribble" from Star Trek – hence the project's mascot, Robert the tribble.

Index permutations

TribleSets index each fact under all six permutations of entity (E), attribute (A) and value (V) so any combination of bound variables can be resolved efficiently:

┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐  ┌─────┐
│ EAV │  │ EVA │  │ AEV │  │ AVE │  │ VEA │  │ VAE │
└──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘  └──┬──┘
   │        │        │        │        │        │
┌───────────────────────────────────────────────────────┐
│            order-specific inner nodes                 │
└───────────────────────────────────────────────────────┘ 
   │        │        │        │        │        │
   ▼        ▼        ▼        ▼        ▼        ▼

┌───────────────────────────────────────────────────────┐
│                   SHARED LEAVES                       │
│     single canonical E–A–V tribles used by all        │
└───────────────────────────────────────────────────────┘

Each permutation has its own inner nodes, but all six share leaf nodes containing the 64‑byte trible. This avoids a naïve six‑fold memory cost while still letting the query planner pick the most selective ordering, keeping joins resistant to skew.

Advantages

  • A total order over tribles enables efficient storage and canonicalisation.
  • Simple byte‑wise segmentation supports indexing and querying without an interning mechanism, keeping memory usage low and parallelisation easy while avoiding the need for garbage collection.
  • Schemas describe the value portion directly, making serialisation and deserialisation straightforward.
  • The fixed 64‑byte layout makes it easy to estimate the physical size of a dataset as a function of the number of tribles stored.
  • The minimalistic design aims to minimise entropy while retaining collision resistance, making it likely that a similar format would emerge through convergent evolution and could serve as a universal data interchange format.

Direction and Consistency

In other triple stores the direction of the edge drawn by a triple is often choosen incidentally, e.g. there is no intrinsic preference for hasColor over colorOf. This can lead to confusion and inconsistency in the graph, as different writers might choose different directions for the same edge. This is typically solved by:

  • Automatically inferring the opposite edge for every edge inserted, as done by OWL and RDF with the inverseOf predicate. Leading to a doubling of the number of edges in the graph or inference at query time.
  • Endless bikeshedding about the "right" direction of edges.

In the tribles crate we solve this problem by giving the direction of the edge an explicit semantic meaning: The direction of the edge indicates which entity is the one making the statement, i.e. which entity is observing the fact or proclaiming the relationship. This is a simple and consistent rule that naturally fits into a distributed system, where each entity is associated with a single writer that is responsible the consistency of the facts it asserts.

A different perspective is that edges are always ordered from describing to described entities, with circles constituting consensus between them.

For example, the edge hasColor is always drawn from the entity that has the color to the entity that represents the color. This makes the direction of the edge a natural consequence of the semantics of the edge, and not an arbitrary choice.

Blobs

Blobs are immutable sequences of bytes used to represent data that does not fit into the fixed 256‑bit value slot of a trible. Each blob is typed by a BlobSchema similar to how values use ValueSchema. This allows structured data to be serialized into a blob while still tracking its schema.

Values and tribles capture small facts in a fixed width, whereas blobs are used "in the large" for documents, media and other sizable payloads. A blob can therefore represent anything from a single file to a complete archive of tribles.

Converting Rust types to blobs is infallible in practice, so only the ToBlob and TryFromBlob traits are widely used. The TryToBlob and FromBlob variants have been dropped to keep the API surface small.

The following example demonstrates creating blobs, archiving a TribleSet and signing its contents:

#![allow(unused)]
fn main() {
use tribles::prelude::*;
use tribles::examples::literature;
use tribles::repo::repo;
use valueschemas::{Handle, Blake3};
use blobschemas::{SimpleArchive, LongString};
use rand::rngs::OsRng;
use ed25519_dalek::{Signature, Signer, SigningKey};

// Build a BlobStore and fill it with some data.
let mut memory_store: MemoryBlobStore<Blake3> = MemoryBlobStore::new();

let book_author_id = fucid();
let quote_a: Value<Handle<Blake3, LongString>> = memory_store
    .put("Deep in the human unconscious is a pervasive need for a logical universe that makes sense. But the real universe is always one step beyond logic.")
    .unwrap();
let quote_b = memory_store
    .put("I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.")
    .unwrap();

let set = literature::entity!({
   title: "Dune",
   author: &book_author_id,
   quote: quote_a,
   quote: quote_b
});

// Serialize the TribleSet and store it as another blob.
let archived_set_handle: Value<Handle<Blake3, SimpleArchive>> = memory_store.put(&set).unwrap();

let mut csprng = OsRng;
let commit_author_key: SigningKey = SigningKey::generate(&mut csprng);
let signature: Signature = commit_author_key.sign(
    &memory_store
        .reader()
        .get::<Blob<SimpleArchive>, SimpleArchive>(archived_set_handle)
        .unwrap()
        .bytes,
);

// Store the handle in another TribleSet.
let _meta_set = repo::entity!({
   content: archived_set_handle,
   short_message: "Initial commit",
   signed_by: commit_author_key.verifying_key(),
   signature_r: signature,
   signature_s: signature,
});
}

Blobs complement tribles and values by handling large payloads while keeping the core data structures compact. A blob's hash, computed via a chosen HashProtocol, acts as a stable handle that can be embedded into tribles or other blobs, enabling content‑addressed references without copying the payload.

PATCH

The Persistent Adaptive Trie with Cuckoo-compression and Hash-maintenance (PATCH) is the core data structure used for set operations in Trible Space. It stores keys in a compressed 256-ary trie where each node uses a byte oriented cuckoo hash table to map to its children. This single node layout supports anywhere from two to 256 entries, avoiding the complex branching logic of other adaptive tries.

Traditional Adaptive Radix Trees (ART) employ multiple node variants like Node4, Node16 or Node48 to keep memory usage proportional to the number of children. PATCH instead compresses every branch with a byte oriented cuckoo hash table. Each node contains two arrays of candidate slots and inserts may displace previous entries similar to classic cuckoo hashing. The layout never changes, so we avoid the branching logic and pointer chasing common in ART implementations while still achieving high occupancy.

Our byte table uses two hash functions built from a specialised compressed permutation. The first always uses the identity mapping and the second picks a random bijective byte→byte permutation. The current table size simply masks off the upper bits to compress these results. Doubling the table reveals one more significant bit so entries either stay in place or move to bucket index * 2. When all 256 children exist we disable the random permutation and use the identity for both hashes, turning the full table into a simple array where each byte already occupies its canonical slot.

The byte_table_resize_benchmark demonstrates how densely the table can fill before triggering a resize. The benchmark inserts all byte values many times and measures the occupancy that forced each power-of-two table size to grow:

ByteTable resize fill - random: 0.863, sequential: 0.972
Per-size fill (random)
  size   2: 1.000  # path compression keeps two-entry nodes fully occupied
  size   4: 0.973
  size   8: 0.899
  size  16: 0.830
  size  32: 0.749
  size  64: 0.735
  size 128: 0.719
  size 256: 1.000  # identity hash maps all 256 children without resizing
Per-size fill (sequential)
  size   2: 1.000  # path compression keeps two-entry nodes fully occupied
  size   4: 1.000
  size   8: 0.993
  size  16: 1.000
  size  32: 0.928
  size  64: 0.925
  size 128: 0.927
  size 256: 1.000  # identity hash maps all 256 children without resizing

Random inserts average roughly 86% table fill while sequential inserts hold about 97% before doubling the table size. Nodes of size two are always 100% full thanks to path compression, and the final 256‑ary node also reaches 100% occupancy because of the linear hash, which we now report explicitly instead of 0.000. This keeps memory usage predictable without the specialized node formats used by ART.

Hash maintenance

Each node maintains a 256‑bit rolling hash derived from the keys in its subtree. On insert or delete the old influence of a child is XORed out and the new one XORed in, yielding an updated hash in constant time. These hashes allow set operations such as union, intersection and difference to skip entire branches when the hashes differ.

Keys can be viewed in different orders with the KeyOrdering trait and segmented via KeySegmentation to enable prefix based queries. All updates use copy‑on‑write semantics, so cloning a tree is cheap and safe.

Pile Format

The on-disk pile keeps every blob and branch in one append-only file. This layout provides a simple write-ahead log style database where new data is only appended. It allows both blob and branch storage in a single file while remaining resilient to crashes. The pile backs local repositories and acts as a durable content‑addressed store. The pile file can be memory mapped for fast reads and is safely shared between threads because existing bytes are never mutated.

While large databases often avoid mmap due to pitfalls with partial writes and page cache thrashing [1], this design works well for the pile's narrow usage pattern.

Design Rationale

This format emphasizes simplicity over sophisticated on-disk structures. Appending new blobs rather than rewriting existing data keeps corruption windows small and avoids complicated page management. Storing everything in a single file makes a pile easy to back up or replicate over simple transports while still allowing it to be memory mapped for fast reads. The 64 byte alignment ensures each entry begins on a cache line boundary, which improves concurrent access patterns and allows safe typed views with the zerocopy crate.

Hash verification only happens when blobs are read. Opening even a very large pile is therefore fast while still catching corruption before data is used.

Every record begins with a 16 byte magic marker that identifies whether it stores a blob or a branch. The sections below illustrate the layout of each type.

Usage

A pile typically lives as a .pile file on disk. Repositories open it through Pile::open which performs any necessary recovery and returns a handle for appending new blobs or branches. Multiple threads may share the same handle thanks to internal synchronisation, making a pile a convenient durable store for local development.

Blob Storage

                             8 byte  8 byte
            ┌────16 byte───┐┌──────┐┌──────┐┌────────────32 byte───────────┐
          ┌ ┌──────────────┐┌──────┐┌──────┐┌──────────────────────────────┐
 header   │ │magic number A││ time ││length││             hash             │
          └ └──────────────┘└──────┘└──────┘└──────────────────────────────┘
            ┌────────────────────────────64 byte───────────────────────────┐
          ┌ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐
          │ │                                                              │
 payload  │ │              bytes (64byte aligned and padded)               │
          │ │                                                              │
          └ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘

Each blob entry records its creation timestamp, the length of the payload and its hash. The payload is padded so the next record begins on a 64 byte boundary.

Branch Storage

            ┌────16 byte───┐┌────16 byte───┐┌────────────32 byte───────────┐
          ┌ ┌──────────────┐┌──────────────┐┌──────────────────────────────┐
 header   │ │magic number B││  branch id   ││             hash             │
          └ └──────────────┘└──────────────┘└──────────────────────────────┘

Branch entries map a branch identifier to the hash of a blob.

Recovery

When [Pile::try_open] scans an existing file it checks that every header uses a known marker and that the whole record fits. It does not verify any hashes. If a truncated or unknown block is found the function reports the number of bytes that were valid so far using [OpenError::CorruptPile].

The convenience wrapper [Pile::open] re-runs the same validation and truncates the file to the valid length if corruption is encountered. This recovers from interrupted writes by discarding incomplete data. Hash verification happens lazily only when individual blobs are loaded so that opening a large pile remains fast.

For more details on interacting with a pile see the Pile struct documentation.

Pile Blob Metadata

The pile file already stores a timestamp and length in each blob header but this information is discarded when building the in-memory index. Clients therefore cannot query when a blob was added or how large it is without re-parsing the file.

Proposed Changes

  • Extend IndexEntry in src/repo/pile.rs with a timestamp field. The length can be determined from the stored bytes when needed.
  • Introduce a public BlobMetadata struct containing timestamp and length so callers do not depend on internal types.
  • Populate the timestamp when Pile::try_open scans existing entries and when inserting new blobs. Lengths are computed on demand.
  • Add PileReader::metadata(&self, handle) to retrieve a blob's metadata if it exists. Iterators may later be extended to yield this information alongside the blob itself.

This approach keeps the current API intact while making useful details available for replication and debugging tools.