Introduction
Welcome to the Tribles Book. This guide provides a gentle introduction to
Trible Space, its design goals and how to use the tribles
crate. The aim is to
present a clear narrative for newcomers while linking to in‑depth reference
material for those who want to dig deeper.
Trible Space combines ideas from databases and version control. Data is stored in small immutable blobs that can live in memory, on disk or inside remote object stores without conversion. Cryptographic identifiers ensure integrity and enable efficient sharing across systems.
What makes Trible Space stand out?
- Content‑addressed immutability – values are stored by the hash of their contents, making them easy to verify and share.
- Lightweight queries – work over unordered collections like hash maps, trible sets or custom full‑text/semantic indexes; you can embed domain‑specific DSLs and they run fast enough to use like everyday language constructs.
- Repository workflows – history and collaboration follow familiar version‑control patterns.
The opening chapters introduce these ideas at a gentle pace. Later sections dive into architecture, query semantics, schemas and commit selectors so you can build and reason about complex data sets.
While this book walks you through the basics, the crate documentation offers a complete API reference. Use the links throughout the text to jump directly to the modules that interest you. A Glossary is provided for quick reference to common terms.
If you would like to work on Tribles yourself, check out Developing Locally.
Getting Started
First add the required crates to your project:
cargo add tribles ed25519-dalek rand
This example uses ed25519-dalek
to generate a signing key and rand
for randomness.
Next create a simple repository and commit some data:
use tribles::prelude::*;
use tribles::examples::literature;
use tribles::repo::Repository;
use std::path::Path;
use ed25519_dalek::SigningKey;
use rand::rngs::OsRng;
const MAX_PILE_SIZE: usize = 1 << 20;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let pile: Pile<MAX_PILE_SIZE> = Pile::open(Path::new("example.pile"))?;
let mut repo = Repository::new(pile, SigningKey::generate(&mut OsRng));
let mut ws = repo.branch("main")?;
ws.commit(literature::entity!(&ufoid(), { firstname: "Alice" }), None);
repo.push(&mut ws)?;
Ok(())
}
Running this program with cargo run
creates an example.pile
file in the current
directory and pushes a single entity to the main
branch.
See the crate documentation for additional modules and examples.
Developing Locally
To build and test Tribles yourself you need a recent Rust toolchain. Install it from rustup.rs.
The repository provides helper scripts for common workflows:
./scripts/preflight.sh
formats the code, runs the full test suite and rebuilds this book. Run it before committing../scripts/devtest.sh
executes only the tests for faster feedback during development../scripts/build_book.sh
rebuilds the documentation oncemdbook
is installed withcargo install mdbook
.
These scripts keep the project formatted, tested and documented while you iterate locally.
Architecture Overview
Trible Space is designed to keep data management simple, safe and fast. The README introduces these goals in more detail, emphasizing a lean design with predictable performance and straightforward developer experience. This chapter explains how the pieces fit together and why they are organised this way.
Design Goals
A full discussion of the motivation behind Trible Space can be found in the Philosophy section. At a high level we want a self‑contained data store that offers:
- Simplicity – minimal moving parts and predictable behaviour.
- Developer Experience – a clear API that avoids complex servers or background processes.
- Safety and Performance – sound data structures backed by efficient content addressed blobs.
These goals grew out of earlier "semantic" technologies that attempted to model knowledge as graphs. While systems like RDF promised great flexibility, in practice they often became difficult to host, query and synchronise. Trible Space keeps the idea of describing the world with simple statements but stores them in a form that is easy to exchange and reason about.
Data Model
The fundamental unit of information is a Trible
. Its 64 byte layout is described in Trible Structure. A Trible
links a subject entity to an attribute and value. Multiple tribles are stored in a TribleSet
, which behaves like a hashmap with three columns — subject, attribute and value.
Trible Sets
TribleSet
s provide fast querying and cheap copy‑on‑write semantics. They can be merged, diffed and searched entirely in memory. When durability is needed the set is serialised into a blob and tracked by the repository layer.
To keep joins skew‑resistant, each set maintains all six orderings of entity, attribute and value. The trees reuse the same leaf nodes so a trible is stored only once, avoiding a naïve six‑fold memory cost while still letting the query planner pick the most selective permutation.
Blob Storage
All persistent data lives in a BlobStore
. Each blob is addressed by the hash of its contents, so identical data occupies space only once and readers can verify integrity by recomputing the hash. The trait exposes simple get
and put
operations, leaving caching and eviction strategies to the backend. Implementations decide where bytes reside: an in‑memory MemoryBlobStore
, an on‑disk Pile
described in Pile Format or a remote object store. Because handles are just 32‑byte hashes, repositories can copy or cache blobs without coordination. Trible sets, user blobs and commit records all share this mechanism.
Branch Store
A BranchStore
keeps track of the tips of each branch. Updates use a simple compare‑and‑set operation so concurrent writers detect conflicts. Both the in‑memory and pile repositories implement this trait.
Because only this single operation mutates repository state, nearly all other logic is value oriented and immutable. Conflicts surface only at the branch store update step, which simplifies concurrent use and reasoning about changes.
Repository
The Repository
combines a blob store with a branch store. Commits store a trible set blob along with a parent link and signature. Because everything is content addressed, multiple repositories can share blobs or synchronize through a basic file copy.
Workspaces
A Workspace
represents mutable state during editing. Checking out or branching yields a workspace backed by a fresh MemoryBlobStore
. Commits are created locally and only become visible to others when pushed, as described in Repository Workflows.
Commits and History
TribleSet
s written to blobs form immutable commits. Each commit references its parent, creating an append‑only chain signed by the author. This is the durable history shared between repositories.
Putting It Together
+-----------------------------------------------------------+
| Repository |
| +---------------------+ +----------------------------+ |
| | BlobStore | | BranchStore | |
| +---------------------+ +----------------------------+ |
+-----------------------------------------------------------+
^ pull | push
| v
+-----------------------------------------------------------+
| Workspace |
| +---------------------+ +----------------------------+ |
| | MemoryBlobStore | | TribleSet | |
| +---------------------+ +----------------------------+ |
+-----------------------------------------------------------+
|
| commit/add_blob
v
TribleSet blobs
Repositories persist blobs and branch metadata while workspaces stage changes before pushing them. Because every blob is addressed by its hash, repositories can safely share data through any common storage without coordination.
Query Engine
Queries describe the patterns you want to retrieve. The engine favors extreme
simplicity and aims for predictable latency and skew resistance without any
tuning. Every constraint implements the
Constraint
trait so operators, sub-languages and
even alternative data sources can compose cleanly.
Queries as Schemas
You might notice that trible.space does not define a global ontology or schema
beyond associating attributes with a
ValueSchema
or
BlobSchema
. This is deliberate. The semantic web
taught us that per-value typing, while desirable, was awkward in RDF: literal
datatypes are optional, custom types need globally scoped IRIs and there is no
enforcement, so most data degenerates into untyped strings. Trying to regain
structure through global ontologies and class hierarchies made schemas rigid
and reasoning computationally infeasible. Real-world data often arrives with
missing, duplicate or additional fields, which clashes with these global,
class-based constraints.
Our approach is to be sympathetic to edge cases and have the system deal only with the data it declares capable of handling. These application-specific schema declarations are exactly the shapes and constraints expressed by our queries1. Data not conforming to these queries is simply ignored by definition, as a query only returns data satisfying its constraints.2
Join Strategy
The query engine uses the Atreides family of worst-case optimal join algorithms. These algorithms leverage cardinality estimates to guide a depth-first search over variable bindings, providing skew-resistant and predictable performance. For a detailed discussion, see the Atreides Join chapter.
Query Languages
Instead of a single query language, the engine exposes small composable
constraints that combine with logical operators such as and
and or
. These
constraints are simple yet flexible, enabling a wide variety of operators while
still allowing the engine to explore the search space efficiently.
The query engine and data model are flexible enough to support many query styles, including graph, relational and document-oriented queries.
For example, the namespace
module offers macros that
generate constraints for a given trible pattern in a query-by-example style
reminiscent of SPARQL or GraphQL but tailored to a document-graph data model.
It would also be possible to layer a property-graph language like Cypher or a
relational language like Datalog on top of the engine.3
Great care has been taken to ensure that query languages with different styles and semantics can coexist and even be mixed with other languages and data models within the same query. For practical examples of the current facilities, see the Query Language chapter.
Note that this query-schema isomorphism isn't necessarily true in all databases or query languages, e.g., it does not hold for SQL.
In RDF terminology: We challenge the classical A-Box & T-Box dichotomy by replacing the T-Box with a "Q-Box", which is descriptive and open rather than prescriptive and closed. This Q-Box naturally evolves with new and changing requirements, contexts and applications.
SQL would be a bit more challenging, as it is surprisingly imperative with its explicit JOINs and ORDER BYs, and its lack of a clear declarative semantics. This makes it harder to implement on top of a constraint-based query engine tailored towards a more declarative and functional style.
The Atreides Family of Worst-case Optimal Join Algorithms
The heart of the system is a constraint-solving approach based on the theory of worst-case optimal joins. Queries are represented as sets of constraints over variables, and the join algorithm explores this search space by binding one variable at a time.
Size estimates, normally used only by query optimizers, directly guide the join. Instead of building explicit join plans, the engine asks each constraint for cardinality bounds and chooses the next variable with the smallest estimated result. Traditional databases rely on sorted indexes to efficiently find corresponding values. Atreides still uses random lookups when confirming each candidate, but the bounds let it enumerate the most specific constraint sequentially and probe only a few possibilities in the larger ones, offering similar pruning power without maintaining sorted indexes everywhere.
Maintaining accurate estimates is crucial. We therefore provide data structures that update cardinalities in O(1) time so the algorithm can adapt quickly as bindings accumulate.
We currently expose four variants of the Atreides family, each with a descriptive name and a playful Dune nickname. They differ in how much information they consider when predicting the remaining search space:
- Row-count Join (Jessica) – uses per-constraint row counts to pick the variable with the smallest number of matching rows.
- Distinct-value Join (Paul) – estimates the smallest number of distinct values for the variable across one column.
- Partial-binding Join (Ghanima) – refines the estimate using the current partial binding but ignores yet-to-be-bound variables.
- Exact-result Join (Leto) – the ideal algorithm that knows the exact result size even for unbound variables.
Each variant trades complexity for precision. More accurate estimates let the engine prune failing paths earlier.
For example, given constraints that relate ?person
to ?parent
and ?city
, the Distinct-value Join (Paul) will bind whichever variable has the fewest distinct candidates. If ?city
only has a handful of possibilities, the search tries them first and backtracks quickly when they fail.
The join proceeds as a depth-first search. At every step the engine binds a value, evaluates the affected constraints, and backtracks when a constraint reports no matches. Because estimates come directly from the constraints, the engine avoids complex query planning and remains resistant to data skew — even wildly imbalanced cardinalities do not degrade performance.
Query Language
This chapter introduces the basic query facilities provided by tribles
.
Queries are declared in a small declarative language that describes which
values should match rather than how to iterate them.
The find!
macro builds a
Query
by declaring variables and a constraint
expression. A minimal invocation looks like:
#![allow(unused)] fn main() { let results = find!((a), a.is(1.into())).collect::<Vec<_>>(); }
find!
returns an Iterator
over tuples of the bound
variables. Matches can be consumed lazily or collected into common
collections:
#![allow(unused)] fn main() { for (a,) in find!((a), a.is(1.into())) { println!("match: {a}"); } }
Variables can optionally specify a concrete type to convert the underlying value:
#![allow(unused)] fn main() { find!((x: i32, y: Value<ShortString>), and!(x.is(1.into()), y.is("foo".to_value()))) }
The first variable is read as an i32
and the second as a short string if the
conversion succeeds. The query engine walks all possible assignments that
satisfy the constraint and yields tuples of the declared variables.
Built-in constraints
find!
queries combine a small set of constraint operators to form a declarative
language for matching tribles:
and!
builds anIntersectionConstraint
requiring all sub-constraints to hold.or!
constructs aUnionConstraint
accepting any satisfied alternative.ignore!
tells the query engine to ignore variables in a sub-query. Constraints mentioning ignored variables are evaluated without checking those positions, so their bindings neither join with surrounding constraints nor appear in the result set.- Collection types such as
TribleSet
provide ahas
method yielding aContainsConstraint
for membership tests.
Any data structure that can iterate its contents, test membership and report its
size can implement ContainsConstraint
, so queries have no inherent ordering
requirements.
Ignored variables are handy when a sub-expression references fields you want to drop. The engine skips checking them entirely, effectively treating the positions as wildcards. If the underlying constraint guarantees some value, ignoring works like existential quantification; otherwise the ignored portion is simply discarded. Without ignoring, those variables would leak into the outer scope and either appear in the results or unintentionally join with other constraints.
Alternatives are expressed with or!
and temporary variables can be hidden
with ignore!
:
#![allow(unused)] fn main() { find!((x), or!(x.is(1.into()), x.is(2.into()))); find!((x), ignore!((y), and!(x.is(1.into()), y.is(2.into())))); }
In the second query y
is ignored entirely—the engine never checks the
y.is(2.into())
part—so the outer query only enforces x.is(1.into())
regardless of whether any y
equals 2
.
Example
#![allow(unused)] fn main() { use tribles::prelude::*; use tribles::examples::{self, literature}; let dataset = examples::dataset(); for (title,) in find!((title: Value<_>), and!(dataset.has(title), title.is("Dune".to_value()))) { println!("Found {}", title.from_value::<&str>()); } }
This query searches the example dataset for the book titled "Dune". The variables and constraint can be adapted to express more complex joins and filters.
matches!
Sometimes you only want to check whether a constraint has any solutions.
The matches!
macro mirrors the find!
syntax but returns a boolean:
#![allow(unused)] fn main() { use tribles::prelude::*; assert!(matches!((x), x.is(1.into()))); assert!(!matches!((x), and!(x.is(1.into()), x.is(2.into())))); }
Custom constraints
Every building block implements the
Constraint
trait. You can implement this trait on
your own types to integrate custom data sources or query operators with the
solver.
Regular path queries
The path!
macro lets you search for graph paths matching a regular
expression over edge attributes. It expands to a
RegularPathConstraint
and can be
combined with other constraints. Invoke it through a namespace module
(social::path!
) to implicitly resolve attribute names:
#![allow(unused)] fn main() { use tribles::prelude::*; NS! { namespace social { "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" as follows: GenId; "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB" as likes: GenId; } } let mut kb = TribleSet::new(); let a = fucid(); let b = fucid(); let c = fucid(); kb += social::entity!(&a, { follows: &b }); kb += social::entity!(&b, { likes: &c }); let results: Vec<_> = find!((s: Value<_>, e: Value<_>), social::path!(&kb, s (follows | likes)+ e)).collect(); }
The middle section uses a familiar regex syntax to describe allowed edge
sequences. Editors with Rust macro expansion support provide highlighting and
validation of the regular expression at compile time. Paths reference
attributes from a single namespace; to traverse edges across multiple
namespaces, create a new namespace that re-exports the desired attributes and
invoke path!
through it.
Incremental Queries
The query engine normally evaluates a pattern against a complete
TribleSet
, recomputing every match from scratch. Applications that
ingest data continuously often only need to know which results are
introduced by new tribles. Tribles supports this with semi‑naive
evaluation, a classic incremental query technique.
Delta evaluation
Given a base dataset and a set of newly inserted tribles, the engine runs the original query multiple times. Each run restricts a different triple constraint to the delta while the remaining constraints see the full set. The union of these runs yields exactly the new solutions. The process is:
- compute a
delta
TribleSet
containing only the inserted tribles, - for every triple in the query, evaluate a variant where that triple
matches against
delta
, - union all per‑triple results to obtain the incremental answers.
Because each variant touches only one triple from the delta, the work grows with the number of constraints and the size of the delta set rather than the size of the full dataset.
Monotonicity and CALM
Removed results are not tracked. Tribles follow the CALM principle: a program whose outputs are monotonic in its inputs needs no coordination. Updates simply add new facts and previously derived conclusions remain valid. When conflicting information arises, applications append fresh tribles describing their preferred view instead of retracting old ones. Stores may forget obsolete data, but semantically tribles are never deleted.
Example
Namespaces provide a pattern_changes!
operator to express these delta
queries. It behaves like pattern!
but takes the current TribleSet
and
a precomputed changeset. The macro unions variants of the query where
each triple is constrained to that changeset, matching only the newly
inserted tribles. Combined with the union constraint, this lets us run
incremental updates using the familiar find!
interface.
#![allow(unused)] fn main() { let storage = MemoryRepo::default(); let mut repo = Repository::new(storage, SigningKey::generate(&mut OsRng)); let mut ws = repo.branch("main").expect("branch"); // Commit initial data let shakespeare = ufoid(); let hamlet = ufoid(); let mut base = TribleSet::new(); base += literature::entity!(&shakespeare, { firstname: "William", lastname: "Shakespeare" }); base += literature::entity!(&hamlet, { title: "Hamlet", author: &shakespeare }); ws.commit(base.clone(), None); let c1 = ws.head().unwrap(); // Commit a new book let macbeth = ufoid(); let mut change = TribleSet::new(); change += literature::entity!(&macbeth, { title: "Macbeth", author: &shakespeare }); ws.commit(change.clone(), None); let c2 = ws.head().unwrap(); // Compute updated state and delta between commits let base_state = ws.checkout(c1).expect("base"); let updated = ws.checkout(c2).expect("updated"); let delta = updated.difference(&base_state); // Find new titles by Shakespeare let results: Vec<_> = find!( (author: Value<_>, book: Value<_>, title: Value<_>), literature::pattern_changes!(&updated, &delta, [ { author @ firstname: ("William"), lastname: ("Shakespeare") }, { book @ author: author, title: title } ]) ) .map(|(_, b, t)| (b, t)) .collect(); println!("{results:?}"); }
Comparing history points
Workspace::checkout
accepts commit selectors
which can describe ranges in repository history. By checking out a range
like a..b
we obtain exactly the tribles introduced by commits reachable
from b
but not from a
. Feeding that delta into pattern_changes!
lets us ask, “What new matches did commit b
introduce over a
?”
Trade‑offs
- Applications must compute and supply the delta set; the engine does not track changes automatically.
- Queries must remain monotonic since deletions are ignored.
- Each triple incurs an extra variant, so highly selective constraints keep incremental evaluation efficient.
Schemas
Trible Space stores data in strongly typed values and blobs. A schema defines
the language‑agnostic byte layout for these types: [Value
]s always occupy
exactly 32 bytes while [Blob
]s may be any length. Schemas translate those
raw bytes to concrete application types and decouple persisted data from a
particular implementation. This separation lets you refactor to new libraries
or frameworks without rewriting what's already stored. The crate ships with a
collection of ready‑made schemas located in
src/value/schemas
and
src/blob/schemas
.
Why 32 bytes?
Storing arbitrary Rust types requires a portable representation. Instead of human‑readable identifiers like RDF's URIs, Tribles uses a fixed 32‑byte array for all values. This size provides enough entropy to embed intrinsic identifiers—typically cryptographic hashes—when a value references data stored elsewhere in a blob. Keeping the width constant avoids platform‑specific encoding concerns and makes it easy to reason about memory usage.
Conversion traits
Schemas define how to convert between raw bytes and concrete Rust types. The
conversion traits ToValue
/FromValue
and their fallible counterparts live on
the schema types rather than on Value
itself, avoiding orphan‑rule issues when
supporting external data types. The Value
wrapper treats its bytes as opaque;
schemas may validate them or reject invalid patterns during conversion.
Schema identifiers
Every schema declares a unique 128‑bit identifier such as VALUE_SCHEMA_ID
(and optionally BLOB_SCHEMA_ID
for blob handles). Persisting these IDs allows
applications to look up the appropriate schema at runtime, even when they were
built against different code. The schema_id
method on Value
and Blob
returns the identifier so callers can dispatch to the correct conversion logic.
Built‑in value schemas
The crate provides the following value schemas out of the box:
GenId
– an abstract 128 bit identifier.ShortString
– a UTF-8 string up to 32 bytes.U256BE
/U256LE
– 256-bit unsigned integers.I256BE
/I256LE
– 256-bit signed integers.R256BE
/R256LE
– 256-bit rational numbers.F256BE
/F256LE
– 256-bit floating point numbers.Hash
andHandle
– cryptographic digests and blob handles (seehash.rs
).ED25519RComponent
,ED25519SComponent
andED25519PublicKey
– signature fields and keys.NsTAIInterval
to encode time intervals.UnknownValue
as a fallback when no specific schema is known.
#![allow(unused)] fn main() { use tribles::value::schemas::shortstring::ShortString; use tribles::value::{ToValue, ValueSchema}; let v = "hi".to_value::<ShortString>(); assert_eq!(v.schema_id(), ShortString::VALUE_SCHEMA_ID); }
Built‑in blob schemas
The crate also ships with these blob schemas:
LongString
for arbitrarily long UTF‑8 strings.SimpleArchive
which stores a raw sequence of tribles.SuccinctArchive
providing a compressed index for offline queries.UnknownBlob
for data of unknown type.
#![allow(unused)] fn main() { use tribles::blob::schemas::longstring::LongString; use tribles::blob::{ToBlob, BlobSchema}; let b = "example".to_blob::<LongString>(); assert_eq!(LongString::BLOB_SCHEMA_ID, b.schema_id()); }
Defining new schemas
Custom formats implement [ValueSchema
] or [BlobSchema
]. A unique identifier
serves as the schema ID. The example below defines a little-endian u64
value
schema and a simple blob schema for arbitrary bytes.
#![allow(unused)] fn main() { pub struct U64LE; impl ValueSchema for U64LE { const VALUE_SCHEMA_ID: Id = id_hex!("0A0A0A0A0A0A0A0A0A0A0A0A0A0A0A0A"); type ValidationError = Infallible; } impl ToValue<U64LE> for u64 { fn to_value(self) -> Value<U64LE> { let mut raw = [0u8; VALUE_LEN]; raw[..8].copy_from_slice(&self.to_le_bytes()); Value::new(raw) } } impl FromValue<'_, U64LE> for u64 { fn from_value(v: &Value<U64LE>) -> Self { u64::from_le_bytes(v.raw[..8].try_into().unwrap()) } } pub struct BytesBlob; impl BlobSchema for BytesBlob { const BLOB_SCHEMA_ID: Id = id_hex!("B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0"); } impl ToBlob<BytesBlob> for Bytes { fn to_blob(self) -> Blob<BytesBlob> { Blob::new(self) } } impl TryFromBlob<BytesBlob> for Bytes { type Error = Infallible; fn try_from_blob(b: Blob<BytesBlob>) -> Result<Self, Self::Error> { Ok(b.bytes) } } }
See examples/custom_schema.rs
for the full
source.
Repository Workflows
Tribles borrows much of its vocabulary from Git:
- Repository – top-level object that tracks history through a
BlobStore
andBranchStore
. - Workspace – mutable view of a branch, similar to Git's working directory and index combined.
- BlobStore – stores commits and blobs.
- BranchStore – records branch metadata.
Both stores can be in memory, on disk or backed by a remote service. The
examples in examples/repo.rs
and examples/workspace.rs
showcase this API and
should feel familiar to anyone comfortable with Git.
Branching
A branch records a line of history. Creating one writes initial metadata to the
underlying store and yields a Workspace
pointing at that branch. Typical steps
look like:
- Create a repository backed by blob and branch stores.
- Open or create a branch to obtain a
Workspace
. - Commit changes in the workspace.
- Push the workspace to publish those commits.
While Repository::branch
is a convenient way to start a fresh branch, most
workflows use Repository::pull
to obtain a workspace for an existing branch:
#![allow(unused)] fn main() { let mut repo = Repository::new(pile, SigningKey::generate(&mut OsRng)); let mut ws = repo.branch("main").expect("create branch"); let mut ws2 = repo.pull(ws.branch_id()).expect("open branch"); }
After committing changes you can push the workspace back:
#![allow(unused)] fn main() { ws.commit(change, Some("initial commit")); repo.push(&mut ws)?; }
Inspecting History
You can explore previous commits using Workspace::checkout
which returns a
TribleSet
with the union of the specified commit contents. Passing a single
commit returns just that commit. To include its history you can use the
ancestors
helper. Commit ranges are supported for convenience. The expression
a..b
yields every commit reachable from b
that is not reachable from a
,
treating missing endpoints as empty (..b
) or the current HEAD
(a..
and
..
):
#![allow(unused)] fn main() { let history = ws.checkout(commit_a..commit_b)?; let full = ws.checkout(ancestors(commit_b))?; }
The history_of
helper builds on the filter
selector to
retrieve only the commits affecting a specific entity. Commit selectors are
covered in more detail in the next chapter:
#![allow(unused)] fn main() { let entity_changes = ws.checkout(history_of(my_entity))?; }
Merging and Conflict Handling
When pushing a workspace another client might have already updated the branch.
Repository::push
attempts to update the branch atomically and returns an
optional conflicting Workspace
if the head moved. The usual loop looks like:
#![allow(unused)] fn main() { while let Some(mut incoming) = repo.push(&mut ws)? { incoming.merge(&mut ws)?; ws = incoming; } }
This snippet is taken from examples/workspace.rs
.
The examples/repo.rs
example demonstrates the same
pattern with two separate workspaces. The returned Workspace
already contains
the remote commits, so after merging your changes you push that new workspace to
continue.
Typical CLI Usage
There is a small command line front-end in the
trible
repository. It exposes push
and merge operations over simple commands and follows the same API presented in
the examples. The tool is currently experimental and may lag behind the library,
but it demonstrates how repository operations map onto a CLI.
Diagram
A simplified view of the push/merge cycle:
┌───────────┐ pull ┌───────────┐
| local ws |◀───────────────────── | repo |
└─────┬─────┘ └───────────┘
│
│ commit
│
▼
┌───────────┐ push ┌───────────┐
│ local ws │ ─────────────────────▶│ repo │
└─────┬─────┘ └─────┬─────┘
│ │
│ merge │ conflict?
└──────▶┌─────────────┐◀────────────┘
│ conflict ws │
└───────┬─────┘
│ ┌───────────┐
└────────────▶| repo │
push └───────────┘
Each push either succeeds or returns a workspace containing the other changes. Merging incorporates your commits and the process repeats until no conflicts remain.
Commit Selectors
Commit selectors describe which commits to load from a workspace. The
Workspace::checkout
method accepts any type implementing the
CommitSelector
trait and returns a TribleSet
containing data from those
commits. It currently supports individual commit handles, lists of handles and a
handful of higher level selectors.
Most selectors operate on ranges inspired by Git's two‑dot syntax. a..b
selects all commits reachable from b
that are not reachable from a
. In set
terms it computes ancestors(b) - ancestors(a)
. Omitting the start defaults a
to the empty set so ..b
yields ancestors(b)
. Omitting the end defaults b
to the current HEAD
, making a..
resolve to ancestors(HEAD) - ancestors(a)
while ..
expands to ancestors(HEAD)
.
#![allow(unused)] fn main() { // Check out the entire history of the current branch let history = ws.checkout(ancestors(ws.head()))?; }
While convenient, the range-based design makes it difficult to compose complex queries over the commit graph.
Implemented selectors
CommitSelector
is implemented for:
CommitHandle
– a single commit.Vec<CommitHandle>
and&[CommitHandle]
– explicit lists of commits.ancestors(commit)
– a commit and all of its ancestors.nth_ancestor(commit, n)
– follows the first-parent chainn
steps.parents(commit)
– direct parents of a commit.symmetric_diff(a, b)
– commits reachable from eithera
orb
but not both.- Standard ranges:
a..b
,a..
,..b
and..
following the two‑dot semantics described above. filter(selector, predicate)
– retains commits for whichpredicate
returnstrue
.history_of(entity)
– commits touching a specific entity (built onfilter
).time_range(start, end)
– commits whose timestamps intersect the inclusive range.
A future redesign could mirror Git's revision selection semantics.
Instead of passing ranges, callers would construct commit sets derived from
reachability. Primitive functions like ancestors(<commit>)
and
descendants(<commit>)
would produce sets. Higher level combinators such as
union
, intersection
and difference
would then let users express queries
like "A minus B" or "ancestors of A intersect B". Each selector would return
a CommitSet
patch of commit handles for checkout
to load.
This approach aligns with Git's mental model and keeps selection logic separate from workspace mutation. It also opens the door for additional operations on commit sets without complicating the core API.
Filtering commits
The filter
selector wraps another selector and keeps only the commits for
which a user provided closure returns true
. The closure receives the commit
metadata and its payload, allowing inspection of authors, timestamps or the
data itself. Selectors compose, so you can further narrow a range:
#![allow(unused)] fn main() { use hifitime::Epoch; use tribles::repo::{filter, time_range}; let since = Epoch::from_unix_seconds(1_609_459_200.0); // 2020-12-01 let now = Epoch::now().unwrap(); let recent = ws.checkout(filter(time_range(since, now), |_, payload| { payload.iter().any(|t| t.e() == &my_entity) }))?; }
Higher level helpers can build on this primitive. For example history_of(entity)
filters
ancestors(HEAD)
to commits touching a specific entity:
#![allow(unused)] fn main() { let changes = ws.checkout(history_of(my_entity))?; }
Git Comparison
The table below summarizes Git's revision grammar. Each row links back to the official documentation. Forms that rely on reflogs or reference objects other than commits are listed for completeness but are unlikely to be implemented.
Git Syntax | Planned Equivalent | Reference | Status |
---|---|---|---|
A | commit(A) | gitrevisions | Implemented |
A^ /A^N | nth_parent(A, N) | gitrevisions | Not planned |
A~N | nth_ancestor(A, N) | gitrevisions | Implemented |
A^@ | parents(A) | gitrevisions | Implemented |
A^! | A minus parents(A) | gitrevisions | Unimplemented |
A^-N | A minus nth_parent(A, N) | gitrevisions | Not planned |
A^0 | commit(A) | gitrevisions | Implemented |
A^{} | deref_tag(A) | gitrevisions | Unimplemented |
A^{type} | object_of_type(A, type) | gitrevisions | Not planned: non-commit object |
A^{/text} | search_from(A, text) | gitrevisions | Not planned: requires commit message search |
:/text | search_repo(text) | gitrevisions | Not planned: requires repository search |
A:path | blob_at(A, path) | gitrevisions | Not planned: selects a blob not a commit |
:[N:]path | index_blob(path, N) | gitrevisions | Not planned: selects from the index |
A..B | range(A, B) | gitrevisions | Implemented |
A...B | symmetric_diff(A, B) | gitrevisions | Implemented |
^A | exclude(reachable(A)) | gitrevisions | Unimplemented |
A@{upstream} | upstream_of(A) | gitrevisions | Not planned: depends on remote config |
A@{push} | push_target_of(A) | gitrevisions | Not planned: depends on remote config |
A@{N} | reflog(A, N) | gitrevisions | Not planned: relies on reflog history |
A@{<date>} | reflog_at(A, date) | gitrevisions | Not planned: relies on reflog history |
@{N} | reflog(HEAD, N) | gitrevisions | Not planned: relies on reflog history |
@{-N} | previous_checkout(N) | gitrevisions | Not planned: relies on reflog history |
Only a subset of Git's revision grammar will likely be supported. Selectors relying on reflog history, remote configuration, or searching commits and blobs add complexity with little benefit for workspace checkout. They are listed above for completeness but remain unplanned for now.
TimeRange
Commits record when they were made via a timestamp
attribute of type
NsTAIInterval
. When creating a commit this
interval defaults to (now, now)
but other tools could provide a wider range
if the clock precision is uncertain. The TimeRange
selector uses this interval
to gather commits whose timestamps fall between two Epoch
values:
#![allow(unused)] fn main() { use hifitime::Epoch; use tribles::repo::time_range; let since = Epoch::from_unix_seconds(1_609_459_200.0); // 2020-12-01 let now = Epoch::now().unwrap(); let tribles = ws.checkout(time_range(since, now))?; }
This walks the history from HEAD
and returns only those commits whose
timestamp interval intersects the inclusive range.
Internally it uses filter(ancestors(HEAD), ..)
to check each commit's
timestamp range.
Garbage Collection and Forgetting
Repositories grow over time as commits, branch metadata and user blobs accumulate. Because each blob is content addressed and immutable, nothing is ever overwritten and there is no automatic reclamation when branches move or objects become orphaned. To keep disk usage in check a repository can periodically forget blobs that are no longer referenced. Forgetting only removes local copies; any blob can be reintroduced later without violating TribleSpace's monotonic model. The challenge is deciding which blobs are still reachable without rebuilding every record.
Conservative Reachability
Every commit and branch metadata record is stored as a SimpleArchive
. The
archive encodes a canonical TribleSet
as alternating 32‑byte keys and values.
Instead of deserialising the archive, the collector scans the raw bytes in
32‑byte chunks. Each second chunk is treated as a candidate value. If a chunk
matches the hash of a blob in the store we assume it is a reference, regardless
of the attribute type. With 32‑byte hashes the odds of a random collision are
negligible, so the scan may keep extra blobs but will not drop a referenced one.
Traversal Algorithm
- Enumerate all branches and load their metadata blobs.
- Extract candidate handles from the metadata. This reveals the current commit head along with any other referenced blobs.
- Recursively walk the discovered commits and content blobs. Whenever a
referenced blob is a
SimpleArchive
, scan every second 32‑byte segment for further handles instead of deserialising it. - Collect all visited handles into a plain set or list of 32‑byte handles.
A
keep
‑style operation can pass this collection to the blob store and prune everything else without imposing any trible semantics.
Content blobs that are not SimpleArchive
instances (for example large binary
attachments) act as leaves. They become reachable when some archive references
their handle and are otherwise eligible for forgetting.
Future Work
The public API for triggering garbage collection is still open. The blob store
could expose a method that retains only a supplied collection of handles, or a
helper such as Repository::forget_unreachable()
might compute those handles
before delegating pruning. A more flexible ReachabilityWalker
could also let
applications decide how to handle reachable blobs. Whatever interface emerges,
conservative reachability by scanning SimpleArchive
bytes lays the groundwork
for safe space reclamation.
Glossary
Blob: A chunk of binary data addressed by the hash of its contents. Blobs are immutable and can live in memory, on disk, or in remote object stores.
Trible: A three-part tuple of entity, attribute, and value. Tribles are the atomic facts that make up all higher level structures in Trible Space.
Trible Space: The overall storage model that organises tribles across blobs, PATCHes, and repositories. It emphasises immutability and content addressing.
PATCH: A tree-shaped index used to organise tribles for efficient queries. Different permutations of the same trible share leaves to resist skewed data.
Pile: An append-only collection of blobs such as tribles, patches, and other data. Piles can be opened from local files or object stores.
Workspace: A mutable working area for preparing commits. Once a commit is finalised, it becomes immutable like the rest of Trible Space.
Deep Dive
This section contains in-depth discussions about Trible Space internals.
Philosophy
This section collects the more detailed discussions around the design of Trible Space and the reasoning behind certain choices. It is meant as an optional read for the curious.
We prioritise a simple and predictable system over clever heuristics. Each component should be understandable on its own and interact cleanly with the rest of the stack.
Developer experience is equally important. APIs aim to be straightforward and use synchronous building blocks that can be composed as needed.
Finally, we strive for soundness and performance. Safety checks guard against invalid data while efficient data structures keep the core fast.
Identifiers for Distributed Systems
We found it useful to categorize identifiers along two axes:
Abstract | Semantic | |
---|---|---|
Intrinsic | Hash, Signature, PubKey | embeddings |
Extrinsic | UUID, UFOID, FUCID | names, DOI, URL |
Abstract vs. Semantic Identifiers
-
Semantic Identifiers (e.g., human-readable names, URLs, embeddings) These identifiers carry meaning and context about the entity they represent. This can make them them useful for human users, as they can convey information about the entity without requiring additional lookups. For example, a URL can provide information about the location of a resource, or a human readable name can provide information about the entity itself. Embeddings are a special case of semantic identifiers, as they represent the content of an entity in a way that can be compared to other entities. They are also more likely to change over time, as the context of the entity changes. This makes them less useful for identity, as they are not necessarily unique; their strength is to aid interpretation rather than define persistence. To avoid ambiguities and conflicts or the need for a central authority to manage them, semantic identifiers should always be explicitly scoped to a context, such as a namespace or system environment. This ensures that the same name can coexist in different contexts without collision or confusion. This scoping also addresses social challenges inherent in human-readable names: different users may prefer different names for the same entity. By allowing local names to reference persistent identifiers (extrinsic or intrinsic), each user can adopt their preferred naming conventions while maintaining a shared understanding of the underlying identity.
-
Abstract Identifiers (e.g., UUIDs, UFOIDs, FUCIDs, hashes, signatures)
These identifiers provide abstract identity without imposing any semantic meaning or cultural connotations. They can be generated cheaply and without coordination, relying on high entropy to make collisions practically impossible, uniquely, globally, and persistently addressing an entity, regardless of its content or context. Abstract identifiers, when used to reference entities in a system, provide a stable and unique identity that is independent of the content or context of the entity. They are particularly useful in distributed systems, where they can be used to address entities across different nodes without requiring a central authority.
Intrinsic vs. Extrinsic Identifiers
-
Intrinsic Identifiers (e.g., hashes, signatures)
These identifiers provide intrinsic identity by acting as unique fingerprints of the exact content they represent. Unlike abstract identifiers, intrinsic identifiers are directly tied to the data itself, ensuring immutability and self-validation.Intrinsic identifiers are generated by applying cryptographic functions to the content. Their entropy requirements are higher than those of abstract identifiers, as they must not only prevent accidental collisions but also withstand adversarial scenarios, such as deliberate attempts to forge data.
-
Extrinsic Identifiers (e.g., human-readable names, URLs, DOIs, UUIDs, UFOIDs, FUCIDs) These identifiers provide identity that is not tied to the content itself, but only by association. They are used to reference entities in a system, but do not provide any guarantees about the content or the entity itself. Allowing for continuity even as that entity may change or evolve.
Extrinsic identifiers and intrinsic identifiers represent different kinds of metaphysical identity.
For example, in the ship of Theseus thought experiment, both the original ship and the reconstructed ship
would share the same extrinsic identity but have different intrinsic identities.
Embeddings as Semantic Intrinsic Identifiers
Note that embeddings are the somewhat curious case of semantic intrinsic identifiers. They are intrinsic in that they are tied to the content they represent, but they are also semantic in that they carry meaning about the content. Embeddings are used to represent the content of an entity in a way that can be compared to other entities, such as for similarity search or classification. This makes them especially interesting for search and retrieval systems, where they can be used to find similar entities based on a reference entity. But less useful for identity, as they are not necessarily unique.
One thing that makes them especially interesting is that they can be used to compare entities across different systems or contexts, even if the entities themselves are not directly comparable. For example, you could compare the embeddings of a text document and an image to find similar content, even though the two entities are of different types.
Furthermore they aid in the decentralization and commoditization of search and retrieval systems, as they allow for the relatively expensive process of generating embeddings to be done decoupled from the indexing and retrieval process. This allows for the embedding generation to be done once in a distributed manner, and then the embeddings can be used by any system that needs to compare entities. With the embeddings acting as a common language for comparing entities, different embeddings can be compared without needing to know about the specifics of each system.
Contrastingly classic search and retrieval systems require a central authority to index and search the content, as the indexing process is tightly coupled with the indexed data. This makes it difficult to compare entities across different systems, as each system has its own index and retrieval process. It also makes merging indexes virtually impossible, as the indexes are tightly coupled with the structure of the data they index.
High-Entropy Identifiers
For a truly distributed system, the creation of identifiers must avoid the bottlenecks and overhead associated with a central coordinating authority. At the same time, we must ensure that these identifiers are unique.
To guarantee uniqueness, we use abstract identifiers containing a large amount of entropy, making collisions statistically irrelevant. However, the entropy requirements differ based on the type of identifier:
- Extrinsic abstract identifiers need enough entropy to prevent accidental collisions in normal operation.
- Intrinsic abstract identifiers must also resist adversarial forging attempts, requiring significantly higher entropy.
From an information-theoretic perspective, the length of an identifier determines the maximum amount of entropy it can encode. For example, a 128-bit identifier can represent ( 2^{128} ) unique values, which is sufficient to make collisions statistically negligible even for large-scale systems.
For intrinsic identifiers, 256 bits is widely considered sufficient when modern cryptographic hash functions (e.g., SHA-256) are used. These hash functions provide strong guarantees of collision resistance, preimage resistance, and second-preimage resistance. Even in the event of weaknesses being discovered in a specific algorithm, it is more practical to adopt a new hash function than to increase the bit size of identifiers.
Additionally, future advances such as quantum computing are unlikely to undermine this length. Grover's algorithm would halve the effective security of a 256-bit hash, reducing it to ( 2^{128} ) operations—still infeasible with current or theoretical technology. As a result, 256 bits remains a future-proof choice for intrinsic identifiers.
Such 256-bit intrinsic identifiers are represented by the types
tribles::value::schemas::hash::Hash
and
tribles::value::schemas::hash::Handle
.
Additionally, we define three types of high-entropy abstract identifiers to address different requirements:
RNGID, UFOID, and FUCID. Each balances trade-offs between entropy, locality, compression, and
predictability, as summarized below.
Comparison of Identifier Types
RNGID | UFOID | FUCID | |
---|---|---|---|
Entropy | High | High | Low |
Locality | None | High | High |
Compression | None | Low | High |
Predictability | None | Low | Mid |
Example: Scientific Publishing
Consider the case of published scientific papers. Each artifact, such as a .html
or .pdf
file,
should be identified by its abstract intrinsic identifier, typically a cryptographic hash of its content.
This ensures that any two entities referencing the same hash are referring to the exact same version
of the artifact, providing immutability and validation.
Across different versions of the same paper, an abstract extrinsic identifier can be used to tie these artifacts together as part of one logical entity. The identifier provides continuity, regardless of changes to the paper’s content over time.
Semantic (human-readable) identifiers, such as abbreviations in citations or bibliographies, are scoped to individual papers and provide context-specific usability for readers. These names do not convey identity but serve as a way for humans to reference the persistent abstract identifiers that underlie the system.
Sadly the identifiers used in practice, such as DOIs, fail to align with these principles and strengths. They attempt to provide global extrinsic semantic identifiers for scientific papers, an ultimately flawed approach. They lack the associated guarantees of intrinsic identifiers and bring all the challenges of semantic identifiers. With their scope defined too broadly, and their authority centralized, they fail to live up to the potential of distributed systems.
ID Ownership
In distributed systems, consistency requires monotonicity due to the CALM principle. However, this is not necessary for single writer systems. By assigning each ID an owner, we ensure that only the current owner can write new information about an entity associated with that ID. This allows for fine-grained synchronization and concurrency control.
To create a transaction, you can uniquely own all entities involved and write new data for them simultaneously. Since there can only be one owner for each ID at any given time, you can be confident that no other information has been written about the entities in question.
By default, all minted ExclusiveId
s are associated with the thread they are dropped from.
These IDs can be found in queries via the local_ids
function.
Ownership and Eventual Consistency
While a simple grow set like the history stored in a Head
already constitutes a conflict-free replicated data type (CRDT), it is also limited in expressiveness.
To provide richer semantics while guaranteeing conflict-free mergeability we allow only
"owned" IDs to be used in the entity
position of newly generated triples.
As owned IDs are [Send] but not [Sync] owning a
set of them essentially constitutes a single writer transaction domain,
allowing for some non-monotonic operations like if-does-not-exist
, over
the set of contained entities. Note that this does not make operations that
would break CALM (consistency as logical monotonicity) safe, e.g. delete
.
The trible
module contains the definition of the Trible
struct, which is the fundamental unit of knowledge in the knowledge graph.
Instance of Trible
s are stored in TribleSet
s which index the trible in various ways, allowing for efficient querying and retrieval of data.
┌────────────────────────────64 byte───────────────────────────┐
┌──────────────┐┌──────────────┐┌──────────────────────────────┐
│ entity-id ││ attribute-id ││ inlined value │
└──────────────┘└──────────────┘└──────────────────────────────┘
└────16 byte───┘└────16 byte───┘└────────────32 byte───────────┘
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─▶
On a high level, a trible is a triple consisting of an entity, an attribute, and a value. The entity and attribute are both 128‑bit abstract extrinsic identifiers as described in [crate::id], while the value is an arbitrary 256‑bit [crate::value::Value]. The width of the value is chosen so that it can hold an entire intrinsic identifier, allowing larger payloads to be referenced via blobs without inflating the inlined representation.
Abstract identifiers
Entities and attributes are purely extrinsic; their identifiers do not encode any meaning beyond uniqueness. An entity may accrue additional tribles over time and attributes simply name relationships without prescribing a schema. This keeps the format agnostic to external ontologies and minimises accidental coupling between datasets.
The value slot can carry any 256‑bit payload. Its size is dictated by the need to embed an intrinsic identifier for out‑of‑line data. When a fact exceeds this space the value typically stores a blob handle pointing to the larger payload.
Tribles are stored as a contiguous 64‑byte array with the entity occupying the first 16 bytes, the attribute the next 16 and the value the final 32 bytes. The name "trible" is a portmanteau of triple and byte and is pronounced like "tribble" from Star Trek – hence the project's mascot, Robert the tribble.
Index permutations
TribleSet
s index each fact under all six permutations of entity (E), attribute
(A) and value (V) so any combination of bound variables can be resolved
efficiently:
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ EAV │ │ EVA │ │ AEV │ │ AVE │ │ VEA │ │ VAE │
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ │ │ │ │ │
┌───────────────────────────────────────────────────────┐
│ order-specific inner nodes │
└───────────────────────────────────────────────────────┘
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
┌───────────────────────────────────────────────────────┐
│ SHARED LEAVES │
│ single canonical E–A–V tribles used by all │
└───────────────────────────────────────────────────────┘
Each permutation has its own inner nodes, but all six share leaf nodes containing the 64‑byte trible. This avoids a naïve six‑fold memory cost while still letting the query planner pick the most selective ordering, keeping joins resistant to skew.
Advantages
- A total order over tribles enables efficient storage and canonicalisation.
- Simple byte‑wise segmentation supports indexing and querying without an interning mechanism, keeping memory usage low and parallelisation easy while avoiding the need for garbage collection.
- Schemas describe the value portion directly, making serialisation and deserialisation straightforward.
- The fixed 64‑byte layout makes it easy to estimate the physical size of a dataset as a function of the number of tribles stored.
- The minimalistic design aims to minimise entropy while retaining collision resistance, making it likely that a similar format would emerge through convergent evolution and could serve as a universal data interchange format.
Direction and Consistency
In other triple stores the direction of the edge drawn by a triple is often
choosen incidentally, e.g. there is no intrinsic preference for hasColor
over
colorOf
. This can lead to confusion and inconsistency in the graph, as
different writers might choose different directions for the same edge.
This is typically solved by:
- Automatically inferring the opposite edge for every edge inserted,
as done by OWL and RDF with the
inverseOf
predicate. Leading to a doubling of the number of edges in the graph or inference at query time. - Endless bikeshedding about the "right" direction of edges.
In the tribles
crate we solve this problem by giving the direction of the edge
an explicit semantic meaning: The direction of the edge indicates which entity
is the one making the statement, i.e. which entity is observing the fact
or proclaiming the relationship. This is a simple and consistent rule that
naturally fits into a distributed system, where each entity is associated with
a single writer that is responsible the consistency of the facts it asserts.
- see ID Ownership.
A different perspective is that edges are always ordered from describing to described entities, with circles constituting consensus between them.
For example, the edge hasColor
is always drawn from the entity that has
the color to the entity that represents the color. This makes the direction
of the edge a natural consequence of the semantics of the edge, and not
an arbitrary choice.
Blobs
Blobs are immutable sequences of bytes used to represent data that does not fit into the fixed 256‑bit value slot of a trible. Each blob is typed by a BlobSchema
similar to how values use ValueSchema
. This allows structured data to be serialized into a blob while still tracking its schema.
Values and tribles capture small facts in a fixed width, whereas blobs are used "in the large" for documents, media and other sizable payloads. A blob can therefore represent anything from a single file to a complete archive of tribles.
Converting Rust types to blobs is infallible in practice, so only the ToBlob
and TryFromBlob
traits are widely used. The TryToBlob
and FromBlob
variants have been dropped to keep the API surface small.
The following example demonstrates creating blobs, archiving a TribleSet
and signing its contents:
#![allow(unused)] fn main() { use tribles::prelude::*; use tribles::examples::literature; use tribles::repo::repo; use valueschemas::{Handle, Blake3}; use blobschemas::{SimpleArchive, LongString}; use rand::rngs::OsRng; use ed25519_dalek::{Signature, Signer, SigningKey}; // Build a BlobStore and fill it with some data. let mut memory_store: MemoryBlobStore<Blake3> = MemoryBlobStore::new(); let book_author_id = fucid(); let quote_a: Value<Handle<Blake3, LongString>> = memory_store .put("Deep in the human unconscious is a pervasive need for a logical universe that makes sense. But the real universe is always one step beyond logic.") .unwrap(); let quote_b = memory_store .put("I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain.") .unwrap(); let set = literature::entity!({ title: "Dune", author: &book_author_id, quote: quote_a, quote: quote_b }); // Serialize the TribleSet and store it as another blob. let archived_set_handle: Value<Handle<Blake3, SimpleArchive>> = memory_store.put(&set).unwrap(); let mut csprng = OsRng; let commit_author_key: SigningKey = SigningKey::generate(&mut csprng); let signature: Signature = commit_author_key.sign( &memory_store .reader() .get::<Blob<SimpleArchive>, SimpleArchive>(archived_set_handle) .unwrap() .bytes, ); // Store the handle in another TribleSet. let _meta_set = repo::entity!({ content: archived_set_handle, short_message: "Initial commit", signed_by: commit_author_key.verifying_key(), signature_r: signature, signature_s: signature, }); }
Blobs complement tribles and values by handling large payloads while keeping the
core data structures compact. A blob's hash, computed via a chosen
HashProtocol
, acts as a stable handle that can be embedded into tribles or
other blobs, enabling content‑addressed references without copying the payload.
PATCH
The Persistent Adaptive Trie with Cuckoo-compression and Hash-maintenance (PATCH) is the core data structure used for set operations in Trible Space. It stores keys in a compressed 256-ary trie where each node uses a byte oriented cuckoo hash table to map to its children. This single node layout supports anywhere from two to 256 entries, avoiding the complex branching logic of other adaptive tries.
Traditional Adaptive Radix Trees (ART) employ multiple node variants like
Node4
, Node16
or Node48
to keep memory usage proportional to the number of
children. PATCH instead compresses every branch with a byte oriented cuckoo hash
table. Each node contains two arrays of candidate slots and inserts may displace
previous entries similar to classic cuckoo hashing. The layout never changes, so
we avoid the branching logic and pointer chasing common in ART implementations
while still achieving high occupancy.
Our byte table uses two hash functions built from a specialised compressed
permutation. The first always uses the identity mapping and the second picks a
random bijective byte→byte permutation. The current table size simply masks off
the upper bits to compress these results. Doubling the table reveals one more
significant bit so entries either stay in place or move to bucket index * 2
.
When all 256 children exist we disable the random permutation and use the
identity for both hashes, turning the full table into a simple array where each
byte already occupies its canonical slot.
The byte_table_resize_benchmark
demonstrates how densely the table can fill
before triggering a resize. The benchmark inserts all byte values many times
and measures the occupancy that forced each power-of-two table size to grow:
ByteTable resize fill - random: 0.863, sequential: 0.972
Per-size fill (random)
size 2: 1.000 # path compression keeps two-entry nodes fully occupied
size 4: 0.973
size 8: 0.899
size 16: 0.830
size 32: 0.749
size 64: 0.735
size 128: 0.719
size 256: 1.000 # identity hash maps all 256 children without resizing
Per-size fill (sequential)
size 2: 1.000 # path compression keeps two-entry nodes fully occupied
size 4: 1.000
size 8: 0.993
size 16: 1.000
size 32: 0.928
size 64: 0.925
size 128: 0.927
size 256: 1.000 # identity hash maps all 256 children without resizing
Random inserts average roughly 86% table fill while sequential inserts hold
about 97% before doubling the table size. Nodes of size two are always 100%
full thanks to path compression, and the final 256‑ary node also reaches 100%
occupancy because of the linear hash, which we now report explicitly instead of
0.000
. This keeps memory usage predictable without the specialized node
formats used by ART.
Hash maintenance
Each node maintains a 256‑bit rolling hash derived from the keys in its subtree. On insert or delete the old influence of a child is XORed out and the new one XORed in, yielding an updated hash in constant time. These hashes allow set operations such as union, intersection and difference to skip entire branches when the hashes differ.
Keys can be viewed in different orders with the KeyOrdering
trait and segmented via KeySegmentation
to enable prefix based queries.
All updates use copy‑on‑write semantics, so cloning a tree is cheap and safe.
Pile Format
The on-disk pile keeps every blob and branch in one append-only file. This layout provides a simple write-ahead log style database where new data is only appended. It allows both blob and branch storage in a single file while remaining resilient to crashes. The pile backs local repositories and acts as a durable content‑addressed store. The pile file can be memory mapped for fast reads and is safely shared between threads because existing bytes are never mutated.
While large databases often avoid mmap
due to pitfalls with partial writes
and page cache thrashing [1], this
design works well for the pile's narrow usage pattern.
Design Rationale
This format emphasizes simplicity over sophisticated on-disk structures.
Appending new blobs rather than rewriting existing data keeps corruption
windows small and avoids complicated page management. Storing everything in a
single file makes a pile easy to back up or replicate over simple transports
while still allowing it to be memory mapped for fast reads. The 64 byte
alignment ensures each entry begins on a cache line boundary, which improves
concurrent access patterns and allows safe typed views with the zerocopy
crate.
Hash verification only happens when blobs are read. Opening even a very large pile is therefore fast while still catching corruption before data is used.
Every record begins with a 16 byte magic marker that identifies whether it stores a blob or a branch. The sections below illustrate the layout of each type.
Usage
A pile typically lives as a .pile
file on disk. Repositories open it through
Pile::open
which performs any necessary recovery and returns a handle for
appending new blobs or branches. Multiple threads may share the same handle
thanks to internal synchronisation, making a pile a convenient durable store for
local development.
Blob Storage
8 byte 8 byte
┌────16 byte───┐┌──────┐┌──────┐┌────────────32 byte───────────┐
┌ ┌──────────────┐┌──────┐┌──────┐┌──────────────────────────────┐
header │ │magic number A││ time ││length││ hash │
└ └──────────────┘└──────┘└──────┘└──────────────────────────────┘
┌────────────────────────────64 byte───────────────────────────┐
┌ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┐
│ │ │
payload │ │ bytes (64byte aligned and padded) │
│ │ │
└ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘
Each blob entry records its creation timestamp, the length of the payload and its hash. The payload is padded so the next record begins on a 64 byte boundary.
Branch Storage
┌────16 byte───┐┌────16 byte───┐┌────────────32 byte───────────┐
┌ ┌──────────────┐┌──────────────┐┌──────────────────────────────┐
header │ │magic number B││ branch id ││ hash │
└ └──────────────┘└──────────────┘└──────────────────────────────┘
Branch entries map a branch identifier to the hash of a blob.
Recovery
When [Pile::try_open
] scans an existing file it checks that every header uses a known marker and that the whole record fits. It does not verify any hashes. If a truncated or unknown block is found the function reports the number of bytes that were valid so far using [OpenError::CorruptPile
].
The convenience wrapper [Pile::open
] re-runs the same validation and truncates
the file to the valid length if corruption is encountered. This recovers from
interrupted writes by discarding incomplete data.
Hash verification happens lazily only when individual blobs are loaded so that
opening a large pile remains fast.
For more details on interacting with a pile see the Pile
struct
documentation.
Pile Blob Metadata
The pile file already stores a timestamp and length in each blob header but this information is discarded when building the in-memory index. Clients therefore cannot query when a blob was added or how large it is without re-parsing the file.
Proposed Changes
- Extend
IndexEntry
insrc/repo/pile.rs
with atimestamp
field. The length can be determined from the stored bytes when needed. - Introduce a public
BlobMetadata
struct containingtimestamp
andlength
so callers do not depend on internal types. - Populate the timestamp when
Pile::try_open
scans existing entries and when inserting new blobs. Lengths are computed on demand. - Add
PileReader::metadata(&self, handle)
to retrieve a blob's metadata if it exists. Iterators may later be extended to yield this information alongside the blob itself.
This approach keeps the current API intact while making useful details available for replication and debugging tools.