Identifiers for Distributed Systems

We found it useful to categorize identifiers along two axes:

AbstractSemantic
IntrinsicHash, Signature, PubKeyembeddings
ExtrinsicUUID, UFOID, FUCIDnames, DOI, URL

Abstract vs. Semantic Identifiers

  • Semantic Identifiers (e.g., human-readable names, URLs, embeddings) These identifiers carry meaning and context about the entity they represent. This can make them them useful for human users, as they can convey information about the entity without requiring additional lookups. For example, a URL can provide information about the location of a resource, or a human readable name can provide information about the entity itself. Embeddings are a special case of semantic identifiers, as they represent the content of an entity in a way that can be compared to other entities. They are also more likely to change over time, as the context of the entity changes. This makes them less useful for identity, as they are not necessarily unique; their strength is to aid interpretation rather than define persistence. To avoid ambiguities and conflicts or the need for a central authority to manage them, semantic identifiers should always be explicitly scoped to a context, such as a namespace or system environment. This ensures that the same name can coexist in different contexts without collision or confusion. This scoping also addresses social challenges inherent in human-readable names: different users may prefer different names for the same entity. By allowing local names to reference persistent identifiers (extrinsic or intrinsic), each user can adopt their preferred naming conventions while maintaining a shared understanding of the underlying identity.

  • Abstract Identifiers (e.g., UUIDs, UFOIDs, FUCIDs, hashes, signatures)
    These identifiers provide abstract identity without imposing any semantic meaning or cultural connotations. They can be generated cheaply and without coordination, relying on high entropy to make collisions practically impossible, uniquely, globally, and persistently addressing an entity, regardless of its content or context. Abstract identifiers, when used to reference entities in a system, provide a stable and unique identity that is independent of the content or context of the entity. They are particularly useful in distributed systems, where they can be used to address entities across different nodes without requiring a central authority.

Intrinsic vs. Extrinsic Identifiers

  • Intrinsic Identifiers (e.g., hashes, signatures)
    These identifiers provide intrinsic identity by acting as unique fingerprints of the exact content they represent. Unlike abstract identifiers, intrinsic identifiers are directly tied to the data itself, ensuring immutability and self-validation.

    Intrinsic identifiers are generated by applying cryptographic functions to the content. Their entropy requirements are higher than those of abstract identifiers, as they must not only prevent accidental collisions but also withstand adversarial scenarios, such as deliberate attempts to forge data.

  • Extrinsic Identifiers (e.g., human-readable names, URLs, DOIs, UUIDs, UFOIDs, FUCIDs) These identifiers provide identity that is not tied to the content itself, but only by association. They are used to reference entities in a system, but do not provide any guarantees about the content or the entity itself. Allowing for continuity even as that entity may change or evolve.

Extrinsic identifiers and intrinsic identifiers represent different kinds of metaphysical identity.
For example, in the ship of Theseus thought experiment, both the original ship and the reconstructed ship
would share the same extrinsic identity but have different intrinsic identities.

Embeddings as Semantic Intrinsic Identifiers

Note that embeddings are the somewhat curious case of semantic intrinsic identifiers. They are intrinsic in that they are tied to the content they represent, but they are also semantic in that they carry meaning about the content. Embeddings are used to represent the content of an entity in a way that can be compared to other entities, such as for similarity search or classification. This makes them especially interesting for search and retrieval systems, where they can be used to find similar entities based on a reference entity. But less useful for identity, as they are not necessarily unique.

One thing that makes them especially interesting is that they can be used to compare entities across different systems or contexts, even if the entities themselves are not directly comparable. For example, you could compare the embeddings of a text document and an image to find similar content, even though the two entities are of different types.

Furthermore they aid in the decentralization and commoditization of search and retrieval systems, as they allow for the relatively expensive process of generating embeddings to be done decoupled from the indexing and retrieval process. This allows for the embedding generation to be done once in a distributed manner, and then the embeddings can be used by any system that needs to compare entities. With the embeddings acting as a common language for comparing entities, different embeddings can be compared without needing to know about the specifics of each system.

Contrastingly classic search and retrieval systems require a central authority to index and search the content, as the indexing process is tightly coupled with the indexed data. This makes it difficult to compare entities across different systems, as each system has its own index and retrieval process. It also makes merging indexes virtually impossible, as the indexes are tightly coupled with the structure of the data they index.

High-Entropy Identifiers

For a truly distributed system, the creation of identifiers must avoid the bottlenecks and overhead associated with a central coordinating authority. At the same time, we must ensure that these identifiers are unique.

To guarantee uniqueness, we use abstract identifiers containing a large amount of entropy, making collisions statistically irrelevant. However, the entropy requirements differ based on the type of identifier:

  • Extrinsic abstract identifiers need enough entropy to prevent accidental collisions in normal operation.
  • Intrinsic abstract identifiers must also resist adversarial forging attempts, requiring significantly higher entropy.

From an information-theoretic perspective, the length of an identifier determines the maximum amount of entropy it can encode. For example, a 128-bit identifier can represent ( 2^{128} ) unique values, which is sufficient to make collisions statistically negligible even for large-scale systems.

For intrinsic identifiers, 256 bits is widely considered sufficient when modern cryptographic hash functions (e.g., SHA-256) are used. These hash functions provide strong guarantees of collision resistance, preimage resistance, and second-preimage resistance. Even in the event of weaknesses being discovered in a specific algorithm, it is more practical to adopt a new hash function than to increase the bit size of identifiers.

Additionally, future advances such as quantum computing are unlikely to undermine this length. Grover's algorithm would halve the effective security of a 256-bit hash, reducing it to ( 2^{128} ) operations—still infeasible with current or theoretical technology. As a result, 256 bits remains a future-proof choice for intrinsic identifiers.

Such 256-bit intrinsic identifiers are represented by the types tribles::value::schemas::hash::Hash and tribles::value::schemas::hash::Handle.

Additionally, we define three types of high-entropy abstract identifiers to address different requirements:
RNGID, UFOID, and FUCID. Each balances trade-offs between entropy, locality, compression, and predictability, as summarized below.

Comparison of Identifier Types

RNGIDUFOIDFUCID
EntropyHighHighLow
LocalityNoneHighHigh
CompressionNoneLowHigh
PredictabilityNoneLowMid

Example: Scientific Publishing

Consider the case of published scientific papers. Each artifact, such as a .html or .pdf file, should be identified by its abstract intrinsic identifier, typically a cryptographic hash of its content. This ensures that any two entities referencing the same hash are referring to the exact same version of the artifact, providing immutability and validation.

Across different versions of the same paper, an abstract extrinsic identifier can be used to tie these artifacts together as part of one logical entity. The identifier provides continuity, regardless of changes to the paper’s content over time.

Semantic (human-readable) identifiers, such as abbreviations in citations or bibliographies, are scoped to individual papers and provide context-specific usability for readers. These names do not convey identity but serve as a way for humans to reference the persistent abstract identifiers that underlie the system.

Sadly the identifiers used in practice, such as DOIs, fail to align with these principles and strengths. They attempt to provide global extrinsic semantic identifiers for scientific papers, an ultimately flawed approach. They lack the associated guarantees of intrinsic identifiers and bring all the challenges of semantic identifiers. With their scope defined too broadly, and their authority centralized, they fail to live up to the potential of distributed systems.

ID Ownership

In distributed systems, consistency requires monotonicity due to the CALM principle. However, this is not necessary for single writer systems. By assigning each ID an owner, we ensure that only the current owner can write new information about an entity associated with that ID. This allows for fine-grained synchronization and concurrency control.

To create a transaction, you can uniquely own all entities involved and write new data for them simultaneously. Since there can only be one owner for each ID at any given time, you can be confident that no other information has been written about the entities in question.

By default, all minted ExclusiveIds are associated with the thread they are dropped from. These IDs can be found in queries via the local_ids function.

Ownership and Eventual Consistency

While a simple grow set like the history stored in a Head already constitutes a conflict-free replicated data type (CRDT), it is also limited in expressiveness. To provide richer semantics while guaranteeing conflict-free mergeability we allow only "owned" IDs to be used in the entity position of newly generated triples. As owned IDs are [Send] but not [Sync] owning a set of them essentially constitutes a single writer transaction domain, allowing for some non-monotonic operations like if-does-not-exist, over the set of contained entities. Note that this does not make operations that would break CALM (consistency as logical monotonicity) safe, e.g. delete.