Importing Other Data Formats
Import pipelines let you bring external datasets into a tribles repository without
hand-writing schemas or entity identifiers every time. This chapter introduces the
import namespace, explains how the JSON importers map foreign fields onto
attributes, and outlines how you can extend the same patterns to new formats.
Import Namespace Overview
The triblespace_core::import module collects conversion helpers that translate
structured documents into raw tribles. Today the namespace ships with two
deterministic JSON importers:
JsonObjectImporterhashes attribute/value pairs to derive entity identifiers so identical inputs reproduce the same entities. It accepts a top-level JSON object (or a top-level array of objects). Construct it with a blob sink (e.g., aWorkspace’s store or aMemoryBlobStore) and an optional 32-byte salt when you want to mix in extra entropy to avoid collisions. Eachimport_*call returns aFragmentthat exports the root entity id(s) and contains the emitted facts.JsonTreeImporterpreserves the full JSON structure and ordering by emitting explicit node and entry entities (a JSON AST). It derives content-addressed identifiers from the JSON values themselves so identical subtrees deduplicate across overlapping imports. Unlike the object importer it can represent arbitrary JSON roots, including primitives. Eachimport_*call returns a rootedFragmentfor the imported JSON value.
JsonObjectImporter uses a fixed mapping for JSON primitives:
- strings →
Handle<Blake3, LongString> - numbers →
F64 - booleans →
Boolean
Arrays are treated as multi-valued fields; every element becomes its own trible
under the same attribute identifier. Nested objects recurse automatically,
linking parent to child entities through GenId attributes derived from the
containing field name. After one or more imports, call metadata() to retrieve
metadata as a Fragment exporting the derived attribute ids. The fragment
contains attribute descriptors plus multi-value hints (a metadata::tag edge
pointing to metadata::KIND_MULTI). Use clear() when you need a completely
fresh run (drop the per-field attribute caches and multi-value tracking).
Mapping JSON Fields to Attributes
Attributes are derived through Attribute::from_name, which hashes the JSON
field name together with the fixed ValueSchema for that primitive. The
importer caches the resulting RawIds per field so the hash only has to be
computed once per run. Arrays are treated as multi-valued fields: every item is
encoded and stored under the same attribute identifier, producing one trible per
element.
After an import completes the importer regenerates metadata from its cached
attribute map. The import_* call returns a Fragment exporting the root
entity id(s) for the imported document and containing the emitted facts; call
metadata() to retrieve a separate Fragment exporting the derived attribute
ids and containing attribute descriptors plus multi-value hints (via
metadata::tag pointing to metadata::KIND_MULTI). Merge those descriptors
into your repository alongside the imported facts when you want queries to
discover the original JSON field names or project datasets by schema without
repeating the derivation logic. Field names are stored as metadata::name
handles to
LongString blobs so arbitrarily long keys survive roundtrips; metadata::name
is a general-purpose entity naming attribute, but importers use it for field
names here. Importers intentionally avoid emitting attribute usage annotations;
those are reserved for code-defined attributes so each codebase can attach its
own contextual names and descriptions.
You can import multiple documents by merging fragments:
let mut all = Fragment::default();
all += importer.import_str(doc1)?;
all += importer.import_str(doc2)?;
// all.exports() yields the root ids; all.facts() yields the merged tribles.
When exporting back to JSON, pass a blob reader (e.g., from a Workspace or
MemoryBlobStore) to export_to_json so longstrings can be inlined. If a blob
is missing or unreadable the exporter returns an error with the handle hash
instead of silently emitting a placeholder, keeping roundtrips lossless when
blobs are present. The exporter uses the same fixed mapping in reverse:
ShortString → JSON string, Handle<Blake3, LongString> → JSON string (via
blob lookup), Boolean → JSON bool, F64 → JSON number, GenId → inlined
object (unless already visited). Attributes that use other schemas are ignored
so JSON roundtrips stay predictable even when the dataset mixes in
format-specific extensions.
Nested objects recurse automatically. The parent receives a GenId attribute
that points at the child entity, allowing the importer to represent the entire
object graph as a connected set of tribles. Because those GenId attributes are
also derived from the parent field names they remain stable even when you import
related documents in separate batches.
Lossless JSON Import
JsonTreeImporter trades the compact attribute/value encoding for a
lossless JSON AST representation. Each JSON value becomes a node tagged with a
kind (json_tree::kind_*). Objects and arrays emit explicit entry entities
that store field names and indices (json_tree::field_* and
json_tree::array_*), preserving ordering and allowing repeated keys.
Numbers are stored as raw decimal strings via Handle<Blake3, LongString> so
precision is not lost. Array and field indices are stored as U256BE to keep
ordering exact even for large collections.
Because node identifiers are derived from the content of each value, identical subtrees converge automatically when you import overlapping backups. This makes lossless imports a good archival layer: you can keep full-fidelity raw JSON and still layer semantic projections on top.
Each import_* call returns a rooted Fragment containing the JSON AST facts.
Merge fragments when you ingest multiple documents. metadata() returns a
fixed Fragment exporting the schema ids for the json_tree::* attributes and
kinds. You typically merge it once alongside your lossless archive.
Managing Entity Identifiers
The importer buffers the encoded attribute/value pairs for each object, sorts
them, and feeds the resulting byte stream into a hash protocol. The first 16
bytes of that digest become the entity identifier, ensuring identical JSON
inputs produce identical IDs even across separate runs. You can supply an
optional 32-byte salt via the constructor to keep deterministic imports from
colliding with existing data. Once the identifier is established,
the importer writes the derived pairs into a TribleSet via Trible::new and
returns them as a Fragment whose exports are the root entity id(s) for the
imported document.
This hashing step also changes how repeated structures behave. When a JSON
document contains identical nested objects—common in fixtures such as
citm_catalog or Twitter exports—the deterministic importer emits the same
identifier for each recurrence. Only the first copy reaches the underlying
TribleSet; later occurrences are recognised as duplicates and skipped during
the merge. Even if the hash itself is fast, that deduplication step reduces
workload on datasets with significant repetition.
Extending the Importers
To support a new external format, implement a module in the import namespace
that follows the same pattern: decode the source data, derive attributes with
Attribute::from_name, encode values using the appropriate ValueSchema, and
hand the results to Trible::new. If the format supplies stable identifiers,
mix them into the hashing step or salt so downstream systems can keep imports
idempotent.