Importing Other Data Formats
Import pipelines let you bring external datasets into a tribles repository without
hand-writing schemas or entity identifiers every time. This chapter introduces the
import namespace, explains how the JSON importers map foreign fields onto
attributes, and outlines how you can extend the same patterns to new formats.
Import Namespace Overview
The triblespace_core::import module collects conversion helpers that translate
structured documents into raw tribles. Today the namespace ships with two JSON
importers:
JsonImportergenerates fresh entity identifiers for every object it visits.DeterministicJsonImporterderives entity identifiers by hashing the encoded attribute/value pairs so the same input always reproduces the same entities.
Both variants accept encoder callbacks for JSON primitives. Those closures turn
strings, numbers, and booleans into Value instances and can allocate blobs or
perform validation before handing the data back to the importer. The
valueschemas::Boolean helper stores false as all-zero bytes and true as
all ones so JSON flags round-trip without ambiguity when you wire the boolean
encoder to it.
Both importers accumulate statements internally. After feeding one or more
JSON documents through import_value or import_str, call data() to inspect
the tribles that were emitted and metadata() to retrieve attribute
descriptors. When you want to start a fresh batch without recomputing attribute
hashes, clear_data() drops the accumulated statements while leaving the
attribute caches intact. Call clear() to reset both the staged data and the
attribute caches when you need a completely fresh run.
Mapping JSON Fields to Attributes
Attributes are derived through Attribute::from_name, which hashes the JSON
field name together with the ValueSchema selected for that primitive. The
importers cache the resulting RawIds per field and schema so the hash only has
to be computed once per run. Arrays are treated as multi-valued fields: every
item is encoded and stored under the same attribute identifier, producing one
trible per element.
After an import completes the importers regenerate metadata from their cached
attribute maps. The metadata()
accessor returns tribles that link each derived attribute id to its field name,
value schema, and optional blob schema. Merge those descriptors into your
repository alongside the imported data when you want queries to discover the
original JSON field names or project datasets by schema without repeating the
derivation logic.
Nested objects recurse automatically. The parent receives a GenId attribute
that points at the child entity, allowing the importer to represent the entire
object graph as a connected set of tribles. Because those GenId attributes are
also derived from the parent field names they remain stable even when you import
related documents in separate batches.
Managing Entity Identifiers
JsonImporter::new defaults to ufoid() for identifier generation, but the
constructor is parameterized so you can inject your own policy—for example, a
fucid() generator or any other closure that returns an ExclusiveId. The
custom generator is applied consistently to every object the importer touches,
including nested documents.
DeterministicJsonImporter takes a different approach. It buffers the encoded
attribute/value pairs for each object, sorts them, and feeds the resulting byte
stream into a user-supplied hash protocol. The first 16 bytes of that digest
become the entity identifier, ensuring identical JSON inputs produce identical
IDs even across separate runs. Once the identifier is established, the importer
writes the cached pairs into its internal trible set via Trible::new, so
subsequent calls to data() expose the deterministic statements alongside the
metadata generated for every derived attribute.
This hashing step also changes how repeated structures behave. When a JSON
document contains identical nested objects—common in fixtures such as
citm_catalog or Twitter exports—the deterministic importer emits the same
identifier for each recurrence. Only the first copy reaches the underlying
TribleSet; later occurrences are recognised as duplicates and skipped during
the merge. The nondeterministic importer must still mint a fresh identifier for
every repetition, so it inserts and deduplicates a full set of triples each
time. Even if the ID generator itself is fast, that extra merge work makes the
deterministic importer benchmark faster on datasets with significant repetition.
Working with Encoder Callbacks
Encoder callbacks receive borrowed references to the raw JSON values. Because the closures are generic over a lifetime you can capture external resources—like a blob store connection or a schema registry—without allocating reference-counted wrappers. Callers can stage binary payloads in whichever blob backend they prefer and return handles that will be persisted alongside the tribles.
The callbacks report failures through EncodeError. You can construct an error
with a simple message or wrap an existing error type. The importer surfaces the
field name alongside the original error so schema mismatches remain easy to
diagnose while keeping the hot path lightweight.
Extending the Importers
To support a new external format, implement a module in the import namespace
that follows the same pattern: decode the source data, derive attributes with
Attribute::from_name, and hand encoded values to Trible::new. Reuse the
lifetime-parameterized encoder callbacks so callers can plug in existing blob
stores or validation logic. If the format supplies stable identifiers, offer a
constructor that accepts a custom generator or hash protocol so downstream
systems can choose between ephemeral and deterministic imports.