Skip to content

Blob and Artifact Storage Architecture

This document describes how coding-agent stores large/binary payloads outside session JSONL, how truncated tool output is persisted, and how internal URLs (artifact://, agent://) resolve back to stored data.

The runtime uses two different persistence mechanisms for different data shapes:

  • Content-addressed blobs (blob:sha256:<hash>): global, binary-oriented storage used to externalize large image base64 payloads from persisted session entries.
  • Session-scoped artifacts (files under <sessionFile-without-.jsonl>/): per-session text files used for full tool outputs and subagent outputs.

They are intentionally separate:

  • blob storage optimizes deduplication and stable references by content hash,
  • artifact storage optimizes append-only session tooling and human/tool retrieval by local IDs.

SessionManager constructs BlobStore(getBlobsDir()), so blob files live in a shared global blob directory (not in a session folder).

Blob file naming:

  • file path: <blobsDir>/<sha256-hex>
  • no extension
  • reference string stored in entries: blob:sha256:<sha256-hex>

Implications:

  • same binary content across sessions resolves to the same hash/path,
  • writes are idempotent at the content level,
  • blobs can outlive any individual session file.

ArtifactManager derives artifact directory from session file path:

  • session file: .../<timestamp>_<sessionId>.jsonl
  • artifacts directory: .../<timestamp>_<sessionId>/ (strip .jsonl)

Artifact types share this directory:

  • truncated tool output files: <numericId>.<toolType>.log (for artifact://)
  • subagent output files: <outputId>.md (for agent://)

BlobStore.put() computes SHA-256 over raw binary bytes and returns:

  • hash: hex digest,
  • path: <blobsDir>/<hash>,
  • ref: blob:sha256:<hash>.

No session-local counter is used.

Artifact IDs: session-local monotonic integer

Section titled “Artifact IDs: session-local monotonic integer”

ArtifactManager scans existing *.log artifact files on first use to find max existing numeric ID and sets nextId = max + 1.

Allocation behavior:

  • file format: {id}.{toolType}.log
  • IDs are sequential strings ("0", "1", …)
  • resume does not overwrite existing artifacts because scan happens before allocation.

If artifact directory is missing, scanning yields empty list and allocation starts from 0.

AgentOutputManager allocates IDs for subagent outputs as <index>-<requestedId> (optionally nested under parent prefix, e.g. 0-Parent.1-Child). It scans existing .md files on initialization to continue from the next index on resume.

Before session entries are written (#rewriteFile / incremental persist), SessionManager calls prepareEntryForPersistence() (via truncateForPersistence).

Key behaviors:

  1. Large string truncation: oversized strings are cut and suffixed with "[Session persistence truncated large content]".
  2. Transient field stripping: partialJson and jsonlEvents are removed from persisted entries.
  3. Image externalization to blobs:
    • only applies to image blocks in content arrays,
    • only when data is not already a blob ref,
    • only when base64 length is at least threshold (BLOB_EXTERNALIZE_THRESHOLD = 1024),
    • replaces inline base64 with blob:sha256:<hash>.

This keeps session JSONL compact while preserving recoverability.

When opening a session (setSessionFile), after migrations, SessionManager runs resolveBlobRefsInEntries().

For each message/custom-message image block with blob:sha256:<hash>:

  • reads blob bytes from blob store,
  • converts bytes back to base64,
  • mutates in-memory entry to inline base64 for runtime consumers.

If blob is missing:

  • resolveImageData() logs warning,
  • returns original ref string unchanged,
  • load continues (no hard crash).

OutputSink powers streaming output in bash/python/ssh and related executors.

Behavior:

  1. Every chunk is sanitized and appended to in-memory tail buffer.
  2. When in-memory bytes exceed spill threshold (DEFAULT_MAX_BYTES, 50KB), sink marks output truncated.
  3. If an artifact path is available, sink opens a file writer and writes:
    • existing buffered content once,
    • all subsequent chunks.
  4. In-memory buffer is always trimmed to tail window for display.
  5. dump() returns summary including artifactId only when file sink was successfully created.

Practical effect:

  • UI/tool return shows truncated tail,
  • full output is preserved in artifact file and referenced as artifact://<id>.

If file sink creation fails (I/O error, missing path, etc.), sink silently falls back to in-memory truncation only; full output is not persisted.

blob:sha256:<hash> is a persistence reference inside session entry payloads, not an internal URL scheme handled by the router. Resolution is done by SessionManager during session load.

Handled by ArtifactProtocolHandler:

  • requires active session artifact directory,
  • ID must be numeric,
  • resolves by matching filename prefix <id>.,
  • returns raw text (text/plain) from the matched .log file,
  • when missing, error includes list of available artifact IDs.

Missing directory behavior:

  • if artifacts directory does not exist, throws No artifacts directory found.

Handled by AgentProtocolHandler over <artifactsDir>/<id>.md:

  • plain form returns markdown text,
  • /path or ?q= forms perform JSON extraction,
  • path and query extraction cannot be combined,
  • if extraction requested, file content must parse as JSON.

Missing directory behavior:

  • throws No artifacts directory found.

Missing output behavior:

  • throws Not found: <id> with available IDs from existing .md files.

Read tool integration:

  • read supports offset/limit pagination for non-extraction internal URL reads,
  • rejects offset/limit when agent:// extraction is used.
  • ArtifactManager scans existing {id}.*.log files on first allocation and continues numbering.
  • AgentOutputManager scans existing .md output IDs and continues numbering.
  • SessionManager rehydrates blob refs to base64 on load.

SessionManager.fork() creates a new session file with new session ID and parentSession link, then returns old/new file paths. Artifact copying is handled by AgentSession.fork():

  • attempts recursive copy of old artifact directory to new artifact directory,
  • missing old directory is tolerated,
  • non-ENOENT copy errors are logged as warnings and fork still completes.

ID implications after fork:

  • if copy succeeded, artifact counters in new session continue after max copied ID,
  • if copy failed/skipped, new session artifact IDs start from 0.

Blob implications after fork:

  • blobs are global and content-addressed, so no blob directory copy is required.

SessionManager.moveTo() renames both session file and artifact directory to the new default session directory, with rollback logic if a later step fails. This preserves artifact identity while relocating session scope.

CaseBehavior
Blob file missing during rehydrationWarn and keep blob:sha256: ref string in-memory
Blob read ENOENT via BlobStore.getReturns null
Artifact directory missing (ArtifactManager.listFiles)Returns empty list (allocation can start fresh)
Artifact directory missing (artifact:// / agent://)Throws explicit No artifacts directory found
Artifact ID not foundThrows with available IDs listing
OutputSink artifact writer init failsContinues with tail-only truncation (no full-output artifact)
No session file (some task paths)Task tool falls back to temp artifacts directory for subagent outputs

Binary blob externalization vs text-output artifacts

Section titled “Binary blob externalization vs text-output artifacts”
  • Blob externalization is for binary image payloads inside persisted session entry content; it replaces inline base64 in JSONL with stable content refs.
  • Artifacts are plain text files for execution output and subagent output; they are addressable by session-local IDs through internal URLs.

The two systems intersect only indirectly (both reduce session JSONL bloat) but have different identity, lifetime, and retrieval paths.