Skip to content

Filesystem Scan Cache Architecture

Filesystem Scan Cache Architecture Contract

Section titled “Filesystem Scan Cache Architecture Contract”

This document defines the current contract for the shared filesystem scan cache implemented in Rust (crates/pi-natives/src/fs_cache.rs) and consumed by native discovery/search APIs exposed to packages/coding-agent.

The cache stores full directory-scan entry lists (GlobMatch[]) keyed by scan scope and traversal policy, then lets higher-level operations (glob filtering, fuzzy scoring, grep file selection) run against those cached entries.

Primary goals:

  • avoid repeated filesystem walks for repeated discovery/search calls
  • keep consistency across glob, fuzzyFind, and grep when they share the same scan policy
  • allow explicit staleness recovery for empty results and explicit invalidation after file mutations
  • Cache implementation and policy: crates/pi-natives/src/fs_cache.rs
  • Native consumers:
    • crates/pi-natives/src/glob.rs
    • crates/pi-natives/src/fd.rs (fuzzyFind)
    • crates/pi-natives/src/grep.rs
  • JS binding/export:
    • packages/natives/src/glob/index.ts (invalidateFsScanCache)
    • packages/natives/src/glob/types.ts
    • packages/natives/src/grep/types.ts
  • Coding-agent mutation invalidation helpers:
    • packages/coding-agent/src/tools/fs-cache-invalidation.ts

Each entry is keyed by:

  • canonicalized root directory path
  • include_hidden boolean
  • use_gitignore boolean

Implications:

  • Hidden and non-hidden scans do not share entries.
  • Gitignore-respecting and ignore-disabled scans do not share entries.
  • Consumers must pass stable semantics for hidden/gitignore behavior; changing either flag creates a different cache partition.

node_modules inclusion is not in the cache key. The cache stores entries with node_modules included; per-consumer filtering is applied after retrieval.

Cache population uses a deterministic walker (ignore::WalkBuilder) configured by include_hidden and use_gitignore:

  • follow_links(false)
  • sorted by file path
  • .git is always skipped
  • node_modules is always collected at cache-scan time (and optionally filtered later)
  • entry file type + mtime are captured via symlink_metadata

Search roots are resolved by resolve_search_path:

  • relative paths are resolved against current cwd
  • target must be an existing directory
  • root is canonicalized when possible

Global policy (environment-overridable):

  • FS_SCAN_CACHE_TTL_MS (default 1000)
  • FS_SCAN_EMPTY_RECHECK_MS (default 200)
  • FS_SCAN_CACHE_MAX_ENTRIES (default 16)

Behavior:

  • get_or_scan(...)
    • if TTL is 0: bypass cache entirely, always fresh scan (cache_age_ms = 0)
    • on cache hit within TTL: return cached entries + non-zero cache_age_ms
    • on expired hit: evict key, rescan, store fresh entry
  • max entry enforcement is oldest-first eviction by created_at

Empty-result fast recheck (separate from normal hits)

Section titled “Empty-result fast recheck (separate from normal hits)”

Normal cache hit:

  • a cache hit inside TTL returns cached entries and does nothing else.

Empty-result fast recheck:

  • this is a caller-side policy using ScanResult.cache_age_ms
  • if filtered/query result is empty and cached scan age is at least empty_recheck_ms(), caller performs one force_rescan(...) and retries
  • intended to reduce stale-negative results when files were recently added but cache is still within TTL

Current consumers:

  • glob: rechecks when filtered matches are empty and scan age exceeds threshold
  • fuzzyFind (fd.rs): rechecks only when query is non-empty and scored matches are empty
  • grep: rechecks when selected candidate file list is empty

Cache is opt-in on all exposed APIs (cache?: boolean, default false).

Current defaults in native APIs:

  • glob: hidden=false, gitignore=true, cache=false
  • fuzzyFind: hidden=false, gitignore=true, cache=false
  • grep: hidden=true, cache=false, and cache scan always uses use_gitignore=true

Coding-agent callers today:

  • High-volume mention candidate discovery enables cache:
    • packages/coding-agent/src/utils/file-mentions.ts
    • profile: hidden=true, gitignore=true, includeNodeModules=true, cache=true
  • Tool-level grep integration currently disables scan cache (cache: false):
    • packages/coding-agent/src/tools/grep.ts

Native invalidation entrypoint:

  • invalidateFsScanCache(path?: string)
    • with path: remove cache entries whose root is a prefix of target path
    • without path: clear all scan cache entries

Path handling details:

  • relative invalidation paths are resolved against cwd
  • invalidation attempts canonicalization
  • if target does not exist (e.g., delete), fallback canonicalizes parent and reattaches filename when possible
  • this preserves invalidation behavior for create/delete/rename where one side may not exist

Coding-agent mutation flow responsibilities

Section titled “Coding-agent mutation flow responsibilities”

Coding-agent code must invalidate after successful filesystem mutations.

Central helpers:

  • invalidateFsScanAfterWrite(path)
  • invalidateFsScanAfterDelete(path)
  • invalidateFsScanAfterRename(oldPath, newPath) (invalidates both sides when paths differ)

Current mutation tool callsites:

  • packages/coding-agent/src/tools/write.ts
  • packages/coding-agent/src/patch/index.ts (hashline/patch/replace flows)

Rule: if a flow mutates filesystem content or location and bypasses these helpers, cache staleness bugs are expected.

When introducing cache use in a new scanner/search path:

  1. Use stable scan policy inputs

    • decide hidden/gitignore semantics first
    • pass them consistently to get_or_scan/force_rescan so cache partitions are intentional
  2. Treat cache data as pre-filtered only by traversal policy

    • apply tool-specific filtering (glob patterns, type filters, node_modules rules) after retrieval
    • never assume cached entries already reflect your higher-level filters
  3. Implement empty-result fast recheck only for stale-negative risk

    • use scan.cache_age_ms >= empty_recheck_ms()
    • retry once with force_rescan(..., store=true, ...)
    • keep this path separate from normal cache-hit logic
  4. Respect no-cache mode explicitly

    • when caller disables cache, call force_rescan(..., store=false, ...)
    • do not populate shared cache in a no-cache request path
  5. Wire mutation invalidation for any new write path

    • after successful write/edit/delete/rename, call the coding-agent invalidation helper
    • for rename/move, invalidate both old and new paths
  6. Do not add per-call TTL knobs

    • current contract is global policy only (env-configured), no per-request TTL override
  • Cache scope is process-local in-memory (DashMap), not persisted across process restarts.
  • Cache stores scan entries, not final tool results.
  • glob/fuzzyFind/grep share scan entries only when key dimensions (root, hidden, gitignore) match.
  • .git is always excluded at scan collection time regardless of caller options.