2026 03 13 autoresearch archetype design

Goal¶

Define the minimal Archetype v0.1 surface needed to express Karpathy-style autonomous software optimization: a tracked branch frontier, experiments against that frontier, bounded runs, recorded results, and keep/discard branch advancement semantics.

Design Summary¶

Archetype v0.1 should model an AutoResearch loop as an experiment engine over a single tracked branch head:

branch head -> experiment -> run -> result -> keep|discard|crash -> maybe advance branch head

The design intentionally stays close to Karpathy's terminology:

experiment: the hypothesis under test
run: the bounded execution of that experiment
commit: the concrete git state associated with the experiment
branch: the tracked frontier path
result: the metrics emitted by the run
keep / discard / crash: the selection outcomes

Architectural Decisions¶

1. World-per-experiment¶

For v0.1, the preferred shape is one world per experiment, not a single world containing many competing experiments.

Why:

It matches autoresearch's branch-local evaluation loop closely.
It uses Archetype's strongest existing capability: isolated worlds with bounded runs.
It keeps higher-order experiment selection outside ordinary per-tick processors.
It preserves a clean path to richer multi-world search later.

2. One tracked branch head¶

For v0.1, each autonomous loop tracks exactly one authoritative branch head.

The frontier does not live on main.
The frontier lives on the dedicated optimization branch, like autoresearch/<tag>.
Every new experiment starts from the current tracked branch head.
A kept experiment advances that branch head.
A discarded or crashed experiment leaves the branch head unchanged.

This mirrors autoresearch directly while leaving merge-back-to-main policy out of scope.

3. Git-aware, but not git-driven core¶

Archetype should understand git coordinates and frontier semantics, but should not turn the core engine into a shell wrapper.

Framework-owned responsibilities:

repository, branch, and commit identity
tracked branch-head state
experiment/run/result state machine
frontier comparison rules
branch-head advancement decisions

App-layer responsibilities:

checkout and worktree materialization
patch application
git commit
rollback/reset/cleanup
launching concrete training or evaluation commands

This yields the intended boundary:

Archetype models and decides; the app layer materializes and executes.

4. Transactional git adapter in the app layer¶

The git side effects should live in a contained app-layer module that behaves like a transactional adapter, not just a utility helper.

The adapter should own a small transaction boundary:

resolve tracked branch head
materialize checkout/worktree
apply experiment change
commit or rollback
emit resulting git coordinates back to Archetype

The important safety properties are:

idempotence
serialization per tracked branch
crash recovery with enough journal state to reconcile partial progress

Core Model¶

The minimum persistent conceptual model is:

Repository
BranchHead
Commit
Experiment
Run
Result

Repository¶

Identifies the repo under optimization.

Suggested fields:

repository_id
canonical_path or remote URL
default_branch

BranchHead¶

Represents the single tracked frontier for the active loop.

Suggested fields:

repository_id
branch_name
current_commit_hash
frontier_metric_name
frontier_metric_direction (min or max)
frontier_metric_value

Commit¶

Represents a concrete git state.

Suggested fields:

repository_id
branch_name
commit_hash
parent_commit_hash
message
created_at

For v0.1, this can start as a thin record over git facts while remaining first-class in the model.

Experiment¶

Represents the proposed advancement of the tracked frontier.

An experiment is composed of two layers:

a git layer, which identifies the software state being tested
a runtime layer, which declares how Archetype should instantiate and evaluate that state

An Experiment is therefore not itself a World or a RunConfig. Instead, it carries the declarative recipe needed to create a world and derive a concrete run configuration at execution time.

Suggested fields:

experiment_id
repository_id
branch_name
base_commit_hash
proposal_summary
world_spec_json
run_spec_json
evaluation_spec_json
status
created_at

Suggested meanings:

world_spec_json: how to instantiate the world for this experiment
run_spec_json: how to derive the concrete RunConfig for execution
evaluation_spec_json: how to interpret the resulting metrics and compare against the frontier

Canonical spec shapes for v0.1¶

For v0.1, these specs should start with a small canonical shape instead of unconstrained blobs.

Suggested world_spec_json:

{
  "world_name": "experiment-world",
  "storage": {
    "uri": "./archetype_data",
    "namespace": "archetypes",
    "backend": "lancedb"
  },
  "cache": {
    "flush_rows": 1000000,
    "flush_mb": 512,
    "idle_sec": 30.0
  },
  "resources": {},
  "metadata": {}
}

Suggested run_spec_json:

{
  "budget_kind": "steps",
  "budget_value": 1,
  "debug": false,
  "prefer_live_reads": true,
  "show_rows": 0,
  "suite": "autoresearch",
  "trial": null,
  "metadata": {}
}

Suggested evaluation_spec_json:

{
  "primary_metric_name": "val_bpb",
  "direction": "min",
  "secondary_metric_names": [
    "peak_vram_mb",
    "training_seconds",
    "total_seconds"
  ],
  "crash_is_failure": true,
  "metadata": {}
}

The intent is not to freeze these forever, only to keep the first implementation typed and interoperable.

Run¶

Represents one bounded execution of an experiment.

Suggested fields:

run_id
experiment_id
world_id
status
budget_type
budget_value
started_at
finished_at
artifact_uri or log reference

Result¶

Represents the metrics and terminal outcome of a run.

Suggested fields:

result_id
experiment_id
run_id
primary_metric_name
primary_metric_value
primary_metric_direction
secondary_metrics_json
runtime_metadata_json
failure_metadata_json

State Machine¶

Experiment states¶

pending
running
succeeded
crashed
kept
discarded

Run states¶

pending
running
completed
crashed
timed_out

Transition rules¶

Create an Experiment from the current tracked BranchHead.
Create a Run for that experiment.
Move both to running.
If execution fails: Run -> crashed|timed_out, Experiment -> crashed.
If execution completes: Run -> completed, Experiment -> succeeded.
Compare the Result to the tracked branch frontier.
If the result advances the frontier: Experiment -> kept and advance BranchHead.
Otherwise: Experiment -> discarded and leave BranchHead unchanged.

Important distinction:

succeeded means the run completed successfully.
kept means the experiment improved the tracked frontier.

Run and Result Contract¶

The intended relationship between experiment-time and runtime primitives is:

Experiment: declarative definition of what to test and how to instantiate the runtime
World: instantiated runtime environment for that experiment
RunConfig: concrete execution budget/config derived from the experiment's run_spec
Run: realized execution record for that world and run config

For v0.1, a run is a bounded execution of exactly one experiment against one concrete repository/branch/base-commit tuple.

A result must provide:

exactly one designated primary frontier metric
the direction of comparison (min or max)
optional secondary metrics
runtime and cost metadata
failure metadata when relevant

For autoresearch, the primary metric is val_bpb, but the engine should not hard-code that metric name.

Boundary with Existing Archetype Layers¶

Core¶

Core remains the sacred runtime substrate: tick execution, world stepping, update/query semantics, persistence.

App¶

The app layer is the correct home for:

the transactional git adapter
experiment orchestration
bounded run execution against concrete repos
reference autoresearch-style controller logic

DSL¶

DSL support is explicitly out of scope for v0.1.

The goal is to first make the experiment system correct and minimal without depending on additional DSL sugar.

Explicit Non-Goals for v0.1¶

no merge-back-to-main policy
no multi-branch frontier racing
no population-wide search semantics
no RL rollout or trajectory terminology in the public model
no requirement to embed git shelling directly into core
no requirement to expose the first implementation through the DSL

Open Questions¶

These remain intentionally flexible:

Which model records must be fully persisted in Archetype tables from day one?
How thin or rich should the Commit record be initially?
Should bounded runs support only wall-clock budgets first, or also tick-count budgets?
How much journal state should the transactional git adapter persist for crash recovery?

Risks¶

Core/app boundary drift: git orchestration pressure may leak into core/ unless the transaction and state-machine logic stays in app/.
Dual-state divergence: git state and Archetype state can disagree if commit creation and experiment/result persistence are not updated atomically.
Crash recovery ambiguity: long-running loops need enough journal state to distinguish not-started, materialized, committed, rolled-back, and unknown-after-crash states.
Branch-head race conditions: retries or multiple workers can attempt to advance the same tracked frontier from stale base commits.
Non-reproducible runs: underspecified runtime or environment details can make a kept experiment hard to reproduce later.
Metric comparability drift: keep/discard decisions are only meaningful if the evaluation contract stays stable across experiments.
Spec over-flexibility: unconstrained JSON specs will quickly become a schema trap unless the canonical shapes stay small and explicit.
Result/commit mismatch: a run can succeed while pointing at the wrong git state if the execution boundary is not tightly coupled to the git transaction.
Artifact sprawl: worktrees, logs, checkpoints, and result artifacts will accumulate rapidly unless ownership and cleanup rules are defined early.
Security and side effects: autonomous software optimization assumes arbitrary code execution, so sandbox and trust boundaries must be stated even if v0.1 does not fully solve them.

Recommendation¶

Implement the framework primitives and state machine in the app/framework boundary first, then build a reference autoresearch controller on top. That yields a working end-to-end loop without prematurely coupling the design to DSL ergonomics or deeper multi-world search policies.