Specialized Task Agents And Real-Model Validation
Specialized task agents are default-config agents designed to complete a narrow project task reliably. They combine a concise developer/system message, a task-specific guide in bundled docs, focused tool access, and a real-model test that proves the agent can complete the task in a realistic fixture.
This pattern is useful when a product feature depends on model behavior, not only deterministic code. The goal is not to freeze exact model output. The goal is to verify that the shipped default config gives a capable model enough context, tools, and documentation to complete the intended workflow.
Recommended Shape
Add the specialized agent to the default config, not only to a test-local config. Installed-bundle tests should exercise the same shipped defaults users receive.
Keep the developer/system message short. It should say what the agent is for, state important constraints, and point to a bundled guide by absolute path. Put step-by-step instructions, examples, schemas, and troubleshooting details in the guide, not in the developer message.
Use config variable substitution for doc paths. For docs shipped in plugins,
prefer paths under ${env:BUILTIN_PLUGINS} so both source checkouts and
installed bundles can resolve them.
Example structure:
{
"mixins": {
"example_task_message": {
"system_message_enabled": true,
"system_message": {
"template": "Read the task guide at `{{TASK_GUIDE_PATH}}` before changing files. Follow the guide workflow, use project docs to infer project-specific details, run cheap checks before final validation, and do not print secrets.",
"variables": {
"TASK_GUIDE_PATH": {
"text": "${env:BUILTIN_PLUGINS}/example-plugin/docs/task-guide.md"
}
}
}
}
},
"agents": {
"example-task-author": {
"mixin_refs": ["codex_developer_message", "codex_mcp_defaults", "example_task_message"],
"provider": "openai",
"model": "gpt-5.4-mini",
"allowed_paths": [".", "..", "${env:BUILTIN_PLUGINS}/example-plugin"],
"plugins": ["path:${env:BUILTIN_PLUGINS}/codex-tools"],
"shell_tool_mode": "auto"
}
}
}
For harder fixtures, use a stronger model only in the specific test scenario. Keep the cheaper model as the shipped default when it is good enough for normal use.
Authoring Guide
The guide should include:
- the task goal and expected output files
- the minimum schemas or file formats the agent must generate
- the intended workflow
- fast checks the agent should run before final validation
- examples or references to existing standard implementations
- constraints around secrets, idempotence, generated files, and project-specific guidance
- when plugin source may be read as a fallback
For project-specific tasks, tell the agent to read normal project guidance such
as AGENTS.md, README files, package manifests, and tests. Do not require the
project guidance to be task-specific; it should remain the ordinary project
setup and testing documentation.
Installed-Bundle Real-Model Tests
Real-model tests for specialized agents should use the installed bundle when the feature must work for installed users. The test should:
- extract the bundle archive
- start the bundled CLI/server with the default config
- create a realistic fixture project
- send a generic user request to the specialized agent
- allow only the fixture paths and any bundled docs/source paths the agent may need
- override auth/model only through normal config overrides
- assert final artifacts and run independent validation after the model stops
The user prompt should stay generic. Project details should live in the fixture project docs and files. Task mechanics should live in the bundled guide.
When ChatGPT login auth is supported in a test, make it explicit. For example,
use an env var such as CHATGPT_CREDENTIALS and set auth_mode: "chatgpt" plus
the credentials path in the request overrides. Do not silently fall back from a
requested credential mode to API-key mode.
Process Telemetry
Capture process-quality signals separately from final pass/fail:
- total tool calls and tool calls by name
- whether the agent read the guide
- whether it read plugin source
- whether it read example tests
- generated files
- syntax checks or cheap local tests attempted
- final validation attempted
- command failures observed
Use both soft and hard budgets:
- soft tool-call limit: warn or report diagnostics, but allow a valid solution
- hard tool-call limit: stop or fail the request
- soft timeout: warn and continue so downstream validation can still be observed
- hard timeout: fail and terminate the run
Soft-budget failures are instruction-tuning signals. Hard-budget failures are test failures.
Fixture Design
Fixtures should be small but realistic. Prefer real files, package manifests, git repositories, and tests over mocked internals. For multi-repository or multi-service workflows, initialize fixture repositories during test setup and clean them up through the test temp directory.
The assertion should check behavior, not exact text. Good assertions include:
- required files exist and are executable
- generated JSON matches expected shape
- generated scripts pass shell syntax checks
- a local no-Docker harness passes when custom transfer/setup behavior is generated
- the final public validation command passes
Avoid test-only product shortcuts. If the realistic test exposes a product defect, keep the test realistic and fix the product code or document a follow-up task.
Documentation And Completion Notes
When the test exposes model confusion, update the guide first. Keep the developer message concise and generic. Record observed model behavior in task completion notes, including tool counts, timeouts, source reads, and remaining product issues.
If a passing test still exceeds a soft budget, treat it as functionally passing but performance-suspect. Do not hide product startup or dependency problems by only raising hard timeouts.