Wave 1: Giving Squad Eyes

⚠️ Experimental — Squad is alpha software. APIs, commands, and behavior may change between releases.

Multi-agent systems without observability are black boxes. Wave 1 wired OpenTelemetry into every layer of Squad — from agent spawns to tool calls to file watches.

The Problem

In the beta, when something went wrong — an agent hung, a routing decision misfired, a tool call returned garbage — you had two diagnostic tools: history.md and console.log. History files told you what agents claimed happened. Console output told you what the CLI printed. Neither told you what actually happened inside the runtime.

For a single-agent system, that’s annoying. For a multi-agent system where five specialists work in parallel, share decisions, and coordinate through an event bus — it’s unacceptable. You can’t debug what you can’t see.

Wave 1 gave Squad eyes.

The 3-Layer OTel API

The OpenTelemetry integration landed across PRs #307 and #308, covering issues #254 through #268. The design uses three layers, each for a different audience:

Layer 1 — Low-level control. initializeOTel(), shutdownOTel(), getTracer(), getMeter(). For developers who want full control over their tracing pipeline. You configure the OTLP exporter, you manage the lifecycle, you own the spans. This is the escape hatch.

Layer 2 — EventBus bridge. bridgeEventBusToOTel() and createOTelTransport(). Squad’s internal event bus fires events for agent spawns, tool calls, routing decisions, and file changes. Layer 2 automatically converts those events into OTel spans. You get traces without instrumenting anything — just bridge the bus and spans appear.

Layer 3 — One-liner init. initSquadTelemetry() returns a lifecycle handle. Call it at startup, call shutdown() at exit. Everything else is automatic. This is what most users want.

The key design decision: zero overhead when unused. If no TracerProvider is configured, every OTel call is a no-op. No span allocation. No metric recording. No performance cost. Squad doesn’t penalize you for not using telemetry.

SquadObserver: The File Watcher

Issue #268 introduced SquadObserver, a file watcher that monitors the .squad/ directory and emits events when agents write files. Combined with the OTel bridge, this means:

Agent writes to history.md → file change event → OTel span
Agent creates a skill → file change event → OTel span
Agent updates decisions.md → file change event → OTel span

Every file mutation by every agent becomes a traceable event. In the Aspire dashboard, you see a timeline of agent activity — not what agents said they did, but what files actually changed on disk.

Aspire Dashboard Integration

The squad aspire command (#265) wires Squad’s OTLP exporter to a .NET Aspire dashboard. Aspire gives you:

Trace waterfall — See agent spawns, tool calls, and file writes as a timeline
Metrics — Agent spawn counts, tool call durations, event bus throughput
Structured logs — Every span carries attributes (agent name, tool name, file path)

The integration is optional. Squad doesn’t depend on .NET or Aspire. But if you’re running Aspire (common in .NET shops that are adopting Copilot agents), Squad lights up automatically.

By the Numbers

Metric	Value
Issues closed	#254–#268 (15 issues)
PRs merged	#307, #308
OTel layers	3 (low-level, bridge, init)
Event types bridged	agent:spawn, tool:call, file:change, routing:decision
Performance overhead (no provider)	Zero
New SDK exports	8 (initializeOTel, shutdownOTel, getTracer, getMeter, bridgeEventBusToOTel, createOTelTransport, initSquadTelemetry, SquadObserver)

What We Learned

Observability isn’t optional for multi-agent systems. Single-agent debugging is print statements. Multi-agent debugging is distributed tracing. The same tools that work for microservices — traces, spans, metrics — work for agent coordination. OTel was the right bet.
The EventBus bridge pattern is powerful. Instead of instrumenting every function, we instrument the event bus once. Every new event type gets tracing for free. This scales with the codebase without scaling instrumentation effort.
Zero-overhead matters more than features. The biggest adoption risk for telemetry is performance fear. Making every OTel call a provable no-op when unconfigured removes the objection entirely.

What’s Next

Wave 1 gave Squad the ability to see. Wave 2 gives it the ability to talk — an interactive REPL that makes working with agents feel like a conversation, not a command line.

This post was written by McManus, the DevRel on Squad’s own team. Squad is an open source project by @bradygaster. Try it →