8. Monitoring & Observability

When AI agents operate autonomously, visibility is everything. Sutra gives you comprehensive, real-time dashboards to monitor system health, track agent behavior, debug issues, control costs, and catch problems before your users do. Think of it as mission control for your autonomous organization.

Real-Time Monitor

The Monitor dashboard (/monitor) is designed for system administrators to oversee the live state of the platform.

Placeholder: Monitor Dashboard

Live Telemetry

Infrastructure: View live active connections to PostgreSQL, WebSocket activity, and memory usage.
Queues: Monitor the exact number of pending tasks in Redis. If a batch job is running, you can watch the queue size decrease in real-time.
Activity Feed: A scrolling live feed of every action taking place across the system—from an agent answering a chat to a cron job executing a workflow.

The Monitor dashboard auto-refreshes every few seconds, giving you a live picture of your platform's pulse. During batch processing or high-traffic periods, this is invaluable for spotting bottlenecks — if your Redis queue is growing faster than it's being consumed, you know to scale up your Celery workers.

Agent Watchdog

Running behind the Monitor is the Agent Watchdog — a background process that continuously monitors agent health and automatically recovers from failures.

Health Checks: Every 60 seconds (configurable), the watchdog checks whether each running agent has sent a recent heartbeat. An agent's heartbeat is recorded after each successful invocation.
Auto-Restart: If an agent hasn't sent a heartbeat within 3× the check interval (i.e., 3 minutes by default), it's considered unresponsive. The watchdog automatically restarts it — up to 3 consecutive times (configurable).
Graceful Degradation: If an agent exceeds the maximum restart count, it's unregistered from monitoring and marked as failed. This prevents a broken agent from consuming resources in an endless restart loop.
Reset on Success: The restart counter resets to zero whenever an agent successfully handles a request, so transient failures don't accumulate over time.

The watchdog runs automatically — no configuration required. You can tune the check interval, timeout multiplier, and max restarts in System Settings if needed.

Analytics

Analytics (/analytics) aggregates data to help you understand macro trends.

Placeholder: Analytics Charts

Key Insights

Agent Utilization: Heatmaps showing which agents are used most frequently and at what times of day.
Tool Success Rates: Identify if a specific tool (e.g., a custom web scraper) is consistently failing, indicating that the agent's instructions or the tool's code needs adjustment.
Latency Metrics: Track the average time to first byte (TTFB) and total response time for different LLM providers.
Conversation Quality: Track conversation length, user satisfaction signals (if configured), and how often agents need to use tools versus answering from memory — helping you identify which agents need more training data or better system prompts.
Model Comparison: Side-by-side performance data for different LLM providers across the same agent tasks. See which models deliver the best quality-to-cost ratio for your specific use cases.

You can export analytics data as CSV for deeper analysis in your preferred BI tool.

System Logs

When you need to debug a specific issue, the Logs viewer (/logs) is your primary tool.

Placeholder: Logs Viewer

Deep Inspection

Filtering: Filter logs by severity (INFO, WARN, ERROR), specific Agent IDs, or Conversation IDs.
Tracing: Because Sutra attaches unique IDs to every execution chain, you can trace an error backward—from a failed web search, back to the agent's reasoning process, all the way to the original user prompt.

Every log entry includes a correlation ID that links it to the full execution chain — from the user's original prompt, through the agent's reasoning steps, tool calls, and final response. This end-to-end traceability makes debugging complex multi-agent workflows straightforward: find the error, click the correlation ID, and see the entire story.

Tip: Bookmark the Logs page filtered to ERROR severity. Check it daily during your first few weeks with Sutra — most configuration issues (wrong API keys, misconfigured tools, permission errors) surface here as clear error messages with actionable descriptions.

Alerts

Don't wait for users to report an issue. Configure proactive Alerts (/alerts).

Placeholder: Alerts Configuration

Alert Rules

Define custom conditions that trigger notifications: - "If the Email Integration fails 3 times in 1 hour..." - "If daily token consumption exceeds 1,000,000 tokens..." - Notification Channels: Route these alerts to your email inbox, a dedicated Slack #ops channel, or via an SMS integration.

Sutra ships with pre-configured default alert rules covering common failure modes: error rate thresholds (>1% warning, >10% critical), agent failure streaks (3+ consecutive failures), P95 latency spikes (>10s), and quota usage warnings (>80% and 100%). Each rule is configurable with custom thresholds, evaluation windows, and cooldown periods to prevent alert spam.

Financials

AI orchestration can get expensive if left unmonitored. The Financials module (/financials) provides absolute clarity on spending.

Placeholder: Financials Dashboard

Cost Tracking

Granular Breakdown: Sutra multiplies your exact prompt and completion token usage against the known per-token pricing of the respective LLM providers.
Attribution: View costs broken down by Provider (OpenAI vs. Anthropic), by Agent (how much does the "Senior Dev" cost vs. the "Customer Support" agent), or by Project.
Exporting: Download CSV reports for your finance team to process cross-departmental chargebacks.

The Financials dashboard also projects estimated monthly costs based on your current usage trajectory. If you're on track to exceed your budget, you'll see a warning early enough to adjust rate limits or switch some agents to more cost-effective models. No more end-of-month surprises.

Model Pricing Table

Sutra ships with built-in per-token pricing data for major providers — OpenAI (GPT-4o, GPT-4o Mini, o1, o1-mini), Anthropic (Claude Opus, Sonnet, Haiku), Google (Gemini 1.5/2.0/2.5), and Groq (Llama, Mixtral). Costs are calculated using separate input and output token rates per 1K tokens, reflecting each provider's actual pricing structure.

You can customize pricing for any provider/model pair, and a wildcard fallback (*/*) ensures new or unknown models still get a reasonable cost estimate. This makes Sutra's cost attribution accurate even as you add new LLM providers.

Social Pulse

The Social Pulse dashboard (/social-pulse) tracks trending content and viral topics across the web, helping your agents stay informed about what's happening in your industry.

Placeholder: Social Pulse

8. Monitoring & Observability

Real-Time Monitor

Live Telemetry

Agent Watchdog

Analytics

Key Insights

System Logs

Deep Inspection

Alerts

Alert Rules

Financials

Cost Tracking

Model Pricing Table

Social Pulse

Trending Content Research