8. Monitoring & Observability
When AI agents operate autonomously, visibility is everything. Sutra gives you comprehensive, real-time dashboards to monitor system health, track agent behavior, debug issues, control costs, and catch problems before your users do. Think of it as mission control for your autonomous organization.
Real-Time Monitor
The Monitor dashboard (/monitor) is designed for system administrators to oversee the live state of the platform.

Live Telemetry
- Infrastructure: View live active connections to PostgreSQL, WebSocket activity, and memory usage.
- Queues: Monitor the exact number of pending tasks in Redis. If a batch job is running, you can watch the queue size decrease in real-time.
- Activity Feed: A scrolling live feed of every action taking place across the system—from an agent answering a chat to a cron job executing a workflow.
The Monitor dashboard auto-refreshes every few seconds, giving you a live picture of your platform's pulse. During batch processing or high-traffic periods, this is invaluable for spotting bottlenecks — if your Redis queue is growing faster than it's being consumed, you know to scale up your Celery workers.
Agent Watchdog
Running behind the Monitor is the Agent Watchdog — a background process that continuously monitors agent health and automatically recovers from failures.
- Health Checks: Every 60 seconds (configurable), the watchdog checks whether each running agent has sent a recent heartbeat. An agent's heartbeat is recorded after each successful invocation.
- Auto-Restart: If an agent hasn't sent a heartbeat within 3× the check interval (i.e., 3 minutes by default), it's considered unresponsive. The watchdog automatically restarts it — up to 3 consecutive times (configurable).
- Graceful Degradation: If an agent exceeds the maximum restart count, it's unregistered from monitoring and marked as failed. This prevents a broken agent from consuming resources in an endless restart loop.
- Reset on Success: The restart counter resets to zero whenever an agent successfully handles a request, so transient failures don't accumulate over time.
The watchdog runs automatically — no configuration required. You can tune the check interval, timeout multiplier, and max restarts in System Settings if needed.
Analytics
Analytics (/analytics) aggregates data to help you understand macro trends.

Key Insights
- Agent Utilization: Heatmaps showing which agents are used most frequently and at what times of day.
- Tool Success Rates: Identify if a specific tool (e.g., a custom web scraper) is consistently failing, indicating that the agent's instructions or the tool's code needs adjustment.
- Latency Metrics: Track the average time to first byte (TTFB) and total response time for different LLM providers.
- Conversation Quality: Track conversation length, user satisfaction signals (if configured), and how often agents need to use tools versus answering from memory — helping you identify which agents need more training data or better system prompts.
- Model Comparison: Side-by-side performance data for different LLM providers across the same agent tasks. See which models deliver the best quality-to-cost ratio for your specific use cases.
You can export analytics data as CSV for deeper analysis in your preferred BI tool.
System Logs
When you need to debug a specific issue, the Logs viewer (/logs) is your primary tool.

Deep Inspection
- Filtering: Filter logs by severity (
INFO,WARN,ERROR), specific Agent IDs, or Conversation IDs. - Tracing: Because Sutra attaches unique IDs to every execution chain, you can trace an error backward—from a failed web search, back to the agent's reasoning process, all the way to the original user prompt.
Every log entry includes a correlation ID that links it to the full execution chain — from the user's original prompt, through the agent's reasoning steps, tool calls, and final response. This end-to-end traceability makes debugging complex multi-agent workflows straightforward: find the error, click the correlation ID, and see the entire story.
Alerts
Don't wait for users to report an issue. Configure proactive Alerts (/alerts).

Alert Rules
Define custom conditions that trigger notifications:
- "If the Email Integration fails 3 times in 1 hour..."
- "If daily token consumption exceeds 1,000,000 tokens..."
- Notification Channels: Route these alerts to your email inbox, a dedicated Slack #ops channel, or via an SMS integration.
Sutra ships with pre-configured default alert rules covering common failure modes: error rate thresholds (>1% warning, >10% critical), agent failure streaks (3+ consecutive failures), P95 latency spikes (>10s), and quota usage warnings (>80% and 100%). Each rule is configurable with custom thresholds, evaluation windows, and cooldown periods to prevent alert spam.
Financials
AI orchestration can get expensive if left unmonitored. The Financials module (/financials) provides absolute clarity on spending.

Cost Tracking
- Granular Breakdown: Sutra multiplies your exact prompt and completion token usage against the known per-token pricing of the respective LLM providers.
- Attribution: View costs broken down by Provider (OpenAI vs. Anthropic), by Agent (how much does the "Senior Dev" cost vs. the "Customer Support" agent), or by Project.
- Exporting: Download CSV reports for your finance team to process cross-departmental chargebacks.
The Financials dashboard also projects estimated monthly costs based on your current usage trajectory. If you're on track to exceed your budget, you'll see a warning early enough to adjust rate limits or switch some agents to more cost-effective models. No more end-of-month surprises.
Model Pricing Table
Sutra ships with built-in per-token pricing data for major providers — OpenAI (GPT-4o, GPT-4o Mini, o1, o1-mini), Anthropic (Claude Opus, Sonnet, Haiku), Google (Gemini 1.5/2.0/2.5), and Groq (Llama, Mixtral). Costs are calculated using separate input and output token rates per 1K tokens, reflecting each provider's actual pricing structure.
You can customize pricing for any provider/model pair, and a wildcard fallback (*/*) ensures new or unknown models still get a reasonable cost estimate. This makes Sutra's cost attribution accurate even as you add new LLM providers.
Social Pulse
The Social Pulse dashboard (/social-pulse) tracks trending content and viral topics across the web, helping your agents stay informed about what's happening in your industry.

Trending Content Research
- Aggregate trending topics, sentiment analysis, and virality scoring across Google Trends, YouTube, Reddit, and Hacker News.
- Configure niches to focus research on your areas of interest. Sutra ships with 7 built-in niches: Technology, Business & Finance, Marketing & Social Media, Health & Fitness, Entertainment, Gaming, and Science & Education.
- Track specific keywords across platforms to monitor topics relevant to your business.
Social Pulse runs automatically every 30 minutes (configurable via cron). Each trending item includes platform source, category (trending/rising/viral/keyword_track), sentiment analysis (positive/negative/neutral/mixed), and engagement metrics. You can create custom niches with your own Google Trends keywords, subreddit lists, and YouTube categories.