Skip to content

feat: add chat automations database schema and queries#23675

Open
kylecarbs wants to merge 1 commit intomainfrom
automations-database
Open

feat: add chat automations database schema and queries#23675
kylecarbs wants to merge 1 commit intomainfrom
automations-database

Conversation

@kylecarbs
Copy link
Member

@kylecarbs kylecarbs commented Mar 26, 2026

Why

Automations bridge external events to Coder chats. A GitHub webhook, a PagerDuty alert, or a cron schedule fires a trigger, and Coder turns it into a chat conversation — either creating a new one or continuing an existing one by matching labels.

This PR adds the database layer only: three tables, their indexes, RBAC resource definition, dbauthz authorization wrappers, and all generated code. No HTTP handlers, SDK types, or application logic — those stay in #23631 and will land on top of this.

Schema design

automations — what to do when a trigger fires

Each automation belongs to a user+org pair (RBAC-scoped) and carries:

  • instructions — the user-role message injected into every chat. This is the core prompt.
  • model_config_id / mcp_server_ids / allowed_tools — which model, MCP servers, and tools to use. Config references are SET NULL on delete so automations survive config removal.
  • status — three-state lifecycle: disabled (silently drop events), preview (log events but take no chat action), active (create/continue chats). Preview mode lets users verify their filter and label config before going live.
  • max_chat_creates_per_hour / max_messages_per_hour — two-tier rate limiting. The first caps new chats (protects against webhook storms creating hundreds of conversations). The second caps total messages including continuations (protects against tight cron loops flooding a single chat). Both have CHECK > 0 constraints.
  • Unique constraint on (owner_id, organization_id, name) so automations are addressable by name.

automation_triggers — how an automation is invoked

Each automation can have multiple triggers. Two types share one table with type-specific nullable columns (simpler than inheritance):

  • Webhook triggers: webhook_secret for HMAC-SHA256 verification, webhook_secret_key_id reserved for future dbcrypt encryption.
  • Cron triggers: cron_schedule (5-field standard cron), last_triggered_at for the scheduler to compute the next fire time. Falls back to created_at when NULL (first fire).
  • Shared columns: filter (gjson path→value conditions that must all match), label_paths (gjson paths that extract chat labels from payloads for routing events to the right conversation).

automation_events — audit trail and rate-limit data source

Every trigger invocation produces an event row regardless of outcome. The status column captures what happened: filtered, preview, created, continued, rate_limited, or error. Rate-limit window counts query this table. Rows are append-only and purged by retention.

Two indexes: (automation_id, received_at DESC) for the per-automation event list UI, and (received_at) for the purge job.

chats.automation_id — backlink

Nullable FK from chats to the automation that created them. SET NULL on delete. Indexed for "list chats spawned by automation X" queries.

What is included

  • Single migration (000453_automations) with all three tables, indexes, constraints, and the chats FK
  • SQL queries for full CRUD, rate-limit window counts, cron trigger discovery, and event purge
  • RBAC automation resource with create/read/update/delete actions
  • dbauthz wrappers for all 18 query methods
  • LockIDAutomationCron advisory lock constant for the cron scheduler
  • All make gen output (models, querier, dbmock, dbmetrics, constraints, dump.sql)

What is NOT included

HTTP handlers, SDK types, db2sdk converters, the automations package, the cron scheduler, and CLI wiring all stay in #23631.

@kylecarbs kylecarbs requested a review from Emyrk as a code owner March 26, 2026 14:56
@kylecarbs kylecarbs force-pushed the automations-database branch 3 times, most recently from d54f12a to 5ef6764 Compare March 26, 2026 15:41
-- chats. An automation defines *what* to say, *which* model and tools to
-- use, and *how fast* it is allowed to create or continue chats.
CREATE TABLE automations (
id uuid DEFAULT gen_random_uuid() NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generate the uuids in golang rather than in the sql.

This has been helpful when unable to returning * and in dbgen when the uuid is nice to know up front.

Comment on lines +31 to +35
-- Lifecycle state:
-- disabled — trigger events are silently dropped.
-- preview — events are logged but no chat is created (dry-run).
-- active — events create or continue chats.
status text NOT NULL DEFAULT 'disabled',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an enum?

Comment on lines +44 to +45
created_at timestamp with time zone NOT NULL DEFAULT now(),
updated_at timestamp with time zone NOT NULL DEFAULT now(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use defaults. Make the Insert specify the times. Really helps with unit testing and mocking different timestamps.

-- Webhook and cron triggers share the same row shape with type-specific
-- nullable columns to keep the schema simple.
CREATE TABLE automation_triggers (
id uuid DEFAULT gen_random_uuid() NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate uuid in Go, not sql

Comment on lines +71 to +73
-- Discriminator: 'webhook' or 'cron'. Determines which nullable
-- columns are meaningful.
type text NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum?

Comment on lines +100 to +101
created_at timestamp with time zone NOT NULL DEFAULT now(),
updated_at timestamp with time zone NOT NULL DEFAULT now(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do in golang, not sql

@kylecarbs kylecarbs marked this pull request as draft March 26, 2026 15:51
@kylecarbs kylecarbs force-pushed the automations-database branch 4 times, most recently from 9f5762f to 3004e2d Compare March 26, 2026 19:27
@kylecarbs kylecarbs changed the title feat(coderd/database): add automations database schema and queries feat(coderd/database): add chat automations database schema and queries Mar 26, 2026
@kylecarbs kylecarbs force-pushed the automations-database branch from 3004e2d to 007645d Compare March 26, 2026 20:03
@kylecarbs kylecarbs changed the title feat(coderd/database): add chat automations database schema and queries feat: add chat automations database schema and queries Mar 26, 2026
@kylecarbs kylecarbs force-pushed the automations-database branch from 007645d to 0ae7379 Compare March 26, 2026 20:23
@kylecarbs kylecarbs marked this pull request as ready for review March 26, 2026 20:44
"chat:delete",
"chat:read",
"chat:update",
"chat_automation:*",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no docs on experimental endpoints - sadly agents need to be reminded to look at other methods in x/chatd

mcp_server_ids uuid[] NOT NULL DEFAULT '{}',
-- Tool allowlist. Empty means all tools available to the model
-- config are permitted.
allowed_tools text[] NOT NULL DEFAULT '{}',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcps will add their own tools on top of this from their allowed list? or is this tools from our harness?

-- Maximum number of *new* chats this automation may create in a
-- rolling one-hour window. Prevents runaway webhook storms from
-- flooding the system.
max_chat_creates_per_hour integer NOT NULL DEFAULT 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be not null? cron jobs shouldn't have to worry about this. do subagents go into that count?

@mafredri mafredri self-requested a review March 26, 2026 21:12
Copy link
Collaborator

@ibetitsmike ibetitsmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice schema design. The three-table split (automations, triggers, events) is clean and well-normalized. The partial index on rate-limit events, the dbcrypt integration, and the check constraints for webhook/cron field exclusivity are all well-thought-out. A few findings from a deep review below.

1 P2, 3 P3, 4 Nit across 8 inline comments. This review contains no blockers, but the P2 and the P3 constraint findings are worth discussing before merge.

This review was generated by an automated agent (best-of-10 deep-review synthesis across database, security, concurrency, contract, edge-case, test, style, and dbcrypt domains).

return database.ChatAutomationEvent{}, err
}
if err := q.authorizeContext(ctx, policy.ActionUpdate, automation); err != nil {
return database.ChatAutomationEvent{}, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 InsertChatAutomationEvent checks ActionUpdate on the automation, but events are append-only audit records created by the system (webhook handlers, cron scheduler), not user mutations of the automation config. (8/10 independent reviewers flagged this)

The wrapper requires ActionUpdate permission to insert an event, but events are system-generated side effects of trigger evaluation, not user-initiated modifications.

If the webhook handler or cron scheduler runs as AsSystemRestricted(ctx), this passes fine. But the semantic mismatch could cause confusion: a future developer reading this code will think "inserting events requires the ability to modify the automation," when really it's "the system is recording what happened." Could this be ActionRead instead (since the system is reading the automation config to process the event), or should there be a comment explaining why ActionUpdate was chosen?

The same pattern is used for InsertChatAutomationTrigger at position 159, which is more defensible since creating a trigger is genuinely modifying the automation's configuration.


This comment was written by an automated agent.

return err
}
if err := q.authorizeContext(ctx, policy.ActionUpdate, automation); err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 DeleteChatAutomationTriggerByID authorizes with ActionUpdate, not ActionDelete. (3/10 reviewers flagged this)

Deleting a trigger requires only update permission on the parent automation, not delete. This is inconsistent with other delete operations in the codebase (e.g., DeleteChatAutomationByID uses deleteQ which checks ActionDelete).

This may be intentional if triggers are considered sub-resources whose lifecycle is managed via "updating" the automation. If so, a short comment explaining the rationale would help future readers. If not, this should be ActionDelete for consistency.


This comment was written by an automated agent.

PRIMARY KEY (id),
FOREIGN KEY (automation_id) REFERENCES chat_automations(id) ON DELETE CASCADE,
CONSTRAINT chat_automation_triggers_webhook_fields CHECK (
type != 'webhook' OR (cron_schedule IS NULL AND last_triggered_at IS NULL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 The CHECK constraints enforce that webhook triggers don't have cron fields and vice versa, but they don't enforce that a webhook trigger actually has a webhook_secret, or that a cron trigger actually has a cron_schedule. (3/10 reviewers flagged this)

A row like (type='webhook', webhook_secret=NULL, cron_schedule=NULL) passes both constraints but is logically invalid: a webhook trigger with no shared secret can't verify HMAC signatures.

This may be acceptable if application-level validation catches it before insert, but a database-level guard would prevent malformed rows from any code path:

CONSTRAINT chat_automation_triggers_webhook_has_secret CHECK (
    type != 'webhook' OR webhook_secret IS NOT NULL
),
CONSTRAINT chat_automation_triggers_cron_has_schedule CHECK (
    type != 'cron' OR cron_schedule IS NOT NULL
)

Non-blocking, but worth considering for defense-in-depth.


This comment was written by an automated agent.

FOREIGN KEY (matched_chat_id) REFERENCES chats(id) ON DELETE SET NULL,
FOREIGN KEY (created_chat_id) REFERENCES chats(id) ON DELETE SET NULL,
CONSTRAINT chat_automation_events_chat_exclusivity CHECK (
matched_chat_id IS NULL OR created_chat_id IS NULL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 The chat_exclusivity constraint ensures at most one of matched_chat_id/created_chat_id is set, but doesn't enforce that at least one is set when status IN ('created', 'continued'). (1/10 reviewers flagged this, but the logic is sound)

An event with status='created' and both matched_chat_id=NULL and created_chat_id=NULL passes the constraint but is logically inconsistent: it claims a chat was created but doesn't reference which one.

A status-dependent constraint would tighten this:

CONSTRAINT chat_automation_events_chat_required CHECK (
    status NOT IN ('created', 'continued')
    OR matched_chat_id IS NOT NULL
    OR created_chat_id IS NOT NULL
)

This is lower-priority since the insert code controls these values, but it prevents silent data integrity issues from any code path.


This comment was written by an automated agent.

}

func (q *querier) GetActiveChatAutomationCronTriggers(ctx context.Context) ([]database.GetActiveChatAutomationCronTriggersRow, error) {
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceChatAutomation.All()); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit GetActiveChatAutomationCronTriggers checks rbac.ResourceChatAutomation.All(), which is a binary admin-or-reject gate with no per-row filtering. This is correct for a system-level scheduler, but worth a short comment explaining that this function must only be called from a privileged/system context (e.g., dbauthz.AsSystemRestricted). (8/10 reviewers flagged this)

Without the comment, a future developer might wonder why there's no SQL-level row filtering like GetChatAutomations has.


This comment was written by an automated agent.

AND received_at > @window_start::timestamptz;

-- name: PurgeOldChatAutomationEvents :exec
DELETE FROM chat_automation_events
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit The 7-day retention is hardcoded in SQL. (5/10 reviewers noted this)

This is fine for a V1, but if the retention period ever needs to change, it requires modifying the SQL query rather than a configuration knob. Also, for high-volume deployments, a single unbounded DELETE could hold locks for a while. Consider batching or accepting a parameter from the caller:

DELETE FROM chat_automation_events
WHERE received_at < @cutoff::timestamptz;

Non-blocking for this PR since handlers aren't included yet.


This comment was written by an automated agent.

SET webhook_secret = NULL,
webhook_secret_key_id = NULL
WHERE webhook_secret_key_id IS NOT NULL;COMMIT;
`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit Formatting: COMMIT; is concatenated directly to the WHERE clause without a newline. The SQL is technically valid (; terminates the UPDATE, then COMMIT; runs), but it's easy to misread. (1/10 reviewers flagged this)

-		WHERE webhook_secret_key_id IS NOT NULL;COMMIT;
+		WHERE webhook_secret_key_id IS NOT NULL;
+COMMIT;

This comment was written by an automated agent.

LockIDReconcileSystemRoles
LockIDBoundaryUsageStats
LockIDChatAutomationCron
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit Missing comment on LockIDChatAutomationCron. Other lock constants in this file have (or should have) comments explaining their purpose. (4/10 reviewers noted this)

Suggestion:

// LockIDChatAutomationCron prevents concurrent cron trigger
// evaluation across coderd replicas.
LockIDChatAutomationCron

This comment was written by an automated agent.

Copy link
Collaborator

@ibetitsmike ibetitsmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice schema design. The three-table split (automations, triggers, events) is clean and well-normalized. The partial index on rate-limit events, the dbcrypt integration, and the check constraints for webhook/cron field exclusivity are all well-thought-out. A few findings from a deep review below.

1 P2, 3 P3, 4 Nit across 8 inline comments. This review contains no blockers, but the P2 and the P3 constraint findings are worth discussing before merge.

This review was generated by an automated agent (best-of-10 deep-review synthesis across database, security, concurrency, contract, edge-case, test, style, and dbcrypt domains).

return database.ChatAutomationEvent{}, err
}
if err := q.authorizeContext(ctx, policy.ActionUpdate, automation); err != nil {
return database.ChatAutomationEvent{}, err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 InsertChatAutomationEvent checks ActionUpdate on the automation, but events are append-only audit records created by the system (webhook handlers, cron scheduler), not user mutations of the automation config. (8/10 independent reviewers flagged this)

The wrapper requires ActionUpdate permission to insert an event, but events are system-generated side effects of trigger evaluation, not user-initiated modifications.

If the webhook handler or cron scheduler runs as AsSystemRestricted(ctx), this passes fine. But the semantic mismatch could cause confusion: a future developer reading this code will think "inserting events requires the ability to modify the automation," when really it's "the system is recording what happened." Could this be ActionRead instead (since the system is reading the automation config to process the event), or should there be a comment explaining why ActionUpdate was chosen?

The same pattern is used for InsertChatAutomationTrigger at position 159, which is more defensible since creating a trigger is genuinely modifying the automation's configuration.


This comment was written by an automated agent.

mafredri

This comment was marked as duplicate.

mafredri

This comment was marked as duplicate.

mafredri

This comment was marked as duplicate.

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-designed schema. The three-state lifecycle, two-tier rate limiting, and trigger-type discriminator with cross-check constraints are thoughtful. The dbcrypt integration covers all paths and the cron trigger query correctly excludes secret columns.

2 P1, 3 P2, 2 P3, 2 obs, 2 nits across 11 inline comments.

Killua: "A deployment receiving 1 event/second accumulates ~600K rows/week. If the purge job falls behind, this single DELETE attempts to remove all accumulated rows in one transaction: long lock hold, massive WAL generation, vacuum pressure."

Hisoka: "An org admin could set owner_id to another admin or a user with different permissions, attributing automated chat actions to that user. This is the kind of thing that holds up fine until the handler trusts the data layer to enforce ownership."

Nit: the commit message body references migration 000454 (doesn't exist, the column is in 000453), and uses wrong file/query names: automations.sql (actual: chatautomations.sql), automationtriggers.sql (actual: chatautomationtriggers.sql), automationevents.sql (actual: chatautomationevents.sql), GetActiveCronTriggers (actual: GetActiveChatAutomationCronTriggers), LockIDAutomationCron (actual: LockIDChatAutomationCron). (Leorio)

PS. The PR description says webhook_secret_key_id reserved for future dbcrypt encryption, but the diff implements full dbcrypt integration (seven methods in dbcrypt.go, rotation/decrypt/delete in cliutil.go). Worth updating so reviewers don't skip the encryption paths.

🤖 This review was automatically generated with Coder Agents.

@kylecarbs kylecarbs force-pushed the automations-database branch from 0ae7379 to cabaa9b Compare March 26, 2026 21:57
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the unbounded purge, the missing .WithOwner(), the COMMIT formatting, and adding the status consistency constraint. All four look correct.

One P1 remains (dbcrypt test coverage), one P2 (TOCTOU, may be fine to address in the handler PR). The rest are nits.

1 P1, 1 P2, 1 P3, 1 obs, 3 nits.

PS. The PR description still says webhook_secret_key_id reserved for future dbcrypt encryption and the commit message still references migration 000454 and wrong file/query names (automations.sql instead of chatautomations.sql, GetActiveCronTriggers instead of GetActiveChatAutomationCronTriggers, LockIDAutomationCron instead of LockIDChatAutomationCron).

🤖 This review was automatically generated with Coder Agents.

@kylecarbs kylecarbs force-pushed the automations-database branch from cabaa9b to 227578d Compare March 27, 2026 12:10
Adds the database layer for the automations feature:

Schema (migration 000453):
- automations table with status, rate limits, instructions, MCP config
- automation_triggers table (webhook + cron types)
- automation_events table for audit trail
- Indexes for performance and unique constraint on (owner, org, name)
- automation_id FK on chats table

Schema (migration 000454):
- last_triggered_at column on automation_triggers for cron scheduling

SQL queries:
- automations.sql: CRUD with authorize_filter support
- automationtriggers.sql: CRUD + GetActiveCronTriggers JOIN query
- automationevents.sql: insert, list, count windows, purge

RBAC:
- automation resource with CRUD actions
- dbauthz wrappers for all 18 query methods

Also adds LockIDAutomationCron advisory lock constant.
@kylecarbs kylecarbs force-pushed the automations-database branch from 227578d to 51de284 Compare March 27, 2026 12:27
@kylecarbs kylecarbs requested a review from mafredri March 27, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants