Operator console
Operator console
Last updated 5/24/2026
Colony Operator Console
Document ID: TECH-OPS-001 Version: 1.0 Classification: Internal — Engineering & Operations Product: Colony — ZoomProp GTM Agentic Stack Scope: Full internal operations surface including deployment, monitoring, secret management, environment configuration, and support workflows
Table of Contents
- Operations Overview
- Admin Dashboard and Routes
- Internal Tooling Inventory
- Operational Workflows
- Monitoring and Alerting
- Support Workflows
- Environment Management
- Secret Management
- Health Checks and Status
- Runbook Index
1. Operations Overview
Colony is deployed as a Next.js 16 application on Google Cloud Run, backed by Cloud SQL (PostgreSQL 16) with pgvector and pgcrypto extensions, and Google Cloud Storage for asset persistence. The agent runtime runs as durable Inngest functions triggered by application events. Infrastructure is provisioned and versioned via Terraform (infrastructure/gcp/).
The operations surface spans three distinct control planes:
| Plane | System | Access Method |
|---|---|---|
| Platform infrastructure | GCP (Cloud Run, Cloud SQL, GCS, Secret Manager) | gcloud CLI, Terraform, GCP Console |
| Application identity and RBAC | Clerk | Clerk Dashboard, Svix webhooks |
| Agent observability | Langfuse | Langfuse Dashboard |
| Application errors | Sentry | Sentry Dashboard |
| Durable job runtime | Inngest | Inngest Dashboard |
All platform-level secrets are stored in GCP Secret Manager and injected at deploy time. All per-organization integration credentials are stored in the Colony Vault (api_keys table, KMS-encrypted rows in Cloud SQL).
2. Admin Dashboard and Routes
2.1 Application Routes
Colony's UI is built on the Next.js App Router. The primary operator-facing entry point is /overview, which serves as both the user-facing daily brief and the command interface for the GTM Orchestrator.
| Route | Purpose | Minimum Role |
|---|---|---|
/overview | GTM Orchestrator chat interface; daily brief; pipeline snapshot | member |
/pipeline | 9-stage deal tracker with Pipedrive bi-sync | member |
/outbound | Sequence engine, HOT reply queue, angle selection | member |
/content | 6-pillar content pipeline with attribution tracking | member |
/recordings | Gemini Meet Notes ingestion and signal extraction | member |
/onboarding | Deployment Kit generator (8 assets on Closed-Won) | member |
/knowledge | Knowledge Core editor — 10-domain pgvector retrieval | admin |
/settings/keys | Colony Vault UI — per-org API key management | admin |
/settings/org | Organization configuration, user roster | admin |
2.2 RBAC Model
Clerk manages authentication with three roles enforced across all routes. The .github/workflows/rbac-check.yml CI workflow validates role enforcement on every pull request.
| Role | Capabilities |
|---|---|
owner | Full access; can manage billing, rotate org-level secrets, add/remove admins |
admin | Knowledge Core editing, Colony Vault key management, org settings, all member capabilities |
member | All GTM surface routes; cannot access vault or knowledge editor |
Role assignment is managed through the Clerk Dashboard at https://dashboard.clerk.com. Webhook sync via Svix propagates role changes to the Colony database in real time. If a role change does not reflect within 60 seconds, check Svix webhook delivery logs in the Clerk Dashboard.
2.3 External Control Panels
| Panel | URL Pattern | Purpose |
|---|---|---|
| GCP Console | console.cloud.google.com | Cloud Run revisions, Cloud SQL, GCS, IAM |
| Clerk Dashboard | dashboard.clerk.com | User management, org management, RBAC, webhook logs |
| Inngest Dashboard | app.inngest.com | Durable function runs, retries, event replay |
| Langfuse Dashboard | cloud.langfuse.com | LLM trace observability, token usage, latency |
| Sentry | sentry.io | Application error tracking, stack traces |
| Pipedrive | {org}.pipedrive.com | CRM bi-sync verification |
3. Internal Tooling Inventory
3.1 Infrastructure as Code
Location: infrastructure/gcp/
Provider: Terraform with hashicorp/google v7.29.0 and hashicorp/google-beta v7.29.0
| File | Manages |
|---|---|
main.tf | GCP provider configuration, project binding, region |
cloud-run-runner.tf | Cloud Run service definition, traffic routing, scaling config |
cloud-sql.tf | Cloud SQL PostgreSQL 16 instance colony-39989:…:colony, private IP, backup policy |
iam-runner.tf | Service account for Cloud Run, IAM bindings, least-privilege roles |
secrets.tf | GCP Secret Manager secret definitions and IAM accessor bindings |
storage.tf | GCS bucket gs://colony-assets, CMEK configuration, signed URL policy |
variables.tf | Input variable declarations |
terraform.tfvars | Environment-specific variable values (not committed to VCS in production) |
Standard Terraform workflow:
cd infrastructure/gcp
terraform init # first-time or after provider changes
terraform plan # review changeset
terraform apply # apply to active GCP project
State is stored in infrastructure/gcp/terraform.tfstate. A .backup copy is maintained automatically. For production, migrate state to a GCS backend to enable team access and state locking.
3.2 Database Initialization Scripts
Location: infrastructure/postgres/init/
| Script | Purpose |
|---|---|
01-multiple-dbs.sh | Shell script that creates the colony database and any additional databases required for local multi-service development |
02-extensions.sql | Installs pgvector (1536-dimension vectors for Knowledge Core embeddings) and pgcrypto (UUID generation and column-level encryption for the Colony Vault) |
These scripts are consumed by infrastructure/docker-compose.yml on first container startup. They are not re-run against Cloud SQL — the Cloud SQL instance is bootstrapped per docs/phase1/runbooks/db-bootstrap.md.
3.3 Local Development Stack
Location: infrastructure/docker-compose.yml
Provides a local-only PostgreSQL 16 instance with the initialization scripts applied. No local equivalents for Cloud Run, GCS, or Secret Manager are defined — developers use .env for local overrides and connect to a shared development Cloud SQL instance or a local Postgres container.
docker compose -f infrastructure/docker-compose.yml up -d
3.4 Playwright Runner
Location: .github/workflows/playwright-runner.yml
Artifacts: runners/.artifacts/pipedrive-dry-run/ and runners/.artifacts/pipedrive-batch/
The Playwright runner executes browser-automation jobs against Pipedrive (and other scrape targets defined in phase 2). It produces structured artifact directories per run:
| Artifact | Content |
|---|---|
batch.log | Full execution log for batch scrape runs |
collect-N.log | Per-collector logs for parallel data collection tasks |
import-N.log | Per-import logs for database write operations |
summary.json | Machine-readable run summary: counts, errors, duration |
candidates.json | Extracted prospect candidates (dry-run mode) |
captures.json | Page capture metadata |
final.png | Screenshot of final Playwright browser state |
session.log | Raw browser session log |
Artifacts are retained under runners/.artifacts/ organized by run type and ISO timestamp. Operators should review summary.json after each batch run to confirm import counts and identify partial failures.
3.5 RBAC Enforcement CI Check
Location: .github/workflows/rbac-check.yml
Automated CI workflow that validates RBAC enforcement rules on every PR. Failures block merge. Operators investigating access-control regressions should review this workflow's run history in GitHub Actions before escalating.
3.6 Codacy Static Analysis
Location: .codacy/
| Config File | Tool |
|---|---|
.codacy/tools-configs/eslint.config.mjs | ESLint (TypeScript/JS) |
.codacy/tools-configs/semgrep.yaml | Semgrep (security patterns) |
.codacy/tools-configs/trivy.yaml | Trivy (dependency and container vulnerability scanning) |
.codacy/tools-configs/pylint.rc | Pylint (Python, if applicable) |
.codacy/tools-configs/lizard.yaml | Lizard (cyclomatic complexity) |
.codacy/cli.sh | Local Codacy CLI runner |
.codacy/codacy.yaml | Top-level Codacy configuration |
Run Codacy locally:
bash .codacy/cli.sh
4. Operational Workflows
4.1 Deploy Workflow
Colony is deployed to Cloud Run via gcloud run deploy. Secrets are injected at deploy time from GCP Secret Manager using --update-secrets flags. The canonical deploy procedure is documented in docs/phase1/runbooks/deploy.md.
Standard deploy sequence:
# 1. Build and push container image
gcloud builds submit --tag gcr.io/{PROJECT_ID}/colony:{TAG}
# 2. Deploy to Cloud Run with secret injection
gcloud run deploy colony \
--image gcr.io/{PROJECT_ID}/colony:{TAG} \
--region {REGION} \
--update-secrets=CLERK_SECRET_KEY=clerk-secret-key:latest,\
INNGEST_SIGNING_KEY=inngest-signing-key:latest,\
DATABASE_URL=colony-db-url:latest,\
LANGFUSE_SECRET_KEY=langfuse-secret-key:latest,\
SENTRY_DSN=sentry-dsn:latest
# 3. Verify traffic routing
gcloud run services describe colony --region {REGION}
All Cloud Run configuration (CPU allocation, concurrency, min/max instances, environment variable bindings) is defined in infrastructure/gcp/cloud-run-runner.tf. Changes to scaling or routing must go through Terraform, not the GCP Console, to avoid state drift.
4.2 Rollback Workflow
Cloud Run maintains a full revision history. Rollback is instant and does not require a new build.
# List recent revisions
gcloud run revisions list --service colony --region {REGION}
# Route 100% traffic to a prior revision
gcloud run services update-traffic colony \
--to-revisions {REVISION_NAME}=100 \
--region {REGION}
For database schema rollbacks, Colony uses migration files. Consult docs/phase1/runbooks/db-bootstrap.md for the migration toolchain. Schema rollbacks require a maintenance window and must be coordinated with the on-call engineer.
4.3 Feature Flags
Colony does not currently implement a dedicated feature-flag service. Feature gating is implemented via:
- RBAC role checks — features restricted to
adminorownerroles act as controlled rollouts to a subset of users. - Org-level API key presence — features that require a third-party integration (Pipedrive, Unipile, Resend, Google Drive) are conditionally available only when the corresponding key exists in the Colony Vault (
api_keystable) for that organization. - Environment variable flags — early-phase features may be toggled via environment variables injected through GCP Secret Manager at deploy time.
To enable a new integration for a specific organization, an admin user for that org must register the API key via /settings/keys. There is no operator-side UI bypass for this — it must be done through the application or directly via a SQL write to the api_keys table (production access procedure: see Section 4.5).
4.4 User Management
Day-to-day user operations are performed in the Clerk Dashboard (dashboard.clerk.com):
- Create/deactivate users
- Assign or change roles (
owner,admin,member) - Manage organization memberships
- Review sign-in activity and MFA status
- Inspect and replay Svix webhook deliveries
Operator actions available in Clerk Dashboard:
| Action | Path in Clerk UI |
|---|---|
| Add user to organization | Organizations → {Org} → Members → Invite |
| Change user role | Organizations → {Org} → Members → {User} → Role |
| Deactivate user | Users → {User} → Ban or Delete |
| View webhook delivery log | Webhooks → {Endpoint} → Attempts |
| Replay failed webhook | Webhooks → {Endpoint} → Attempts → Retry |
Role changes propagate to Colony via Svix in real time. If propagation fails (user retains old permissions), manually trigger a webhook replay for the relevant organizationMembership.updated event.
4.5 Production Database Access
Direct production database access requires the Cloud SQL Auth Proxy and appropriate IAM credentials. This is a privileged operation tracked in the on-call log.
# Start Cloud SQL Auth Proxy
cloud-sql-proxy colony-39989:{REGION}:colony
# Connect via psql
psql -h 127.0.0.1 -U colony -d colony
IAM bindings for Cloud SQL access are defined in infrastructure/gcp/iam-runner.tf. Only engineers with the cloudsql.client IAM role may connect. All production SQL queries on api_keys (the Colony Vault) involve KMS-encrypted rows — plaintext values are never returned without explicit decryption via pgcrypto and the appropriate KMS key.
4.6 Pipedrive Integration Setup
Per docs/phase1/runbooks/pipedrive-fields.md, custom Pipedrive fields must be created before the bi-sync functions correctly. This is a one-time manual step per organization. See the runbook for the full field manifest. The Playwright runner (runners/) automates the scrape-to-import pipeline once fields are configured.
4.7 Google Drive / Recording Pipeline Setup
Per docs/phase1/runbooks/google-drive-setup.md and docs/phase2/runbooks/03_recording_pipeline.md, the Google Drive integration requires:
- A Google Cloud service account with Drive API access
- OAuth scope expansion (phase 2:
docs/phase2/action_plans/12_google_oauth_scope_expansion.md) for Gmail and Calendar integration - Manual workspace authorization steps per
docs/phase2/runbooks/02_workspace_manual_steps.md
5. Monitoring and Alerting
5.1 Observability Stack
| Signal Type | System | What Is Captured |
|---|---|---|
| LLM traces | Langfuse | All agent tool calls, prompt/completion pairs, token counts, latency per span, model used |
| Application errors | Sentry | Unhandled exceptions, API route errors, agent runtime failures, stack traces |
| Infrastructure metrics | GCP Cloud Monitoring | Cloud Run request count, latency p50/p95/p99, instance count, CPU utilization, memory utilization |
| Database metrics | Cloud SQL Monitoring | Connection count, query latency, disk utilization, replication lag (if replica configured) |
| Durable job execution | Inngest Dashboard | Function run status, retry counts, event queue depth, failed runs |
| CI quality | Codacy | Code coverage delta, new issues per PR, security findings (Semgrep, Trivy) |
5.2 What to Monitor
Critical (page-level):
| Signal | Threshold | Likely Cause |
|---|---|---|
| Cloud Run 5xx error rate | > 1% over 5 minutes | Agent runtime crash, database connection exhaustion, invalid secret |
| Cloud SQL connection count | > 90% of max connections | Connection pool leak, traffic spike without pooling |
| Inngest failed runs | Any function with > 3 consecutive failures | Downstream API outage (Pipedrive, Unipile, Resend), malformed event payload |
| Sentry error spike | > 10× baseline in 10 minutes | Bad deployment, upstream API change, schema migration issue |
Warning (non-paging):
| Signal | Threshold | Action |
|---|---|---|
| LLM token usage (Langfuse) | > 150% of 7-day average | Review orchestrator prompts for runaway loops; check circuit breakers |
| Playwright batch run failures | Any summary.json with errors > 0 | Review collect-N.log and import-N.log for root cause |
| Cloud Run instance count | Sustained at max configured instances | Evaluate scaling configuration in cloud-run-runner.tf |
| Cloud SQL disk utilization | > 80% | Review knowledge_core embedding growth; consider pgvector index pruning |
5.3 Circuit Breakers
Colony implements per-org circuit breakers for the outbound engine. These are defined in the application layer (see docs/phase1/action_plans/22_circuit_breakers.md and docs/phase2/action_plans/11_circuit_breakers_phase2.md) and enforce limits on outbound volume per organization. When a breaker trips:
- The affected org's outbound sequences pause automatically.
- The HOT reply queue continues to function (inbound-initiated replies are not blocked).
- The
adminorownerfor the org sees an alert on/overviewand/outbound. - An operator can inspect breaker state via the Inngest Dashboard — look for paused or cancelled runs in the outbound function family.
Phase 2 circuit breakers extend this model to the discovery pipeline (see docs/phase2/testing/11_breakers_phase2.spec.ts).
5.4 Escalation Path
L1 — On-call engineer
├── Reviews Sentry for error detail
├── Reviews Inngest Dashboard for failed runs
├── Reviews Cloud Run logs in GCP Console
└── If database: connects via Cloud SQL Auth Proxy
L2 — Infrastructure engineer
├── Reviews Cloud Monitoring metrics
├── Reviews Terraform state for drift (terraform plan)
└── Executes rollback if L1 remediation fails
L3 — Engineering lead
└── Authorizes schema rollback, emergency secret rotation,
or extended maintenance window
6. Support Workflows
6.1 Ticket Triage
Incoming support requests should be classified against the following taxonomy before investigation:
| Class | Description | First Step |
|---|---|---|
| AUTH-* | User cannot sign in, wrong role, org access denied | Check Clerk Dashboard → Users and Webhook delivery log |
| VAULT-* | Integration not working, API key validation fails | Check /settings/keys in the org context; verify key format |
| AGENT-* | Orchestrator not responding, tool call hangs, empty output | Check Inngest Dashboard for the corresponding run; check Langfuse for trace |
| OUTBOUND-* | Sequences not sending, circuit breaker tripped | Check outbound circuit breaker state in Inngest; check Unipile / Resend vault keys |
| PIPELINE-* | Pipedrive sync missing data, deal stage mismatch | Check Pipedrive custom field configuration per docs/phase1/runbooks/pipedrive-fields.md |
| RECORDING-* | Meet Notes not appearing, signals not extracted | Check Google Drive service account permissions; check Inngest for recording ingestion run |
| ONBOARDING-* | Deployment Kit not generated on Closed-Won | Verify deal stage mapping; check Inngest for onboarding agent trigger |
| INFRA-* | Service unavailable, elevated error rate | Check Cloud Run revisions and GCP Cloud Monitoring |
6.2 Common Issues and Resolution Playbooks
PLAY-001: User Has Wrong Role After Org Change
Symptom: User reports they cannot access /knowledge or /settings/keys despite being promoted to admin.
Steps:
- Open Clerk Dashboard → Organizations → {Org} → Members.
- Confirm the role displayed in Clerk matches the expected role.
- Navigate to Clerk Dashboard → Webhooks → {Colony endpoint} → Attempts.
- Find the most recent
organizationMembership.updatedevent for this user. - If delivery failed, click Retry. The Colony database will update within 30 seconds of successful delivery.
- If Clerk shows the correct role and the webhook delivered successfully, connect to the database and inspect the
usersormembershipstable for the user record.
PLAY-002: Org Integration Key Not Working (Colony Vault)
Symptom: Agent reports a tool call failure for Pipedrive / Unipile / Resend / Google Drive.
Steps:
- Ask the org
adminto navigate to/settings/keysand verify the key is present and saved (keys are stored but not displayed in plaintext post-save). - If the key was recently rotated in the third-party system, the admin must delete and re-enter it in the vault UI.
- If the vault UI is not accessible, connect to Cloud SQL and query:
Do not select the encrypted value column in a logged session.SELECT org_id, provider, updated_at FROM api_keys WHERE org_id = '{ORG_ID}'; - After re-entry, trigger a test agent run from
/overviewto verify connectivity.
PLAY-003: Inngest Function Stuck or Failing
Symptom: An agent operation initiated from /overview or a background job (daily brief, recording ingestion, batch scrape) never completes.
Steps:
- Open Inngest Dashboard → Functions.
- Filter by function name (e.g.,
gtm-orchestrator,recording-ingestion,onboarding-kit-generator). - Inspect the failed run: review event payload, step output, and error message.
- If the failure is a transient third-party error (rate limit, timeout), click Replay on the run.
- If the failure is a schema validation error or unexpected payload shape, this likely indicates a code bug. Capture the run ID and file a AGENT-* ticket with the Inngest run URL.
- If the function is stuck (in-progress for > 15 minutes), it may be holding a database connection. Check Cloud SQL connection count in Cloud Monitoring. If connections are exhausted, the Cloud Run service may need to be redeployed to recycle connection pools.
PLAY-004: Playwright Batch Run Partial Failure
Symptom: runners/.artifacts/pipedrive-batch/{TIMESTAMP}/summary.json shows errors > 0 or import count is lower than expected.
Steps:
- Open
batch.logfor the overall execution timeline. - Check
collect-N.logfiles for the specific collector that failed — look for HTTP errors, selector mismatches, or authentication failures. - Check
import-N.logfor database write errors — look for constraint violations or connection errors. - If the failure is an authentication issue with Pipedrive, the Pipedrive session cookie in the runner configuration may have expired. Update the credentials in GCP Secret Manager.
- Re-run the batch for the failed segment only (do not re-run already-imported segments to avoid duplicates).
PLAY-005: Daily Brief Not Delivered
Symptom: User reports no brief email and no brief on /overview for a given morning.
Steps:
- Check Inngest Dashboard for the
daily-brieffunction scheduled run for that date. Verify it triggered and completed. - If the Inngest run succeeded but no email was received, check Resend delivery logs for the user's email address.
- If Resend shows delivery but the brief is absent from
/overview, the brief was likely written to the database but the UI query is failing — check Sentry for errors on the/overviewroute around that time. - If the Inngest run did not trigger, verify the cron schedule configuration has not drifted. Review the function definition.
PLAY-006: Disaster Recovery
Per docs/phase2/runbooks/04_dr_recovery.md, the DR procedure covers:
- Cloud SQL point-in-time recovery from automated backups
- GCS object restoration from versioned bucket
- Cloud Run service re-deployment from the last known good container image revision
The DR runbook must be followed in sequence. Do not attempt manual data restoration outside the documented procedure.
7. Environment Management
7.1 Environment Topology
| Environment | Infrastructure | Purpose |
|---|---|---|
| Local | infrastructure/docker-compose.yml + .env | Developer iteration; Postgres only, no GCS or Secret Manager |
| Development / CI | Shared GCP project or per-PR ephemeral | Integration testing; GitHub Actions Playwright runner |
| Production | GCP project colony-39989, Cloud Run, Cloud SQL, GCS | Live system |
7.2 Local Environment Configuration
The .env file at the repository root provides local overrides. It is gitignored. Developers must populate it with:
| Variable | Source |
|---|---|
DATABASE_URL | Local Docker Postgres or development Cloud SQL (via proxy) |
CLERK_SECRET_KEY | Clerk development instance |
CLERK_PUBLISHABLE_KEY | Clerk development instance |
INNGEST_SIGNING_KEY | Inngest dev environment |
INNGEST_EVENT_KEY | Inngest dev environment |
LANGFUSE_SECRET_KEY | Langfuse cloud or self-hosted |
LANGFUSE_PUBLIC_KEY | Langfuse cloud or self-hosted |
SENTRY_DSN | Sentry project DSN (development project) |
GCP_KMS_KEY_NAME | KMS key for Colony Vault encryption (dev key) |
GCS_BUCKET_NAME | gs://colony-assets or a dev bucket |
7.3 Production Environment Configuration
All production secrets are stored in GCP Secret Manager and injected via gcloud run deploy --update-secrets. They are never stored in .env, Terraform variable files, or application code. The canonical list of secrets is defined in infrastructure/gcp/secrets.tf.
7.4 Database Initialization
Local:
docker compose -f infrastructure/docker-compose.yml up -d
# Runs infrastructure/postgres/init/01-multiple-dbs.sh
# Runs infrastructure/postgres/init/02-extensions.sql automatically
Cloud SQL (production bootstrap):
Follow docs/phase1/runbooks/db-bootstrap.md — extensions (pgvector, pgcrypto) must be manually enabled on the Cloud SQL instance after provisioning, as the init scripts are not run against managed instances.
7.5 Cost Monitoring
Per docs/phase2/runbooks/05_cost_dashboard.md, a GCP cost dashboard is maintained for the Colony project. Key cost drivers to monitor monthly: Cloud Run compute, Cloud SQL storage and IOPS, GCS egress, Gemini API token consumption, and Langfuse hosted usage.
8. Secret Management
8.1 Two-Vault Architecture
Colony operates two intentionally separate secret stores:
Store A — GCP Secret Manager (platform-level) Holds secrets required to bootstrap and operate the platform itself. These are not tenant-specific and are managed by the infrastructure team.
Store B — Colony Vault (per-org, api_keys table)
Holds tenant-specific integration credentials. Rows are encrypted at the column level using pgcrypto with a GCP KMS key. No plaintext credential is ever stored in this table.
8.2 GCP Secret Manager Inventory
| Secret Name | Contents | Consumer |
|---|---|---|
clerk-secret-key | Clerk backend API key | Next.js API routes, webhook validation |
inngest-signing-key | Inngest function signing key | Inngest function handler |
inngest-event-key | Inngest event ingestion key | Application event dispatch |
colony-db-url | Cloud SQL connection string | Next.js DB client, migration runner |
langfuse-secret-key | Langfuse API secret | LLM client wrapper |
langfuse-public-key | Langfuse API public key | LLM client wrapper |
sentry-dsn | Sentry project DSN | Sentry Next.js SDK |
gcp-kms-key-name | KMS key resource name for Colony Vault | api_keys encryption/decryption |
bootstrap-llm-key | Platform-level LLM key (Gemini / OpenAI) | Agent runtime bootstrap |
Secrets are versioned in GCP Secret Manager. The latest alias always points to the active version. When rotating, create a new version first, update the Cloud Run deployment to bind latest, verify functionality, then disable the prior version.
8.3 Colony Vault Inventory (api_keys table)
Per-organization. One row per integration per org. The provider column identifies the integration.
| Provider Key | Integration | Used By |
|---|---|---|
pipedrive | Pipedrive CRM API key | Pipeline bi-sync, Playwright runner imports |
unipile | Unipile LinkedIn/messaging API | Outbound sequence delivery, prospect discovery |
resend | Resend transactional email | Outbound email sequences, daily brief delivery |
google_drive | Google Drive service account credentials | Recording ingestion, Meet Notes pipeline |
google_oauth | Google OAuth tokens (Gmail, Calendar) | Gmail integration, Calendar integration (phase 2) |
serpapi | SerpAPI key | Prospect discovery web search (phase 2) |
notion | Notion integration token | Knowledge Core sync to Notion (phase 2) |
Keys are entered by org admin users via /settings/keys. Operators do not routinely access vault contents. Emergency access requires Cloud SQL Auth Proxy, the KMS key, and explicit approval from the Engineering lead (L3 escalation).
8.4 Secret Rotation Policy
| Secret | Rotation Trigger | Rotation Procedure |
|---|---|---|
| GCP Secret Manager secrets | Annually, or on personnel change, or on suspected compromise | Create new version in Secret Manager → redeploy Cloud Run with --update-secrets → disable old version |
| Colony Vault keys (per-org) | On third-party credential rotation or org admin request | Org admin deletes old key entry in /settings/keys and adds new value |
| Clerk signing keys | Per Clerk rotation schedule or on compromise | Rotate in Clerk Dashboard → update Secret Manager → redeploy |
| Cloud SQL password | On DBA personnel change | Rotate in Cloud SQL → update colony-db-url secret → redeploy |
After any GCP Secret Manager rotation, a Cloud Run redeployment is required to pick up the new version if a pinned version (not latest) was previously in use.
8.5 KMS Configuration
The GCP KMS key used for Colony Vault encryption is referenced by the gcp-kms-key-name secret. The key is a symmetric encryption key in the colony-39989 project keyring. Key rotation is managed through GCP KMS key rotation policy (recommended: 90-day automatic rotation). Rotation does not require re-encryption of existing rows — Cloud KMS handles envelope encryption such that old data remains decryptable with the prior key version.
9. Health Checks and Status
9.1 Cloud Run Health Check
Cloud Run performs startup and liveness checks against the deployed service. The health check path should be configured in infrastructure/gcp/cloud-run-runner.tf. The standard Next.js health endpoint is:
GET /api/health
This endpoint should return HTTP 200 with a JSON body confirming database connectivity and any critical dependency status. If it does not exist as a discrete route, Cloud Run falls back to checking that the container is listening on the configured port.
9.2 Pub/Sub Topics
Per docs/phase2/runbooks/01_pubsub_topics.md, GCP Pub/Sub topics are used for event-driven pipeline components introduced in phase 2 (recording pipeline, Google Chat push notifications). Topic health can be monitored via GCP Cloud Monitoring → Pub/Sub → Message delivery metrics.
9.3 Inngest Function Health
The Inngest Dashboard provides a real-time view of all