Colony

Operator console

Operator console

Last updated 5/24/2026

Colony Operator Console

Document ID: TECH-OPS-001 Version: 1.0 Classification: Internal — Engineering & Operations Product: Colony — ZoomProp GTM Agentic Stack Scope: Full internal operations surface including deployment, monitoring, secret management, environment configuration, and support workflows


Table of Contents

  1. Operations Overview
  2. Admin Dashboard and Routes
  3. Internal Tooling Inventory
  4. Operational Workflows
  5. Monitoring and Alerting
  6. Support Workflows
  7. Environment Management
  8. Secret Management
  9. Health Checks and Status
  10. Runbook Index

1. Operations Overview

Colony is deployed as a Next.js 16 application on Google Cloud Run, backed by Cloud SQL (PostgreSQL 16) with pgvector and pgcrypto extensions, and Google Cloud Storage for asset persistence. The agent runtime runs as durable Inngest functions triggered by application events. Infrastructure is provisioned and versioned via Terraform (infrastructure/gcp/).

The operations surface spans three distinct control planes:

PlaneSystemAccess Method
Platform infrastructureGCP (Cloud Run, Cloud SQL, GCS, Secret Manager)gcloud CLI, Terraform, GCP Console
Application identity and RBACClerkClerk Dashboard, Svix webhooks
Agent observabilityLangfuseLangfuse Dashboard
Application errorsSentrySentry Dashboard
Durable job runtimeInngestInngest Dashboard

All platform-level secrets are stored in GCP Secret Manager and injected at deploy time. All per-organization integration credentials are stored in the Colony Vault (api_keys table, KMS-encrypted rows in Cloud SQL).


2. Admin Dashboard and Routes

2.1 Application Routes

Colony's UI is built on the Next.js App Router. The primary operator-facing entry point is /overview, which serves as both the user-facing daily brief and the command interface for the GTM Orchestrator.

RoutePurposeMinimum Role
/overviewGTM Orchestrator chat interface; daily brief; pipeline snapshotmember
/pipeline9-stage deal tracker with Pipedrive bi-syncmember
/outboundSequence engine, HOT reply queue, angle selectionmember
/content6-pillar content pipeline with attribution trackingmember
/recordingsGemini Meet Notes ingestion and signal extractionmember
/onboardingDeployment Kit generator (8 assets on Closed-Won)member
/knowledgeKnowledge Core editor — 10-domain pgvector retrievaladmin
/settings/keysColony Vault UI — per-org API key managementadmin
/settings/orgOrganization configuration, user rosteradmin

2.2 RBAC Model

Clerk manages authentication with three roles enforced across all routes. The .github/workflows/rbac-check.yml CI workflow validates role enforcement on every pull request.

RoleCapabilities
ownerFull access; can manage billing, rotate org-level secrets, add/remove admins
adminKnowledge Core editing, Colony Vault key management, org settings, all member capabilities
memberAll GTM surface routes; cannot access vault or knowledge editor

Role assignment is managed through the Clerk Dashboard at https://dashboard.clerk.com. Webhook sync via Svix propagates role changes to the Colony database in real time. If a role change does not reflect within 60 seconds, check Svix webhook delivery logs in the Clerk Dashboard.

2.3 External Control Panels

PanelURL PatternPurpose
GCP Consoleconsole.cloud.google.comCloud Run revisions, Cloud SQL, GCS, IAM
Clerk Dashboarddashboard.clerk.comUser management, org management, RBAC, webhook logs
Inngest Dashboardapp.inngest.comDurable function runs, retries, event replay
Langfuse Dashboardcloud.langfuse.comLLM trace observability, token usage, latency
Sentrysentry.ioApplication error tracking, stack traces
Pipedrive{org}.pipedrive.comCRM bi-sync verification

3. Internal Tooling Inventory

3.1 Infrastructure as Code

Location: infrastructure/gcp/ Provider: Terraform with hashicorp/google v7.29.0 and hashicorp/google-beta v7.29.0

FileManages
main.tfGCP provider configuration, project binding, region
cloud-run-runner.tfCloud Run service definition, traffic routing, scaling config
cloud-sql.tfCloud SQL PostgreSQL 16 instance colony-39989:…:colony, private IP, backup policy
iam-runner.tfService account for Cloud Run, IAM bindings, least-privilege roles
secrets.tfGCP Secret Manager secret definitions and IAM accessor bindings
storage.tfGCS bucket gs://colony-assets, CMEK configuration, signed URL policy
variables.tfInput variable declarations
terraform.tfvarsEnvironment-specific variable values (not committed to VCS in production)

Standard Terraform workflow:

cd infrastructure/gcp
terraform init          # first-time or after provider changes
terraform plan          # review changeset
terraform apply         # apply to active GCP project

State is stored in infrastructure/gcp/terraform.tfstate. A .backup copy is maintained automatically. For production, migrate state to a GCS backend to enable team access and state locking.

3.2 Database Initialization Scripts

Location: infrastructure/postgres/init/

ScriptPurpose
01-multiple-dbs.shShell script that creates the colony database and any additional databases required for local multi-service development
02-extensions.sqlInstalls pgvector (1536-dimension vectors for Knowledge Core embeddings) and pgcrypto (UUID generation and column-level encryption for the Colony Vault)

These scripts are consumed by infrastructure/docker-compose.yml on first container startup. They are not re-run against Cloud SQL — the Cloud SQL instance is bootstrapped per docs/phase1/runbooks/db-bootstrap.md.

3.3 Local Development Stack

Location: infrastructure/docker-compose.yml

Provides a local-only PostgreSQL 16 instance with the initialization scripts applied. No local equivalents for Cloud Run, GCS, or Secret Manager are defined — developers use .env for local overrides and connect to a shared development Cloud SQL instance or a local Postgres container.

docker compose -f infrastructure/docker-compose.yml up -d

3.4 Playwright Runner

Location: .github/workflows/playwright-runner.yml Artifacts: runners/.artifacts/pipedrive-dry-run/ and runners/.artifacts/pipedrive-batch/

The Playwright runner executes browser-automation jobs against Pipedrive (and other scrape targets defined in phase 2). It produces structured artifact directories per run:

ArtifactContent
batch.logFull execution log for batch scrape runs
collect-N.logPer-collector logs for parallel data collection tasks
import-N.logPer-import logs for database write operations
summary.jsonMachine-readable run summary: counts, errors, duration
candidates.jsonExtracted prospect candidates (dry-run mode)
captures.jsonPage capture metadata
final.pngScreenshot of final Playwright browser state
session.logRaw browser session log

Artifacts are retained under runners/.artifacts/ organized by run type and ISO timestamp. Operators should review summary.json after each batch run to confirm import counts and identify partial failures.

3.5 RBAC Enforcement CI Check

Location: .github/workflows/rbac-check.yml

Automated CI workflow that validates RBAC enforcement rules on every PR. Failures block merge. Operators investigating access-control regressions should review this workflow's run history in GitHub Actions before escalating.

3.6 Codacy Static Analysis

Location: .codacy/

Config FileTool
.codacy/tools-configs/eslint.config.mjsESLint (TypeScript/JS)
.codacy/tools-configs/semgrep.yamlSemgrep (security patterns)
.codacy/tools-configs/trivy.yamlTrivy (dependency and container vulnerability scanning)
.codacy/tools-configs/pylint.rcPylint (Python, if applicable)
.codacy/tools-configs/lizard.yamlLizard (cyclomatic complexity)
.codacy/cli.shLocal Codacy CLI runner
.codacy/codacy.yamlTop-level Codacy configuration

Run Codacy locally:

bash .codacy/cli.sh

4. Operational Workflows

4.1 Deploy Workflow

Colony is deployed to Cloud Run via gcloud run deploy. Secrets are injected at deploy time from GCP Secret Manager using --update-secrets flags. The canonical deploy procedure is documented in docs/phase1/runbooks/deploy.md.

Standard deploy sequence:

# 1. Build and push container image
gcloud builds submit --tag gcr.io/{PROJECT_ID}/colony:{TAG}

# 2. Deploy to Cloud Run with secret injection
gcloud run deploy colony \
  --image gcr.io/{PROJECT_ID}/colony:{TAG} \
  --region {REGION} \
  --update-secrets=CLERK_SECRET_KEY=clerk-secret-key:latest,\
INNGEST_SIGNING_KEY=inngest-signing-key:latest,\
DATABASE_URL=colony-db-url:latest,\
LANGFUSE_SECRET_KEY=langfuse-secret-key:latest,\
SENTRY_DSN=sentry-dsn:latest

# 3. Verify traffic routing
gcloud run services describe colony --region {REGION}

All Cloud Run configuration (CPU allocation, concurrency, min/max instances, environment variable bindings) is defined in infrastructure/gcp/cloud-run-runner.tf. Changes to scaling or routing must go through Terraform, not the GCP Console, to avoid state drift.

4.2 Rollback Workflow

Cloud Run maintains a full revision history. Rollback is instant and does not require a new build.

# List recent revisions
gcloud run revisions list --service colony --region {REGION}

# Route 100% traffic to a prior revision
gcloud run services update-traffic colony \
  --to-revisions {REVISION_NAME}=100 \
  --region {REGION}

For database schema rollbacks, Colony uses migration files. Consult docs/phase1/runbooks/db-bootstrap.md for the migration toolchain. Schema rollbacks require a maintenance window and must be coordinated with the on-call engineer.

4.3 Feature Flags

Colony does not currently implement a dedicated feature-flag service. Feature gating is implemented via:

  1. RBAC role checks — features restricted to admin or owner roles act as controlled rollouts to a subset of users.
  2. Org-level API key presence — features that require a third-party integration (Pipedrive, Unipile, Resend, Google Drive) are conditionally available only when the corresponding key exists in the Colony Vault (api_keys table) for that organization.
  3. Environment variable flags — early-phase features may be toggled via environment variables injected through GCP Secret Manager at deploy time.

To enable a new integration for a specific organization, an admin user for that org must register the API key via /settings/keys. There is no operator-side UI bypass for this — it must be done through the application or directly via a SQL write to the api_keys table (production access procedure: see Section 4.5).

4.4 User Management

Day-to-day user operations are performed in the Clerk Dashboard (dashboard.clerk.com):

  • Create/deactivate users
  • Assign or change roles (owner, admin, member)
  • Manage organization memberships
  • Review sign-in activity and MFA status
  • Inspect and replay Svix webhook deliveries

Operator actions available in Clerk Dashboard:

ActionPath in Clerk UI
Add user to organizationOrganizations → {Org} → Members → Invite
Change user roleOrganizations → {Org} → Members → {User} → Role
Deactivate userUsers → {User} → Ban or Delete
View webhook delivery logWebhooks → {Endpoint} → Attempts
Replay failed webhookWebhooks → {Endpoint} → Attempts → Retry

Role changes propagate to Colony via Svix in real time. If propagation fails (user retains old permissions), manually trigger a webhook replay for the relevant organizationMembership.updated event.

4.5 Production Database Access

Direct production database access requires the Cloud SQL Auth Proxy and appropriate IAM credentials. This is a privileged operation tracked in the on-call log.

# Start Cloud SQL Auth Proxy
cloud-sql-proxy colony-39989:{REGION}:colony

# Connect via psql
psql -h 127.0.0.1 -U colony -d colony

IAM bindings for Cloud SQL access are defined in infrastructure/gcp/iam-runner.tf. Only engineers with the cloudsql.client IAM role may connect. All production SQL queries on api_keys (the Colony Vault) involve KMS-encrypted rows — plaintext values are never returned without explicit decryption via pgcrypto and the appropriate KMS key.

4.6 Pipedrive Integration Setup

Per docs/phase1/runbooks/pipedrive-fields.md, custom Pipedrive fields must be created before the bi-sync functions correctly. This is a one-time manual step per organization. See the runbook for the full field manifest. The Playwright runner (runners/) automates the scrape-to-import pipeline once fields are configured.

4.7 Google Drive / Recording Pipeline Setup

Per docs/phase1/runbooks/google-drive-setup.md and docs/phase2/runbooks/03_recording_pipeline.md, the Google Drive integration requires:

  1. A Google Cloud service account with Drive API access
  2. OAuth scope expansion (phase 2: docs/phase2/action_plans/12_google_oauth_scope_expansion.md) for Gmail and Calendar integration
  3. Manual workspace authorization steps per docs/phase2/runbooks/02_workspace_manual_steps.md

5. Monitoring and Alerting

5.1 Observability Stack

Signal TypeSystemWhat Is Captured
LLM tracesLangfuseAll agent tool calls, prompt/completion pairs, token counts, latency per span, model used
Application errorsSentryUnhandled exceptions, API route errors, agent runtime failures, stack traces
Infrastructure metricsGCP Cloud MonitoringCloud Run request count, latency p50/p95/p99, instance count, CPU utilization, memory utilization
Database metricsCloud SQL MonitoringConnection count, query latency, disk utilization, replication lag (if replica configured)
Durable job executionInngest DashboardFunction run status, retry counts, event queue depth, failed runs
CI qualityCodacyCode coverage delta, new issues per PR, security findings (Semgrep, Trivy)

5.2 What to Monitor

Critical (page-level):

SignalThresholdLikely Cause
Cloud Run 5xx error rate> 1% over 5 minutesAgent runtime crash, database connection exhaustion, invalid secret
Cloud SQL connection count> 90% of max connectionsConnection pool leak, traffic spike without pooling
Inngest failed runsAny function with > 3 consecutive failuresDownstream API outage (Pipedrive, Unipile, Resend), malformed event payload
Sentry error spike> 10× baseline in 10 minutesBad deployment, upstream API change, schema migration issue

Warning (non-paging):

SignalThresholdAction
LLM token usage (Langfuse)> 150% of 7-day averageReview orchestrator prompts for runaway loops; check circuit breakers
Playwright batch run failuresAny summary.json with errors > 0Review collect-N.log and import-N.log for root cause
Cloud Run instance countSustained at max configured instancesEvaluate scaling configuration in cloud-run-runner.tf
Cloud SQL disk utilization> 80%Review knowledge_core embedding growth; consider pgvector index pruning

5.3 Circuit Breakers

Colony implements per-org circuit breakers for the outbound engine. These are defined in the application layer (see docs/phase1/action_plans/22_circuit_breakers.md and docs/phase2/action_plans/11_circuit_breakers_phase2.md) and enforce limits on outbound volume per organization. When a breaker trips:

  1. The affected org's outbound sequences pause automatically.
  2. The HOT reply queue continues to function (inbound-initiated replies are not blocked).
  3. The admin or owner for the org sees an alert on /overview and /outbound.
  4. An operator can inspect breaker state via the Inngest Dashboard — look for paused or cancelled runs in the outbound function family.

Phase 2 circuit breakers extend this model to the discovery pipeline (see docs/phase2/testing/11_breakers_phase2.spec.ts).

5.4 Escalation Path

L1 — On-call engineer
  ├── Reviews Sentry for error detail
  ├── Reviews Inngest Dashboard for failed runs
  ├── Reviews Cloud Run logs in GCP Console
  └── If database: connects via Cloud SQL Auth Proxy

L2 — Infrastructure engineer
  ├── Reviews Cloud Monitoring metrics
  ├── Reviews Terraform state for drift (terraform plan)
  └── Executes rollback if L1 remediation fails

L3 — Engineering lead
  └── Authorizes schema rollback, emergency secret rotation,
      or extended maintenance window

6. Support Workflows

6.1 Ticket Triage

Incoming support requests should be classified against the following taxonomy before investigation:

ClassDescriptionFirst Step
AUTH-*User cannot sign in, wrong role, org access deniedCheck Clerk Dashboard → Users and Webhook delivery log
VAULT-*Integration not working, API key validation failsCheck /settings/keys in the org context; verify key format
AGENT-*Orchestrator not responding, tool call hangs, empty outputCheck Inngest Dashboard for the corresponding run; check Langfuse for trace
OUTBOUND-*Sequences not sending, circuit breaker trippedCheck outbound circuit breaker state in Inngest; check Unipile / Resend vault keys
PIPELINE-*Pipedrive sync missing data, deal stage mismatchCheck Pipedrive custom field configuration per docs/phase1/runbooks/pipedrive-fields.md
RECORDING-*Meet Notes not appearing, signals not extractedCheck Google Drive service account permissions; check Inngest for recording ingestion run
ONBOARDING-*Deployment Kit not generated on Closed-WonVerify deal stage mapping; check Inngest for onboarding agent trigger
INFRA-*Service unavailable, elevated error rateCheck Cloud Run revisions and GCP Cloud Monitoring

6.2 Common Issues and Resolution Playbooks

PLAY-001: User Has Wrong Role After Org Change

Symptom: User reports they cannot access /knowledge or /settings/keys despite being promoted to admin.

Steps:

  1. Open Clerk Dashboard → Organizations → {Org} → Members.
  2. Confirm the role displayed in Clerk matches the expected role.
  3. Navigate to Clerk Dashboard → Webhooks → {Colony endpoint} → Attempts.
  4. Find the most recent organizationMembership.updated event for this user.
  5. If delivery failed, click Retry. The Colony database will update within 30 seconds of successful delivery.
  6. If Clerk shows the correct role and the webhook delivered successfully, connect to the database and inspect the users or memberships table for the user record.

PLAY-002: Org Integration Key Not Working (Colony Vault)

Symptom: Agent reports a tool call failure for Pipedrive / Unipile / Resend / Google Drive.

Steps:

  1. Ask the org admin to navigate to /settings/keys and verify the key is present and saved (keys are stored but not displayed in plaintext post-save).
  2. If the key was recently rotated in the third-party system, the admin must delete and re-enter it in the vault UI.
  3. If the vault UI is not accessible, connect to Cloud SQL and query:
    SELECT org_id, provider, updated_at FROM api_keys WHERE org_id = '{ORG_ID}';
    
    Do not select the encrypted value column in a logged session.
  4. After re-entry, trigger a test agent run from /overview to verify connectivity.

PLAY-003: Inngest Function Stuck or Failing

Symptom: An agent operation initiated from /overview or a background job (daily brief, recording ingestion, batch scrape) never completes.

Steps:

  1. Open Inngest Dashboard → Functions.
  2. Filter by function name (e.g., gtm-orchestrator, recording-ingestion, onboarding-kit-generator).
  3. Inspect the failed run: review event payload, step output, and error message.
  4. If the failure is a transient third-party error (rate limit, timeout), click Replay on the run.
  5. If the failure is a schema validation error or unexpected payload shape, this likely indicates a code bug. Capture the run ID and file a AGENT-* ticket with the Inngest run URL.
  6. If the function is stuck (in-progress for > 15 minutes), it may be holding a database connection. Check Cloud SQL connection count in Cloud Monitoring. If connections are exhausted, the Cloud Run service may need to be redeployed to recycle connection pools.

PLAY-004: Playwright Batch Run Partial Failure

Symptom: runners/.artifacts/pipedrive-batch/{TIMESTAMP}/summary.json shows errors > 0 or import count is lower than expected.

Steps:

  1. Open batch.log for the overall execution timeline.
  2. Check collect-N.log files for the specific collector that failed — look for HTTP errors, selector mismatches, or authentication failures.
  3. Check import-N.log for database write errors — look for constraint violations or connection errors.
  4. If the failure is an authentication issue with Pipedrive, the Pipedrive session cookie in the runner configuration may have expired. Update the credentials in GCP Secret Manager.
  5. Re-run the batch for the failed segment only (do not re-run already-imported segments to avoid duplicates).

PLAY-005: Daily Brief Not Delivered

Symptom: User reports no brief email and no brief on /overview for a given morning.

Steps:

  1. Check Inngest Dashboard for the daily-brief function scheduled run for that date. Verify it triggered and completed.
  2. If the Inngest run succeeded but no email was received, check Resend delivery logs for the user's email address.
  3. If Resend shows delivery but the brief is absent from /overview, the brief was likely written to the database but the UI query is failing — check Sentry for errors on the /overview route around that time.
  4. If the Inngest run did not trigger, verify the cron schedule configuration has not drifted. Review the function definition.

PLAY-006: Disaster Recovery

Per docs/phase2/runbooks/04_dr_recovery.md, the DR procedure covers:

  • Cloud SQL point-in-time recovery from automated backups
  • GCS object restoration from versioned bucket
  • Cloud Run service re-deployment from the last known good container image revision

The DR runbook must be followed in sequence. Do not attempt manual data restoration outside the documented procedure.


7. Environment Management

7.1 Environment Topology

EnvironmentInfrastructurePurpose
Localinfrastructure/docker-compose.yml + .envDeveloper iteration; Postgres only, no GCS or Secret Manager
Development / CIShared GCP project or per-PR ephemeralIntegration testing; GitHub Actions Playwright runner
ProductionGCP project colony-39989, Cloud Run, Cloud SQL, GCSLive system

7.2 Local Environment Configuration

The .env file at the repository root provides local overrides. It is gitignored. Developers must populate it with:

VariableSource
DATABASE_URLLocal Docker Postgres or development Cloud SQL (via proxy)
CLERK_SECRET_KEYClerk development instance
CLERK_PUBLISHABLE_KEYClerk development instance
INNGEST_SIGNING_KEYInngest dev environment
INNGEST_EVENT_KEYInngest dev environment
LANGFUSE_SECRET_KEYLangfuse cloud or self-hosted
LANGFUSE_PUBLIC_KEYLangfuse cloud or self-hosted
SENTRY_DSNSentry project DSN (development project)
GCP_KMS_KEY_NAMEKMS key for Colony Vault encryption (dev key)
GCS_BUCKET_NAMEgs://colony-assets or a dev bucket

7.3 Production Environment Configuration

All production secrets are stored in GCP Secret Manager and injected via gcloud run deploy --update-secrets. They are never stored in .env, Terraform variable files, or application code. The canonical list of secrets is defined in infrastructure/gcp/secrets.tf.

7.4 Database Initialization

Local:

docker compose -f infrastructure/docker-compose.yml up -d
# Runs infrastructure/postgres/init/01-multiple-dbs.sh
# Runs infrastructure/postgres/init/02-extensions.sql automatically

Cloud SQL (production bootstrap): Follow docs/phase1/runbooks/db-bootstrap.md — extensions (pgvector, pgcrypto) must be manually enabled on the Cloud SQL instance after provisioning, as the init scripts are not run against managed instances.

7.5 Cost Monitoring

Per docs/phase2/runbooks/05_cost_dashboard.md, a GCP cost dashboard is maintained for the Colony project. Key cost drivers to monitor monthly: Cloud Run compute, Cloud SQL storage and IOPS, GCS egress, Gemini API token consumption, and Langfuse hosted usage.


8. Secret Management

8.1 Two-Vault Architecture

Colony operates two intentionally separate secret stores:

Store A — GCP Secret Manager (platform-level) Holds secrets required to bootstrap and operate the platform itself. These are not tenant-specific and are managed by the infrastructure team.

Store B — Colony Vault (per-org, api_keys table) Holds tenant-specific integration credentials. Rows are encrypted at the column level using pgcrypto with a GCP KMS key. No plaintext credential is ever stored in this table.

8.2 GCP Secret Manager Inventory

Secret NameContentsConsumer
clerk-secret-keyClerk backend API keyNext.js API routes, webhook validation
inngest-signing-keyInngest function signing keyInngest function handler
inngest-event-keyInngest event ingestion keyApplication event dispatch
colony-db-urlCloud SQL connection stringNext.js DB client, migration runner
langfuse-secret-keyLangfuse API secretLLM client wrapper
langfuse-public-keyLangfuse API public keyLLM client wrapper
sentry-dsnSentry project DSNSentry Next.js SDK
gcp-kms-key-nameKMS key resource name for Colony Vaultapi_keys encryption/decryption
bootstrap-llm-keyPlatform-level LLM key (Gemini / OpenAI)Agent runtime bootstrap

Secrets are versioned in GCP Secret Manager. The latest alias always points to the active version. When rotating, create a new version first, update the Cloud Run deployment to bind latest, verify functionality, then disable the prior version.

8.3 Colony Vault Inventory (api_keys table)

Per-organization. One row per integration per org. The provider column identifies the integration.

Provider KeyIntegrationUsed By
pipedrivePipedrive CRM API keyPipeline bi-sync, Playwright runner imports
unipileUnipile LinkedIn/messaging APIOutbound sequence delivery, prospect discovery
resendResend transactional emailOutbound email sequences, daily brief delivery
google_driveGoogle Drive service account credentialsRecording ingestion, Meet Notes pipeline
google_oauthGoogle OAuth tokens (Gmail, Calendar)Gmail integration, Calendar integration (phase 2)
serpapiSerpAPI keyProspect discovery web search (phase 2)
notionNotion integration tokenKnowledge Core sync to Notion (phase 2)

Keys are entered by org admin users via /settings/keys. Operators do not routinely access vault contents. Emergency access requires Cloud SQL Auth Proxy, the KMS key, and explicit approval from the Engineering lead (L3 escalation).

8.4 Secret Rotation Policy

SecretRotation TriggerRotation Procedure
GCP Secret Manager secretsAnnually, or on personnel change, or on suspected compromiseCreate new version in Secret Manager → redeploy Cloud Run with --update-secrets → disable old version
Colony Vault keys (per-org)On third-party credential rotation or org admin requestOrg admin deletes old key entry in /settings/keys and adds new value
Clerk signing keysPer Clerk rotation schedule or on compromiseRotate in Clerk Dashboard → update Secret Manager → redeploy
Cloud SQL passwordOn DBA personnel changeRotate in Cloud SQL → update colony-db-url secret → redeploy

After any GCP Secret Manager rotation, a Cloud Run redeployment is required to pick up the new version if a pinned version (not latest) was previously in use.

8.5 KMS Configuration

The GCP KMS key used for Colony Vault encryption is referenced by the gcp-kms-key-name secret. The key is a symmetric encryption key in the colony-39989 project keyring. Key rotation is managed through GCP KMS key rotation policy (recommended: 90-day automatic rotation). Rotation does not require re-encryption of existing rows — Cloud KMS handles envelope encryption such that old data remains decryptable with the prior key version.


9. Health Checks and Status

9.1 Cloud Run Health Check

Cloud Run performs startup and liveness checks against the deployed service. The health check path should be configured in infrastructure/gcp/cloud-run-runner.tf. The standard Next.js health endpoint is:

GET /api/health

This endpoint should return HTTP 200 with a JSON body confirming database connectivity and any critical dependency status. If it does not exist as a discrete route, Cloud Run falls back to checking that the container is listening on the configured port.

9.2 Pub/Sub Topics

Per docs/phase2/runbooks/01_pubsub_topics.md, GCP Pub/Sub topics are used for event-driven pipeline components introduced in phase 2 (recording pipeline, Google Chat push notifications). Topic health can be monitored via GCP Cloud Monitoring → Pub/Sub → Message delivery metrics.

9.3 Inngest Function Health

The Inngest Dashboard provides a real-time view of all