Operator console

Last updated 5/24/2026

Colony Operator Console

Document ID: TECH-OPS-001 Version: 1.0 Classification: Internal — Engineering & Operations Product: Colony — ZoomProp GTM Agentic Stack Scope: Full internal operations surface including deployment, monitoring, secret management, environment configuration, and support workflows

Operations Overview
Admin Dashboard and Routes
Internal Tooling Inventory
Operational Workflows
Monitoring and Alerting
Support Workflows
Environment Management
Secret Management
Health Checks and Status
Runbook Index

1. Operations Overview

Colony is deployed as a Next.js 16 application on Google Cloud Run, backed by Cloud SQL (PostgreSQL 16) with pgvector and pgcrypto extensions, and Google Cloud Storage for asset persistence. The agent runtime runs as durable Inngest functions triggered by application events. Infrastructure is provisioned and versioned via Terraform (infrastructure/gcp/).

The operations surface spans three distinct control planes:

Plane	System	Access Method
Platform infrastructure	GCP (Cloud Run, Cloud SQL, GCS, Secret Manager)	`gcloud` CLI, Terraform, GCP Console
Application identity and RBAC	Clerk	Clerk Dashboard, Svix webhooks
Agent observability	Langfuse	Langfuse Dashboard
Application errors	Sentry	Sentry Dashboard
Durable job runtime	Inngest	Inngest Dashboard

All platform-level secrets are stored in GCP Secret Manager and injected at deploy time. All per-organization integration credentials are stored in the Colony Vault (api_keys table, KMS-encrypted rows in Cloud SQL).

2. Admin Dashboard and Routes

2.1 Application Routes

Colony's UI is built on the Next.js App Router. The primary operator-facing entry point is /overview, which serves as both the user-facing daily brief and the command interface for the GTM Orchestrator.

Route	Purpose	Minimum Role
`/overview`	GTM Orchestrator chat interface; daily brief; pipeline snapshot	`member`
`/pipeline`	9-stage deal tracker with Pipedrive bi-sync	`member`
`/outbound`	Sequence engine, HOT reply queue, angle selection	`member`
`/content`	6-pillar content pipeline with attribution tracking	`member`
`/recordings`	Gemini Meet Notes ingestion and signal extraction	`member`
`/onboarding`	Deployment Kit generator (8 assets on Closed-Won)	`member`
`/knowledge`	Knowledge Core editor — 10-domain pgvector retrieval	`admin`
`/settings/keys`	Colony Vault UI — per-org API key management	`admin`
`/settings/org`	Organization configuration, user roster	`admin`

2.2 RBAC Model

Clerk manages authentication with three roles enforced across all routes. The .github/workflows/rbac-check.yml CI workflow validates role enforcement on every pull request.

Role	Capabilities
`owner`	Full access; can manage billing, rotate org-level secrets, add/remove admins
`admin`	Knowledge Core editing, Colony Vault key management, org settings, all member capabilities
`member`	All GTM surface routes; cannot access vault or knowledge editor

Role assignment is managed through the Clerk Dashboard at https://dashboard.clerk.com. Webhook sync via Svix propagates role changes to the Colony database in real time. If a role change does not reflect within 60 seconds, check Svix webhook delivery logs in the Clerk Dashboard.

2.3 External Control Panels

Panel	URL Pattern	Purpose
GCP Console	`console.cloud.google.com`	Cloud Run revisions, Cloud SQL, GCS, IAM
Clerk Dashboard	`dashboard.clerk.com`	User management, org management, RBAC, webhook logs
Inngest Dashboard	`app.inngest.com`	Durable function runs, retries, event replay
Langfuse Dashboard	`cloud.langfuse.com`	LLM trace observability, token usage, latency
Sentry	`sentry.io`	Application error tracking, stack traces
Pipedrive	`{org}.pipedrive.com`	CRM bi-sync verification

3. Internal Tooling Inventory

3.1 Infrastructure as Code

Location: infrastructure/gcp/ Provider: Terraform with hashicorp/google v7.29.0 and hashicorp/google-beta v7.29.0

File	Manages
`main.tf`	GCP provider configuration, project binding, region
`cloud-run-runner.tf`	Cloud Run service definition, traffic routing, scaling config
`cloud-sql.tf`	Cloud SQL PostgreSQL 16 instance `colony-39989:…:colony`, private IP, backup policy
`iam-runner.tf`	Service account for Cloud Run, IAM bindings, least-privilege roles
`secrets.tf`	GCP Secret Manager secret definitions and IAM accessor bindings
`storage.tf`	GCS bucket `gs://colony-assets`, CMEK configuration, signed URL policy
`variables.tf`	Input variable declarations
`terraform.tfvars`	Environment-specific variable values (not committed to VCS in production)

Standard Terraform workflow:

cd infrastructure/gcp
terraform init          # first-time or after provider changes
terraform plan          # review changeset
terraform apply         # apply to active GCP project

State is stored in infrastructure/gcp/terraform.tfstate. A .backup copy is maintained automatically. For production, migrate state to a GCS backend to enable team access and state locking.

3.2 Database Initialization Scripts

Location: infrastructure/postgres/init/

Script	Purpose
`01-multiple-dbs.sh`	Shell script that creates the `colony` database and any additional databases required for local multi-service development
`02-extensions.sql`	Installs `pgvector` (1536-dimension vectors for Knowledge Core embeddings) and `pgcrypto` (UUID generation and column-level encryption for the Colony Vault)

These scripts are consumed by infrastructure/docker-compose.yml on first container startup. They are not re-run against Cloud SQL — the Cloud SQL instance is bootstrapped per docs/phase1/runbooks/db-bootstrap.md.

3.3 Local Development Stack

Location: infrastructure/docker-compose.yml

Provides a local-only PostgreSQL 16 instance with the initialization scripts applied. No local equivalents for Cloud Run, GCS, or Secret Manager are defined — developers use .env for local overrides and connect to a shared development Cloud SQL instance or a local Postgres container.

docker compose -f infrastructure/docker-compose.yml up -d

3.4 Playwright Runner

Location: .github/workflows/playwright-runner.yml Artifacts: runners/.artifacts/pipedrive-dry-run/ and runners/.artifacts/pipedrive-batch/

The Playwright runner executes browser-automation jobs against Pipedrive (and other scrape targets defined in phase 2). It produces structured artifact directories per run:

Artifact	Content
`batch.log`	Full execution log for batch scrape runs
`collect-N.log`	Per-collector logs for parallel data collection tasks
`import-N.log`	Per-import logs for database write operations
`summary.json`	Machine-readable run summary: counts, errors, duration
`candidates.json`	Extracted prospect candidates (dry-run mode)
`captures.json`	Page capture metadata
`final.png`	Screenshot of final Playwright browser state
`session.log`	Raw browser session log

Artifacts are retained under runners/.artifacts/ organized by run type and ISO timestamp. Operators should review summary.json after each batch run to confirm import counts and identify partial failures.

3.5 RBAC Enforcement CI Check

Location: .github/workflows/rbac-check.yml

Automated CI workflow that validates RBAC enforcement rules on every PR. Failures block merge. Operators investigating access-control regressions should review this workflow's run history in GitHub Actions before escalating.

3.6 Codacy Static Analysis

Location: .codacy/

Config File	Tool
`.codacy/tools-configs/eslint.config.mjs`	ESLint (TypeScript/JS)
`.codacy/tools-configs/semgrep.yaml`	Semgrep (security patterns)
`.codacy/tools-configs/trivy.yaml`	Trivy (dependency and container vulnerability scanning)
`.codacy/tools-configs/pylint.rc`	Pylint (Python, if applicable)
`.codacy/tools-configs/lizard.yaml`	Lizard (cyclomatic complexity)
`.codacy/cli.sh`	Local Codacy CLI runner
`.codacy/codacy.yaml`	Top-level Codacy configuration

Run Codacy locally:

bash .codacy/cli.sh

4. Operational Workflows

4.1 Deploy Workflow

Colony is deployed to Cloud Run via gcloud run deploy. Secrets are injected at deploy time from GCP Secret Manager using --update-secrets flags. The canonical deploy procedure is documented in docs/phase1/runbooks/deploy.md.

Standard deploy sequence:

# 1. Build and push container image
gcloud builds submit --tag gcr.io/{PROJECT_ID}/colony:{TAG}

# 2. Deploy to Cloud Run with secret injection
gcloud run deploy colony \
  --image gcr.io/{PROJECT_ID}/colony:{TAG} \
  --region {REGION} \
  --update-secrets=CLERK_SECRET_KEY=clerk-secret-key:latest,\
INNGEST_SIGNING_KEY=inngest-signing-key:latest,\
DATABASE_URL=colony-db-url:latest,\
LANGFUSE_SECRET_KEY=langfuse-secret-key:latest,\
SENTRY_DSN=sentry-dsn:latest

# 3. Verify traffic routing
gcloud run services describe colony --region {REGION}

All Cloud Run configuration (CPU allocation, concurrency, min/max instances, environment variable bindings) is defined in infrastructure/gcp/cloud-run-runner.tf. Changes to scaling or routing must go through Terraform, not the GCP Console, to avoid state drift.

4.2 Rollback Workflow

Cloud Run maintains a full revision history. Rollback is instant and does not require a new build.

# List recent revisions
gcloud run revisions list --service colony --region {REGION}

# Route 100% traffic to a prior revision
gcloud run services update-traffic colony \
  --to-revisions {REVISION_NAME}=100 \
  --region {REGION}

For database schema rollbacks, Colony uses migration files. Consult docs/phase1/runbooks/db-bootstrap.md for the migration toolchain. Schema rollbacks require a maintenance window and must be coordinated with the on-call engineer.

4.3 Feature Flags

Colony does not currently implement a dedicated feature-flag service. Feature gating is implemented via:

RBAC role checks — features restricted to admin or owner roles act as controlled rollouts to a subset of users.
Org-level API key presence — features that require a third-party integration (Pipedrive, Unipile, Resend, Google Drive) are conditionally available only when the corresponding key exists in the Colony Vault (api_keys table) for that organization.
Environment variable flags — early-phase features may be toggled via environment variables injected through GCP Secret Manager at deploy time.

To enable a new integration for a specific organization, an admin user for that org must register the API key via /settings/keys. There is no operator-side UI bypass for this — it must be done through the application or directly via a SQL write to the api_keys table (production access procedure: see Section 4.5).

4.4 User Management

Day-to-day user operations are performed in the Clerk Dashboard (dashboard.clerk.com):

Create/deactivate users
Assign or change roles (owner, admin, member)
Manage organization memberships
Review sign-in activity and MFA status
Inspect and replay Svix webhook deliveries

Operator actions available in Clerk Dashboard:

Action	Path in Clerk UI
Add user to organization	Organizations → {Org} → Members → Invite
Change user role	Organizations → {Org} → Members → {User} → Role
Deactivate user	Users → {User} → Ban or Delete
View webhook delivery log	Webhooks → {Endpoint} → Attempts
Replay failed webhook	Webhooks → {Endpoint} → Attempts → Retry

Role changes propagate to Colony via Svix in real time. If propagation fails (user retains old permissions), manually trigger a webhook replay for the relevant organizationMembership.updated event.

4.5 Production Database Access

Direct production database access requires the Cloud SQL Auth Proxy and appropriate IAM credentials. This is a privileged operation tracked in the on-call log.

# Start Cloud SQL Auth Proxy
cloud-sql-proxy colony-39989:{REGION}:colony

# Connect via psql
psql -h 127.0.0.1 -U colony -d colony

IAM bindings for Cloud SQL access are defined in infrastructure/gcp/iam-runner.tf. Only engineers with the cloudsql.client IAM role may connect. All production SQL queries on api_keys (the Colony Vault) involve KMS-encrypted rows — plaintext values are never returned without explicit decryption via pgcrypto and the appropriate KMS key.

4.6 Pipedrive Integration Setup

Per docs/phase1/runbooks/pipedrive-fields.md, custom Pipedrive fields must be created before the bi-sync functions correctly. This is a one-time manual step per organization. See the runbook for the full field manifest. The Playwright runner (runners/) automates the scrape-to-import pipeline once fields are configured.

4.7 Google Drive / Recording Pipeline Setup

Per docs/phase1/runbooks/google-drive-setup.md and docs/phase2/runbooks/03_recording_pipeline.md, the Google Drive integration requires:

A Google Cloud service account with Drive API access
OAuth scope expansion (phase 2: docs/phase2/action_plans/12_google_oauth_scope_expansion.md) for Gmail and Calendar integration
Manual workspace authorization steps per docs/phase2/runbooks/02_workspace_manual_steps.md

5. Monitoring and Alerting

5.1 Observability Stack

Signal Type	System	What Is Captured
LLM traces	Langfuse	All agent tool calls, prompt/completion pairs, token counts, latency per span, model used
Application errors	Sentry	Unhandled exceptions, API route errors, agent runtime failures, stack traces
Infrastructure metrics	GCP Cloud Monitoring	Cloud Run request count, latency p50/p95/p99, instance count, CPU utilization, memory utilization
Database metrics	Cloud SQL Monitoring	Connection count, query latency, disk utilization, replication lag (if replica configured)
Durable job execution	Inngest Dashboard	Function run status, retry counts, event queue depth, failed runs
CI quality	Codacy	Code coverage delta, new issues per PR, security findings (Semgrep, Trivy)

5.2 What to Monitor

Critical (page-level):

Signal	Threshold	Likely Cause
Cloud Run 5xx error rate	> 1% over 5 minutes	Agent runtime crash, database connection exhaustion, invalid secret
Cloud SQL connection count	> 90% of max connections	Connection pool leak, traffic spike without pooling
Inngest failed runs	Any function with > 3 consecutive failures	Downstream API outage (Pipedrive, Unipile, Resend), malformed event payload
Sentry error spike	> 10× baseline in 10 minutes	Bad deployment, upstream API change, schema migration issue

Warning (non-paging):

Signal	Threshold	Action
LLM token usage (Langfuse)	> 150% of 7-day average	Review orchestrator prompts for runaway loops; check circuit breakers
Playwright batch run failures	Any `summary.json` with `errors > 0`	Review `collect-N.log` and `import-N.log` for root cause
Cloud Run instance count	Sustained at max configured instances	Evaluate scaling configuration in `cloud-run-runner.tf`
Cloud SQL disk utilization	> 80%	Review `knowledge_core` embedding growth; consider pgvector index pruning

5.3 Circuit Breakers

Colony implements per-org circuit breakers for the outbound engine. These are defined in the application layer (see docs/phase1/action_plans/22_circuit_breakers.md and docs/phase2/action_plans/11_circuit_breakers_phase2.md) and enforce limits on outbound volume per organization. When a breaker trips:

The affected org's outbound sequences pause automatically.
The HOT reply queue continues to function (inbound-initiated replies are not blocked).
The admin or owner for the org sees an alert on /overview and /outbound.
An operator can inspect breaker state via the Inngest Dashboard — look for paused or cancelled runs in the outbound function family.

Phase 2 circuit breakers extend this model to the discovery pipeline (see docs/phase2/testing/11_breakers_phase2.spec.ts).

5.4 Escalation Path

L1 — On-call engineer
  ├── Reviews Sentry for error detail
  ├── Reviews Inngest Dashboard for failed runs
  ├── Reviews Cloud Run logs in GCP Console
  └── If database: connects via Cloud SQL Auth Proxy

L2 — Infrastructure engineer
  ├── Reviews Cloud Monitoring metrics
  ├── Reviews Terraform state for drift (terraform plan)
  └── Executes rollback if L1 remediation fails

L3 — Engineering lead
  └── Authorizes schema rollback, emergency secret rotation,
      or extended maintenance window

6. Support Workflows

6.1 Ticket Triage

Incoming support requests should be classified against the following taxonomy before investigation:

Class	Description	First Step
AUTH-*	User cannot sign in, wrong role, org access denied	Check Clerk Dashboard → Users and Webhook delivery log
VAULT-*	Integration not working, API key validation fails	Check `/settings/keys` in the org context; verify key format
AGENT-*	Orchestrator not responding, tool call hangs, empty output	Check Inngest Dashboard for the corresponding run; check Langfuse for trace
OUTBOUND-*	Sequences not sending, circuit breaker tripped	Check outbound circuit breaker state in Inngest; check Unipile / Resend vault keys
PIPELINE-*	Pipedrive sync missing data, deal stage mismatch	Check Pipedrive custom field configuration per `docs/phase1/runbooks/pipedrive-fields.md`
RECORDING-*	Meet Notes not appearing, signals not extracted	Check Google Drive service account permissions; check Inngest for recording ingestion run
ONBOARDING-*	Deployment Kit not generated on Closed-Won	Verify deal stage mapping; check Inngest for onboarding agent trigger
INFRA-*	Service unavailable, elevated error rate	Check Cloud Run revisions and GCP Cloud Monitoring

6.2 Common Issues and Resolution Playbooks

PLAY-001: User Has Wrong Role After Org Change

Symptom: User reports they cannot access /knowledge or /settings/keys despite being promoted to admin.

Steps:

Open Clerk Dashboard → Organizations → {Org} → Members.
Confirm the role displayed in Clerk matches the expected role.
Navigate to Clerk Dashboard → Webhooks → {Colony endpoint} → Attempts.
Find the most recent organizationMembership.updated event for this user.
If delivery failed, click Retry. The Colony database will update within 30 seconds of successful delivery.
If Clerk shows the correct role and the webhook delivered successfully, connect to the database and inspect the users or memberships table for the user record.

PLAY-002: Org Integration Key Not Working (Colony Vault)

Symptom: Agent reports a tool call failure for Pipedrive / Unipile / Resend / Google Drive.

Steps:

Ask the org admin to navigate to /settings/keys and verify the key is present and saved (keys are stored but not displayed in plaintext post-save).
If the key was recently rotated in the third-party system, the admin must delete and re-enter it in the vault UI.
If the vault UI is not accessible, connect to Cloud SQL and query:
```
SELECT org_id, provider, updated_at FROM api_keys WHERE org_id = '{ORG_ID}';
```
Do not select the encrypted value column in a logged session.
After re-entry, trigger a test agent run from /overview to verify connectivity.

PLAY-003: Inngest Function Stuck or Failing

Symptom: An agent operation initiated from /overview or a background job (daily brief, recording ingestion, batch scrape) never completes.

Steps:

Open Inngest Dashboard → Functions.
Filter by function name (e.g., gtm-orchestrator, recording-ingestion, onboarding-kit-generator).
Inspect the failed run: review event payload, step output, and error message.
If the failure is a transient third-party error (rate limit, timeout), click Replay on the run.
If the failure is a schema validation error or unexpected payload shape, this likely indicates a code bug. Capture the run ID and file a AGENT-* ticket with the Inngest run URL.
If the function is stuck (in-progress for > 15 minutes), it may be holding a database connection. Check Cloud SQL connection count in Cloud Monitoring. If connections are exhausted, the Cloud Run service may need to be redeployed to recycle connection pools.

PLAY-004: Playwright Batch Run Partial Failure

Symptom: runners/.artifacts/pipedrive-batch/{TIMESTAMP}/summary.json shows errors > 0 or import count is lower than expected.

Steps:

Open batch.log for the overall execution timeline.
Check collect-N.log files for the specific collector that failed — look for HTTP errors, selector mismatches, or authentication failures.
Check import-N.log for database write errors — look for constraint violations or connection errors.
If the failure is an authentication issue with Pipedrive, the Pipedrive session cookie in the runner configuration may have expired. Update the credentials in GCP Secret Manager.
Re-run the batch for the failed segment only (do not re-run already-imported segments to avoid duplicates).

PLAY-005: Daily Brief Not Delivered

Symptom: User reports no brief email and no brief on /overview for a given morning.

Steps:

Check Inngest Dashboard for the daily-brief function scheduled run for that date. Verify it triggered and completed.
If the Inngest run succeeded but no email was received, check Resend delivery logs for the user's email address.
If Resend shows delivery but the brief is absent from /overview, the brief was likely written to the database but the UI query is failing — check Sentry for errors on the /overview route around that time.
If the Inngest run did not trigger, verify the cron schedule configuration has not drifted. Review the function definition.

PLAY-006: Disaster Recovery

Per docs/phase2/runbooks/04_dr_recovery.md, the DR procedure covers:

Cloud SQL point-in-time recovery from automated backups
GCS object restoration from versioned bucket
Cloud Run service re-deployment from the last known good container image revision

The DR runbook must be followed in sequence. Do not attempt manual data restoration outside the documented procedure.

7. Environment Management

7.1 Environment Topology

Environment	Infrastructure	Purpose
Local	`infrastructure/docker-compose.yml` + `.env`	Developer iteration; Postgres only, no GCS or Secret Manager
Development / CI	Shared GCP project or per-PR ephemeral	Integration testing; GitHub Actions Playwright runner
Production	GCP project `colony-39989`, Cloud Run, Cloud SQL, GCS	Live system

7.2 Local Environment Configuration

The .env file at the repository root provides local overrides. It is gitignored. Developers must populate it with:

Variable	Source
`DATABASE_URL`	Local Docker Postgres or development Cloud SQL (via proxy)
`CLERK_SECRET_KEY`	Clerk development instance
`CLERK_PUBLISHABLE_KEY`	Clerk development instance
`INNGEST_SIGNING_KEY`	Inngest dev environment
`INNGEST_EVENT_KEY`	Inngest dev environment
`LANGFUSE_SECRET_KEY`	Langfuse cloud or self-hosted
`LANGFUSE_PUBLIC_KEY`	Langfuse cloud or self-hosted
`SENTRY_DSN`	Sentry project DSN (development project)
`GCP_KMS_KEY_NAME`	KMS key for Colony Vault encryption (dev key)
`GCS_BUCKET_NAME`	`gs://colony-assets` or a dev bucket

7.3 Production Environment Configuration

All production secrets are stored in GCP Secret Manager and injected via gcloud run deploy --update-secrets. They are never stored in .env, Terraform variable files, or application code. The canonical list of secrets is defined in infrastructure/gcp/secrets.tf.

7.4 Database Initialization

Local:

docker compose -f infrastructure/docker-compose.yml up -d
# Runs infrastructure/postgres/init/01-multiple-dbs.sh
# Runs infrastructure/postgres/init/02-extensions.sql automatically

Cloud SQL (production bootstrap): Follow docs/phase1/runbooks/db-bootstrap.md — extensions (pgvector, pgcrypto) must be manually enabled on the Cloud SQL instance after provisioning, as the init scripts are not run against managed instances.

7.5 Cost Monitoring

Per docs/phase2/runbooks/05_cost_dashboard.md, a GCP cost dashboard is maintained for the Colony project. Key cost drivers to monitor monthly: Cloud Run compute, Cloud SQL storage and IOPS, GCS egress, Gemini API token consumption, and Langfuse hosted usage.

8. Secret Management

8.1 Two-Vault Architecture

Colony operates two intentionally separate secret stores:

Store A — GCP Secret Manager (platform-level) Holds secrets required to bootstrap and operate the platform itself. These are not tenant-specific and are managed by the infrastructure team.

Store B — Colony Vault (per-org, api_keys table) Holds tenant-specific integration credentials. Rows are encrypted at the column level using pgcrypto with a GCP KMS key. No plaintext credential is ever stored in this table.

8.2 GCP Secret Manager Inventory

Secret Name	Contents	Consumer
`clerk-secret-key`	Clerk backend API key	Next.js API routes, webhook validation
`inngest-signing-key`	Inngest function signing key	Inngest function handler
`inngest-event-key`	Inngest event ingestion key	Application event dispatch
`colony-db-url`	Cloud SQL connection string	Next.js DB client, migration runner
`langfuse-secret-key`	Langfuse API secret	LLM client wrapper
`langfuse-public-key`	Langfuse API public key	LLM client wrapper
`sentry-dsn`	Sentry project DSN	Sentry Next.js SDK
`gcp-kms-key-name`	KMS key resource name for Colony Vault	`api_keys` encryption/decryption
`bootstrap-llm-key`	Platform-level LLM key (Gemini / OpenAI)	Agent runtime bootstrap

Secrets are versioned in GCP Secret Manager. The latest alias always points to the active version. When rotating, create a new version first, update the Cloud Run deployment to bind latest, verify functionality, then disable the prior version.

8.3 Colony Vault Inventory (`api_keys` table)

Per-organization. One row per integration per org. The provider column identifies the integration.

Provider Key	Integration	Used By
`pipedrive`	Pipedrive CRM API key	Pipeline bi-sync, Playwright runner imports
`unipile`	Unipile LinkedIn/messaging API	Outbound sequence delivery, prospect discovery
`resend`	Resend transactional email	Outbound email sequences, daily brief delivery
`google_drive`	Google Drive service account credentials	Recording ingestion, Meet Notes pipeline
`google_oauth`	Google OAuth tokens (Gmail, Calendar)	Gmail integration, Calendar integration (phase 2)
`serpapi`	SerpAPI key	Prospect discovery web search (phase 2)
`notion`	Notion integration token	Knowledge Core sync to Notion (phase 2)

Keys are entered by org admin users via /settings/keys. Operators do not routinely access vault contents. Emergency access requires Cloud SQL Auth Proxy, the KMS key, and explicit approval from the Engineering lead (L3 escalation).

8.4 Secret Rotation Policy

Secret	Rotation Trigger	Rotation Procedure
GCP Secret Manager secrets	Annually, or on personnel change, or on suspected compromise	Create new version in Secret Manager → redeploy Cloud Run with `--update-secrets` → disable old version
Colony Vault keys (per-org)	On third-party credential rotation or org admin request	Org admin deletes old key entry in `/settings/keys` and adds new value
Clerk signing keys	Per Clerk rotation schedule or on compromise	Rotate in Clerk Dashboard → update Secret Manager → redeploy
Cloud SQL password	On DBA personnel change	Rotate in Cloud SQL → update `colony-db-url` secret → redeploy

After any GCP Secret Manager rotation, a Cloud Run redeployment is required to pick up the new version if a pinned version (not latest) was previously in use.

8.5 KMS Configuration

The GCP KMS key used for Colony Vault encryption is referenced by the gcp-kms-key-name secret. The key is a symmetric encryption key in the colony-39989 project keyring. Key rotation is managed through GCP KMS key rotation policy (recommended: 90-day automatic rotation). Rotation does not require re-encryption of existing rows — Cloud KMS handles envelope encryption such that old data remains decryptable with the prior key version.

9. Health Checks and Status

9.1 Cloud Run Health Check

Cloud Run performs startup and liveness checks against the deployed service. The health check path should be configured in infrastructure/gcp/cloud-run-runner.tf. The standard Next.js health endpoint is:

GET /api/health

This endpoint should return HTTP 200 with a JSON body confirming database connectivity and any critical dependency status. If it does not exist as a discrete route, Cloud Run falls back to checking that the container is listening on the configured port.

9.2 Pub/Sub Topics

Per docs/phase2/runbooks/01_pubsub_topics.md, GCP Pub/Sub topics are used for event-driven pipeline components introduced in phase 2 (recording pipeline, Google Chat push notifications). Topic health can be monitored via GCP Cloud Monitoring → Pub/Sub → Message delivery metrics.

9.3 Inngest Function Health

The Inngest Dashboard provides a real-time view of all

Operator console

Colony Operator Console

Table of Contents

1. Operations Overview

2. Admin Dashboard and Routes

2.1 Application Routes

2.2 RBAC Model

2.3 External Control Panels

3. Internal Tooling Inventory

3.1 Infrastructure as Code

3.2 Database Initialization Scripts

3.3 Local Development Stack

3.4 Playwright Runner

3.5 RBAC Enforcement CI Check

3.6 Codacy Static Analysis

4. Operational Workflows

4.1 Deploy Workflow

4.2 Rollback Workflow

4.3 Feature Flags

4.4 User Management

4.5 Production Database Access

4.6 Pipedrive Integration Setup

4.7 Google Drive / Recording Pipeline Setup

5. Monitoring and Alerting

5.1 Observability Stack

5.2 What to Monitor

5.3 Circuit Breakers

5.4 Escalation Path

6. Support Workflows

6.1 Ticket Triage

6.2 Common Issues and Resolution Playbooks

PLAY-001: User Has Wrong Role After Org Change

PLAY-002: Org Integration Key Not Working (Colony Vault)

PLAY-003: Inngest Function Stuck or Failing

PLAY-004: Playwright Batch Run Partial Failure

PLAY-005: Daily Brief Not Delivered

PLAY-006: Disaster Recovery

7. Environment Management

7.1 Environment Topology

7.2 Local Environment Configuration

7.3 Production Environment Configuration

7.4 Database Initialization

7.5 Cost Monitoring

8. Secret Management

8.1 Two-Vault Architecture

8.2 GCP Secret Manager Inventory

8.3 Colony Vault Inventory (api_keys table)

8.4 Secret Rotation Policy

8.5 KMS Configuration

9. Health Checks and Status

9.1 Cloud Run Health Check

9.2 Pub/Sub Topics

9.3 Inngest Function Health

8.3 Colony Vault Inventory (`api_keys` table)