Cloud-based RDS managent on AWS
- Mark Kendall
- 3 days ago
- 4 min read
Think of it as self-service with guardrails. You want teams to move fast on day-to-day DB work (tables, indexes, connections) without a ticket queue, while the platform enforces safety, cost, and compliance.
Operating Model (in one glance)
• Golden paths per engine (Postgres, Oracle, SQL Server, DynamoDB) with ready-made templates.
• Everything via Git (GitOps): schemas, migrations, access, and infra are PR-driven.
• Short-lived credentials + network policy + audit by default.
• Observability baked in (Performance Insights/pg_stat_statements, CloudWatch, alarms).
⸻
1) Provisioning & Access (self-serve)
Platform provides:
• Terraform/CDK modules for each engine (RDS/Aurora, Oracle, SQL Server, DynamoDB) with:
• subnet groups, security groups, parameter groups
• backups, PITR, deletion protection
• RDS Proxy / PgBouncer for connection pooling (Lambdas, K8s)
• Secrets Manager (automatic rotation) or IAM DB Auth (Postgres/MySQL)
• Backstage/AWS Service Catalog forms to instantiate DBs by team/env.
Teams do:
• Create a repo db-<domain> with:
• /schema (DDL in migrations)
• /access (who can connect; least-privilege policies)
• /observability (dashboards/alerts)
Connection patterns:
• Lambdas: role → RDS Proxy → DB, secret fetched at runtime.
• K8s: IRSA service account → secret mount/env → PgBouncer/RDS Proxy.
• Human access: SSO to a bastion/SSM Session Manager + short-lived creds; no static users.
⸻
2) Day-to-Day Changes (tables, indexes, procs)
Rule: changes land via migrations, not consoles.
Migration tool (pick one): Flyway or Liquibase.
Workflow
1. Dev writes migration: V2025_09_04__add_order_idx.sql
2. PR runs CI dry-run against an ephemeral DB (or dev replica):
• lints DDL (no full-table locks on big tables, naming conventions)
• runs EXPLAIN checks for new indexes
• verifies rollbacks (where feasible)
3. On merge, CD applies migrations to dev → stage → prod with approvals.
Example Flyway layout
/migrations
└─ V2025_09_04__create_orders.sql
└─ V2025_09_04__add_orders_customer_idx.sql
Example GitHub Actions (Postgres)
name: db-migrations
on: [pull_request, push]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker://flyway/flyway:10
with:
args: >
-url=jdbc:postgresql://ephemeral-db:5432/app
-user=${{ secrets.DB_USER }}
-password=${{ secrets.DB_PASS }}
-locations=filesystem:migrations
-connectRetries=10
info validate migrate -target=latest -dryRunOutput=flyway.sql
⸻
3) Indexing & Performance Governance
• Naming convention: ix_<table>_<cols>, cap N indexes/table unless justified.
• Automated checks:
• reject duplicate/unused indexes via pg_stat_user_indexes
• require EXPLAIN (ANALYZE, BUFFERS) artifacts for queries tied to new indexes
• Insights: enable Performance Insights + pg_stat_statements; weekly report of top SQL.
• Safety: large tables require ONLINE/CONCURRENT index creation (Postgres CREATE INDEX CONCURRENTLY).
⸻
4) Multi-Engine Notes
Postgres/Aurora
• Prefer Aurora Serverless v2 for elastic capacity.
• Use IAM auth where possible; else Secrets Manager rotation.
• Read replicas for analytics; logical replication if needed.
Oracle / SQL Server (RDS)
• Standardize on Liquibase (good vendor support).
• Parameter groups enforce audit/tracing; AWR/Extended Events to CloudWatch Logs.
DynamoDB
• Changes via IaC only:
• table, GSIs/LSIs defined in Terraform/CDK
• on-demand for spiky loads; PITR on
• Data access by fine-grained IAM; add a small “table access” module teams can include.
• Model changes are code changes (no migrations), but enforce GSIs review in PR.
⸻
5) Connectivity Without Central Tickets
• Network: VPC endpoints + SGs managed by module; teams declare allowed SGs in PR (access/allowlist.tf).
• AuthZ: teams request roles via code (access/app-roles.tf) mapped to:
• db_readonly, db_writer, db_admin (admin reserved)
• Secrets: db/<env>/<service> path; rotation policy auto-applies.
• Catalog: register each schema in the data catalog (owner, PII flags, retention).
⸻
6) Observability & SRE Basics
• Dashboards per DB with:
• CPU/IOPS/latency, active connections, deadlocks, top queries
• Alarms:
• connection saturation, replication lag, disk space, slow query budget
• Runbooks in the repo (/runbooks): hot index creation, vacuum bloat, failover steps.
⸻
7) “Golden Templates” (copy/paste starters)
Terraform: Postgres DB + RDS Proxy (excerpt)
module "aurora_pg" {
source = "modules/aurora-postgres"
name = "catalog"
engine_version = "15"
vpc_id = var.vpc_id
subnets = var.db_subnets
backup_retention_days = 7
performance_insights = true
}
module "rds_proxy" {
source = "modules/rds-proxy"
name = "catalog-proxy"
target_cluster = module.aurora_pg.cluster_arn
secrets_manager_arn = module.aurora_pg.secret_arn
iam_auth = true
}
Liquibase migration (Oracle/SQL Server flavor)
-- changeset team:2025-09-04-create-customer
CREATE TABLE CUSTOMER (
ID VARCHAR2(36) PRIMARY KEY,
NAME VARCHAR2(200) NOT NULL,
CREATED_AT TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
DynamoDB table (CDK, TypeScript)
new dynamodb.Table(this, "Orders", {
tableName: "orders",
partitionKey: { name: "orderId", type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
pointInTimeRecovery: true,
});
⸻
What this gives you
• Teams create/alter their own schemas with speed (PRs → pipelines), no DBA bottleneck.
• Platform ensures security, resilience, cost control, and auditability.
• Works uniformly across Postgres, Oracle, SQL Server, and DynamoDB.
Comments