Serverless Go — Professional¶
1. The production framing¶
Serverless Go in production is rarely "write the handler, deploy the zip". It's a deployment pipeline, an infrastructure-as-code definition, an observability story, a secrets policy, a multi-region plan, and a cost dashboard — with the Go code being the smallest part. The professional job, roughly:
- Build reproducibly in CI with locked Go version, locked dependencies, locked architecture.
- Deploy via IaC (SAM / Terraform / Serverless Framework / CDK) so the function and its IAM, alarms, and triggers move together.
- Observe cold starts, errors, throttles, durations, downstream call rates, and dependency call latencies as four separate signals.
- Secure with least-privilege IAM, secrets via Secrets Manager / Parameter Store, and per-environment KMS keys.
- Plan for region failover, traffic shifting on deploy, and quota-exhaustion incidents.
The rest of this file is what that looks like.
2. The reproducible build pipeline¶
CI must produce the same bytes for the same commit, on any machine. Drift in the resulting bootstrap binary breaks layer caching, complicates incident review, and makes "is this the binary we tested?" unanswerable.
# .github/workflows/build.yml
name: build
on: [push]
jobs:
build:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: 1.24.2 # pinned; not "stable"
check-latest: false
- name: Build
env:
GOOS: linux
GOARCH: arm64
CGO_ENABLED: 0
GOFLAGS: -trimpath
run: |
go build -tags lambda.norpc \
-ldflags="-s -w -buildid=" \
-o bootstrap ./cmd/lambda
- name: Verify reproducibility
run: |
shasum -a 256 bootstrap > checksum.txt
cat checksum.txt
- uses: actions/upload-artifact@v4
with: { name: bootstrap, path: bootstrap }
Three details:
-trimpathremoves absolute paths from the binary.-buildid=clears the build ID that varies per machine.GOFLAGS=-trimpathis belt-and-suspenders.
The resulting binary is byte-identical across machines for the same commit and Go version. Verify with shasum -a 256.
3. SAM template¶
AWS SAM (Serverless Application Model) is the AWS-native IaC for serverless. It compiles to CloudFormation.
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Runtime: provided.al2023
Architectures: [arm64]
Timeout: 10
MemorySize: 512
Tracing: Active
Environment:
Variables:
LOG_LEVEL: info
GOMEMLIMIT: 460MiB
Resources:
HelloFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: !Sub ${AWS::StackName}-hello
CodeUri: ./bootstrap
Handler: bootstrap
Policies:
- AWSLambdaBasicExecutionRole
- DynamoDBCrudPolicy:
TableName: !Ref ItemsTable
Events:
Api:
Type: HttpApi
Properties:
Path: /hello/{id}
Method: GET
ItemsTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey: { Name: id, Type: String }
Build and deploy:
sam build knows how to compile Go: it runs go build with the right flags as long as metadata.BuildMethod: go1.x is set on the function. For full control over flags (as in §2), use a Makefile-based build instead.
4. Terraform alternative¶
For multi-cloud or multi-account setups, Terraform is the lingua franca:
resource "aws_lambda_function" "hello" {
function_name = "hello-go"
filename = "function.zip"
source_code_hash = filebase64sha256("function.zip")
handler = "bootstrap"
runtime = "provided.al2023"
architectures = ["arm64"]
memory_size = 512
timeout = 10
role = aws_iam_role.lambda.arn
environment {
variables = {
LOG_LEVEL = "info"
GOMEMLIMIT = "460MiB"
}
}
tracing_config { mode = "Active" }
}
resource "aws_apigatewayv2_api" "http" {
name = "hello-api"
protocol_type = "HTTP"
}
resource "aws_apigatewayv2_integration" "hello" {
api_id = aws_apigatewayv2_api.http.id
integration_type = "AWS_PROXY"
integration_uri = aws_lambda_function.hello.invoke_arn
payload_format_version = "2.0"
}
The trade-off: SAM is more concise for Lambda-only stacks; Terraform wins when Lambda is one piece of a larger system that includes EKS, RDS, VPC, etc.
5. CI/CD with canary deployments¶
A production deploy never just replaces the alias all at once:
HelloFunction:
Type: AWS::Serverless::Function
Properties:
...
AutoPublishAlias: live
DeploymentPreference:
Type: Canary10Percent5Minutes
Alarms:
- !Ref HelloErrorAlarm
This makes SAM (via CodeDeploy):
- Publish a new version on every deploy.
- Shift 10 % of traffic via the
livealias to the new version. - Wait 5 minutes.
- Shift the remaining 90 % if the
HelloErrorAlarmhasn't fired. - Roll back automatically if the alarm fires.
The error alarm watches per-version metrics, so an error spike on the new version triggers rollback without affecting the old version. Configurations: Canary10Percent5Minutes, Linear10PercentEvery1Minute, AllAtOnce (don't), and custom.
6. Secrets and config¶
| Type | Store | Access pattern |
|---|---|---|
| Configuration (table names, endpoints) | Environment variables (in IaC) | os.Getenv |
| Sensitive config (API keys, signing secrets) | SSM Parameter Store (SecureString) | Lazy fetch + cache |
| Database credentials | Secrets Manager (auto-rotation) | Lazy fetch + cache + handle rotation |
| Encryption keys | KMS | Don't fetch the key; sign/decrypt remotely |
Pattern for SSM:
type SecretCache struct {
once sync.Once
value string
err error
ttl time.Time
}
func (c *SecretCache) Get(ctx context.Context, name string) (string, error) {
if time.Now().Before(c.ttl) && c.value != "" {
return c.value, nil
}
cli := ssm.NewFromConfig(awsConfig)
out, err := cli.GetParameter(ctx, &ssm.GetParameterInput{
Name: aws.String(name),
WithDecryption: aws.Bool(true),
})
if err != nil {
return "", err
}
c.value = *out.Parameter.Value
c.ttl = time.Now().Add(5 * time.Minute)
return c.value, nil
}
The TTL is non-obvious: too short and you hammer SSM; too long and rotation propagates slowly. Five minutes is a fair default.
For Secrets Manager rotation, subscribe the function to the rotation lifecycle (CloudWatch Event on aws.secretsmanager) or accept that the in-memory cache lags by up to TTL.
7. Observability stack¶
Three signals per function, no fewer:
| Signal | Source | Dashboard tile |
|---|---|---|
| Duration | CloudWatch metric AWS/Lambda Duration | p50, p95, p99 |
| Errors | AWS/Lambda Errors and Throttles | Stacked count |
| Cold starts | Log Insights query on Init Duration | Rate (% of invocations) |
Add for HTTP-fronted functions:
| Signal | Source |
|---|---|
| API Gateway 4xx/5xx | AWS/ApiGateway metrics |
| API Gateway integration latency | IntegrationLatency metric |
Distributed tracing via OpenTelemetry:
import (
"go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws"
"go.opentelemetry.io/otel"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
)
func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
exp, err := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("collector:4317"), otlptracegrpc.WithInsecure())
if err != nil { return nil, err }
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(tp)
return tp, nil
}
For Lambda, swap the OTLP gRPC exporter for the AWS Distro for OpenTelemetry Lambda layer, which buffers spans and ships them out-of-band so you don't block the handler on export.
Propagate trace IDs across event sources:
| Boundary | Propagation |
|---|---|
| API Gateway → Lambda | X-Amzn-Trace-Id header (X-Ray) or traceparent (W3C, via integration mapping) |
| Lambda → SQS | Set MessageAttributes with the trace context |
| SQS → Lambda | Read MessageAttributes; pass to span builder |
| Lambda → DynamoDB / S3 / etc. | otelaws middleware on the SDK client |
8. The CloudWatch Logs Insights queries you actually need¶
-- Cold-start rate per hour
fields @timestamp, @initDuration
| filter ispresent(@initDuration)
| stats count() as cold_starts by bin(1h)
-- p99 of invocation duration
fields @timestamp, @duration
| stats pct(@duration, 99) by bin(5m)
-- Errors with request IDs (for correlation with traces)
fields @timestamp, @requestId, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- Memory utilization (right-size opportunity)
fields @timestamp, @maxMemoryUsed, @memorySize
| stats max(@maxMemoryUsed / @memorySize) as util_pct by bin(1h)
Save these as the function's standard query set in the CloudWatch console. The memory utilization query in particular tells you when to drop a tier and save money.
9. Multi-region¶
Lambda is regional. Multi-region deployment patterns:
| Pattern | When | Mechanism |
|---|---|---|
| Active-active | Customer-facing, latency-critical | Route 53 latency routing + per-region API Gateway + DynamoDB Global Tables |
| Active-passive | DR for less-critical workloads | Route 53 failover + on-demand replication |
| Read-local / write-primary | Eventual-consistency-OK | Read from local DynamoDB replica; write through primary |
| Region-pinned | Data sovereignty | Per-region stack with regional routing at CDN |
Cross-region pitfalls in Go code:
- Don't hardcode
us-east-1inconfig.LoadDefaultConfig. The SDK readsAWS_REGIONfrom Lambda's environment; trust it. - DynamoDB Global Tables require last-writer-wins reconciliation; your handler must tolerate eventually consistent reads.
- KMS keys are regional: encryption done in
eu-west-1requires that region's CMK to decrypt. Plan envelope keys accordingly.
10. Cost dashboard¶
Per-function monthly cost has three components:
| Component | Formula |
|---|---|
| Invocation | requests × $0.20/M |
| Compute | requests × duration × memory_GB × $0.0000166667 |
| Data transfer | mostly negligible inside AWS; egress to internet billed |
Plus, if used:
| Component | Formula |
|---|---|
| Provisioned concurrency | provisioned_GB-s × $0.000004133 |
| CloudWatch Logs ingestion | GB ingested × $0.50 (us-east-1) |
| X-Ray traces | traces recorded × $5/M |
The hidden costs that bite teams: CloudWatch Logs ingestion (a chatty handler with slog.Debug everywhere can cost more than the Lambda itself), and X-Ray sampling (default 100 % is fine in dev, ruinous in prod — set the sampling rate explicitly).
Build a per-function cost panel via Cost Explorer's "Lambda" cost category. Tag every function with service, env, owner from IaC so the dashboard slices cleanly.
11. Throttling and concurrency limits¶
Each AWS account has a regional concurrent execution limit (1000 by default; request raises to 10 000+). Each function can carve out a slice with ReservedConcurrentExecutions.
Two effects:
- Cap — the function never exceeds 100 concurrent invocations. Excess invocations throttle (synchronous: HTTP 429 to caller; async: SQS replay or DLQ).
- Floor — that 100 is reserved away from the account pool. Other functions can't use it.
For SQS-triggered Lambdas, the maximum concurrency is min(ReservedConcurrentExecutions, MaximumConcurrency on the event-source mapping). Setting only the function-level reservation leaves SQS free to push as many batches as it wants, which can starve other functions.
12. Dead-letter queues¶
Async invocations (SNS, S3, EventBridge) that fail twice after retries land in a DLQ if you configure one:
HelloFunction:
Type: AWS::Serverless::Function
Properties:
...
DeadLetterQueue:
Type: SQS
TargetArn: !GetAtt HelloDLQ.Arn
DLQs require operational discipline: a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0, a runbook that says "replay or discard", and a dashboard that shows DLQ depth per function.
For SQS-source Lambdas, configure the SQS queue's own RedrivePolicy for the source queue. The function's DLQ is for async invocation failures, which by definition can't happen from a polling event source.
13. Standard library and ecosystem footguns¶
| Pattern | Why it bites in production |
|---|---|
time.Now() in init for cache keys | Cold and warm starts produce different keys; cache thrashes |
os.Hostname() for log fields | Returns localhost.localdomain on Lambda; useless |
net.Dial with default timeout | Default is unlimited; Lambda will kill the invocation before the dial gives up |
database/sql without SetConnMaxLifetime | Stale connections accumulate; first request after freeze can hit a dead conn |
| Logging entire request bodies | CloudWatch ingestion cost dominates; also leaks PII |
panic for control flow | Lambda kills the invocation; downstream sees 502 with no detail |
| Background goroutines after handler returns | Frozen mid-flight; resumes weirdly on next invocation |
Each of these has been a real production incident at some team. Code review for them once, then put them on a Lambda-specific checklist.
14. The "function is degraded" runbook¶
When CloudWatch alarms fire:
- Check throttles. If
Throttles > 0, either raiseReservedConcurrentExecutionsor investigate why a downstream is slowing things down enough to consume concurrency. - Check cold start rate. A spike in cold starts means new environments are being spun up; either traffic surged or environments are being killed for some reason.
- Check downstream call latency. If DynamoDB / RDS / external API latency is up, your handler durations grow and you exhaust concurrency.
- Sample recent error traces. X-Ray or CloudWatch Insights filter on errors; look for the common root cause.
- Roll back if a recent deploy correlates. The canary alarm should have caught it; if it didn't, the alarm needs tightening.
- Capture a
pproffrom a warm instance. Available viago.amzn.com/lambda/extensionsor by exposing a port through API Gateway in a debug stack.
15. Summary¶
Production serverless Go is a system: reproducible byte-identical builds, IaC-managed deployments with canary rollback, three-signal observability (duration / errors / cold starts) plus per-function cost panels, lazy-fetched secrets with bounded TTL, multi-region planning, and reserved-concurrency budgeting. The Go code itself is small and disciplined: lazy init, context-honoring downstream calls, structured logs with request IDs. The operational surface area is large; treat each function like a microservice with full ownership of its own pipeline, alerts, runbook, and cost.
Further reading¶
- AWS SAM developer guide: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/
- Terraform AWS provider — Lambda: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function
- AWS Distro for OpenTelemetry on Lambda: https://aws-otel.github.io/docs/getting-started/lambda
- CodeDeploy traffic shifting: https://docs.aws.amazon.com/lambda/latest/dg/configuration-aliases.html
- Lambda Powertools (Go community port): https://github.com/aws-powertools/powertools-lambda-go