Go memory profiling for traffic spikes: pprof walkthrough
Go memory profiling helps you handle sudden traffic spikes. A hands-on pprof walkthrough to spot allocation hot spots in JSON, DB scans, and middleware.

What sudden traffic spikes do to a Go service’s memory
A “memory spike” in production rarely means one simple number went up. You might see RSS (process memory) climb fast while the Go heap barely moves, or the heap grows and drops in sharp waves as GC runs. At the same time, latency often gets worse because the runtime spends more time cleaning up.
Common patterns in metrics:
- RSS rises faster than expected and sometimes doesn’t fully drop after the spike
- Heap in-use rises, then falls in sharp cycles as GC runs more often
- Allocation rate jumps (bytes allocated per second)
- GC pause time and GC CPU time increase, even if each pause is small
- Request latency jumps and tail latency gets noisy
Traffic spikes magnify per-request allocations because “small” waste scales linearly with load. If one request allocates an extra 50 KB (temporary JSON buffers, per-row scan objects, middleware context data), then at 2,000 requests per second you’re feeding the allocator about 100 MB every second. Go can handle a lot, but the GC still has to trace and free those short-lived objects. When allocation outpaces cleanup, the heap target grows, RSS follows, and you can hit memory limits.
The symptoms are familiar: OOM kills from your orchestrator, sudden latency jumps, more time spent in GC, and a service that looks “busy” even when CPU isn’t pinned. You can also get GC thrash: the service stays up but keeps allocating and collecting so throughput drops right when you need it most.
pprof helps answer one question fast: which code paths allocate the most, and are those allocations necessary? A heap profile shows what’s retained right now. Allocation-focused views (like alloc_space) show what’s getting created and thrown away.
What pprof won’t do is explain every byte of RSS. RSS includes more than the Go heap (stacks, runtime metadata, OS mappings, cgo allocations, fragmentation). pprof is best at pointing to allocation hot spots in your Go code, not proving an exact container-level memory total.
Set up pprof safely (step by step)
pprof is easiest to use as HTTP endpoints, but those endpoints can reveal a lot about your service. Treat them like an admin feature, not a public API.
1) Add pprof endpoints
In Go, the simplest setup is to run pprof on a separate admin server. That keeps profiling routes away from your main router and middleware.
package main
import (
"log"
"net/http"
_ "net/http/pprof"
)
func main() {
go func() {
// Admin only: bind to localhost
log.Println(http.ListenAndServe("127.0.0.1:6060", nil))
}()
// Your main server starts here...
// http.ListenAndServe(":8080", appHandler)
select {}
}
If you can’t open a second port, you can mount pprof routes into your main server, but it’s easier to expose them by accident. A separate admin port is the safer default.
2) Lock it down before you deploy
Start with controls that are hard to mess up. Binding to localhost means the endpoints aren’t reachable from the internet unless someone also exposes that port.
A quick checklist:
- Run pprof on an admin port, not the main user-facing port
- Bind to 127.0.0.1 (or a private interface) in production
- Add an allowlist at the network edge (VPN, bastion, or internal subnet)
- Require auth if your edge can enforce it (basic auth or token)
- Verify you can fetch the profiles you’ll actually use: heap, allocs, goroutine
3) Build and roll out safely
Keep the change small: add pprof, ship it, and confirm it’s reachable only from where you expect. If you have staging, test there first by simulating some load and capturing a heap and allocs profile.
For production, roll out gradually (one instance or a small slice of traffic). If pprof is misconfigured, the blast radius stays small while you fix it.
Capture the right profiles during a spike
During a spike, a single snapshot is rarely enough. Capture a small timeline: a few minutes before the spike (baseline), during the spike (impact), and a few minutes after (recovery). That makes it easier to separate real allocation changes from normal warm-up behavior.
If you can reproduce the spike with controlled load, match production as closely as possible: request mix, payload sizes, and concurrency. A spike of small requests behaves very differently from a spike of large JSON responses.
Take both a heap profile and an allocation-focused profile. They answer different questions:
- Heap (inuse) shows what’s alive and holding memory right now
- Allocations (alloc_space or alloc_objects) show what’s being allocated heavily, even if it gets freed quickly
A practical capture pattern: grab one heap profile, then an allocation profile, then repeat 30 to 60 seconds later. Two points during the spike helps you see whether a suspect path is steady or accelerating.
# examples: adjust host/port and timing to your setup
curl -o heap_during.pprof "http://127.0.0.1:6060/debug/pprof/heap"
curl -o allocs_30s.pprof "http://127.0.0.1:6060/debug/pprof/allocs?seconds=30"
Alongside pprof files, record a few runtime stats so you can explain what the GC was doing at the same time. Heap size, number of GCs, and pause time are usually enough. Even a short log line at each capture time helps correlate “allocations went up” with “GC started running constantly.”
Keep incident notes as you go: build version (commit/tag), Go version, important flags, config changes, and what traffic was happening (endpoints, tenants, payload sizes). Those details often matter later when you compare profiles and realize the request mix shifted.
How to read heap and allocation profiles
A heap profile answers different questions depending on the view.
Inuse space shows what’s still in memory at the moment of capture. Use it for leaks, long-lived caches, or requests that leave objects behind.
Alloc space (total allocations) shows what was allocated over time, even if it was freed quickly. Use it when spikes cause high GC work, latency jumps, or OOMs from churn.
Sampling matters. Go doesn’t record every allocation. It samples allocations (controlled by runtime.MemProfileRate), so small, frequent allocations can be underrepresented and numbers are estimates. The biggest offenders still tend to stand out, especially under spike conditions. Look for trends and top contributors, not perfect accounting.
The most useful pprof views:
- top: a quick read on who dominates inuse or alloc (check both flat and cumulative)
- list
: line-level allocation sources inside a hot function - graph: call paths that explain how you got there
Diffs are where it gets practical. Compare a baseline profile (normal traffic) with a spike profile to highlight what changed, instead of chasing background noise.
Validate findings with a small change before a big refactor:
- Reuse a buffer (or add a small
sync.Pool) in the hot path - Cut per-request object creation (for example, avoid building intermediate maps for JSON)
- Re-profile under the same load and confirm the diff shrinks where you expected
If the numbers move the right way, you’ve found a real cause, not just a scary report.
Find allocation hot spots in JSON encoding
During spikes, JSON work can become a major memory bill because it runs on every request. JSON hot spots often show up as lots of small allocations that push GC harder.
Red flags to look for in pprof
If the heap or allocation view points at encoding/json, look closely at what you feed into it. These patterns commonly inflate allocations:
- Using
map[string]any(or[]any) for responses instead of typed structs - Marshaling the same object multiple times (for example, logging it and also returning it)
- Pretty printing with
json.MarshalIndentin production - Building JSON through temporary strings (
fmt.Sprintf, string concatenation) before marshaling - Converting large
[]bytetostring(or back) just to match an API
json.Marshal always allocates a new []byte for the full output. json.NewEncoder(w).Encode(v) usually avoids that one big buffer because it writes to an io.Writer, but it can still allocate internally, especially if v is full of any, maps, or pointer-heavy structures.
Quick fixes and quick experiments
Start with typed structs for your response shape. They reduce reflection work and avoid per-field interface boxing.
Then remove avoidable per-request temporaries: reuse bytes.Buffer via a sync.Pool (carefully), don’t indent in production, and don’t re-marshal just for logs.
Small experiments that confirm JSON is the culprit:
- Replace
map[string]anywith a struct for one hot endpoint and compare profiles - Switch from
MarshaltoEncoderwriting directly to the response - Remove
MarshalIndentor debug-only formatting and re-profile under the same load - Skip JSON encoding for unchanged cached responses and measure the drop
Find allocation hot spots in query scanning
When memory jumps during a spike, database reads are a common surprise. It’s easy to focus on SQL time, but the scan step can allocate a lot per row, especially when you scan into flexible types.
Common offenders:
- Scanning into
interface{}(ormap[string]any) and letting the driver decide types - Converting
[]bytetostringfor every field - Using nullable wrappers (
sql.NullString,sql.NullInt64) in large result sets - Pulling big text/blob columns you don’t always need
One pattern that quietly burns memory is scanning row data into temporary variables, then copying into a real struct (or building a map per row). If you can scan straight into a struct with concrete fields, you avoid extra allocations and type checks.
Batch size and pagination change your memory shape. Fetching 10,000 rows into a slice allocates for slice growth and every row, all at once. If the handler only needs a page, push that into the query and keep the page size stable. If you must process lots of rows, stream them and aggregate small summaries instead of storing every row.
Large text fields need special care. Many drivers return text as []byte. Converting that to string copies the data, so doing it for every row can explode allocations. If you only need the value sometimes, delay the conversion or scan fewer columns for that endpoint.
To confirm whether the driver or your code is doing most of the allocating, check what dominates your profiles:
- If frames point to your mapping code, focus on scan targets and conversions
- If frames point into
database/sqlor the driver, reduce rows and columns first, then consider driver-specific options - Check both alloc_space and alloc_objects; many tiny allocations can be worse than a few big ones
Example: a “list orders” endpoint scans SELECT * into []map[string]any. During a spike, each request builds thousands of small maps and strings. Changing the query to select only needed columns and scanning into []Order{ID int64, Status string, TotalCents int64} often drops allocations immediately. The same idea applies if you’re profiling a generated Go backend from AppMaster: the hot spot is usually in how you shape and scan result data, not the database itself.
Middleware patterns that quietly allocate per request
Middleware feels cheap because it’s “just a wrapper,” but it runs on every request. During a spike, small per-request allocations add up fast and show up as a rising allocation rate.
Logging middleware is a common source: formatting strings, building maps of fields, or copying headers for nicer output. Request ID helpers can allocate when they generate an ID, convert it to a string, then attach it to context. Even context.WithValue can allocate if you store new objects (or new strings) on every request.
Compression and body handling are another frequent culprit. If middleware reads the full request body to “peek” or validate it, you can end up with a large buffer per request. Gzip middleware can allocate a lot if it creates new readers and writers each time instead of reusing buffers.
Auth and session layers can be similar. If each request parses tokens, base64-decodes cookies, or loads session blobs into fresh structs, you get constant churn even when the handler work is light.
Tracing and metrics can allocate more than expected when labels are built dynamically. Concatenating route names, user agents, or tenant IDs into new strings per request is a classic hidden cost.
Patterns that often show up as “death by a thousand cuts”:
- Building log lines with
fmt.Sprintfand newmap[string]anyvalues per request - Copying headers into new maps or slices for logging or signing
- Allocating new gzip buffers and readers/writers instead of pooling
- Creating high-cardinality metric labels (many unique strings)
- Storing new structs in context on every request
To isolate middleware cost, compare two profiles: one with the full chain enabled and one with middleware temporarily disabled or replaced with a no-op. A simple test is a health endpoint that should be almost allocation-free. If /health allocates heavily during a spike, the handler isn’t the problem.
If you build Go backends generated by AppMaster, the same rule applies: keep cross-cutting features (logging, auth, tracing) measurable, and treat per-request allocations as a budget you can audit.
Fixes that usually pay off quickly
Once you have heap and allocs views from pprof, prioritize changes that reduce per-request allocations. The goal isn’t clever tricks. It’s making the hot path create fewer short-lived objects, especially under load.
Start with the safe, boring wins
If sizes are predictable, preallocate. If an endpoint usually returns around 200 items, create your slice with capacity 200 so it doesn’t grow and copy itself several times.
Avoid building strings in hot paths. fmt.Sprintf is convenient, but it often allocates. For logging, prefer structured fields, and reuse a small buffer where it makes sense.
If you generate big JSON responses, consider streaming them instead of building one huge []byte or string in memory. A common spike pattern is: request comes in, you read a large body, build a big response, memory jumps until GC catches up.
Quick changes that typically show up clearly in before/after profiles:
- Preallocate slices and maps when you know the size range
- Replace fmt-heavy formatting in request handling with cheaper alternatives
- Stream large JSON responses (encode directly to the response writer)
- Use
sync.Poolfor reusable, same-shaped objects (buffers, encoders) and return them consistently - Set request limits (body size, payload size, page size) to cap worst cases
Use sync.Pool carefully
sync.Pool helps when you repeatedly allocate the same thing, like a bytes.Buffer per request. It can also hurt if you pool objects with unpredictable sizes or forget to reset them, which keeps large backing arrays alive.
Measure before and after using the same workload:
- Capture an allocs profile during the spike window
- Apply one change at a time
- Re-run the same request mix and compare total allocs/op
- Watch tail latency, not just memory
If you build Go backends generated by AppMaster, these fixes still apply to custom code around handlers, integrations, and middleware. That’s where spike-driven allocations tend to hide.
Common pprof mistakes and false alarms
The fastest way to waste a day is to optimize the wrong thing. If the service is slow, start with CPU. If it gets killed by OOM, start with heap. If it survives but GC is working nonstop, look at allocation rate and GC behavior.
Another trap is staring at “top” and calling it done. “Top” hides context. Always inspect call stacks (or a flame graph) to see who called the allocator. The fix is often one or two frames above the hot function.
Also watch for the inuse vs churn mix-up. A request might allocate 5 MB of short-lived objects, trigger extra GC, and end with only 200 KB inuse. If you only look at inuse, you miss churn. If you only look at total allocations, you might optimize something that never stays resident and doesn’t matter for OOM risk.
Quick sanity checks before changing code:
- Confirm you’re in the right view: heap inuse for retention, alloc_space/alloc_objects for churn
- Compare stacks, not just function names (encoding/json is often a symptom)
- Reproduce traffic realistically: same endpoints, payload sizes, headers, concurrency
- Capture a baseline and a spike profile, then diff them
Unrealistic load tests cause false alarms. If your test sends tiny JSON bodies but production sends 200 KB payloads, you’ll optimize the wrong path. If your test returns one database row, you’ll never see the scanning behavior that appears with 500 rows.
Don’t chase noise. If a function appears only in the spike profile (not baseline), it’s a strong lead. If it appears in both at the same level, it might be normal background work.
A realistic incident walkthrough
A Monday morning promo goes out and your Go API starts getting 8x normal traffic. The first symptom isn’t a crash. RSS climbs, GC gets busier, and p95 latency jumps. The hottest endpoint is GET /api/orders because the mobile app refreshes it on every screen open.
You take two snapshots: one from a quiet moment (baseline) and one during the spike. Capture the same type of heap profile both times so the comparison stays fair.
The flow that works in the moment:
- Take a baseline heap profile and note current RPS, RSS, and p95 latency
- During the spike, take another heap profile plus an allocation profile within the same 1 to 2 minute window
- Compare the top allocators between the two and focus on what grew the most
- Walk from the biggest function to its callers until you hit your handler path
- Make one small change, redeploy to a single instance, and re-profile
In this case, the spike profile showed most new allocations came from JSON encoding. The handler built map[string]any rows, then called json.Marshal on a slice of maps. Each request created lots of short-lived strings and interface values.
The smallest safe fix was to stop building maps. Scan database rows directly into a typed struct and encode that slice. Nothing else changed: same fields, same response shape, same status codes. After rolling the change to one instance, allocations in the JSON path dropped, GC time fell, and latency stabilized.
Only then do you roll out gradually while watching memory, GC, and error rates. If you build services on a no-code platform like AppMaster, this is also a reminder to keep response models typed and consistent, because it helps avoid hidden allocation costs.
Next steps to prevent the next memory spike
Once you’ve stabilized a spike, make the next one boring. Treat profiling like a repeatable drill.
Write a short runbook your team can follow when they’re tired. It should say what to capture, when to capture it, and how to compare it to a known-good baseline. Keep it practical: exact commands, where profiles go, and what “normal” looks like for your top allocators.
Add lightweight monitoring for allocation pressure before you hit OOM: heap size, GC cycles per second, and bytes allocated per request. Catching “allocations per request up 30% week over week” is often more useful than waiting for a hard memory alarm.
Push checks earlier with a short load test in CI on a representative endpoint. Small response changes can double allocations if they trigger extra copies, and it’s better to find that before production traffic does.
If you run a generated Go backend, export the source and profile it the same way. Generated code is still Go code, and pprof will point to real functions and lines.
If your requirements change often, AppMaster (appmaster.io) can be a practical way to rebuild and regenerate clean Go backends as the app evolves, then profile the exported code under realistic load before it ships.
FAQ
A spike usually increases the allocation rate more than you expect. Even small per-request temporary objects add up linearly with RPS, which forces the GC to run more often and can push RSS up even if the live heap isn’t huge.
Heap metrics track Go-managed memory, but RSS includes more: goroutine stacks, runtime metadata, OS mappings, fragmentation, and any non-heap allocations (including some cgo usage). It’s normal for RSS and heap to move differently during spikes, so use pprof to find allocation hot spots rather than trying to “match” RSS exactly.
Start with a heap profile when you suspect retention (something is staying alive), and an allocation-focused profile (like allocs/alloc_space) when you suspect churn (lots of short-lived objects). During traffic spikes, churn is often the real problem because it drives GC CPU time and tail latency.
The simplest safe setup is to run pprof on a separate admin-only server bound to 127.0.0.1, and only make it reachable through internal access. Treat pprof like an admin interface because it can expose internal details about your service.
Capture a short timeline: one profile a few minutes before the spike (baseline), one during the spike (impact), and one after (recovery). This makes it easier to see what changed, instead of chasing normal background allocations.
Use inuse to find what’s actually retained at capture time, and use alloc_space (or alloc_objects) to find what’s being created heavily. A common mistake is using only inuse and missing churn that causes GC thrash and latency spikes.
If encoding/json dominates allocations, the usual culprit is your data shape, not the package itself. Replacing map[string]any with typed structs, avoiding json.MarshalIndent, and not building JSON through temporary strings often reduces allocations immediately.
Scanning rows into flexible targets like interface{} or map[string]any, converting []byte to string for many fields, and fetching too many rows or columns can allocate a lot per request. Selecting only needed columns, paging results, and scanning directly into concrete struct fields are common high-impact fixes.
Middleware runs on every request, so small allocations become huge under load. Logging that builds new strings, tracing that creates high-cardinality labels, request ID generation, gzip readers/writers created per request, and context values that store new objects can all show up as steady allocation churn in profiles.
Yes, because the same profile-driven approach applies to any Go code, generated or handwritten. If you export the generated backend source, you can run pprof, identify the allocating call paths, and then adjust your models, handlers, and cross-cutting logic to reduce per-request allocations before the next spike.


