Stress Testing Your Web App: Finding the Breaking Point Before Your Users Do

Yves SoeteFollow

7 min read · Jan 23, 2026

JAN 23, 2026 - Written by Yves SoeteBlacksight LLC — QA testing + CI/CD gating in one tool atkuality.io

Every production outage during a traffic spike has the same post-mortem: "We didn't know our capacity limit." The database connection pool exhausted at 800 concurrent users. The session store ran out of memory at 1,200. The CDN origin started 503-ing when cache-miss rate exceeded 40%. Each of these is discoverable before launch with a two-hour stress test. The problem isn't that stress testing is hard — it's that teams skip it because the tooling feels heavyweight and the results feel academic until the first time the site falls over on a real traffic spike.

Load testing vs. stress testing vs. soak testing

These terms get used interchangeably but they measure different things:

Load testing validates that your application performs acceptably under expected traffic. You simulate your typical peak — say, 500 concurrent users — and verify that response times, error rates, and resource utilization stay within acceptable bounds. This is what most teams mean when they say "performance testing."

Stress testing pushes past expected traffic to find the breaking point. You ramp up from 500 to 5,000 concurrent users and watch for the moment latency spikes, errors appear, or the app stops responding. The goal isn't to prove it handles the load — it's to find exactly where it breaks and what breaks first.

Soak testing runs sustained load over hours or days, looking for slow memory leaks, connection pool exhaustion, log-volume disk fills, and other time-dependent failures. A system that passes a 10-minute stress test at 1,000 users might OOM-kill after 6 hours at 200 users because of a leak.

For launch readiness, you need all three. Load testing first (does it handle expected traffic?), stress testing second (where does it break?), soak testing third (does it survive a full day?).

Tool selection: k6 vs. Locust vs. Artillery vs. Gatling

The open-source load testing landscape has consolidated around four tools, each with a clear use case:

k6 (Grafana Labs) — JavaScript-based scenarios, excellent CLI, built-in metrics export to Prometheus/Grafana. Best for teams already in the Grafana ecosystem. Runs locally or distributed. The scripting ergonomics are the best in class: you write ES6 modules that describe user behavior, and k6 handles connection pooling, cookie jars, and protocol details.

Locust — Python-based, event-driven architecture. Best for teams with Python expertise who want to write realistic user scenarios as Python classes. Excellent distributed mode with a web UI for watching tests in real time. Weaker on protocol-level testing (WebSocket, gRPC) without plugins.

Artillery — YAML-defined scenarios with JavaScript hooks. Best for teams that want declarative test definitions checked into the repo alongside infrastructure-as-code. Supports HTTP, WebSocket, Socket.io, and gRPC natively. Good AWS Lambda integration for distributed tests without managing load generators.

Gatling — Scala/Java-based, the oldest of the four. Best for enterprise Java shops. Highest overhead to learn but the most sophisticated scenario modeling for complex multi-step user journeys.

For most web applications, k6 or Locust. Pick based on whether your team thinks in JavaScript or Python.

What to measure and what to ignore

The default output of every load testing tool is a wall of numbers. Here's what actually matters:

p95 response time — not average, not median. The 95th percentile tells you what your slowest-but-not-outlier users experience. If your p95 is 3 seconds, 1 in 20 users waits 3+ seconds on every page load. Average hides this because the fast requests dilute it.

Error rate — the percentage of requests that returned 4xx/5xx or timed out. During a stress test, you want to find the concurrency level where this exceeds 1%. That's your effective capacity ceiling.

Throughput curve — requests per second as concurrency increases. In a healthy system, throughput rises linearly with concurrency until you hit saturation, then plateaus, then drops as the system starts queueing and timing out. The knee of that curve is your maximum sustainable throughput.

Resource utilization — CPU, memory, disk I/O, and network on the application server, database, cache, and CDN origin. During the test, one of these will hit a ceiling first. That's your bottleneck. Knowing which resource saturates first tells you exactly what to scale.

Ignore: average response time, minimum response time, total requests sent. These are vanity metrics that mask real problems.

The pre-launch stress test protocol

Here is the protocol I use for every launch. It takes about two hours and produces a one-page capacity report:

Step 1: Baseline (15 minutes) — Run 50 virtual users for 5 minutes against your three most critical endpoints (homepage, login, primary conversion flow). Record p95 response time and error rate. This is your baseline.

Step 2: Ramp to expected peak (20 minutes) — Linearly ramp from 50 to your expected peak concurrency over 10 minutes, hold for 10 minutes. Verify p95 stays under 2 seconds and error rate stays under 0.1%. If either threshold breaks, you have a capacity problem at expected load — stop and fix before continuing.

Step 3: Find the breaking point (30 minutes) — Continue ramping at 50 users per minute until error rate exceeds 5% or p95 exceeds 10 seconds. Record the concurrency level. This is your breaking point. Divide by your expected peak to get your headroom multiple. Anything below 3x is risky for a launch.

Step 4: Identify the bottleneck (15 minutes) — Review resource utilization during the breaking point. CPU-bound? Scale horizontally or optimize hot paths. Memory-bound? Check for leaks or increase limits. Database connection pool? Increase pool size or add read replicas. CDN cache-miss rate? Warm the cache or extend TTLs.

Step 5: Soak test (60 minutes) — Run at 70% of your breaking point for one hour. Watch for memory growth, connection count drift, and error rate creep. If the error rate is higher at minute 60 than minute 5, you have a time-dependent failure mode.

Common failure modes and their fixes

Database connection pool exhaustion — the single most common failure in web applications under load. Most frameworks default to 10-20 connections. At 500 concurrent users with 100ms average query time, you need 50 connections. Fix: increase pool size, add pgbouncer/ProxySQL, or optimize slow queries.

Memory leak in session store — common in Node.js apps using in-memory sessions. Each session allocates 1-5KB, and at 10,000 sessions the process hits its memory limit. Fix: use Redis or PostgreSQL-backed sessions.

CDN origin overload — your CDN handles the static assets, but API calls and dynamic pages go straight to origin. Under load, the origin becomes the bottleneck even though the CDN reports 99.9% hit rate. Fix: cache API responses where possible, add edge-side includes, or put a reverse proxy with short-TTL caching in front of the API.

DNS resolution latency — overlooked because it's invisible in development. Under load, DNS resolution for external services (payment APIs, email providers, analytics endpoints) can add 50-200ms per request. Fix: use connection pooling for external services, cache DNS locally, or use IP-based connections for critical paths.

Stress testing in CI

Full stress tests don't belong in every PR pipeline — they're too slow and too resource-intensive. But lightweight load tests do. A k6 script that runs 100 virtual users for 60 seconds against a preview deployment catches the most common performance regressions: accidentally removing a database index, adding an N+1 query, or introducing a synchronous call in a hot path.

Set a p95 threshold (say, 500ms for API endpoints, 2s for page loads) and fail the build if the PR exceeds it. This catches 80% of performance regressions at 1% of the cost of a full stress test.

For the full protocol, run it on a schedule — weekly against staging, and always before a major release. The results go into a capacity dashboard that the team reviews monthly. If your headroom multiple drops below 3x, it's time to scale before you need to.

Kuality's upcoming stress testing scanner (M79 on our roadmap) will automate this protocol with domain-verified targets and scheduled capacity reports. In the meantime, you can use kuality.io to audit the quality foundations — performance, accessibility, security headers — that stress testing builds on top of.

Cookie settings