Executive Summary

This post covers two patterns every AWS platform team eventually encounters, where AWS’s defaults end, and what deliberate architectural choices—including solutions built in partnership with AWS—actually resolve them.

Horizontal scaling exposes the database ceiling. Adding application instances doesn’t add database connections—it consumes a finite pool. And even before you hit that pool, native cloud storage I/O becomes the hidden constraint that no amount of horizontal scaling can fix. RDS Proxy addresses the connection problem but leaves the storage I/O ceiling untouched.
Auto Scaling chases the ceiling; it doesn’t prevent it. Scaling policies react to lagging metrics. By the time scale-out triggers and completes, users have already felt the degradation—often because the real bottleneck was never compute in the first place. When storage I/O is the true constraint, adding EC2 instances doesn’t solve the problem. It can make it worse.

Understanding where these ceilings live—and making deliberate architectural choices before load exposes them—is the difference between infrastructure that scales and infrastructure that scales until it doesn’t.

When the Playbook Fails

You scaled. You did everything right.

You set up Auto Scaling groups with sensible thresholds. You decomposed the monolith into services. You moved background jobs to Lambda. You added caching layers. The architecture diagram looked exactly like the re:Invent slide it was modeled after.

Then traffic grew—real growth, the kind you wanted—and something started breaking. Not catastrophically. Subtly. P99 latency climbed. A few timeouts. A queue that started falling behind. By the time your team traced it, you had a problem that more compute couldn’t fix, because compute was never the constraint.

This is the part the architecture diagrams don’t show: AWS scaling patterns are excellent at moving the ceiling. They are not, by themselves, excellent at removing it.

Pattern 1: Horizontal Scaling and the Database Ceiling

The pattern: Horizontal scaling is the right answer for stateless application tiers. Add more EC2 instances, ECS tasks, or Lambda functions when load increases; distribute the work; keep each node stateless. At the application layer, this works exactly as advertised. The pattern breaks at the first stateful dependency downstream—almost always the database.

How the ceiling forms: Your Auto Scaling group adds five new EC2 instances in response to rising CPU. Each instance maintains a connection pool to your RDS database. A db.r6g.large running PostgreSQL or MySQL has a hard max_connections limit—a function of instance class memory, not your traffic pattern. Add enough application instances and you don’t just approach that limit. You hit it, and new connection requests start failing. Your application scales out successfully while your database starts refusing to talk to it.

This isn’t a misconfiguration. It’s the structural consequence of a pattern that doesn’t account for downstream capacity.

Where the ceiling lives: max_connections on RDS is set at the instance level. A db.r5.large caps around 1,250 connections. A db.r5.4xlarge reaches roughly 5,000. These aren’t configurable soft limits—they’re derived from available memory. Every Lambda invocation, every ECS task, every new Auto Scaling instance that opens a connection is drawing from that finite pool.

But the connection ceiling is only the first layer. Even teams that solve the connection problem with proxies and pooling often hit a second, deeper constraint: storage I/O throughput. Cloud-native storage tiers are convenient and scalable, but they introduce real boundaries—throughput and IOPS caps on cloud disks throttle performance under sustained load, and latency variability between zones can affect consistency in ways that only emerge at scale. A gp2 EBS volume delivers 3 IOPS per GB, with a typical sustained maximum of 10,000 IOPS (volumes can reach 16,000 IOPS at larger sizes of 5.3TB+). A heavily loaded OLTP database can saturate that ceiling with read/write patterns that look completely normal at smaller scale. When it does, query latency climbs—not because the database is misconfigured, but because the storage layer physically cannot keep up with I/O demand.

What it looks like in production: Intermittent connection timeouts that correlate with traffic spikes. Database error logs showing too many connections. Query latency that climbs gradually under sustained load even after connection pooling is in place—the signature of storage I/O saturation rather than connection exhaustion.

What AWS Provides—and Where It Stops

RDS Proxy is AWS’s answer to the connection problem, launched in 2020. It multiplexes thousands of application connections into a manageable pool of actual database connections, and for Lambda-to-RDS architectures specifically it’s close to essential—Lambda’s execution model opens a new connection on every cold start. It works well for what it was designed to do.

Where it stops: RDS Proxy addresses the connection ceiling. The storage I/O ceiling that sits underneath the connection layer is a separate boundary, addressed separately. For storage I/O, AWS offers gp3 volumes and io2 Block Express with higher provisioned IOPS—meaningful options that give architects the headroom to design for their peak applications when those peaks are known and planned for.

The Most Widely Used Alternative: PgBouncer

For the connection problem, PgBouncer (open source) is the community standard—predating RDS Proxy by years and considered more flexible and debuggable by many production teams. It solves the connection ceiling well. Like RDS Proxy, it operates at the connection layer and does not address storage I/O throughput.

Where Silk Fits

For the storage I/O ceiling, Silk addresses this at the infrastructure layer rather than the application layer. Silk uses fast compute networks and powerful VMs to deliver unmatched throughput and IOPS to databases and applications at sub-millisecond latency, decoupling performance from capacity. This means storage performance scales independently of the database instance class—teams running Oracle, SQL Server, or PostgreSQL on AWS get the I/O headroom their applications actually demand without overprovisioning compute or storage to compensate.

The results are concrete: one customer reported moving from a maximum speed of 1.6MB/s on native cloud to 400MB/s with Silk on AWS. That’s not a tuning improvement—it’s the removal of a structural ceiling. The significance for horizontal scaling patterns is direct: when the storage layer is no longer the bottleneck, connection pooling solutions like RDS Proxy or PgBouncer can do their job without a deeper I/O constraint undermining the result. For OLTP applications specifically—the most common applications hitting these ceilings—Silk’s platform delivered 663,618 IOPS at 1.4ms latency on AWS, representing 400% more IOPS at roughly one-quarter the latency compared to the next closest enterprise alternative tested.

Pattern 2: Auto Scaling Reaction Lag—The Thermometer Problem

The pattern: Auto Scaling is a feedback control system. It watches a metric, compares it to a threshold, and responds. That description contains the problem: it watches, then responds. The metric has to degrade before scale-out begins. Scale-out takes time to complete. And the metric being watched is almost always a lagging indicator of actual user experience.

How the ceiling forms: The typical setup uses CPU utilization as the scaling trigger—scale out when average CPU exceeds 70%, scale in when it drops below 30%. CPU is a reasonable proxy for compute-bound applications, but it measures what has already happened. By the time your fleet’s average CPU hits 70%, you’ve already been serving degraded traffic for 60 to 90 seconds. Then the scale-out decision is made. Then a new instance launches and passes its health check—another 60 to 180 seconds depending on your AMI and bootstrap process.

From the first user who felt slowness to the first new instance absorbing requests: three to five minutes, on a good day.

The problem deepens when the metric itself is wrong. CPU is a poor proxy for I/O-bound applications, memory-constrained applications, or connection-limited services. A Node.js application waiting on database responses shows low CPU while users wait. An application hitting its storage I/O ceiling shows low CPU while queries queue behind disk. Auto Scaling, watching CPU, sees nothing wrong and does not scale—because the problem was never in the compute tier.

What it looks like in production: Load spikes that generate a burst of elevated latency or errors that clear on their own 5 to 10 minutes later, after scaling catches up. Teams often attribute this to “a blip”—until the blips become frequent enough to trace. When they do trace it, they often find CPU was flat the entire time.

What AWS Provides—and Where It Stops

This is the area where AWS has made the most genuine recent progress. Predictive scaling uses ML-based forecasting, analyzing historical data and generating forecasts for the next 48 hours. Warm pools keep pre-initialized instances ready to receive traffic in seconds rather than minutes. Both are meaningful additions to the AWS-native toolkit.

Where they stop: predictive scaling requires sufficient metric history and is designed for applications with cyclical, recurring patterns. It works well for predictable daily traffic curves. For unpredictable spikes—a viral event, a sudden campaign, a DDoS—reactive scaling remains the reality. And for applications where the degradation is caused by storage I/O saturation rather than compute pressure, both predictive scaling and warm pools scale the wrong layer. Adding EC2 capacity when storage is the constraint doesn’t move the needle—it can make things worse by adding more connections competing for the same constrained storage throughput.

The Most Widely Used Alternative: KEDA

For the metric problem, KEDA (Kubernetes Event-Driven Autoscaling) on EKS is the most widely adopted community solution—scaling on queue depth, custom metrics, and external signals rather than lagging infrastructure metrics. It addresses the “wrong metric” problem more fundamentally than any AWS-native solution. It does not address the underlying storage I/O constraint that makes the metric wrong in the first place.

Where Silk Fits

For degradation caused by storage I/O constraints that compute scaling cannot address, Silk removes the storage performance ceiling that makes Auto Scaling reactive rather than predictive. Silk’s patented adaptive block sizing and log-structured array keep applications performing consistently regardless of the number or type of concurrent applications[3]. The practical implication for Auto Scaling is significant: when storage I/O is no longer the hidden variable, CPU and latency metrics become honest proxies for actual load. Scaling policies fire on real compute pressure rather than chasing symptoms of a constraint that lives one layer deeper. The benefit is both operational and financial—every time teams overprovision to compensate for a storage slowdown, costs compound across cloud storage, compute fees, database licensing, and data copies across environments. Removing the storage ceiling means architecting for your actual average workload rather than your worst-case guess.

A Note on Lambda Concurrency

There is a third pattern worth knowing. Lambda’s default regional concurrency limit of 1,000 simultaneous executions is a shared, account-level ceiling. API Gateway’s default limit is 10,000 requests per second. At even modest function durations, those two numbers can produce a mismatch that results in throttled invocations and 429s—with no warning in the default AWS setup.

The fix is straightforward but must happen proactively: request a concurrency limit increase via AWS Support before you need it, use provisioned concurrency for latency-sensitive functions, and place an SQS buffer between API Gateway and Lambda for applications that can tolerate asynchronous processing. Unlike the database and Auto Scaling ceilings, this one has no storage dimension—it’s a quota configuration boundary with a known resolution that AWS Support is well-equipped to help with. It belongs in any honest accounting of AWS scaling ceilings because teams consistently discover it under load rather than in testing, which is precisely why proactive load testing and quota planning are part of AWS’s own Well-Architected guidance.

The Shared Responsibility of Scale

AWS and its partners are continuously raising the ceiling. The architect’s job is to know where it is today.

These ceilings are partially by design. Connection limits protect database stability under multi-tenant conditions. Lambda concurrency limits prevent runaway cost for new accounts. They’re not oversights—they’re guardrails that become boundaries at scale. Understanding them as design decisions rather than deficiencies is what allows architects to work with them deliberately rather than collide with them unexpectedly.

AWS provides the tools to address each of these boundaries: RDS Proxy, provisioned concurrency, predictive scaling, warm pools, gp3 and io2 storage options. The gap is not in what exists—it’s in what is on by default. None of these tools activate themselves. They require architectural intent, applied before load exposes the boundary, not after.

The most persistent ceilings live in the storage layer. They are invisible to the metrics the compute tier watches, and they don’t yield to solutions designed for the compute layer. This is where the partnership between AWS and platforms like Silk creates the most direct value—not by replacing AWS-native tools, but by operating at the layer those tools don’t reach, so that the full AWS scaling stack can perform the way it was designed to.

Closing: Performance Is an Architectural Property

The two patterns above share a common thread. Neither is a failure of operations. Both are consequences of architectural decisions that were correct for a smaller system and became boundaries at scale.

Horizontal scaling is right. Auto Scaling is right. The ceiling isn’t in the pattern—it’s in the assumptions the pattern carries about the systems around it. The database that scales vertically while the application scales horizontally. The feedback loop that measures compute while storage quietly saturates. The metric that looks healthy right up until users start noticing.

The teams that navigate this well aren’t the ones who avoid these patterns. They’re the ones who build with explicit awareness of where the boundaries are, instrument for the signals that precede them, and make deliberate architectural choices—connection pooling, storage disaggregation, event-driven scaling—before the boundary introduces itself in production.

At Silk, we work at the layer where compute and storage meet cloud infrastructure. What we see consistently is that the performance ceilings teams hit at scale are rarely where they expected them. They’re not in the code. They’re not in the instance type. They’re in the storage layer—invisible to every metric the compute tier is watching, and unresolved by every scaling policy designed to fix it. Addressing that layer, in partnership with AWS, is where the most durable performance gains are made.

See What AWS Performance Looks Like Without the Bottlenecks

Join our live webinar on April 29 at 11am ET to learn how enterprises are achieving consistent, high-performance application scaling on AWS — without the complexity of constant tuning.

Use Cases

Cloud Vendors

Industries

Why Performance Becomes the Bottleneck in AWS at Scale

Executive Summary

When the Playbook Fails

Pattern 1: Horizontal Scaling and the Database Ceiling

The Most Widely Used Alternative: PgBouncer

Where Silk Fits

Pattern 2: Auto Scaling Reaction Lag—The Thermometer Problem

What AWS Provides—and Where It Stops

The Most Widely Used Alternative: KEDA

Where Silk Fits

A Note on Lambda Concurrency

The Shared Responsibility of Scale

Closing: Performance Is an Architectural Property

See What AWS Performance Looks Like Without the Bottlenecks

About the Author

Why Performance Becomes the Bottleneck in AWS at Scale

Executive Summary

When the Playbook Fails

Pattern 1: Horizontal Scaling and the Database Ceiling

The Most Widely Used Alternative: PgBouncer

Where Silk Fits

Pattern 2: Auto Scaling Reaction Lag—The Thermometer Problem

What AWS Provides—and Where It Stops

The Most Widely Used Alternative: KEDA

Where Silk Fits

A Note on Lambda Concurrency

The Shared Responsibility of Scale

Closing: Performance Is an Architectural Property

See What AWS Performance Looks Like Without the Bottlenecks

About the Author

Popular Posts