Monday, May 4, 2026

Do You Really Need Horizontal Scaling?

In times of hype around distributed systems, it’s tempting to scale right away. Using Spark, Flink, or Dask as a processing engine is often straightforward. 

But should we design for clusters from the beginning, or start simpler and evolve the system over time?

To effectively scale, we need to understand the data volume. Because horizontal scaling can often be an overkill.

Scaling is not free. The more distributed a system becomes, the higher its complexity. More components means more failure points, operational cost rises.

Vertical scaling is underrated in my opinion.

Newest machines have dozens or even hundreds of cores. Same with RAM.

The “just add more nodes” approach is often the default assumption. But what does adding more nodes really mean? Network overhead, serialization, harder debugging.

Image: Spark Data Scaling Horizontal and Vertical 

 

Example of ETL batch processing:


Spark runs in two main execution modes: 

  • local mode (driver and execution on a single machine)
  • cluster mode (distributed executors across nodes)

 

Reality Check on Horizontal Scaling

In distributed systems, moving data between partitions introduces a cost known as shuffling.

Shuffling is required to achieve proper data redistribution and balanced workload across partitions. It also occurs in single-node systems (Spark still operates with a logical distributed execution model, even when running locally). However, in cluster environments, shuffle becomes significantly more expensive due to network communication and coordination overhead.

Technical cost

  • network latency, data shuffling
  • In many cases, Spark performance is dominated more by shuffle behavior and data layout than by raw compute.
  • serialization/deserialization (can also happen on single node, for example local disk spillover)

Operational cost

  • Kubernetes, cluster management
  • monitoring, observability
  • deployments

Cognitive cost

  • debugging distributed systems
  • tracing
  • consistency issues

However, a single large machine is not a universal solution. Now we come to trade-offs.

With vertical scaling we have simplicity, lower latency, easier debugging.

However, vertical scaling is ultimately constrained by hardware limits.

At a certain point, the data size or workload characteristics simply exceed what a single machine can handle. The question is: are you actually at that point yet?

As Microsoft research point out: 

"that the majority of analytics
jobs do not process huge data sets. For example, at least
two analytics production clusters (at Microsoft and Ya-
hoo) have median job input sizes under 14 GB"

or 

that the majority of real-
world analytic jobs process less than 100 GB of input,
but popular infrastructures such as Hadoop/MapReduce
were originally designed for petascale processing."

from: Scale-up vs Scale-out for Hadoop: Time to rethink?

This research paper is from 2013 and now we are dealing with a lot more data, but not always and not everywhere - I still experienced under 100GB workloads in current times.


Now let's move to Horizontal scaling: we get scalability, fault tolerance (new Executor can re-start a task)

For some systems it's the only possibility, and we have to count the added complexity and operational overhead in. 

Making decision:

Are we CPU-bound?  -> scale vertically -> tune jobs -> then scale horizontally if not helping

Tuning jobs: data organization (e.g., Iceberg partitioning - when using Iceberg) often has higher impact than scaling compute. 

Are we really in Petabyte scale? -> scale horizontally, as single machine will probably not be effective

SLA (Service Level Agreement) Low Latency or High Availability requirements -> scale horizontally.

Does the operational cost align with business expectations? Maybe longer running job over night will also be tolerable. Cost per Job?

I would rather not scale in early stage system or up to couple TB of data. 
I would rather scale for massive datasets, HA requirements, or when dealing with streaming or real-time.

Even if we need to think about scaling from the beginning, it is still better to start simple and scale later. But always measure first.
It is useful to understand tools like autoscalers and Kubernetes taints & tolerations (in environments such as EC2 nodes on EKS), but it is equally important to know when not to use them.

As always, it is not only CPU and memory that matter, but also I/O bottleneck. A larger VM does not always translate into linear performance gains. And scaling is a decision, not a default.