ELK Cluster Planning Tool

Based on Elasticsearch best practices and official documentation

Overview

Calculator

Explanation

Purpose of the Workbook

This workbook helps engineers, architects, and system administrators plan and estimate infrastructure requirements for deploying an Elasticsearch cluster. It calculates the recommended number of nodes for each role in the cluster—master-eligible nodes, data nodes, ingest nodes, and coordinating nodes—based on key factors like data ingestion volume, indexing rate, and retention policy.

By inputting realistic operational parameters, users can generate a sizing baseline tailored to their use case, improving the reliability and performance of their Elasticsearch deployment while ensuring efficient resource allocation.

Usage Considerations

This tool is intended to serve as a high-level estimation guide for Elasticsearch sizing. It should not be used in isolation to finalize production cluster designs.

Actual requirements may vary depending on:

Workload characteristics (query vs. indexing heavy)
Node hardware profiles (CPU, disk I/O, network bandwidth)
Performance tuning (caching, filters, storage tiering)
Security features (encryption, audit logging, etc.)

It is highly recommended to validate all assumptions through load testing, staging environments, and benchmark trials before applying this sizing in a production scenario.

Calculation Methodology

All calculations and assumptions in this workbook are informed by official Elasticsearch documentation and community-accepted best practices. The sizing estimates rely on formulas and thresholds that reflect how Elasticsearch handles data distribution, indexing, and query performance.

Key factors taken into account include:

Daily ingestion volume (GB or TB)
Retention period (in days or weeks)
Ideal shard size (usually ~30–50 GB)
Desired number of replicas for high availability
Estimated JVM heap size per node
Usable storage per node (factoring in 10–20% overhead)
Shard-to-heap ratio (maximum recommended: ~20 shards per GB of heap)

Official Documentation References

Input Parameters

Daily Data Volume (GB)

Retention Period (Days)

Replica Factor (Count)

Overhead Multiplier

CPU Cores per Node (Count)

Data Node RAM (GB)

Disk Size per Data Node (TB)

Shards per GB Heap

Target Shard Size (GB)

Peak Ingestion Rate (evt/sec)

Peak Query Load (QPS)

2. Heap & Shard Calculations

Parameter	Formula	Value
Heap Size per Node	MIN(ram / 2, 32)	-
Max Shards per Node	heapSize × shards_per_gb_heap	-
Total Raw Storage (GB)	dailyData × retention × (1 + replica) × overhead	-
Usable Storage per Node (GB)	disk × 1024 × 0.8	-
Data-Driven Min Shards	CEIL(dailyData / targetShard)	-
Heap-Constrained Shards	CEIL(totalRawStorage / targetShard)	-
Final Daily Shards	MAX(minShards, CEIL(heapConstrainedShards / maxShardsNode))	-
Total Cluster Shards	finalDailyShards × (1 + replica) × retention	-

Cluster Recommendations

Data Nodes

Storage and shard requirements

Master Nodes

Cluster coordination

Ingest Nodes

Data processing pipelines

Coordinating Nodes

Query handling

Total Nodes

Minimum cluster size

Total Storage

0 TB

Raw storage required

Critical Rule: Never exceed 32GB heap! Split into more nodes instead of scaling RAM vertically.

Heap-Driven Shard Adjustment

Node RAM	Heap Allocation	Max Shards/Node (20/GB)	Max Daily Data Before Adding Shards
32GB	16GB	320	16TB
64GB	30GB	600	30TB
128GB	32GB	640	32TB
256GB	32GB	640	32TB

Metric Explanations

1. Daily Data Volume (GB)

The total amount of raw, uncompressed data ingested into the system per day, including source fields and indexing overhead.

Measure uncompressed data at ingest
Include all fields (_source + indexing overhead)
Calculate 30-day peak, not average
Tool: Use GET _cat/allocation?v on test cluster

2. Retention Period (Days)

The number of days data is stored before it's eligible for archival or deletion, influencing storage lifecycle tiers (hot/warm/cold).

Hot tier: Data actively queried (3-7 days)
Warm tier: Older data (SSD/HDD hybrid)
Cold/Frozen: Archival (object storage)

ILM Policy Example:
PUT _ilm/policy/logs {
"hot": {"min_age": "0d", "actions": {"rollover": {"max_size": "50gb"}}},
"warm": {"min_age": "7d", "actions": {"forcemerge": {"max_num_segments": 1}}}
}

3. Replica Factor (Count)

1=HA, 2=Production

The number of copies of each data shard to ensure high availability and fault tolerance; each replica increases storage usage.

Required for high availability
Provides failover during node outages
Enables parallel query execution
Cost: Doubles storage (1 replica) or triples (2 replicas)

4. Overhead Multiplier

1 + (segment_merge_% + os_reserve_%)

A multiplier that accounts for additional storage usage from OS reserves and segment merging, varying by disk type.

Overhead Type	SSD	HDD
Segment Merges	15%	30%
OS Reserve	15%	20%
Total	30%	50%

Defaults: 1.3 (SSD), 1.5 (HDD)

5. CPU Cores per Node (Count)

vcpus × (1 - hyperthreading_discount)
Conservative: 8 cores = 8 vCPUs

The number of effective CPU cores per node, dictating processing capacity for indexing and queries.

Node Type	Min Cores	Recommended
Data	8	16-32
Ingest	4	8-16
Coordinating	4	8-16
Master	2	4

Avoid >64 vCPUs - leads to thread contention!

6. Data Node RAM (GB)

The total physical storage per node, with 80% typically allocated for usable Elasticsearch data.

Heap ≤ 32GB (Java compressed pointers threshold)
50% RAM to heap, 50% to OS/filesystem cache
Minimum: 8GB RAM (test), 64GB RAM (production)

Never use swap memory for heap!

7. Disk Size per Data Node (TB)

disk_tb × 0.8

The total memory per node, with half allocated to the Java heap (≤32 GB).

Disk Type	Max Size	RAID Config
SATA SSD	8TB	RAID 0
NVMe SSD	4TB	None
HDD	16TB	RAID 10

Avoid >90% disk usage; Prefer 4×2TB NVMe over 1×8TB SATA for throughput

8. Shards per GB Heap

20 + (5 × storage_type_bonus)

The ideal size range for each shard (30–50 GB), balancing performance and manageability.

Storage Type	Bonus Value	Resulting Shards/GB	Calculation
HDD	0	20	20 + (5×0) = 20
SATA SSD	1	25	20 + (5×1) = 25
NVMe SSD	2	30	20 + (5×2) = 30

Notes:

1 shard = ~2MB heap metadata (indexing + search)
Conservative scaling: Max Shards/Node = Heap_GB × 20
Aggressive scaling (SSD-only): Max Shards/Node = Heap_GB × 25

9. Target Shard Size (GB)

MIN(50, MAX(30, daily_data_gb / 20))
Ideal: 30-50GB

The maximum number of index shards that can be supported per GB of Java heap memory.

Scenario	Shard Size
Time-series logs	50GB
Search-heavy	30GB
Vector DB	10GB

Enforcement example:
PUT logs-000001 {
  "settings": {
    "index.lifecycle.rollover_alias": "logs",
    "index.lifecycle.rollover_size": "50gb"
  }
}

10. Peak Ingestion Rate (evt/sec)

The highest event rate (per second) the system needs to handle during ingestion.

Node Type	Events/Sec/Core
Ingest-Optim	50,000
Data Node	20,000
Coordinating	30,000

Scaling Tip: 1 ingest node (8 cores) handles 400K evt/sec with no pipelines and default mapping

11. Peak Query Load (QPS)

concurrent_users × queries_per_user

The maximum number of queries the cluster must support per second.

Query Type	QPS/Core
Match_all	15,000
Term Aggregation	5,000
KNN Search	500

Optimization:

Increase coordinating nodes for search-heavy loads
Use shard request cache for repeated queries

12. Heap Size per Node

MIN(node_ram_gb / 2, 32)

The amount of JVM heap allocated per node, capped at 32 GB.

Elasticsearch recommends ≤32GB JVM heap due to Java pointer compression
Allocate 50% of physical RAM to heap (e.g., 64GB RAM → 32GB heap)

Beyond 32GB, garbage collection efficiency drops sharply

13. Max Shards per Node (Heap)

heap_gb × shards_per_gb_heap

The upper limit on shards a single node can support.

Conservative: 20 shards/GB heap (default for HDD)
Aggressive: 25 shards/GB heap (SSD-optimized clusters)
Example: 32GB heap × 20 shards/GB = 640 shards/node max

Monitor with: GET _nodes/stats/indices?filter_path=**.shards

14. Total Raw Storage

daily_data_gb × retention_days × (1 + replica_factor) × overhead_multiplier

The total cluster storage required for all retained data.

Example: 10TB/day × 30 days × (1+1) × 1.3 = 780TB

Replica Factor: 1 (HA) or 2 (production)
Overhead Multiplier: 1.3 (SSD) or 1.5 (HDD)

15. Usable Storage per Node

(disk_tb × 1000) × 0.8

Effective disk space per node after reserving 20% for operational needs.

Reserve 20% disk space for segment merges, OS operations, and snapshots

Never exceed 85% disk usage (critical for cluster health)

16. Data-Driven Min Shards

CEILING(daily_data_gb / target_shard_size_gb, 1)

The minimum number of daily shards based on the target shard size.

Target Shard Size: 30-50GB (sweet spot for query performance)
If daily data = 10TB (10,240GB): 10,240GB / 50GB/shard = 205 shards

Exception: Time-series data use ILM rollover at 50GB

17. Heap-Constrained Shards

CEILING(total_raw_storage_gb / target_shard_size_gb, 1)

The total number of shards required across the cluster.

Represents total shards cluster must handle (not daily)
Validates if heap can manage shard count

Example: 780TB raw storage / 50GB/shard = 15,600 shards

18. Final Daily Shards

MAX(min_shards, CEILING(heap_constrained_shards / max_shards_per_node, 1))

The final computed number of shards per day.

Takes more restrictive value between data-driven and heap-driven estimates
Ensures neither shard size nor heap limits are violated

19. Total Cluster Shards

final_daily_shards × (1 + replica_factor) × retention_days

The cumulative shard count for the full retention window.

Replica Factor: 1 replica → 2x shards (1 primary + 1 replica)
Absolute Limits: ≤ 1,000 shards/node, ≤ 100,000 shards/cluster

Example: 205 daily shards × 2 × 30 days = 12,300 shards

20. Data Nodes (Storage)

CEILING(total_raw_storage_gb / usable_storage_per_node, 1)

The number of data nodes required to meet total storage needs.

Based purely on storage capacity requirements
Uses total raw storage including replicas and overhead

Critical: Must be ≥ actual storage needed at retention period end

21. Data Nodes (Shards)

CEILING(total_cluster_shards / max_shards_per_node, 1)

The number of data nodes needed based on shard capacity.

Based on heap memory limitations for shard management
Prevents shard overload that causes node crashes

Rule: Always round up to whole nodes

22. Data Nodes (Final)

MAX(data_nodes_storage, data_nodes_shards)

The greater of the storage-based or shard-based node estimates.

Ensures both storage capacity AND heap limits are satisfied
Example: MAX(122, 32) = 122 nodes

Optimization: Add 10% buffer for growth

23. Ingest Nodes

CEILING(peak_ingestion_rate / (50,000 × cores_per_node), 1)

Dedicated nodes that handle document preprocessing before indexing.

50,000 events/sec/core benchmark for medium-complexity pipelines
Scale up for heavy Grok parsing (+30%) or enrichment lookups (+50%)

Default: 1 node minimum even if calculation <1

24. Coordinating Nodes

CEILING(peak_query_load / (5,000 × cores_per_node), 1)

Nodes that serve as query routers and aggregators during search operations.

5,000 QPS/core for typical search/aggregation queries
Scale up for complex aggregations (+100%) or ML jobs (+200%)

25. Master Nodes

if(data_nodes_final ≤ 20, 3, 5)

Responsible for cluster coordination, metadata management, and node discovery.

Always odd number (3,5,7) to prevent split-brain
Never mix roles - dedicated nodes only
For clusters >100 nodes: 7 masters

Critical Setting:
discovery.zen.minimum_master_nodes: (master_nodes / 2) + 1

26. Total Nodes

data_nodes_final + ingest_nodes + coordinating_nodes + master_nodes

The sum of all node types representing the minimum viable cluster size.

Minimum production cluster: 7 nodes (3 master + 3 data + 1 ingest/coordinating)
Always include 10-20% buffer for upgrades and failure recovery

27. Actual Shards per Node

total_cluster_shards / data_nodes_final

The average number of shards assigned to each data node.

Range	Status
100-500 shards/node	Optimal
>600 shards/node	Warning
>1,000 shards/node	Critical (risk of instability)

Monitor with: GET _cat/allocation?v&h=node,shards

28. Max Shards per Node

heap_gb × shards_per_gb_heap

The maximum shard capacity a node can safely manage.

Conservative: 20 shards/GB heap
Aggressive: 25 shards/GB heap (SSD-only)
Adjust: Lower to 15 shards/GB if using heavy vector search

Example: 32GB heap × 20 = 640 shards/node max

29. Storage Utilization

total_raw_storage_gb / (data_nodes_final × usable_storage_per_node)

The ratio of used raw storage to available usable disk space.

Threshold	Effect
85%	Read-only mode activated
90%	Shard relocation stops

Target: 65-75% for headroom

Autoremediation:
PUT _cluster/settings {
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "85%"
  }
}

30. Shard Size Compliance

IF((daily_data_gb / final_daily_shards) ≥ 30, "ok", "⚠️")

Validation ensuring each shard is at least 30 GB (ideally 30–50 GB).

Optimal: 30-50GB/shard
Consequences of small shards (<10GB):
- Metadata overhead up to 50% of heap
- Slower query performance

Fix oversharding:
POST /my_index/_shrink/my_new_index {
"settings": { "index.number_of_shards": 10 }
}