Metric Explanations
1. Daily Data Volume (GB)
The total amount of raw, uncompressed data ingested into the system per day, including source fields and indexing overhead.
- Measure uncompressed data at ingest
- Include all fields (_source + indexing overhead)
- Calculate 30-day peak, not average
- Tool: Use GET _cat/allocation?v on test cluster
2. Retention Period (Days)
The number of days data is stored before it's eligible for archival or deletion, influencing storage lifecycle tiers (hot/warm/cold).
- Hot tier: Data actively queried (3-7 days)
- Warm tier: Older data (SSD/HDD hybrid)
- Cold/Frozen: Archival (object storage)
ILM Policy Example:
PUT _ilm/policy/logs {
"hot": {"min_age": "0d", "actions": {"rollover": {"max_size": "50gb"}}},
"warm": {"min_age": "7d", "actions": {"forcemerge": {"max_num_segments": 1}}}
}
3. Replica Factor (Count)
1=HA, 2=Production
The number of copies of each data shard to ensure high availability and fault tolerance; each replica increases storage usage.
- Required for high availability
- Provides failover during node outages
- Enables parallel query execution
- Cost: Doubles storage (1 replica) or triples (2 replicas)
4. Overhead Multiplier
1 + (segment_merge_% + os_reserve_%)
A multiplier that accounts for additional storage usage from OS reserves and segment merging, varying by disk type.
| Overhead Type |
SSD |
HDD |
| Segment Merges |
15% |
30% |
| OS Reserve |
15% |
20% |
| Total |
30% |
50% |
Defaults: 1.3 (SSD), 1.5 (HDD)
5. CPU Cores per Node (Count)
vcpus × (1 - hyperthreading_discount)
Conservative: 8 cores = 8 vCPUs
The number of effective CPU cores per node, dictating processing capacity for indexing and queries.
| Node Type |
Min Cores |
Recommended |
| Data |
8 |
16-32 |
| Ingest |
4 |
8-16 |
| Coordinating |
4 |
8-16 |
| Master |
2 |
4 |
Avoid >64 vCPUs - leads to thread contention!
6. Data Node RAM (GB)
The total physical storage per node, with 80% typically allocated for usable Elasticsearch data.
- Heap ≤ 32GB (Java compressed pointers threshold)
- 50% RAM to heap, 50% to OS/filesystem cache
- Minimum: 8GB RAM (test), 64GB RAM (production)
Never use swap memory for heap!
7. Disk Size per Data Node (TB)
disk_tb × 0.8
The total memory per node, with half allocated to the Java heap (≤32 GB).
| Disk Type |
Max Size |
RAID Config |
| SATA SSD |
8TB |
RAID 0 |
| NVMe SSD |
4TB |
None |
| HDD |
16TB |
RAID 10 |
Avoid >90% disk usage; Prefer 4×2TB NVMe over 1×8TB SATA for throughput
8. Shards per GB Heap
20 + (5 × storage_type_bonus)
The ideal size range for each shard (30–50 GB), balancing performance and manageability.
| Storage Type |
Bonus Value |
Resulting Shards/GB |
Calculation |
| HDD |
0 |
20 |
20 + (5×0) = 20 |
| SATA SSD |
1 |
25 |
20 + (5×1) = 25 |
| NVMe SSD |
2 |
30 |
20 + (5×2) = 30 |
Notes:
- 1 shard = ~2MB heap metadata (indexing + search)
- Conservative scaling: Max Shards/Node = Heap_GB × 20
- Aggressive scaling (SSD-only): Max Shards/Node = Heap_GB × 25
9. Target Shard Size (GB)
MIN(50, MAX(30, daily_data_gb / 20))
Ideal: 30-50GB
The maximum number of index shards that can be supported per GB of Java heap memory.
| Scenario |
Shard Size |
| Time-series logs |
50GB |
| Search-heavy |
30GB |
| Vector DB |
10GB |
Enforcement example:
PUT logs-000001 {
"settings": {
"index.lifecycle.rollover_alias": "logs",
"index.lifecycle.rollover_size": "50gb"
}
}
10. Peak Ingestion Rate (evt/sec)
The highest event rate (per second) the system needs to handle during ingestion.
| Node Type |
Events/Sec/Core |
| Ingest-Optim |
50,000 |
| Data Node |
20,000 |
| Coordinating |
30,000 |
Scaling Tip: 1 ingest node (8 cores) handles 400K evt/sec with no pipelines and default mapping
11. Peak Query Load (QPS)
concurrent_users × queries_per_user
The maximum number of queries the cluster must support per second.
| Query Type |
QPS/Core |
| Match_all |
15,000 |
| Term Aggregation |
5,000 |
| KNN Search |
500 |
Optimization:
- Increase coordinating nodes for search-heavy loads
- Use shard request cache for repeated queries
12. Heap Size per Node
MIN(node_ram_gb / 2, 32)
The amount of JVM heap allocated per node, capped at 32 GB.
- Elasticsearch recommends ≤32GB JVM heap due to Java pointer compression
- Allocate 50% of physical RAM to heap (e.g., 64GB RAM → 32GB heap)
Beyond 32GB, garbage collection efficiency drops sharply
13. Max Shards per Node (Heap)
heap_gb × shards_per_gb_heap
The upper limit on shards a single node can support.
- Conservative: 20 shards/GB heap (default for HDD)
- Aggressive: 25 shards/GB heap (SSD-optimized clusters)
- Example: 32GB heap × 20 shards/GB = 640 shards/node max
Monitor with: GET _nodes/stats/indices?filter_path=**.shards
14. Total Raw Storage
daily_data_gb × retention_days × (1 + replica_factor) × overhead_multiplier
The total cluster storage required for all retained data.
Example: 10TB/day × 30 days × (1+1) × 1.3 = 780TB
- Replica Factor: 1 (HA) or 2 (production)
- Overhead Multiplier: 1.3 (SSD) or 1.5 (HDD)
15. Usable Storage per Node
(disk_tb × 1000) × 0.8
Effective disk space per node after reserving 20% for operational needs.
- Reserve 20% disk space for segment merges, OS operations, and snapshots
Never exceed 85% disk usage (critical for cluster health)
16. Data-Driven Min Shards
CEILING(daily_data_gb / target_shard_size_gb, 1)
The minimum number of daily shards based on the target shard size.
- Target Shard Size: 30-50GB (sweet spot for query performance)
- If daily data = 10TB (10,240GB): 10,240GB / 50GB/shard = 205 shards
Exception: Time-series data use ILM rollover at 50GB
17. Heap-Constrained Shards
CEILING(total_raw_storage_gb / target_shard_size_gb, 1)
The total number of shards required across the cluster.
- Represents total shards cluster must handle (not daily)
- Validates if heap can manage shard count
Example: 780TB raw storage / 50GB/shard = 15,600 shards
18. Final Daily Shards
MAX(min_shards, CEILING(heap_constrained_shards / max_shards_per_node, 1))
The final computed number of shards per day.
- Takes more restrictive value between data-driven and heap-driven estimates
- Ensures neither shard size nor heap limits are violated
19. Total Cluster Shards
final_daily_shards × (1 + replica_factor) × retention_days
The cumulative shard count for the full retention window.
- Replica Factor: 1 replica → 2x shards (1 primary + 1 replica)
- Absolute Limits: ≤ 1,000 shards/node, ≤ 100,000 shards/cluster
Example: 205 daily shards × 2 × 30 days = 12,300 shards
20. Data Nodes (Storage)
CEILING(total_raw_storage_gb / usable_storage_per_node, 1)
The number of data nodes required to meet total storage needs.
- Based purely on storage capacity requirements
- Uses total raw storage including replicas and overhead
Critical: Must be ≥ actual storage needed at retention period end
21. Data Nodes (Shards)
CEILING(total_cluster_shards / max_shards_per_node, 1)
The number of data nodes needed based on shard capacity.
- Based on heap memory limitations for shard management
- Prevents shard overload that causes node crashes
Rule: Always round up to whole nodes
22. Data Nodes (Final)
MAX(data_nodes_storage, data_nodes_shards)
The greater of the storage-based or shard-based node estimates.
- Ensures both storage capacity AND heap limits are satisfied
- Example: MAX(122, 32) = 122 nodes
Optimization: Add 10% buffer for growth
23. Ingest Nodes
CEILING(peak_ingestion_rate / (50,000 × cores_per_node), 1)
Dedicated nodes that handle document preprocessing before indexing.
- 50,000 events/sec/core benchmark for medium-complexity pipelines
- Scale up for heavy Grok parsing (+30%) or enrichment lookups (+50%)
Default: 1 node minimum even if calculation <1
24. Coordinating Nodes
CEILING(peak_query_load / (5,000 × cores_per_node), 1)
Nodes that serve as query routers and aggregators during search operations.
- 5,000 QPS/core for typical search/aggregation queries
- Scale up for complex aggregations (+100%) or ML jobs (+200%)
25. Master Nodes
if(data_nodes_final ≤ 20, 3, 5)
Responsible for cluster coordination, metadata management, and node discovery.
- Always odd number (3,5,7) to prevent split-brain
- Never mix roles - dedicated nodes only
- For clusters >100 nodes: 7 masters
Critical Setting:
discovery.zen.minimum_master_nodes: (master_nodes / 2) + 1
26. Total Nodes
data_nodes_final + ingest_nodes + coordinating_nodes + master_nodes
The sum of all node types representing the minimum viable cluster size.
- Minimum production cluster: 7 nodes (3 master + 3 data + 1 ingest/coordinating)
- Always include 10-20% buffer for upgrades and failure recovery
27. Actual Shards per Node
total_cluster_shards / data_nodes_final
The average number of shards assigned to each data node.
| Range |
Status |
| 100-500 shards/node |
Optimal |
| >600 shards/node |
Warning |
| >1,000 shards/node |
Critical (risk of instability) |
Monitor with: GET _cat/allocation?v&h=node,shards
28. Max Shards per Node
heap_gb × shards_per_gb_heap
The maximum shard capacity a node can safely manage.
- Conservative: 20 shards/GB heap
- Aggressive: 25 shards/GB heap (SSD-only)
- Adjust: Lower to 15 shards/GB if using heavy vector search
Example: 32GB heap × 20 = 640 shards/node max
29. Storage Utilization
total_raw_storage_gb / (data_nodes_final × usable_storage_per_node)
The ratio of used raw storage to available usable disk space.
| Threshold |
Effect |
| 85% |
Read-only mode activated |
| 90% |
Shard relocation stops |
Target: 65-75% for headroom
Autoremediation:
PUT _cluster/settings {
"transient": {
"cluster.routing.allocation.disk.watermark.low": "85%"
}
}
30. Shard Size Compliance
IF((daily_data_gb / final_daily_shards) ≥ 30, "ok", "⚠️")
Validation ensuring each shard is at least 30 GB (ideally 30–50 GB).
- Optimal: 30-50GB/shard
-
Consequences of small shards (<10GB):
- Metadata overhead up to 50% of heap
- Slower query performance
Fix oversharding:
POST /my_index/_shrink/my_new_index {
"settings": { "index.number_of_shards": 10 }
}