Our team is currently managing log data in Elasticsearch/OpenSearch and we're facing a significant challenge with warm migrations due to our high daily data volume. We're looking for insights and best practices from the community.
Our Current Setup:
-
Index Segmentation: Daily logs in separate indices, e.g.,
mytopic1-2025-05-29
,mytopic1-2025-05-30
,mytopic2-2025-05-29
.... -
Data Volume: ~400GB/day/index.
-
Lifecycle: Hot (1 day) -> Warm (6 days) -> Delete.
-
Core Problem: - Warm migration failures. Warm migration operation during warm migration for a single daily index takes an excessively long time and frequently fails because of insufficient disk space on the Hot nodes.
We'd appreciate the community's insights on:
-
Between increasing shard count and implementing
rollover
, which solution is more recommended and scalable for our needs? Are there any other superior approaches we should consider? -
With a
rollover
strategy, how can we efficiently and intuitively handle date-related queries for logs?
Thank you for your time and expertise!
Details about Solutions We're Considering:
- Increasing the number of shards per daily index:
- Idea: Break down 400GB into more shards, reducing individual
force merge
disk needs. - Concerns:
- Will this truly solve the problem, or just distribute the peak disk pressure? If multiple shards on a single node merge concurrently, will the total disk pressure still be too high?
- If multiple shards on a single node merge concurrently (e.g.,
mytopic1-2025-05-29
andmytopic2-2025-05-29
), will that increase failure rate?
- Implementing a
rollover
strategy with an alias (using ISM/ILM):
- Idea: Idea: Use a write alias (e.g.,
mytopic1-alias
).Rollover
policy splits indices by size (e.g., 30-50GB) or age (1 day), creating smaller physical indices (mytopic1-alias-000001
, etc.). - Concerns:
- Search complexity: Our current query habit is to directly query daily indices like
mytopic1-2025-05-30
. If index names becomemytopic1-00000X
, will querying or operate on specific dates become difficult? - Metadata/JVM/Search cost: Does a higher total number of physical indices inherently increase cluster metadata, JVM, or search coordinating node costs?
- Automation: We are also trying to add the date in the alias. But will we need to manually initialize the first index or manage date-based aliases daily?
- Search complexity: Our current query habit is to directly query daily indices like