Seeking Best Practices - Addressing Warm Migration with High Daily Log Volume

Our team is currently managing log data in Elasticsearch/OpenSearch and we're facing a significant challenge with warm migrations due to our high daily data volume. We're looking for insights and best practices from the community.

Our Current Setup:

  • Index Segmentation: Daily logs in separate indices, e.g., mytopic1-2025-05-29, mytopic1-2025-05-30, mytopic2-2025-05-29....

  • Data Volume: ~400GB/day/index.

  • Lifecycle: Hot (1 day) -> Warm (6 days) -> Delete.

  • Core Problem: - Warm migration failures. Warm migration operation during warm migration for a single daily index takes an excessively long time and frequently fails because of insufficient disk space on the Hot nodes.

We'd appreciate the community's insights on:

  • Between increasing shard count and implementing rollover, which solution is more recommended and scalable for our needs? Are there any other superior approaches we should consider?

  • With a rollover strategy, how can we efficiently and intuitively handle date-related queries for logs?

Thank you for your time and expertise!

Details about Solutions We're Considering:

  1. Increasing the number of shards per daily index:
  • Idea: Break down 400GB into more shards, reducing individual force merge disk needs.
  • Concerns:
    • Will this truly solve the problem, or just distribute the peak disk pressure? If multiple shards on a single node merge concurrently, will the total disk pressure still be too high?
    • If multiple shards on a single node merge concurrently (e.g., mytopic1-2025-05-29 and mytopic2-2025-05-29), will that increase failure rate?
  1. Implementing a rollover strategy with an alias (using ISM/ILM):
  • Idea: Idea: Use a write alias (e.g., mytopic1-alias). Rollover policy splits indices by size (e.g., 30-50GB) or age (1 day), creating smaller physical indices (mytopic1-alias-000001, etc.).
  • Concerns:
    • Search complexity: Our current query habit is to directly query daily indices like mytopic1-2025-05-30. If index names become mytopic1-00000X, will querying or operate on specific dates become difficult?
    • Metadata/JVM/Search cost: Does a higher total number of physical indices inherently increase cluster metadata, JVM, or search coordinating node costs?
    • Automation: We are also trying to add the date in the alias. But will we need to manually initialize the first index or manage date-based aliases daily?

Are you using Elasticsearch or Opensearch? Even though they share heritage they have deviated significantly iver time. Which version?

My initial observation is that the overall retention period of 7 days is exceptionally low for a hot-warm architecture so it may be more appropriate with a cluster with only a single type of nodes in order to avoid migrations.

What is the hardware specification of yhe different node types? How much data does each node type hold on average?

We are currently using OpenSearch 2.5, with a daily log volume of approximately 20TB. Here are some of our specifications.

Number of data nodes

Availability Zone(s) - 3-AZ without standby
Instance type - m6g.4xlarge.search
Number of data nodes - 16
Storage type - AWS EBS
EBS volume type - General Purpose (SSD) - gp3
EBS volume size (GiB) - 3072
Provisioned IOPS - 9216 IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Warm storage instance type - ultrawarm1.medium.search
Number of warm nodes - 30

We actually don't often query data older than one day. Given that, we're currently migrating data older than one day to make a more cost-effective approach for us. We tried not to do warm migration, but it seems our cluster's disk is even less able to handle it that way.

That setup is very, very different from Elasticsearch and includes custom AWS components. It is therefore not something that is supported here. I would recommend you consult the Opensearch forum or AWS Support.

For hot nodes I always recommend storage optimized instance types with ephemeral storage, e.g. i4 series (or newer variants), on AWS. As I do not use the OpenSearch service I am not sure whether that is available or not. gp3 EBS does IMHO not sound like a good option for that level of ingest.

This is not the right forum for OpenSearch. There is some discussion about right-sizing and indexing patterns on OpenSearch here if you are interested: OpenSearch Tutorial for Data & Platform Engineers

As a general rule, you should always favour rollover / datastreams to daily indices , and number of shards should be dictated by indexing pressure (how many docs are ingested per second on avg or at highest peak).

1 Like