ECK Elasticsearch (9.0.1) Pod Stuck - 'Running' but Never 'Ready' (Local Storage)
Hello Elasticsearch Community,
I'm facing a strange issue with a simple Elasticsearch deployment using ECK ( on my on-premise Kubernetes cluster, bare metal. I'm trying to deploy Elasticsearch 9.0.1 with local storage, but the pod consistently gets stuck in a Running state without ever becoming Ready.
I was able to deploy similar steps in the past successfully
This problem is consistent in both single-node and multi-node setups:
- Single-Node Cluster (count: 1): The single quickstart-es-default-0 pod starts, but never becomes Ready.
- Three-Node Cluster (count: 3): If I try a 3-node deployment, two nodes get to the Ready state, but one node consistently remains in Running (not Ready). Here's a summary of my setup and the extensive troubleshooting steps I've taken:
Setup:
- Kubernetes: On-prem self-hosted
- ECK Version: 3.0.0
- Elasticsearch Version: 9.0.1
- Storage: Local PersistentVolume on a specific node , mounted via storageClassName
- Pod Logs HANG:
When tailing the kubectl logs -f for the problematic pod, the logs consistently stop at the following lines, and no further output is printed to stdout/stderr from the container:
compressed ordinary object pointers [true]", "ecs.version":
{"@timestamp":"2025-06-04T12:31:25.645Z", "log.level": "INFO", "message":"Registered local node features [ES_V_8, ES_V_9, cluster.reroute.ignores_metric_param, cluster.stats.source_modes, linear_retriever_supported, lucene_10_1_upgrade, lucene_10_upgrade, security.queryable_built_in_roles, simulate.ignored.fields]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.features.FeatureService","elasticsearch.node.name":"quickstart-es-default-0","elasticsearch.cluster.name":"quickstart"}
{"@timestamp":"2025-06-04T12:31:25.685Z", "log.level": "INFO", "message":"Updated default factory retention to [null]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.cluster.metadata.DataStreamGlobalRetentionSettings","elasticsearch.node.name":"quickstart-es-default-0","elasticsearch.cluster.name":"quickstart"}
{"@timestamp":"2025-06-04T12:31:25.685Z", "log.level": "INFO", "message":"Updated max factory retention to [null]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.cluster.metadata.DataStreamGlobalRetentionSettings","elasticsearch.node.name":"quickstart-es-default-0","elasticsearch.cluster.name":"quickstart"}
On the pods that do get to the running state i have these printed in the logs after the point the previous pod hangs
ry.RecoverySettings","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
{"@timestamp":"2025-06-05T06:56:30.145Z", "log.level": "INFO", "message":"Registered local node features [ES_V_8, ES_V_9, cluster.reroute.ignores_metric_param, cluster.stats.source_modes, linear_retriever_supported, lucene_10_1_upgrade, lucene_10_upgrade, security.queryable_built_in_roles, simulate.ignored.fields]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.features.FeatureService","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
{"@timestamp":"2025-06-05T06:56:30.180Z", "log.level": "INFO", "message":"Updated default factory retention to [null]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.cluster.metadata.DataStreamGlobalRetentionSettings","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
{"@timestamp":"2025-06-05T06:56:30.181Z", "log.level": "INFO", "message":"Updated max factory retention to [null]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.cluster.metadata.DataStreamGlobalRetentionSettings","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
{"@timestamp":"2025-06-05T06:56:30.590Z", "log.level": "INFO", "message":"[controller/106] [Main.cc@123] controller (64 bit): Version 9.0.1 (Build 5ac89bc732bee2) Copyright (c) 2025 Elasticsearch BV", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
{"@timestamp":"2025-06-05T06:56:31.041Z", "log.level": "INFO", "message":"OTel ingest plugin is enabled", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"main","log.logger":"org.elasticsearch.xpack.oteldata.OTelPlugin","elasticsearch.node.name":"cortex-index-es-default-0","elasticsearch.cluster.name":"cortex-index"}
Pod User & Host Permissions:
- Elasticsearch container runs as uid=1000(elasticsearch) gid=1000(elasticsearch).
- The local storage path (/mnt/data/test-es-pv/) on the host is owned by (UID 1000, GID 1000).
- Added securityContext: { runAsUser: 1000, fsGroup: 1000 } to the pod template.
- Readiness Probe: The probe checks port 8080, as configured in elasticsearch.yml, but it fails because Elasticsearch never gets to the point of listening on this port.
I'm at a bit of a loss as i am not able to find any clue in the logs to why Elastic would not start…
Am I doing something completely wrong? Is there a way to get additional logs from the pod to understand what has happened?