I'm encountering a reproducible issue where Fleet Server maintains persistent connections to Elasticsearch nodes in a way that prevents the nodes from shutting down cleanly. This causes 2–3 minute hangs during shutdown, even in a redundant HA setup (i.e., multiple ES nodes configured on a single Fleet output).
The basic issue is:
- When an Elasticsearch node is shut down or restarted, the shutdown hangs for 2–3 minutes, waiting for these Fleet Server connections to close.
- If the
elastic-agent
service on the Fleet Server is restarted during this time, the Elasticsearch node shuts down immediately. This confirms Fleet Server's persistent connection is preventing Elasticsearch from completing its shutdown.
This is a problem because it breaks HA:
- This stack has redundant Fleet Servers and ingest nodes to allow rolling maintenance.
- But a single Fleet Server can block shutdown of a connected Elasticsearch node.
- Worse, because multiple Fleet Servers are running and holding open connections to all ES nodes, all of the Fleet Servers must be restarted before the ES node can shut down, defeating the purpose of redundancy.
- I can't find any configuration to prevent this happening.
Compare this behaviour to Logstash with an Elasticsearch output:
- The connection is being used to stream events and is never idle
- On ES shutdown, Logstash immediately disconnects and carries on with the remaining ES nodes configured in its output.
Is there any way I can work around this in config? I.e., have Fleet Server release connections to ES nodes that are trying to shut down.
I really think this behaviour should be raised as a bug on Github, but the fleet-server
repo directs me here first.