Fleet Server preventing Elasticsearch node shutdown due to persistent HTTP connections

I'm encountering a reproducible issue where Fleet Server maintains persistent connections to Elasticsearch nodes in a way that prevents the nodes from shutting down cleanly. This causes 2–3 minute hangs during shutdown, even in a redundant HA setup (i.e., multiple ES nodes configured on a single Fleet output).

The basic issue is:

  • When an Elasticsearch node is shut down or restarted, the shutdown hangs for 2–3 minutes, waiting for these Fleet Server connections to close.
  • If the elastic-agent service on the Fleet Server is restarted during this time, the Elasticsearch node shuts down immediately. This confirms Fleet Server's persistent connection is preventing Elasticsearch from completing its shutdown.

This is a problem because it breaks HA:

  • This stack has redundant Fleet Servers and ingest nodes to allow rolling maintenance.
  • But a single Fleet Server can block shutdown of a connected Elasticsearch node.
  • Worse, because multiple Fleet Servers are running and holding open connections to all ES nodes, all of the Fleet Servers must be restarted before the ES node can shut down, defeating the purpose of redundancy.
  • I can't find any configuration to prevent this happening.

Compare this behaviour to Logstash with an Elasticsearch output:

  • The connection is being used to stream events and is never idle
  • On ES shutdown, Logstash immediately disconnects and carries on with the remaining ES nodes configured in its output.

Is there any way I can work around this in config? I.e., have Fleet Server release connections to ES nodes that are trying to shut down.

I really think this behaviour should be raised as a bug on Github, but the fleet-server repo directs me here first.

How are you shutting down the ES nodes?

Normally a service manager (such as systemd) has a much lower timeout then 2-3m

Hi there, I'm using systemd with the packaged service definition, which contains:

$ cat /usr/lib/systemd/system/elasticsearch.service
# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0
...
# Built for packages-8.17.4 (packages)

The fleet-server uses our go-elasticsearch library for communication with Elasticsearch.

The issue may be that we set idleconnectionstimeout to our HTTP polling duration: fleet-server/internal/pkg/config/output.go at a2d98b83a7b519722302a2c5a2405bcb51002d0b · elastic/fleet-server · GitHub

Can you make an issue in the fleet-server repo with instructions on how to recreate?

No problem - I've created #4905

As for reproducing the issue: My Fleet setup is one of the few things I don't have config-managed, and I don't really have anywhere to test the required steps to reproduce. Hopefully the fact it's occurring for me in two separate environments is enough to go on.

Thanks :slight_smile:

1 Like