Fleet Server preventing Elasticsearch node shutdown due to persistent HTTP connections

ceekay · April 9, 2025, 3:35am

I'm encountering a reproducible issue where Fleet Server maintains persistent connections to Elasticsearch nodes in a way that prevents the nodes from shutting down cleanly. This causes 2–3 minute hangs during shutdown, even in a redundant HA setup (i.e., multiple ES nodes configured on a single Fleet output).

The basic issue is:

When an Elasticsearch node is shut down or restarted, the shutdown hangs for 2–3 minutes, waiting for these Fleet Server connections to close.
If the elastic-agent service on the Fleet Server is restarted during this time, the Elasticsearch node shuts down immediately. This confirms Fleet Server's persistent connection is preventing Elasticsearch from completing its shutdown.

This is a problem because it breaks HA:

This stack has redundant Fleet Servers and ingest nodes to allow rolling maintenance.
But a single Fleet Server can block shutdown of a connected Elasticsearch node.
Worse, because multiple Fleet Servers are running and holding open connections to all ES nodes, all of the Fleet Servers must be restarted before the ES node can shut down, defeating the purpose of redundancy.
I can't find any configuration to prevent this happening.

Compare this behaviour to Logstash with an Elasticsearch output:

The connection is being used to stream events and is never idle
On ES shutdown, Logstash immediately disconnects and carries on with the remaining ES nodes configured in its output.

Is there any way I can work around this in config? I.e., have Fleet Server release connections to ES nodes that are trying to shut down.

I really think this behaviour should be raised as a bug on Github, but the fleet-server repo directs me here first.

MichelLaterman · April 23, 2025, 4:33pm

How are you shutting down the ES nodes?

Normally a service manager (such as systemd) has a much lower timeout then 2-3m

ceekay · April 27, 2025, 9:59pm

Hi there, I'm using systemd with the packaged service definition, which contains:

$ cat /usr/lib/systemd/system/elasticsearch.service

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0
...
# Built for packages-8.17.4 (packages)

MichelLaterman · May 7, 2025, 3:22pm

The fleet-server uses our go-elasticsearch library for communication with Elasticsearch.

The issue may be that we set idleconnectionstimeout to our HTTP polling duration: fleet-server/internal/pkg/config/output.go at a2d98b83a7b519722302a2c5a2405bcb51002d0b · elastic/fleet-server · GitHub

Can you make an issue in the fleet-server repo with instructions on how to recreate?

ceekay · May 9, 2025, 1:02am

No problem - I've created #4905

As for reproducing the issue: My Fleet setup is one of the few things I don't have config-managed, and I don't really have anywhere to test the required steps to reproduce. Hopefully the fact it's occurring for me in two separate environments is enough to go on.

Thanks

Topic		Replies	Views
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	959	July 6, 2017
Fleet server goes down when elastic search node used to enroll the fleet server goes down Kibana fleet	3	311	December 21, 2022
Long period of querying failure during node timeout Elasticsearch	4	1078	May 15, 2020
Logstash preventing system shutdown Logstash	8	79	November 29, 2024
Fleet server configuration file Kibana	2	176	July 27, 2023

Fleet Server preventing Elasticsearch node shutdown due to persistent HTTP connections

Related topics