GitLab: gitlab-shell timeouts, and /metrics Connection refused

By | 04/25/2023

After running our self-hosted GitLab in production, we faced a bug: during git clone/pull/push operations, the request sometimes hung for 1-2 minutes.

It looked like some kind of “floating” bug, that is, it could normally work 5 times, and then hangs once.

The issues

gitlab-shell timeouts

For example, one time git clone works well:

[simterm]

$ time git clone [email protected]:example/platform/tables-api.git
Cloning into 'tables-api'...
...
real    0m1.380s

[/simterm]

And then a clone of the same repository takes 2 minutes:

[simterm]

$ time git clone [email protected]:example/platform/tables-api.git
Cloning into 'tables-api'...
...
real    2m10.497s

[/simterm]

And it doesn’t look like a network issue, but rather something with SSH at the session establishment and key exchange stage.

Fortunately, I didn’t dig too deep, because first I decided to fix the problem with the metrics so that I could see what was happening with GitLab Shell in monitoring.

gitlab-shell /metrics endpoint Connection refused

I already talked about the issue with metrics when I described the monitoring settings in the GitLab: monitoring – Prometheus, metrics, and Grafana dashboard post, and there was an issue with Git/SSH metrics from the pod gitlab-shell.

It looked like this: open port 9122 (see values)​:

[simterm]

$ kk -n gitlab-cluster-prod port-forward gitlab-cluster-prod-gitlab-shell-744675c985-5t8wn 9122

[/simterm]

Try it with curl:

[simterm]

$ curl localhost:9122/metrics
curl: (52) Empty reply from server

[/simterm]

And Pod says the “Connection refused“:

[simterm]

...
Handling connection for 9122
E0315 12:40:43.712508  826225 portforward.go:407] an error occurred forwarding 9122 -> 9122: error forwarding port 9122 to pod 51856f9224907d4c1380783e46b13069ef5322ae1f286d4301f90a2ed60483c0, uid : exit status 1: 2023/03/15 10:40:43 socat[28712] E connect(5, AF=2 127.0.0.1:9122, 16): Connection refused
E0315 12:40:43.713039  826225 portforward.go:233] lost connection to pod

[/simterm]

The solution

As it turned out, GitLab Shell supports two SSH daemons – openssh and gitlab-sshd, and openssh is the default value, see values ​:

...## Allow to select ssh daemon that would be executed inside container
## Possible values: openssh, gitlab-sshd
sshDaemon: openssh
...

So, update our values:

...
    gitlab-shell:
      enabled: true
      metrics:
        enabled: true
      sshDaemon: gitlab-sshd
...

Deploy and check the metrics:

[simterm]

$ curl localhost:9122/metrics
# HELP gitlab_build_info Current build info for this GitLab Service
# TYPE gitlab_build_info gauge
gitlab_build_info{built="20230309.174623",version="v14.17.0"} 1
# HELP gitlab_shell_gitaly_connections_total Number of Gitaly connections that have been established
# TYPE gitlab_shell_gitaly_connections_total counter
gitlab_shell_gitaly_connections_total{status="ok"} 2
...

[/simterm]

The issue with timeouts has also been solved – now the result is not longer than 1 second – real    0m0.846s.