After running our self-hosted GitLab in production, we faced a bug: during git
clone/pull/push operations, the request sometimes hung for 1-2 minutes.
It looked like some kind of “floating” bug, that is, it could normally work 5 times, and then hangs once.
Contents
The issues
gitlab-shell
timeouts
For example, one time git clone
works well:
[simterm]
$ time git clone [email protected]:example/platform/tables-api.git Cloning into 'tables-api'... ... real 0m1.380s
[/simterm]
And then a clone of the same repository takes 2 minutes:
[simterm]
$ time git clone [email protected]:example/platform/tables-api.git Cloning into 'tables-api'... ... real 2m10.497s
[/simterm]
And it doesn’t look like a network issue, but rather something with SSH at the session establishment and key exchange stage.
Fortunately, I didn’t dig too deep, because first I decided to fix the problem with the metrics so that I could see what was happening with GitLab Shell in monitoring.
gitlab-shell
/metrics endpoint Connection refused
I already talked about the issue with metrics when I described the monitoring settings in the GitLab: monitoring – Prometheus, metrics, and Grafana dashboard post, and there was an issue with Git/SSH metrics from the pod gitlab-shell
.
It looked like this: open port 9122 (see values):
[simterm]
$ kk -n gitlab-cluster-prod port-forward gitlab-cluster-prod-gitlab-shell-744675c985-5t8wn 9122
[/simterm]
Try it with curl
:
[simterm]
$ curl localhost:9122/metrics curl: (52) Empty reply from server
[/simterm]
And Pod says the “Connection refused“:
[simterm]
... Handling connection for 9122 E0315 12:40:43.712508 826225 portforward.go:407] an error occurred forwarding 9122 -> 9122: error forwarding port 9122 to pod 51856f9224907d4c1380783e46b13069ef5322ae1f286d4301f90a2ed60483c0, uid : exit status 1: 2023/03/15 10:40:43 socat[28712] E connect(5, AF=2 127.0.0.1:9122, 16): Connection refused E0315 12:40:43.713039 826225 portforward.go:233] lost connection to pod
[/simterm]
The solution
As it turned out, GitLab Shell supports two SSH daemons – openssh
and gitlab-sshd
, and openssh
is the default value, see values :
...## Allow to select ssh daemon that would be executed inside container ## Possible values: openssh, gitlab-sshd sshDaemon: openssh ...
So, update our values:
... gitlab-shell: enabled: true metrics: enabled: true sshDaemon: gitlab-sshd ...
Deploy and check the metrics:
[simterm]
$ curl localhost:9122/metrics # HELP gitlab_build_info Current build info for this GitLab Service # TYPE gitlab_build_info gauge gitlab_build_info{built="20230309.174623",version="v14.17.0"} 1 # HELP gitlab_shell_gitaly_connections_total Number of Gitaly connections that have been established # TYPE gitlab_shell_gitaly_connections_total counter gitlab_shell_gitaly_connections_total{status="ok"} 2 ...
[/simterm]
The issue with timeouts has also been solved – now the result is not longer than 1 second – real 0m0.846s
.