We run GitHub Runners in Kubernetes to build and deploy our Backend API, see GitHub Actions: running the Actions Runner Controller in Kubernetes.
But over time, we noticed that there was too much traffic on the NAT Gateway – see VictoriaLogs: a Grafana dashboard for AWS VPC Flow Logs – migrating from Grafana Loki.
Contents
The issue: Traffic on the AWS NAT Gateway
When we started investigation, we found an interesting detail:
Here, 40.8 gigabytes of data passed through the GW NAT in an hour, 40.7 of which was Ingress.
Out of these 40 GB, there are three Remote IPs at the top, each of which sent us almost 10 GB of traffic (the table on the bottom left of the screenshot above).
The top Remote IPs are:
Remote IP Value Percent ------------------------------ 20.60.6.4 10.6 GB 28% 20.150.90.164 9.79 GB 26% 20.60.6.100 8.30 GB 22% 185.199.111.133 2.06 GB 5% 185.199.108.133 1.89 GB 5% 185.199.110.133 1.78 GB 5% 185.199.109.133 1.40 GB 4% 140.82.114.4 805 MB 2% 146.75.28.223 705 MB 2% 54.84.248.61 267 MB 1%
And at the top of Kubernetes traffic, we have four Kubernetes Pods IPs:
Source IP Pod IP Value Percent ----------------------------------------------- 20.60.6.4 => 10.0.43.98 1.54 GB 14% 20.60.6.100 => 10.0.43.98 1.49 GB 14% 20.60.6.100 => 10.0.42.194 1.09 GB 10% 20.150.90.164 => 10.0.44.162 1.08 GB 10% 20.60.6.4 => 10.0.44.208 1.03 GB 9%
And all of these IPs belongs to GitHub Runners Pods, and the “kraken” in the name is just those runners for builds and deploys of our kraken project, the Backend:
The next step is even more interesting: if you check the IP https://20.60.6.4, you will see a hostname:
*.blob.core.windows.net???
What? I was very surprised, because we build a Python app, and there are no libraries from Microsoft. But then I had an idea: since we use PiP and Docker caching in GitHub Actions for the Backend API builds, it’s most likely GitHub storage, and it’s from there that we pull these caches to Kubernetes (it is, see the Communication requirements for GitHub-hosted runners and GitHub).
A similar check for the 185.199.111.133 and 140.82.114.4 shows us *.github.io, and the54.84.248.61 is for the athena.us-east-1.amazonaws.com.
So, what we decided to do was to run a local caching in Kubernetes with the Sonatype Nexus, and use it as a proxy for PyPi.org and for Docker Hub images.
We’ll talk about Docker caching next time, but for now, we will:
- test Nexus locally with Docker on a work machine
- run Nexus in Kubernetes from a Helm-chart
- configure and test the PyPI cache for builds
- and see the results
Nexus: testing locally with Docker
Run Nexus:
$ docker run -ti --rm --name nexus -p 8081:8081 sonatype/nexus3
Wait a few minutes, because Nexus is Java-based, so it takes a long time to start.
Get the admin password:
$ docker exec -ti nexus cat /nexus-data/admin.password 6221ad20-0196-4771-b1c7-43df355c2245
In a browser, go to the http://localhost:8081, and log in:
If you haven’t done this in the Setup wizard, then go to Security > Anonymous access, and allow connections without authentication:
Adding a pypi (proxy)
repository
Go to Settings > Repositories, click Create repository:
Select the pypi (proxy)
type:
Create a repository:
- Name:
pypi-proxy
- Remote storage:
https://pypi.org
- Blob store:
default
At the bottom, click the Create repository.
Let’s check what data we have now in the default
Blob storage – go to the Nexus container:
$ docker exec -ti nexus bash bash-4.4$
And look at the /nexus-data/blobs/default/content/
directory – now it’s empty:
bash-4.4$ ls -l /nexus-data/blobs/default/content/ total 8 drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:02 directpath drwxr-xr-x 2 nexus nexus 4096 Nov 27 11:02 tmp
Testing the Nexus PyPI cache
Now let’s check if our proxy cache is working.
Find the IP of the container from the Nexus:
$ docker inspect nexus | jq '.[].NetworkSettings.IPAddress' "172.17.0.2"
Run another container with Python:
$ docker run -ti --rm python bash root@addeba5d307c:/#
And execute pip install --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple setuptools --trusted-host 172.17.0.2
root@addeba5d307c:/# time pip install --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple setuptools --trusted-host 172.17.0.2 Looking in indexes: http://172.17.0.2:8081/repository/pypi-proxy/simple Collecting setuptools Downloading http://172.17.0.2:8081/repository/pypi-proxy/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 81.7 MB/s eta 0:00:00 Installing collected packages: setuptools Successfully installed setuptools-75.6.0 ... real 0m2.595s ...
We can see that the Downloading process was completed, and it took 2.59 seconds.
Let’s see what the default
Blob storage in Nexus is now:
bash-4.4$ ls -l /nexus-data/blobs/default/content/ total 20 drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:02 directpath drwxr-xr-x 2 nexus nexus 4096 Nov 27 11:21 tmp drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-05 drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-19 drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-33
We have some data there now, okay.
Let’s test pip
again – first, let’s uninstall the installed package:
root@addeba5d307c:/# pip uninstall setuptools
And install it again, but now add the --no-cache-dir
to avoid using the local cache in the container:
root@5dc925fe254f:/# time pip install --no-cache-dir --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple setuptools --trusted-host 172.17.0.2 Looking in indexes: http://172.17.0.2:8081/repository/pypi-proxy/simple Collecting setuptools Downloading http://172.17.0.2:8081/repository/pypi-proxy/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 942.9 MB/s eta 0:00:00 Installing collected packages: setuptools Successfully installed setuptools-75.6.0 ... real 0m1.589s
Now it took 1.52 seconds instead of 2.59.
Good, looks like everything works.
Let’s run Nexus on Kubernetes.
Running Nexus in Kubernetes
There is a chart called stevehipwell/nexus3.
You can write manifestos yourself, or you can try this chart.
What might be interesting to us from the chart’s values:
config.anonymous.enabled
: Nexus will work locally in Kubernetes with access only via Cluster IP, so while it is in PoC and purely for PyPI cache – we can do it without authenticationconfig.blobStores
: you can leave it as it is for now, but later can connect a dedicated EBS or AWS Elastic File System, see alsopersistence.enabled
config.job.tolerances
andnodeSelector
: if you need to tweak on a separate node, see Kubernetes: Pods and WorkerNodes to control the placement of pods on nodesconfig.repos
: create repositories directly through valuesingress.enabled
: not our case, but it is possiblemetrics.enabled
: later we can look at the monitoring
First, let’s set it up with the default parameters, then we’ll add our own values.
Add a repository:
$ helm repo add stevehipwell https://stevehipwell.github.io/helm-charts/ "stevehipwell" has been added to your repositories
Create a separate namespace ops-nexus-ns
:
$ kk create ns ops-nexus-ns namespace/ops-nexus-ns created
Install the chart:
$ helm -n ops-nexus-ns upgrade --install nexus3 stevehipwell/nexus3
It took about 5 minutes to launch, and I was thinking about dropping the chart and writing it myself, but eventually, it started. Well, Java – what can we do?
Let’s check what do we have here:
$ kk -n ops-nexus-ns get all NAME READY STATUS RESTARTS AGE pod/nexus3-0 4/4 Running 0 6m5s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/nexus3 ClusterIP 172.20.160.147 <none> 8081/TCP 6m5s service/nexus3-hl ClusterIP None <none> 8081/TCP 6m5s NAME READY AGE statefulset.apps/nexus3 1/1 6m6s
Add an Admin user password
Create a Kubernetes Secret with a password:
$ kk -n ops-nexus-ns create secret generic nexus-root-pass --from-literal=password=p@ssw0rd secret/nexus-root-pass created
Write a nexus-values.yaml
file, in which we set the name of the Kubernetes Secret and the key with the password, and enable Anonymous Access:
rootPassword: secret: nexus-root-password key: password config: enabled: true anonymous: enabled: true
Adding a repository to Nexus via Helm chart values
I had to do some work using the “tick method”, but it worked.
Also, in values.yaml the chart says:“Repository configuration; based on the REST API (API reference docs require an existing Nexus installation and can be found at **Administration** under _System_ → _API_) but with `format` & `type` defined in the object.“.
Let’s see the Nexus API specification – what fields are passed to API request:
What about the format?
We can look at the Format and Type fields in some existing repository:
Describe the repository and other necessary parameters – for me, it looks like this:
rootPassword: secret: nexus-root-password key: password persistence: enabled: true storageClass: gp2-retain resources: requests: cpu: 1000m memory: 1500Mi config: enabled: true anonymous: enabled: true repos: - name: pip-cache format: pypi type: proxy online: true negativeCache: enabled: true timeToLive: 1440 proxy: remoteUrl: https://pypi.org metadataMaxAge: 1440 contentMaxAge: 1440 httpClient: blocked: false autoBlock: true connection: retries: 0 useTrustStore: false storage: blobStoreName: default strictContentTypeValidation: false
It’s a pretty simple setup, and I’ll do some tuning later if necessary. But it’s already working.
Let’s deploy it:
$ helm -n ops-nexus-ns upgrade --install nexus3 stevehipwell/nexus3 -f nexus-values.yml
In case of errors of the type “Could not create repository“:
$ kk -n ops-nexus-ns logs -f nexus3-config-9-2cssf Configuring Nexus3... Configuring anonymous access... Anonymous access configured. Configuring blob stores... Configuring scripts... Script 'cleanup' updated. Script 'task' updated. Configuring cleanup policies... Configuring repositories... ERROR: Could not create repository 'pip-cache'.
Check the logs – Nexus wants to transfer almost all fields, in this case, the config.repos.httpClient.contentMaxAge
was missing:
nexus3-0:nexus3 2024-11-27 12:34:16,818+0000 WARN [qtp554755438-84] admin org.sonatype.nexus.siesta.internal.resteasy.ResteasyViolationExceptionMapper - (ID af473d22-3eca-49ea-adb9-c7985add27e7) Response: [400] '[ValidationErrorXO{id='PARAMETER strictContentTypeValidation', message='must not be null'}, ValidationErrorXO{id='PARAMETER negativeCache', message='must not be null'}, ValidationErrorXO{id='PARAMETER metadataMaxAge', message='must not be null'}, ValidationErrorXO{id='PARAMETER contentMaxAge'[]ust not be null]arg0.httpClient]ntMaxAge]]TypeValidation]TER httpClient', message='must not be null'}]'; mapped from: [PARAMETER]
During deployment, when we set the config.enabled=true
parameter, the chart launches another Kubernetes Pod, which actually performs the Nexus configuration.
Let’s check the access and the repository – open a local port:
$ kk -n ops-nexus-ns port-forward pod/nexus3-0 8082:8081 Forwarding from 127.0.0.1:8082 -> 8081 Forwarding from [::1]:8082 -> 8081
And go to the http://localhost:8082/#admin/repository/repositories:
The Nexus needs a lot of resources, especially memory, because again, it’s Java:
Therefore, it makes sense to immediately set requests
in values.
Also, you can set JVM params:
... # Environment: # INSTALL4J_ADD_VM_PARAMS: -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs -Xms1024m -Xmx1024m -XX:MaxDirectMemorySize=2048m ...
Testing Nexus in Kubernetes
Launch a Pod with Python:
$ kk run pod --rm -i --tty --image python bash If you don't see a command prompt, try pressing enter. root@pod:/#
Find a Kubernetes Service for Nexus:
$ kk -n ops-nexus-ns get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nexus3 ClusterIP 172.20.160.147 <none> 8081/TCP 78m nexus3-hl ClusterIP None <none> 8081/TCP 78m
Run pip install
again:
root@pod:/# time pip install --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple setuptools --trusted-host nexus3.ops-nexus-ns.svc Looking in indexes: http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple Collecting setuptools Downloading http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 86.3 MB/s eta 0:00:00 Installing collected packages: setuptools Successfully installed setuptools-75.6.0 ... real 0m3.958s
It installed setuptools-75.6.0
in 3.95 seconds.
Let’s check in the http://localhost:8082/#browse/browse:pip-cache:
Remove setuptools
from our Python Pod:
root@pod:/# pip uninstall setuptools
And install it again, again with the --no-cache-dir
:
root@pod:/# time pip install --no-cache-dir --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple setuptools --trusted-host nexus3.ops-nexus-ns.svc Looking in indexes: http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple Collecting setuptools Downloading http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 875.9 MB/s eta 0:00:00 Installing collected packages: setuptools Successfully installed setuptools-75.6.0 .. real 0m2.364s
Now it took 2.364s.
The only thing left to do is to update GitHub Workflows – disable all caches there, and add the use of Nexus.
GitHub та результати по AWS NAT Gateway трафіку
I won’t go into detail about GitHub Actions Workflow, because it’s different for everyone, but in short, I’ve disabled the PiP caching:
... - name: "Setup: Python 3.10" uses: actions/setup-python@v5 with: python-version: "3.10" # cache: 'pip' check-latest: "false" # cache-dependency-path: "**/*requirements.txt" ...
This will save about 540 megabytes on downloading the archive with the cache for each Job run.
Next, we have a step that executes the pip install
by calling make
:
... - name: "Setup: Dev Dependencies" id: setup_dev_dependencies #run: make dev-python-requirements run: make dev-python-requirements-nexus shell: bash ...
And in the Makefile, I created a new task so that I could quickly revert to the old configuration:
... dev-python-requirements: python3 -m pip install --no-compile -r dev-requirements.txt dev-python-requirements-nexus: python3 -m pip install --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple --no-compile -r dev-requirements.txt --trusted-host nexus3.ops-nexus-ns.svc ...
Then, in the Workflow, disable any caches like actions/cache
:
.. # - name: "Setup: Get cached api-generator images" # id: api-generator-cache # uses: actions/cache@v4 # with: # path: ~/_work/api-generator-cache # key: api-generator-cache ...
Let’s compare the results.
The build with the old configuration, without Nexus and with GitHub caches – the traffic of the Kubernetes Pod GitHub Runner that this build was running:
3.55 gigabytes of traffic, the build and deployment took 4 minutes and 11 seconds.
And the same GitHub Actions Job, but with the changes merged – using Nexus, and without GitHub caching.
We can see in the logs that the packages are indeed taken from Nexus:
And traffic used:
329 megabytes, the build and deployment took 4 minutes and 20 seconds.
And that’s it for now.
What will be done next is to see how Nexus can be monitored, what metrics it has, and which ones can be used to make alerts, and then add more Docker cache, because we often run into Docker Hub limits – “429 Too Many Requests – Server message: toomanyrequests: You have reached your pull rate limit. You can increase the limit by authenticating and upgrading“.