Click to rate this post!

[Total: 0 Average: 0]

We run GitHub Runners in Kubernetes to build and deploy our Backend API, see GitHub Actions: running the Actions Runner Controller in Kubernetes.

But over time, we noticed that there was too much traffic on the NAT Gateway – see VictoriaLogs: a Grafana dashboard for AWS VPC Flow Logs – migrating from Grafana Loki.

Contents

The issue: Traffic on the AWS NAT Gateway

When we started investigation, we found an interesting detail:

Here, 40.8 gigabytes of data passed through the GW NAT in an hour, 40.7 of which was Ingress.

Out of these 40 GB, there are three Remote IPs at the top, each of which sent us almost 10 GB of traffic (the table on the bottom left of the screenshot above).

The top Remote IPs are:

Remote IP       Value     Percent
------------------------------
20.60.6.4       10.6 GB	  28%
20.150.90.164   9.79 GB	  26%
20.60.6.100     8.30 GB	  22%
185.199.111.133 2.06 GB	  5%
185.199.108.133 1.89 GB	  5%
185.199.110.133 1.78 GB	  5%
185.199.109.133 1.40 GB	  4%
140.82.114.4    805  MB	  2%
146.75.28.223   705  MB	  2%
54.84.248.61    267  MB	  1%

And at the top of Kubernetes traffic, we have four Kubernetes Pods IPs:

Source IP        Pod IP      Value	Percent
-----------------------------------------------
20.60.6.4     => 10.0.43.98  1.54 GB	14%
20.60.6.100   => 10.0.43.98  1.49 GB	14%
20.60.6.100   => 10.0.42.194 1.09 GB	10%
20.150.90.164 => 10.0.44.162 1.08 GB	10%
20.60.6.4     => 10.0.44.208 1.03 GB	9%

And all of these IPs belongs to GitHub Runners Pods, and the “kraken” in the name is just those runners for builds and deploys of our kraken project, the Backend:

The next step is even more interesting: if you check the IP https://20.60.6.4, you will see a hostname:

*.blob.core.windows.net???

What? I was very surprised, because we build a Python app, and there are no libraries from Microsoft. But then I had an idea: since we use PiP and Docker caching in GitHub Actions for the Backend API builds, it’s most likely GitHub storage, and it’s from there that we pull these caches to Kubernetes (it is, see the Communication requirements for GitHub-hosted runners and GitHub).

A similar check for the 185.199.111.133 and 140.82.114.4 shows us *.github.io, and the54.84.248.61 is for the athena.us-east-1.amazonaws.com.

So, what we decided to do was to run a local caching in Kubernetes with the Sonatype Nexus, and use it as a proxy for PyPi.org and for Docker Hub images.

We’ll talk about Docker caching next time, but for now, we will:

test Nexus locally with Docker on a work machine
run Nexus in Kubernetes from a Helm-chart
configure and test the PyPI cache for builds
and see the results

Nexus: testing locally with Docker

Run Nexus:

$ docker run -ti --rm --name nexus -p 8081:8081 sonatype/nexus3

Wait a few minutes, because Nexus is Java-based, so it takes a long time to start.

Get the admin password:

$ docker exec -ti nexus cat /nexus-data/admin.password
6221ad20-0196-4771-b1c7-43df355c2245

In a browser, go to the http://localhost:8081, and log in:

If you haven’t done this in the Setup wizard, then go to Security > Anonymous access, and allow connections without authentication:

Adding a `pypi (proxy)` repository

Go to Settings > Repositories, click Create repository:

Select the pypi (proxy) type:

Create a repository:

Name: pypi-proxy
Remote storage: https://pypi.org
Blob store: default

At the bottom, click the Create repository.

Let’s check what data we have now in the default Blob storage – go to the Nexus container:

$ docker exec -ti nexus bash                          
bash-4.4$

And look at the /nexus-data/blobs/default/content/ directory – now it’s empty:

bash-4.4$ ls -l /nexus-data/blobs/default/content/
total 8
drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:02 directpath
drwxr-xr-x 2 nexus nexus 4096 Nov 27 11:02 tmp

Testing the Nexus PyPI cache

Now let’s check if our proxy cache is working.

Find the IP of the container from the Nexus:

$ docker inspect nexus | jq '.[].NetworkSettings.IPAddress'
"172.17.0.2"

Run another container with Python:

$ docker run -ti --rm python bash
root@addeba5d307c:/#

And execute pip install --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple setuptools --trusted-host 172.17.0.2

root@addeba5d307c:/# time pip install --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple  setuptools  --trusted-host 172.17.0.2
Looking in indexes: http://172.17.0.2:8081/repository/pypi-proxy/simple
Collecting setuptools
  Downloading http://172.17.0.2:8081/repository/pypi-proxy/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 81.7 MB/s eta 0:00:00
Installing collected packages: setuptools
Successfully installed setuptools-75.6.0
...
real    0m2.595s
...

We can see that the Downloading process was completed, and it took 2.59 seconds.

Let’s see what the default Blob storage in Nexus is now:

bash-4.4$ ls -l /nexus-data/blobs/default/content/
total 20
drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:02 directpath
drwxr-xr-x 2 nexus nexus 4096 Nov 27 11:21 tmp
drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-05
drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-19
drwxr-xr-x 3 nexus nexus 4096 Nov 27 11:21 vol-33

We have some data there now, okay.

Let’s test pip again – first, let’s uninstall the installed package:

root@addeba5d307c:/# pip uninstall setuptools

And install it again, but now add the --no-cache-dir to avoid using the local cache in the container:

root@5dc925fe254f:/# time pip install --no-cache-dir --index-url http://172.17.0.2:8081/repository/pypi-proxy/simple setuptools --trusted-host 172.17.0.2
Looking in indexes: http://172.17.0.2:8081/repository/pypi-proxy/simple
Collecting setuptools
  Downloading http://172.17.0.2:8081/repository/pypi-proxy/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 942.9 MB/s eta 0:00:00
Installing collected packages: setuptools
Successfully installed setuptools-75.6.0
...
real    0m1.589s

Now it took 1.52 seconds instead of 2.59.

Good, looks like everything works.

Let’s run Nexus on Kubernetes.

Running Nexus in Kubernetes

There is a chart called stevehipwell/nexus3.

You can write manifestos yourself, or you can try this chart.

What might be interesting to us from the chart’s values:

config.anonymous.enabled: Nexus will work locally in Kubernetes with access only via Cluster IP, so while it is in PoC and purely for PyPI cache – we can do it without authentication
config.blobStores: you can leave it as it is for now, but later can connect a dedicated EBS or AWS Elastic File System, see also persistence.enabled
config.job.tolerations and nodeSelector: if you need to tweak on a separate node, see Kubernetes: Pods and WorkerNodes to control the placement of pods on nodes
config.repos: create repositories directly through values
ingress.enabled: not our case, but it is possible
metrics.enabled: later we can look at the monitoring

First, let’s set it up with the default parameters, then we’ll add our own values.

Add a repository:

$ helm repo add stevehipwell https://stevehipwell.github.io/helm-charts/
"stevehipwell" has been added to your repositories

Create a separate namespace ops-nexus-ns:

$ kk create ns ops-nexus-ns
namespace/ops-nexus-ns created

Install the chart:

$ helm -n ops-nexus-ns upgrade --install nexus3 stevehipwell/nexus3

It took about 5 minutes to launch, and I was thinking about dropping the chart and writing it myself, but eventually, it started. Well, Java – what can we do?

Let’s check what do we have here:

$ kk -n ops-nexus-ns get all
NAME           READY   STATUS    RESTARTS   AGE
pod/nexus3-0   4/4     Running   0          6m5s

NAME                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/nexus3      ClusterIP   172.20.160.147   <none>        8081/TCP   6m5s
service/nexus3-hl   ClusterIP   None             <none>        8081/TCP   6m5s

NAME                      READY   AGE
statefulset.apps/nexus3   1/1     6m6s

Add an Admin user password

Create a Kubernetes Secret with a password:

$ kk -n ops-nexus-ns create secret generic nexus-root-pass --from-literal=password=p@ssw0rd
secret/nexus-root-pass created

Write a nexus-values.yaml file, in which we set the name of the Kubernetes Secret and the key with the password, and enable Anonymous Access:

rootPassword:
  secret: nexus-root-password
  key: password

  config:  
    enabled: true
    anonymous:
      enabled: true

Adding a repository to Nexus via Helm chart values

I had to do some work using the “tick method”, but it worked.

Also, in values.yaml the chart says:“Repository configuration; based on the REST API (API reference docs require an existing Nexus installation and can be found at **Administration** under _System_ → _API_) but with `format` & `type` defined in the object.“.

Let’s see the Nexus API specification – what fields are passed to API request:

What about the format?

We can look at the Format and Type fields in some existing repository:

Describe the repository and other necessary parameters – for me, it looks like this:

rootPassword:
  secret: nexus-root-password
  key: password
  
persistence:
  enabled: true
  storageClass: gp2-retain

resources:
  requests:
    cpu: 1000m
    memory: 1500Mi  

config:  
  enabled: true
  anonymous:
    enabled: true
  repos:
    - name: pip-cache
      format: pypi
      type: proxy
      online: true
      negativeCache:
        enabled: true
        timeToLive: 1440
      proxy:
        remoteUrl: https://pypi.org
        metadataMaxAge: 1440
        contentMaxAge: 1440
      httpClient:
        blocked: false
        autoBlock: true
        connection: 
        retries: 0
        useTrustStore: false
      storage:
        blobStoreName: default
        strictContentTypeValidation: false

It’s a pretty simple setup, and I’ll do some tuning later if necessary. But it’s already working.

Let’s deploy it:

$ helm -n ops-nexus-ns upgrade --install nexus3 stevehipwell/nexus3 -f nexus-values.yml

In case of errors of the type “Could not create repository“:

$ kk -n ops-nexus-ns logs -f nexus3-config-9-2cssf
Configuring Nexus3...
Configuring anonymous access...
Anonymous access configured.
Configuring blob stores...
Configuring scripts...
Script 'cleanup' updated.
Script 'task' updated.
Configuring cleanup policies...
Configuring repositories...
ERROR: Could not create repository 'pip-cache'.

Check the logs – Nexus wants to transfer almost all fields, in this case, the config.repos.httpClient.contentMaxAge was missing:

nexus3-0:nexus3 2024-11-27 12:34:16,818+0000 WARN  [qtp554755438-84] admin org.sonatype.nexus.siesta.internal.resteasy.ResteasyViolationExceptionMapper - (ID af473d22-3eca-49ea-adb9-c7985add27e7) Response: [400] '[ValidationErrorXO{id='PARAMETER strictContentTypeValidation', message='must not be null'}, ValidationErrorXO{id='PARAMETER negativeCache', message='must not be null'}, ValidationErrorXO{id='PARAMETER metadataMaxAge', message='must not be null'}, ValidationErrorXO{id='PARAMETER contentMaxAge'[]ust not be null]arg0.httpClient]ntMaxAge]]TypeValidation]TER httpClient', message='must not be null'}]'; mapped from: [PARAMETER]

During deployment, when we set the config.enabled=true parameter, the chart launches another Kubernetes Pod, which actually performs the Nexus configuration.

Let’s check the access and the repository – open a local port:

$ kk -n ops-nexus-ns port-forward pod/nexus3-0 8082:8081
Forwarding from 127.0.0.1:8082 -> 8081
Forwarding from [::1]:8082 -> 8081

And go to the http://localhost:8082/#admin/repository/repositories:

The Nexus needs a lot of resources, especially memory, because again, it’s Java:

Therefore, it makes sense to set requests in values.

Also, you can set JVM params:

...
  #  Environment:
  #  INSTALL4J_ADD_VM_PARAMS:          -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs -Xms1024m -Xmx1024m -XX:MaxDirectMemorySize=2048m
...

Testing Nexus in Kubernetes

Launch a Pod with Python:

$ kk run pod --rm -i --tty --image python bash
If you don't see a command prompt, try pressing enter.
root@pod:/#

Find a Kubernetes Service for Nexus:

$ kk -n ops-nexus-ns get svc
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
nexus3      ClusterIP   172.20.160.147   <none>        8081/TCP   78m
nexus3-hl   ClusterIP   None             <none>        8081/TCP   78m

Run pip install again:

root@pod:/# time pip install --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple setuptools --trusted-host nexus3.ops-nexus-ns.svc
Looking in indexes: http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple
Collecting setuptools
  Downloading http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 86.3 MB/s eta 0:00:00
Installing collected packages: setuptools
Successfully installed setuptools-75.6.0
...

real    0m3.958s

It installed setuptools-75.6.0 in 3.95 seconds.

Let’s check in the http://localhost:8082/#browse/browse:pip-cache:

Remove setuptools from our Python Pod:

root@pod:/# pip uninstall setuptools

And install it again, again with the --no-cache-dir:

root@pod:/# time pip install --no-cache-dir --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple setuptools --trusted-host nexus3.ops-nexus-ns.svc
Looking in indexes: http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple
Collecting setuptools
  Downloading http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/packages/setuptools/75.6.0/setuptools-75.6.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 875.9 MB/s eta 0:00:00
Installing collected packages: setuptools
Successfully installed setuptools-75.6.0
..

real    0m2.364s

Now it took 2.364s.

The only thing left to do is to update GitHub Workflows – disable all caches there, and add the use of Nexus.

GitHub and results for AWS NAT Gateway traffic

I won’t go into detail about GitHub Actions Workflow, because it’s different for everyone, but in short, I’ve disabled the PiP caching:

...
    - name: "Setup: Python 3.10"
      uses: actions/setup-python@v5
      with:
        python-version: "3.10"
        # cache: 'pip'
        check-latest: "false"
        # cache-dependency-path: "**/*requirements.txt"
...

This will save about 540 megabytes on downloading the archive with the cache for each Job run.

Next, we have a step that executes the pip install by calling make:

...
    - name: "Setup: Dev Dependencies"
      id: setup_dev_dependencies
      #run: make dev-python-requirements
      run: make dev-python-requirements-nexus
      shell: bash
...

And in the Makefile, I created a new task so that I could quickly revert to the old configuration:

...
dev-python-requirements:
  python3 -m pip install --no-compile -r dev-requirements.txt

dev-python-requirements-nexus:
  python3 -m pip install --index-url http://nexus3.ops-nexus-ns.svc:8081/repository/pip-cache/simple --no-compile -r dev-requirements.txt --trusted-host nexus3.ops-nexus-ns.svc
...

Then, in the Workflow, disable any caches like actions/cache:

..
    # - name: "Setup: Get cached api-generator images"
    #   id: api-generator-cache
    #   uses: actions/cache@v4
    #   with:
    #     path: ~/_work/api-generator-cache
    #     key: api-generator-cache
...

Let’s compare the results.

The build with the old configuration, without Nexus and with GitHub caches – the traffic of the Kubernetes Pod GitHub Runner that this build was running:

3.55 gigabytes of traffic, the build and deployment took 4 minutes and 11 seconds.

And the same GitHub Actions Job, but with the changes merged – using Nexus, and without GitHub caching.

We can see in the logs that the packages are indeed taken from Nexus:

And traffic used:

329 megabytes, the build and deployment took 4 minutes and 20 seconds.

And that’s it for now.

What will be done next is to see how Nexus can be monitored, what metrics it has, and which ones can be used to make alerts, and then add more Docker cache, because we often run into Docker Hub limits – “429 Too Many Requests – Server message: toomanyrequests: You have reached your pull rate limit. You can increase the limit by authenticating and upgrading“.