Category Archives: Monitoring

Hardware, services and network monitoring systems

Prometheus: yet-another-cloudwatch-exporter – collecting AWS CloudWatch metrics
0 (0)

23 July 2020

Currently, to collect metrics from the AWS CloudWatch we are using AWS’s own cloudwatch-exporter, see the Prometheus: CloudWatch exporter — сбор метрик из AWS и графики в Grafana post (in Rus), but it has a few gaps: it’s written in Java, so uses CPU/memory of the monitoring host doesn’t scrapes AWS tags from resources uses… Read More »

Loading

Kubernetes: monitoring with Prometheus – exporters, a Service Discovery, and its roles
0 (0)

26 April 2020

The next task with our Kubernetes cluster is to set up its monitoring with Prometheus. This task is complicated by the fact, that there is the whole bunch of resources needs to be monitored: from the infrastructure side – ЕС2 WokerNodes instances, their CPU, memory, network, disks, etc key services of Kubernetes itself – its… Read More »

Loading

Redis: “psync scheduled to be closed ASAP for overcoming of output buffer limits” and the client-output-buffer-limit
0 (0)

26 February 2020

We have a Redis-cluster with Master-slave replication and Sentinel, see the Redis: replication, part 2 – Master-Slave replication, and Redis Sentinel, Redis: fork – Cannot allocate memory, Linux, virtual memory and vm.overcommit_memory, and Redis: main configuration parameters and performance tuning overview posts. The system worked great until we started using it much more actively. Redis… Read More »

Loading

Kubernetes: running metrics-server in AWS EKS for a Kubernetes Pod AutoScaler
0 (0)

15 February 2020

Assuming, we already have an AWS EKS cluster with worker nodes. In this post – we will connect to a newly created cluster, will create a test deployment with an HPA – Kubernetes Horizontal Pod AutoScaler and will try to get information about resources usage using kubectl top. Kubernetes cluster Create a test cluster using… Read More »

Loading

Grafana: Loki – the LogQL’s Prometheus-like counters, aggregation functions and dnsmasq’s requests graphs
0 (0)

17 November 2019

The last time I configured Loki for logs collecting and monitoring was in February 2019 – almost a year ago, see the Grafana Labs: Loki – logs collecting and monitoring system post, when Loki was in its Beta state. Now we faced with outgoing traffic issues in our Production environments and can’t find who guilty for… Read More »

Loading

dnsmasq: AWS – “Temporary failure in name resolution”, logs, debug and dnsmasq cache size
0 (0)

28 October 2019

We are using AWS VPC DNS and sometimes facing with errors like “php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution“. The only advice from AWS tech. support was to configure a local dnsmasq service to act as a local DNS cache, but I did this already year ago and this issue happens once in 1-2-3… Read More »

Loading

Debian: logrotate won’t rotate logs with an “unknown group ‘syslog'” error
0 (0)

9 October 2019

We have an AWS EC2 with Debian and logrotate. One day its root partition was exhausted and when I started investigating it – found, that we have a bunch of files like /var/log/syslog.N.gz. At the same time by default logrotate creates a config file to rotate syslog log files: [simterm] root@monitoring-dev:~# cat /etc/logrotate.d/syslog # Ansible… Read More »

Loading

OpsGenie: Uptrends integration
0 (0)

24 September 2019

Uptrends – just a simple pinging monitoring service already used for the RTFM blog (see the Prometheus: RTFM blog monitoring set up with Ansible – Grafana, Loki, and promtail post for more details). Now I’d like to add it as an additional alerting service for the project’s API-endpoints and configure its alerts to be sent… Read More »

Loading