Debugging Docker network connectivity

A few weeks ago I wrote a short article, networking basics with docker. Since then, I’ve migrated my production environment to be docker only, and I’ve deployed 5 docker hosts which handle various tasks, from a production file server, search server with solr, my complete development environment and some websites (including this one!).

Over the course of a week I was hitting a few timeouts in connecting from one docker container to another, and finally, yesterday I consistently pinpointed a 4+ second delay on an API service I was writing. Four seconds. I’ll go over the steps that I took to diagnose and resolve the issue.

Notice: Hey there. I noticed that this post has been getting a bit of attention. If you want to update your DevOps skills with Docker and 12 Factor apps, consider purchasing 12 Factor Apps with Docker and Go. In it I solve various issues which might crop up when you’re running Docker and scaling your services.

Hardware

First, I suspected that I might be overloading my Digital Ocean instance. That was a very quick dead-end, since I have a lot of free RAM left, and top showed no significant (out of the ordinary) CPU usage. I also have a netdata container running, which gave me no additional info. I started to suspect a network issue.

With it, I started to add some debugging lines into my PHP scripts in this case - I called microtime() to give me an idea of the elapsed time - was it coming from PHP, or was it coming from the reverse-proxy and web server backend. I confirmed that connecting to MySQL took about 4 seconds. I’m using transactions so the connection was not persistent - this enabled me to replicate the issue reliably - established and reused connections obviously don’t have a delay when they are used again.

MySQL

As I’m running a MySQL instance on my development laptop, as well as in production on my VPS instance, I’ve started to suspect it as a culprit. I noticed some connectivity issues on my development laptop, but thought it might just be related with a container needing some warm-up time before the connectivity to other containers was stabilized. But this constant delay in connecting was pointed at MySQL, and I’ve started to suspect some issue with it.

Some googling and I found an Stack Overflow question, Why is connecting to a MySQL server so slow?. After reasoning that okay, perhaps the DNS server included in docker has some issues in regards to reverse-dns lookups, I’ve decided to test this out and include skip-name-resolve in my MySQL configs.

mysql> show variables like '%resolv%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| skip_name_resolve | ON    |
+-------------------+-------+
1 row in set (0.01 sec)

I’m sure the option is fine to include, but it didn’t seem to be the issue, the delay was still present after the change.

Bare bones connectivity

Containers which are based on debian or ubuntu have apt-get installed and allow to run apt-get update as well as install additional packages in the container you’re running. I used my existing container experiencing issues - after all, I didn’t isolate the issue to the source or destination container yet, didn’t want to add new containers to the mix.

I used docker exec to get into a shell on my web instance. Since it didn’t have telnet, I installed it:

$ docker exec -it web-backend bash
root@web-backend:~# apt-get update -qq ; apt-get install telnet -y -qq --force-yes
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package telnet.
(Reading database ... 8007 files and directories currently installed.)
Preparing to unpack .../telnet_0.17-36_amd64.deb ...
Unpacking telnet (0.17-36) ...
Setting up telnet (0.17-36) ...
update-alternatives: using /usr/bin/telnet.netkit to provide /usr/bin/telnet (telnet) in auto mode
root@web-backend:~# telnet db1 3306
Trying 172.25.0.3...
Connected to db1.
Escape character is '^]'.
O
5.6.28-76.1Yj('+@▒pqg2_it)mjN(mysql_native_password

This confirmed that the 4 second delay was coming before MySQL had a chance to do any reverse lookups. The delay happened before the line “Connected to db1”, meaning, before the connection was actually established. I also tried other hosts/ports on the network to confirm that it’s not the database - I tried Redis, Nginx,… everything had a delay of 4 seconds.

What’s actually going on?

There’s a useful Linux utility, strace, which gives you a system call trace for any application. I installed it and ran the same telnet line, to connect to my MySQL server again. The application gives a lot of verbose output with all the system calls that are processed. It’s explanation is a bit out scope, but what we’re looking for is any socket and connect lines.

root@web-backend:~# telnet db1 3306 | grep socket

The offending line that jumped out at the time of the delay was:

socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3

This only tells us that the socket was created in SOCK_DGRAM mode - UDP. DNS functions on UDP. After looking at more output, specifically some connect lines, I managed to isolate the DNS used as 127.0.0.11 - the internal Docker DNS server. I spotted something suspicious in the output:

read(3, "search mtk.lan\nnameserver 127.0."..., 4096) = 53

I see that, somehow, the name resolution from one of my VPNs leaked into the Docker environment. The cause of the delay was being caused by software trying to resolve non-existent domains based on an outdated /etc/resolv.conf. I checked the config file on the host - as I wasn’t connected to the VPN anymore, the “search mtk.lan” option wasn’t in the resolv.conf anymore. But it was still available inside the docker container.

I removed the container, and restarted it from a clean image. The container still had this search option inside resolv.conf. The option was burned into Docker somewhere, and I don’t know a way to get rid of it!

The final solution

After some googling, I’ve located a filed issue against Docker, which describes similar behavior (Issue #8425). The issue is closed but it gave me enough info to fix it for myself:

$ service docker restart

The above command shuts down all containers and restarts the Docker daemon. When the deamon started, it picked up the current resolv.conf from the host, and when it started the containers - the network connectivity issues have gone away.

Since then I’ve also pinpointed the same issue on my development environment. I removed all the search lines from resolv.conf and restarted Docker as mentioned above. All connectivity issues/delays have practically gone away. This is what my Docker resolv.conf looked like before, and after:

$ docker exec -i dev cat /etc/resolv.conf
search mtk.lan
nameserver 127.0.0.11
options ndots:0
$ docker exec -i dev cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
$

Hate it when I’m right

As I wrote in my article on Networking with Docker,

I don’t know exactly where it happened in my professional history of system administration and development, but somewhere along the line I discovered the /etc/hosts file to be the most reliable way to provide host name resolution for active applications on the servers. You don’t need a DNS server, and literally, never ever in all my history, has this method failed for me.

It stands true even today. But with attention to details, any problem can be resolved. It’s now resolved on my development and my production server. And It’s nice to know that it doesn’t affect my other servers. Yay!

While I have you here...

It would be great if you buy one of my books:

I promise you'll learn a lot more if you buy one. Buying a copy supports me writing more about similar topics. Say thank you and buy my books.

Feel free to send me an email if you want to book my time for consultancy/freelance services. I'm great at APIs, Go, Docker, VueJS and scaling services, among many other things.