Debugging Docker network connectivity
A few weeks ago I wrote a short article, networking basics with docker. Since then, I’ve migrated my production environment to be docker only, and I’ve deployed 5 docker hosts which handle various tasks, from a production file server, search server with solr, my complete development environment and some websites (including this one!).
Over the course of a week I was hitting a few timeouts in connecting from one docker container to another, and finally, yesterday I consistently pinpointed a 4+ second delay on an API service I was writing. Four seconds. I’ll go over the steps that I took to diagnose and resolve the issue.
Notice: Hey there. I noticed that this post has been getting a bit of attention. If you want to update your DevOps skills with Docker and 12 Factor apps, consider purchasing 12 Factor Apps with Docker and Go. In it I solve various issues which might crop up when you’re running Docker and scaling your services.
First, I suspected that I might be overloading my Digital Ocean instance. That was a very quick
dead-end, since I have a lot of free RAM left, and
top showed no significant (out of the ordinary) CPU usage. I also have
a netdata container running, which gave me no additional info. I started to
suspect a network issue.
With it, I started to add some debugging lines into my PHP scripts in this case - I called
microtime() to give me an idea
of the elapsed time - was it coming from PHP, or was it coming from the reverse-proxy and web server backend. I confirmed
that connecting to MySQL took about 4 seconds. I’m using transactions so the connection was not persistent - this enabled me
to replicate the issue reliably - established and reused connections obviously don’t have a delay when they are used again.
As I’m running a MySQL instance on my development laptop, as well as in production on my VPS instance, I’ve started to suspect it as a culprit. I noticed some connectivity issues on my development laptop, but thought it might just be related with a container needing some warm-up time before the connectivity to other containers was stabilized. But this constant delay in connecting was pointed at MySQL, and I’ve started to suspect some issue with it.
Some googling and I found an Stack Overflow question, Why is connecting to a MySQL server so slow?.
After reasoning that okay, perhaps the DNS server included in docker has some issues in regards to reverse-dns lookups, I’ve decided
to test this out and include
skip-name-resolve in my MySQL configs.
mysql> show variables like '%resolv%'; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | skip_name_resolve | ON | +-------------------+-------+ 1 row in set (0.01 sec)
I’m sure the option is fine to include, but it didn’t seem to be the issue, the delay was still present after the change.
Bare bones connectivity
Containers which are based on debian or ubuntu have
apt-get installed and allow to run
apt-get update as well as install
additional packages in the container you’re running. I used my existing container experiencing issues - after all, I didn’t isolate
the issue to the source or destination container yet, didn’t want to add new containers to the mix.
docker exec to get into a shell on my web instance. Since it didn’t have telnet, I installed it:
$ docker exec -it web-backend bash root@web-backend:~# apt-get update -qq ; apt-get install telnet -y -qq --force-yes debconf: delaying package configuration, since apt-utils is not installed Selecting previously unselected package telnet. (Reading database ... 8007 files and directories currently installed.) Preparing to unpack .../telnet_0.17-36_amd64.deb ... Unpacking telnet (0.17-36) ... Setting up telnet (0.17-36) ... update-alternatives: using /usr/bin/telnet.netkit to provide /usr/bin/telnet (telnet) in auto mode root@web-backend:~# telnet db1 3306 Trying 172.25.0.3... Connected to db1. Escape character is '^]'. O 5.6.28-76.1Yj('+@▒pqg2_it)mjN(mysql_native_password
This confirmed that the 4 second delay was coming before MySQL had a chance to do any reverse lookups. The delay happened before the line “Connected to db1”, meaning, before the connection was actually established. I also tried other hosts/ports on the network to confirm that it’s not the database - I tried Redis, Nginx,… everything had a delay of 4 seconds.
What’s actually going on?
There’s a useful Linux utility,
strace, which gives you a system call trace for any application. I installed it and ran the
same telnet line, to connect to my MySQL server again. The application gives a lot of verbose output with all the system calls
that are processed. It’s explanation is a bit out scope, but what we’re looking for is any
root@web-backend:~# telnet db1 3306 | grep socket
The offending line that jumped out at the time of the delay was:
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
This only tells us that the socket was created in SOCK_DGRAM mode - UDP. DNS functions on UDP. After looking at more output,
connect lines, I managed to isolate the DNS used as 127.0.0.11 - the internal Docker DNS server. I spotted
something suspicious in the output:
read(3, "search mtk.lan\nnameserver 127.0."..., 4096) = 53
I see that, somehow, the name resolution from one of my VPNs leaked into the Docker environment. The cause of the delay was being
caused by software trying to resolve non-existent domains based on an outdated
/etc/resolv.conf. I checked the config file on
the host - as I wasn’t connected to the VPN anymore, the “search mtk.lan” option wasn’t in the resolv.conf anymore. But it was
still available inside the docker container.
I removed the container, and restarted it from a clean image. The container still had this
search option inside resolv.conf.
The option was burned into Docker somewhere, and I don’t know a way to get rid of it!
The final solution
After some googling, I’ve located a filed issue against Docker, which describes similar behavior (Issue #8425). The issue is closed but it gave me enough info to fix it for myself:
$ service docker restart
The above command shuts down all containers and restarts the Docker daemon. When the deamon started, it picked up the current resolv.conf from the host, and when it started the containers - the network connectivity issues have gone away.
Since then I’ve also pinpointed the same issue on my development environment. I removed all the
search lines from resolv.conf
and restarted Docker as mentioned above. All connectivity issues/delays have practically gone away. This is what my Docker
resolv.conf looked like before, and after:
$ docker exec -i dev cat /etc/resolv.conf search mtk.lan nameserver 127.0.0.11 options ndots:0 $ docker exec -i dev cat /etc/resolv.conf nameserver 127.0.0.11 options ndots:0 $
Hate it when I’m right
As I wrote in my article on Networking with Docker,
I don’t know exactly where it happened in my professional history of system administration and development, but somewhere along the line I discovered the
/etc/hostsfile to be the most reliable way to provide host name resolution for active applications on the servers. You don’t need a DNS server, and literally, never ever in all my history, has this method failed for me.
It stands true even today. But with attention to details, any problem can be resolved. It’s now resolved on my development and my production server. And It’s nice to know that it doesn’t affect my other servers. Yay!
While I have you here...
It would be great if you buy one of my books:
- Go with Databases
- Advent of Go Microservices
- API Foundations in Go
- 12 Factor Apps with Docker and Go
Want to stay up to date with new posts?