7.6 KiB
title | description |
---|---|
Rabbit Hole | Complex systems sometimes require complex debugging. |
We've begun rolling out SkyDNS at my job, which has been pretty neat. We're basing a couple future projects around being able to use it, and it's made dynamic configuration and service discovery nice and easy.
This post chronicles catching a bug because of our switch to SkyDNS, and how we discover its root cause. I like to call these kinds of bugs "rabbit holes"; they look shallow at first, but anytime you make a little progress forward a little more is always required, until you discover the ending somewhere totally unrelated to the start.
The Bug
We are seeing tons of these in the SkyDNS log:
[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward
I fire up tcpdump to see if I can see anything interesting, and sure enough run across a bunch of these:
# tcpdump -vvv -s 0 -l -n port 53
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...
$fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
$fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
$fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
$fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
It appears that some of our front end nodes (FENs) are making tons of DNS
fequests trying to find the A record of unknown
. Something on our FENs is
doing something insane and is breaking.
The FENs
Hopping over to my favorite FEN we're able to see the packets in question leaving on a tcpdump as well, but that's not helpful for finding the root cause. We have lots of processes running on the FENs and any number of them could be doing something crazy.
We fire up sysdig, which is similar to systemtap and strace in that it allows you to hook into the kernel and view various kernel activites in real time, but it's easier to use than both. The following command dumps all UDP packets being sent and what process is sending them:
# sysdig fd.l4proto=udp
...
2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53
2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL
2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown.....
2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728
2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0
2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024
2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173
2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53)
2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0
Aha! We can see php-fpm is requesting something over udp with the string
unknown
in it. We've now narrowed down the guilty process, the rest should be
easy right?
Which PHP?
Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN
narrows down the guilty codebase little. Taking the FEN out of our load-balancer
stops the requests for unknown
, so we can say that it's some user-facing
code that is the culprit. Our setup on the FENs involves users hitting nginx
for static content and nginx proxying PHP requests back to php-fpm. Since all
our virtual domains are defined in nginx, we are able to do something horrible.
On the particular FEN we're on we make a guess about which virtual domain the
problem is likely coming from (our main app), and proxy all traffic from all
other domains to a different FEN. We still see requests for unknown
leaving
the box, so we've narrowed the problem down a little more.
The Despair
Nothing in our code is doing any direct DNS calls as far as we can find, and we
don't see any places PHP might be doing it for us. We have lots of PHP
extensions in place, all written in C and all black boxes; any of them could be
the culprit. Grepping through the likely candidates' source code for the string
unknown
proves fruitless.
We try xdebug at this point. xdebug is a profiler for php which will create cachegrind files for the running code. With cachegrind you can see every function which was ever called, how long spent within each function, a full call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a per-php-fpm-process basis, and overwrites the previous file on each new request. So xdebug is pretty much useless, since what is in the cachegrind file isn't necessarily what spawned the DNS request.
Gotcha (sorta)
We turn back to the tried and true method of dumping all the traffic using tcpdump and perusing through that manually.
What we find is that nearly everytime there is a DNS request for unknown
, if
we scroll up a bit there is (usually) a particular request to memcache. The
requested key is always in the style of function-name:someid:otherstuff
. When
looking in the code around that function name we find this ominous looking call:
$ipAddress = getIPAddress();
$geoipInfo = getCountryInfoFromIP($ipAddress);
This points us in the right direction. On a hunch we add some debug
logging to print out the $ipAddress
variable, and sure enough it comes back as
unknown
. AHA!
So what we surmise is happening is that for some reason our geoip extension,
which we use to get the location data of an IP address and which
getCountryInfoFromIP
calls, is seeing something which is not an IP address
and trying to resolve it.
Gotcha (for real)
So the question becomes: why are we getting the string unknown
as an IP
address?
Adding some debug logging around the area we find before showed that
$_SERVER['REMOTE_ADDR']
, which is the variable populated with the IP address
of the client, is sometimes unknown
. We guess that this has something to do
with some magic we are doing on nginx's side to populate REMOTE_ADDR
with the
real IP address of the client in the case of them going through a proxy.
Many proxies send along the header X-Forwarded-For
to indicate the real IP of
the client they're proxying for, otherwise the server would only see the proxy's
IP. In our setup I decided that in those cases we should set the REMOTE_ADDR
to the real client IP so our application logic doesn't even have to worry about
it. There are a couple problems with this which render it a bad decision, one
being that if some misbahaving proxy was to, say, start sending
X-Forwarded-For: unknown
then some written applications might mistake that to
mean the client's IP is unknown
.
The Fix
The fix here was two-fold:
-
We now always set
$_SERVER['REMOTE_ADDR']
to be the remote address of the requests, regardless of if it's a proxy, and also send the application theX-Forwarded-For
header to do with as it pleases. -
Inside our app we look at all the headers sent and do some processing to decide what the actual client IP is. PHP can handle a lot more complex logic than nginx can, so we can do things like check to make sure the IP is an IP, and also that it's not some NAT'd internal ip, and so forth.
And that's it. From some weird log messages on our DNS servers to an nginx mis-configuration on an almost unrelated set of servers, this is one of those strange bugs that never has a nice solution and goes unsolved for a long time. Spending the time to dive down the rabbit hole and find the answer is often tedious, but also often very rewarding.