mediocre-blog/_posts/2015-03-11-rabbit-hole.md

166 lines
7.6 KiB
Markdown
Raw Permalink Normal View History

---
title: Rabbit Hole
description: >-
Complex systems sometimes require complex debugging.
---
We've begun rolling out [SkyDNS][skydns] at my job, which has been pretty neat.
We're basing a couple future projects around being able to use it, and it's made
dynamic configuration and service discovery nice and easy.
This post chronicles catching a bug because of our switch to SkyDNS, and how we
discover its root cause. I like to call these kinds of bugs "rabbit holes"; they
look shallow at first, but anytime you make a little progress forward a little
more is always required, until you discover the ending somewhere totally
unrelated to the start.
## The Bug
We are seeing *tons* of these in the SkyDNS log:
```
[skydns] Feb 20 17:21:15.168 INFO | no nameservers defined or name too short, can not forward
```
I fire up tcpdump to see if I can see anything interesting, and sure enough run
across a bunch of these:
```
# tcpdump -vvv -s 0 -l -n port 53
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...
$fen_ip.50257 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
$fen_ip.27372 > $skydns_ip.domain: [udp sum ok] 16218+ A? unknown. (25)
$fen_ip.35634 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
$fen_ip.64363 > $skydns_ip.domain: [udp sum ok] 59227+ A? unknown. (25)
```
It appears that some of our front end nodes (FENs) are making tons of DNS
fequests trying to find the A record of `unknown`. Something on our FENs is
doing something insane and is breaking.
## The FENs
Hopping over to my favorite FEN we're able to see the packets in question
leaving on a tcpdump as well, but that's not helpful for finding the root cause.
We have lots of processes running on the FENs and any number of them could be
doing something crazy.
We fire up sysdig, which is similar to systemtap and strace in that it allows
you to hook into the kernel and view various kernel activites in real time, but
it's easier to use than both. The following command dumps all UDP packets being
sent and what process is sending them:
```
# sysdig fd.l4proto=udp
...
2528950 22:17:35.260606188 0 php-fpm (21477) < connect res=0 tuple=$fen_ip:61173->$skydns_ip:53
2528961 22:17:35.260611327 0 php-fpm (21477) > sendto fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=25 tuple=NULL
2528991 22:17:35.260631917 0 php-fpm (21477) < sendto res=25 data=.r...........unknown.....
2530470 22:17:35.261879032 0 php-fpm (21477) > ioctl fd=102(<4u>$fen_ip:61173->$skydns_ip:53) request=541B argument=7FFF82DC8728
2530472 22:17:35.261880574 0 php-fpm (21477) < ioctl res=0
2530474 22:17:35.261881226 0 php-fpm (21477) > recvfrom fd=102(<4u>$fen_ip:61173->$skydns_ip:53) size=1024
2530476 22:17:35.261883424 0 php-fpm (21477) < recvfrom res=25 data=.r...........unknown..... tuple=$skydns_ip:53->$fen_ip:61173
2530485 22:17:35.261888997 0 php-fpm (21477) > close fd=102(<4u>$fen_ip:61173->$skydns_ip:53)
2530488 22:17:35.261892626 0 php-fpm (21477) < close res=0
```
Aha! We can see php-fpm is requesting something over udp with the string
`unknown` in it. We've now narrowed down the guilty process, the rest should be
easy right?
## Which PHP?
Unfortunately we're a PHP shop; knowing that php-fpm is doing something on a FEN
narrows down the guilty codebase little. Taking the FEN out of our load-balancer
stops the requests for `unknown`, so we *can* say that it's some user-facing
code that is the culprit. Our setup on the FENs involves users hitting nginx
for static content and nginx proxying PHP requests back to php-fpm. Since all
our virtual domains are defined in nginx, we are able to do something horrible.
On the particular FEN we're on we make a guess about which virtual domain the
problem is likely coming from (our main app), and proxy all traffic from all
other domains to a different FEN. We still see requests for `unknown` leaving
the box, so we've narrowed the problem down a little more.
## The Despair
Nothing in our code is doing any direct DNS calls as far as we can find, and we
don't see any places PHP might be doing it for us. We have lots of PHP
extensions in place, all written in C and all black boxes; any of them could be
the culprit. Grepping through the likely candidates' source code for the string
`unknown` proves fruitless.
We try xdebug at this point. xdebug is a profiler for php which will create
cachegrind files for the running code. With cachegrind you can see every
function which was ever called, how long spent within each function, a full
call-graph, and lots more. Unfortunately xdebug outputs cachegrind files on a
per-php-fpm-process basis, and overwrites the previous file on each new request.
So xdebug is pretty much useless, since what is in the cachegrind file isn't
necessarily what spawned the DNS request.
## Gotcha (sorta)
We turn back to the tried and true method of dumping all the traffic using
tcpdump and perusing through that manually.
What we find is that nearly everytime there is a DNS request for `unknown`, if
we scroll up a bit there is (usually) a particular request to memcache. The
requested key is always in the style of `function-name:someid:otherstuff`. When
looking in the code around that function name we find this ominous looking call:
```php
$ipAddress = getIPAddress();
$geoipInfo = getCountryInfoFromIP($ipAddress);
```
This points us in the right direction. On a hunch we add some debug
logging to print out the `$ipAddress` variable, and sure enough it comes back as
`unknown`. AHA!
So what we surmise is happening is that for some reason our geoip extension,
which we use to get the location data of an IP address and which
`getCountryInfoFromIP` calls, is seeing something which is *not* an IP address
and trying to resolve it.
## Gotcha (for real)
So the question becomes: why are we getting the string `unknown` as an IP
address?
Adding some debug logging around the area we find before showed that
`$_SERVER['REMOTE_ADDR']`, which is the variable populated with the IP address
of the client, is sometimes `unknown`. We guess that this has something to do
with some magic we are doing on nginx's side to populate `REMOTE_ADDR` with the
real IP address of the client in the case of them going through a proxy.
Many proxies send along the header `X-Forwarded-For` to indicate the real IP of
the client they're proxying for, otherwise the server would only see the proxy's
IP. In our setup I decided that in those cases we should set the `REMOTE_ADDR`
to the real client IP so our application logic doesn't even have to worry about
it. There are a couple problems with this which render it a bad decision, one
being that if some misbahaving proxy was to, say, start sending
`X-Forwarded-For: unknown` then some written applications might mistake that to
mean the client's IP is `unknown`.
## The Fix
The fix here was two-fold:
1) We now always set `$_SERVER['REMOTE_ADDR']` to be the remote address of the
requests, regardless of if it's a proxy, and also send the application the
`X-Forwarded-For` header to do with as it pleases.
2) Inside our app we look at all the headers sent and do some processing to
decide what the actual client IP is. PHP can handle a lot more complex logic
than nginx can, so we can do things like check to make sure the IP is an IP, and
also that it's not some NAT'd internal ip, and so forth.
And that's it. From some weird log messages on our DNS servers to an nginx
mis-configuration on an almost unrelated set of servers, this is one of those
strange bugs that never has a nice solution and goes unsolved for a long time.
Spending the time to dive down the rabbit hole and find the answer is often
tedious, but also often very rewarding.
[skydns]: https://github.com/skynetservices/skydns