dkg's blog - TCP weirdness, IMAP, wireshark, and perdition

This is the story of a weirdly unfriendly/non-compliant IMAP server, and some nice interactions that arose from a debugging session around it.

Over the holidays, i got to do some computer/network debugging for friends and family. One old friend (I'll call him “Fred”) had a series of problems i managed to help work through, but was ultimately basically stumped based on the weird behavior of an IMAP server. Here's the details (names of the innocent and guilty have been changed), just in case it helps other folks in at least diagnosing similar situations.

the diagnosis

The initial symptom was that Fred's computer was "very slow". Sadly, this was a Windows™ machine, so my list of tricks for diagnosing sluggishness is limited. I went through a series of questions, uninstalling things, etc, until we figured it would be better to just have him do his usual work while i watched, kibitzing on what seemed acceptable and what seemed slow. Quite soon, we hit a very specific failure: Fred's Thunderbird installation (version 2, FWIW) was sometimes hanging for a very long period of time during message retrieval. This was not exhaustion of the CPU, disk, RAM, or other local resource. It was pure network delay, and it was a frequent (if unpredictable) frustrating hiccup in his workflow.

One thought i had was Thunderbird's per-server max_cached_connections setting, which can sometimes cause a TB instance to hang if a remote server thinks Thunderbird is being too aggressive. After sorting out why Thunderbird was resetting the values after we'd set them to 0 (grr, thanks for the confusing UI, folks!), we set it to 1, but still had the same occasional, lengthy (about 2 minutes) hang when transfering messages between folders (including the trash folder!), or when reading new messages. Sending mail was quite fast, except for occasional (similarly lengthy) hangs writing the copy to the sent folder. So IMAP was the problem (not SMTP), and the 2-minute timeouts smelled like an issue with the networking layer to me.

At this point, i busted out wireshark, the trusty packet sniffer, which fortunately works as well on Windows as it does on GNU/Linux. Since Fred was doing his IMAP traffic in the clear, i could actually see when and where in the IMAP session the hang was happening. (BTW, Fred's IMAP traffic is no longer in the clear: after all this happened, i switched him to IMAPS (IMAP wrapped in a TLS session), because although the IMAP server in question actually supports the STARTTLS directive, it fails to advertise it in response to the CAPABILITIES query, so Thunderbird refuses to try it. arrgh.)

The basic sequence of Thunderbird's side of an initial IMAP conversation (using plain authentication, anyway) looks something like this:

1 capability2 login "user" "pass"3 lsub "" "*"4 list "" "INBOX"5 select "INBOX"6 UID fetch 1:* (FLAGS)

What i found with this server was that if i issued commands 1 through 5, and then left the connection idle for over 5 minutes, then the next command (even if it was just a 6 NOOP or 6 LOGOUT) would cause the IMAP server to issue a TCP reset. No IMAP error message or anything, just a failure at the TCP level. But a nice, fast, responsive failure -- any IMAP client could recover nicely from that by just immediately opening a new connection. I don't mind busy servers killing inactive connections after a reasonable timeout. If it was just this, though, Thunderbird should have continued to be responsive.

the deep weirdness

But if i issued commands 1 through 6 in rapid succession (the only difference is that extra 6 UID fetch 1:* (FLAGS) command), and then let the connection idle for 5 minutes, then sent the next command: no response of any kind would come from the remote server (not even a TCP ACK or TCP RST). In this circumstance, my client OS's TCP stack would re-send the data repeatedly (staggered at appropriate intervals), until finally the client-side TCP timeout would trigger, and the OS would report the failure to the app, which could turn around and do a simple connection restart to finish up the desired operation. This was the underlying situation causing Fred's Thunderbird client to hang.

In both cases above (with or without the 6th command), the magic window for the idle cutoff was a little more than 300 seconds (5 minutes) of idleness. If the client issued a NOOP at 4 minutes, 45 seconds from the last NOOP, it could keep a connection active indefinitely.

Furthermore, i could replicate the exact same behavior when i used IMAPS -- the state of the IMAP session itself was somehow modifying the TCP session behavior characteristics, whether it was wrapped in a TLS tunnel or not.

One interesting thing about this set of data is that it rules out most common problems in the network connectivity between the two machines. Since none of the hops between the two endpoints know anything about the IMAP state (especially under TLS), and some of the failures are reported properly (e.g. the TCP RST in the 5-command scenario), it's probably safe to say that the various routers, NAT devices, and such were not themselves responsible for the failures.

So what's going on on that IMAP server? The service itself does not announce the flavor of IMAP server, though it does respond to a successful login with “You are so in”, and to a logout with “IMAP server logging out, mate”. A bit of digging on the 'net suggests that they are running a perdition IMAP proxy. (clearly written by an Aussie, mate!) But why does it not advertise its STARTTLS capability, even though it is capable? And why do some idle connections end up timing out without so much as an RST, when other idle connections give at least a clean break at the TCP level?

Is there something about issuing the UID command that causes perdition to hand off the connection to some other service, which in turn doesn't do proper TCP error handling? I don't really know anything about the internals of perdition, so i'm just guessing here.

the workaround

I ultimately recommended to Fred to reduce the number of cached connections to 1, and to set Thunderbird's interval to check for new mail down to 4 minutes. Hopefully, this will keep his one connection active enough that nothing will timeout, and will keep the interference to his workflow to a minimum.

It's an unsatisfactory solution to me, because the behavior of the remote server still seems so non-standard. However, i don't have any sort of control over the remote server, so there's not too much i can do to provide a real fix (other than point the server admins (and perdition developers?) at this writeup).

I don't even know the types of backend server that their perdition proxy is balancing between, so i'm pretty lost for better diagnostics even, let alone a real resolution.

some notes

I couldn't have figured out the exact details listed above just using Thunderbird on Windows. Fortunately, i had a machine with a decent OS available, and was able to cobble together a fake IMAP client from a couple files (imapstart contained the lines above, and imapfinish contained 8 LOGOUT), bash, and socat.

Here's the bash snippet i used as a fake IMAP client:

spoolout() { while read foo; do sleep 1 && printf "%s\r\n" "$foo" ; done }( sleep 2 && spoolout < imapstart && sleep 4 && spoolout < imapfinish && sleep 500 ) | socat STDIO TCP4:imap.fubar.example.net:143

To do the test under IMAPS, i just replaced TCP4:imap.fubar.example.net:143 with OPENSSL:imap.fubar.example.net:993.

And of course, i had wireshark handy on the GNU/Linux machine as well, so i could analyze the generated packets over there.

One thing to note about user empowerment: Fred isn't a tech geek, but he can be curious about the technology he relies on if the situation is right. He was with me through the whole process, didn't get antsy, and never tried to get me to "just fix it" while he did something else. I like that, and wish i got to have that kind of interaction more (though i certainly don't begrudge people the time if they do need to get other things done). I was nervous about breaking out wireshark and scaring him off with it, but it turned out it actually was a good conversation starter about what was actually happening on the network, and how IP and TCP traffic worked.

Giving a crash course like that in a quarter of an hour, i can't expect him to retain any concrete specifics, of course. But i think the process was useful in de-mystifying how computers talk to each other somewhat. It's not magic, there are just a lot of finicky pieces that need to fit together a certain way. And Wireshark turned out to be a really nice window into that process, especially when it displays packets during a real-time capture. I usually prefer to do packet captures with tcpdump and analyze them as a non-privileged user afterward for security reasons. But in this case, i felt the positives of user engagement (how often do you get to show someone how their machine actually works?) far outweighed the risks.

As an added bonus, it also helped Fred really understand what i meant when i said that it was a bad idea to use IMAP in the clear. He could actually see his username and password in the network traffic!

This might be worth keeping in mind as an idea for a demonstration for workshops or hacklabs for folks who are curious about networking -- do a live packet capture of the local network, project it, and just start asking questions about it. Wireshark contains such a wealth of obscure packet dissectors (and today's heterogenous public/open networks are so remarkably chatty and filled with weird stuff) that you're bound to run into things that most (or all!) people in the room don't know about, so it could be a good learning activity for groups of all skill levels.

Tags: debugging, imap, perdition, wireshark