haproxy bafflement

Update 2007-05-05: I sent some strace output to the author of haproxy, Willy Tarreau, and he replied within 24 hours with a full annotation of the strace and a one line patch to fix this issue. That’s what I call support! Here’s his comments and the patch:

I think this is caused by the fact that the end of connection from the client was received BEFORE the connection even established to the server, and when the connection status is checked, there is nothing anymore in the buffer because all data was just sent at once.

Could you please apply the following patch (against 1.2.17) and check that it fixes your problem (simply do “patch -p1” on the mail) ?

diff --git a/haproxy.c b/haproxy.c
index 8e57700..357a37a 100644
--- a/haproxy.c
+++ b/haproxy.c
@@ -5589,7 +5589,7 @@ int process_srv(struct session *t) {
     else if (s == SV_STCONN) { /* connection in progress */
        if (c == CL_STCLOSE || c == CL_STSHUTW ||
            (c == CL_STSHUTR &&
-            (t->req->l == 0 || t->proxy->options & PR_O_ABRT_CLOSE))) { /* give up */
+            ((t->req->l == 0 && t->res_sw == RES_SILENT) || t->proxy->options & PR_O_ABRT_CLOSE))) { /* give up */
            tv_eternity(&t->cnexpire);
            fd_delete(t->srv_fd);
            if (t->srv)

Original problem is described below…

I’m trying to use haproxy to load balance three spamassassin spamd servers. spamd uses a plain text TCP protocol so in theory it should be simple, but I’m getting intermittent connection problems. Here’s my config:

global
        log 127.0.0.1 local0 debug
        maxconn 100
        ulimit-n 512
        uid 999
        gid 999
        daemon
        pidfile /var/run/haproxy-spamd.pid

listen spamd
        bind 212.13.194.5:783
        mode tcp
        option tcplog
        log global
        balance roundrobin
        source 212.13.194.5:0
        clitimeout 150000
        srvtimeout 150000
        contimeout 30000
        server corona  212.13.194.122:783 weight 5
        server curacao 212.13.194.71:783  weight 5
        server islay   212.13.194.96:783  weight 6

The problem is that sometimes the client drops the connection immediately with the client (in my case my MTA, Exim) saying:

2007-04-28 23:10:32 1Hhw3k-0000xN-G2 spam acl condition: cannot parse spamd output
2007-04-28 23:10:32 1Hhw3k-0000xN-G2 SA: Action: scanned but message isn't spam: score=0.7 required=5.0 (scanned in 0/0 secs | Message-Id: SODIUM3tt4LQsJABSCu000006a6@sodium.lon.periodicnetwork.com). From  (host=mail.argon.lon.periodicnetwork.com [83.245.63.194]) for elided@snowblind.net
2007-04-28 23:10:32 1Hhw3k-0000xN-G2 <= noreply@periodicnetwork.com H=mail.argon.lon.periodicnetwork.com (ARGON.lon.periodicnetwork.com) [83.245.63.194] P=esmtp S=2322 id=B0009206781@ARGON.lon.periodicnetwork.com
2007-04-28 23:10:33 1Hhw3k-0000xN-G2 => elided@gmail.com  R=dnslookup T=remote_smtp H=gmail-smtp-in.l.google.com [66.249.93.114]
2007-04-28 23:10:33 1Hhw3k-0000xN-G2 Completed

at these times, haproxy’s log will report:

Apr 28 23:10:32 localhost haproxy[22910]: 212.13.194.70:32958 [28/Apr/2007:23:10:32] spamd islay 0/-1/7 0 CC 0/0/0 0/0
Apr 28 23:10:32 localhost haproxy[22910]: 212.13.194.70:32961 [28/Apr/2007:23:10:32] spamd corona 0/0/236 2285 -- 0/0/0 0/0

The “CC” means that the client dropped the connection before a connection to a backend server was made. That’s the first connection in Exim’s spam acl. The second connection from SA-Exim was successful.

(at the moment Exim on 212.13.194.70 is doing both spam acl connection to spamd and then an SA-Exim one as well, so two connections per email accepted. This is just a transitional thing while I move away from SA-Exim and isn’t a long-term plan.)

It’s not always the spamd islay that shows this error, and the problem doesn’t happen every time – both curacao and islay have successful and problem connections. Only corona is always successful. I don’t know why.

Also when this happens, although Exim drops its spamd connection immediately after sending data, haproxy does pass the connection through to a backend spamd which does process it as normal:

Apr 28 23:10:32 admin spamd[11394]: spamd: connection from 212.13.194.5 [212.13.194.5] at port 33761
Apr 28 23:10:32 admin spamd[11394]: spamd: checking message  aka  for Debian-exim:102
Apr 28 23:10:33 admin spamd[11394]: spamd: clean message (0.8/5.0) for Debian-exim:102 in 0.7 seconds, 1668 bytes.
Apr 28 23:10:33 admin spamd[11394]: spamd: result: . 0 - AWL,NO_REAL_NAME,PORN_URL_SEX,SPF_PASS scantime=0.7,size=1668,user=Debian-exim,uid=102,required_score=5.0,rhost=212.13.194.5,raddr=212.13.194.5,rport=33761,mid=,rmid=,autolearn=no

haproxy doesn’t have a support mailing list, it only has an IRC channel which I am reluctant to bring this up in. I mailed the author and he doesn’t know why it should be behaving like this either. Anyone else have any ideas?

Failing that, anyone know a decent, open source software load balancing solution for generic TCP? Bonus if I can direct to least busy backend, or if I can specify a limit of connections per server.

3 thoughts on “haproxy bafflement

  1. I came across this post in regards to haproxy but found it interesting because you were load balancing spamd, something we have been doing for a number of years now using the above mentioned LVS + heartbeat solution. We are load balancing spamd connections across 8 backend servers and are processing ~100 million spam scans/month, so I can definitlyrecommend the above solution.

Leave a Reply

Your email address will not be published. Required fields are marked *