Fail2Ban, iptables and config management

Fail2Ban ^

Fail2Ban is a piece of software which can watch log files and take an arbitrary action when a certain number of matches are found.

It is most commonly used to read logs from an SSH daemon in order to insert a firewall rule against hosts that repeatedly fail to log in. Hence Fail → Ban.

Wherever possible, it is best to require public key and/or multi-factor authentication for SSH login. Then, it does not matter how many times an attacker tries to guess passwords as they should never succeed. It’s just log noise.

Sadly I have some hosts where some users require password authentication to be available from the public Internet. Also, even on the hosts that can have password authentication disabled, it is irritating to see the same IPs trying over and over.

Putting SSH on a different port is not sufficient, by the way. It may cut down the log noise a little, but the advent of services that scan the entire Internet and then sell the results has meant that if you run an SSH daemon on any port, it will be found and be the subject of dictionary attacks.

So, Fail2Ban.

iptables ^

The usual firewall on Linux is iptables. By default, when Fail2Ban wants to block an IP address it will insert a rule and then when the block expires it will remove it again.

iptables Interaction With Configuration Management ^

I’ve had all my hosts in configuration management for about 10 years now, and that includes the firewall setup. First it was Puppet but these days it is Ansible.

That worked great when the firewall rules were only managed in the config management, but Fail2Ban introduces firewall changes itself.

Now, it’s been many years since I moved on from Puppet so perhaps a way around this has been found there now. At the time though, I was using the Puppetlabs firewall module and it really did not like seeing changes from outside itself. It would keep reverting them.

It was possible to tell it not to meddle with rules that it didn’t add, but it never did work completely correctly. I would still see changes at every run.

Blackholes To The Rescue ^

I never did manage to come up with a way to control the firewall rules in Puppet but still allow Fail2Ban to add and remove its rules and chains, without there being modifications at every Puppet run.

Instead I sidestepped the problem by using the “route” action of Fail2Ban instead of the “iptables” action. The “route” action simply inserts a blackhole route, as if you did this at the command line:

# ip route add blackhole 192.168.1.1

That blocks all traffic to/from that IP address. Some people may have wanted to only block SSH traffic from those hosts but in my view those hosts are bad actors and I am happy to drop all traffic from/to them.

Problem solved? Well, not entirely.

Multiple Jailhouse Blues ^

Fail2Ban isn’t just restricted to processing logs for one service. Taken together, the criteria for banning for a given time over a given set of log files is called a jail, and there can be multiple jails.

When using iptables as the jail action this isn’t much of an issue because the rules are added to separate iptables chains named after the jail itself, e.g. f2b-sshd. You can therefore have the same IP address appearing in multiple different chains and whichever is hit first will ban it.

A common way to configure Fail2Ban is to have one jail banning hosts that have a short burst of failures for a relatively short period of time, and then another jail that bans persistent attackers for a much longer period of time. For example, there could be an sshd jail that looks for 3 failures in 3 minutes and bans for 20 minutes, and then an sshd-hourly jail that looks for 5 failures in an hour and bans for a day.

This doesn’t work with the “route” action because there is only one routing table and you can’t have duplicate routes in it.

Initially you may think you can cause the actual execution of the actions to still succeed with something like this:

actionban   = ip route add blackhole <ip> || true
actionunban = ip route del blackhole <ip> || true

i.e. force them to always succeed even if the IP is already banned or already expired.

The problem now is that the short-term jails can remove bans that the long-term jails have added. It’s a race condition as to which order the adds and removes are done in.

Ansible iptables_raw Deal ^

As I say, I switched to Ansible quite a while ago, and for firewalling here I chose the iptables_raw module.

This has the same issues with changed rules as all my earlier Puppet efforts did.

The docs say that you can set keep_unmanaged and then rules from outside of this module won’t be meddled with. This is true, but still Ansible reports changes on every host every time. It isn’t actually doing a change, it is just noting a change.

I think this is because every time iptables_raw changes the rules, it uses iptables-save to save them out to a file. Then Fail2Ban adds and removes some rules, and next time iptables_raw compares the live rule set with the save file that it saved out last time. So there’s always changes (assuming any Fail2Ban activity).

Someone did ask about the possibility of ignoring some chains, which would be ideal for ignoring all the f2b-* chains, but the response seems to indicate that this will not be happening.

So I am still looking for a way to manage Linux host firewalls in Ansible that can ignore some chains and not want to be in sole control of all rules.

Paul mentioned that from Ansible he uses ferm, which writes rules to files before actioning them, so doesn’t suffer from this problem.

That is a possibility, but if I am going to rewrite all of that I think I should probably do it with something that is going to support nftables, which ferm apparently isn’t.

The Metric System ^

All is not lost, though it is severely bodged.

Routes can have metrics. The metric goes from 0 to 9999, and the lower the number the more important the route is.

There can be multiple routes for the same destination but with different metrics; for example if you have a metric 10 route and a metric 20 route for the same destination, the metric 10 route is chosen.

That means that you can use a different metric for each jail, and then each jail can ban and unban the same IPs without interfering with other jails.

Here’s an action file for the action “route-metric”:

[Definition]
actionban   = ip route add blackhole <ip> metric <metric>
actionunban = ip route del blackhole <ip> metric <metric>

On Debian you might put that in a file called /etc/fail2ban/action.d/route-metric.conf and then in a jail definition use it like this:

[sshd-hourly]
logpath  = /var/log/auth.log
filter   = sshd
enabled  = true
action   = route-metric[metric=9998]
# 5 tries
maxretry = 5
# in one hour
findtime = 3600
# bans for 24 hours
bantime  = 86400

Just make sure to use a different metric number (9998 here) for each jail and that solves that problem.

Clearly that doesn’t solve it in a very nice way though. If you use Ansible and manage your firewall rules in it, what do you use?

Possibly this could instead be worked around by having multiple routing tables.

Comparing Versions in Ansible Templates

In the last few days, Debian archived their jessie release and removed the jessie-updates suite from the distribution mirrors. Those hosts which still reference jessie-updates and do an apt update will see something like:

W: Failed to fetch http://apt-cacher.lon.bitfolk.com/debian/ftp.uk.debian.org/debian/dists/jessie-updates/main/binary-amd64/Packages  404  Not Found [IP: 2001:ba8:1f1:f079::2 80]

This is because the suite and all of its files were removed from the mirrors. The files have now been archived and will be picked up by just using the jessie suite on archive.debian.org, so no longer any need to reference jessie-updates.

In order to not see that every time you should remove the jessie-updates line from /etc/apt/sources.list.

My /etc/apt/sources.list is built by Ansible from a template, and the relevant part of the template looked a bit like this:

{% if ansible_distribution_version >= 8.0 %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

ansible_distribution_version and ansible_distribution_release are host variables, and for Debian jessie currently evaluate as the strings “8.11” and “jessie” respectively.

As there is now only an -updates for Debian stable (version 9.x, “stretch”) the “if” statement should be testing against “9.0”, right? So I changed it to:

{% if ansible_distribution_version >= 9.0 %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

Well, that made no difference. jessie-updates was still being included.

The reason why this didn’t work is that the string “8.11” is being compared against “9.0” and “8.11” is actually bigger! This is a very common mistake. In order to fix it the values could be cast, but a better idea is the use the version test (previously known as version_comparison):

{% if ansible_distribution_version is version('9.0', '>=') %}
deb {{ aptcacher_prefix }}{{ debian_mirror }} {{ ansible_distribution_release }}-updates   main contrib non-free
{% endif %}

Looking through all of my playbooks it seems that I’d figured this out long ago for the playbooks themselves — every test of ansible_distribution_version in YAML files was using version() — but some of my templates were still directly trying to use “>” or “>=”.

Of course, since jessie has now been archived it is only receiving security support from the Debian LTS effort and hosts running jessie should be upgraded as soon as possible.

Ansible, ARA and MariaDB (or MySQL)

ARA ^

A short while back I switched to Ansible for configuration management.

One of the first things I needed was a way to monitor Ansible runs for problems, you know, so I wasn’t relying on seeing abnormal logging to tell me that a particular host was unreachable, or that tasks were failing somewhere.

I know I could use Ansible Tower for this, but it seemed like such a heavyweight solution for what seems like a simple requirement. That was when I found ARA.

Honestly even ARA is more than I need, and I don’t typically find myself using the web interface even though it’s very pretty. No, I’m only really interested in poking around in the database.

It’s great that it’s simple to install and get started with, so very quickly I was recording details about my Ansible runs a SQLite database, which is the default setup. From there it was a simple SQL query to check if a particular host had suffered any task failures. It was easy to knock up a Nagios-compatible check plugin to call from my Icinga2 monitoring.

Issues ^

Excess data ^

The first problem I noted was that the SQLite database file was starting to grow in size quite rapidly. Around one week of run data used around 800MiB of storage. It’s not huge, but it was relentlessly growing and I could see very little value in keeping that data as I never looked at data from runs more than a few days previous. So, I wrote a script to delete old stuff from the database, keeping the last week’s worth.

Locking ^

Next up I started seeing SQLite locking problems.

Checks from Icinga2 were connecting to the SQLite DB, and so was the prune script, and Ansible itself too. The ARA Ansible plugin started complaining about locking:

[WARNING]: Failure using method (v2_runner_on_ok) in callback plugin                                                   
(<ansible.plugins.callback./opt/ansible/venv/ansible/lib/python2.7/site-                                                  
packages/ara/plugins/callbacks/log_ara.CallbackModule object at                                                         
0x7f2a84834b10>): (sqlite3.OperationalError) database is locked [SQL: u'SELECT                                          
tasks.id AS tasks_id, tasks.playbook_id AS tasks_playbook_id, tasks.play_id AS                                          
tasks_play_id, tasks.name AS tasks_name, tasks.sortkey AS tasks_sortkey,                                                
tasks.action AS tasks_action, tasks.tags AS tasks_tags, tasks.is_handler AS                                             
tasks_is_handler, tasks.file_id AS tasks_file_id, tasks.lineno AS tasks_lineno,                                         
tasks.time_start AS tasks_time_start, tasks.time_end AS tasks_time_end \nFROM                                           
tasks \nWHERE tasks.id = ?'] [parameters:                                                                               
('5f4506f7-95ac-4468-bea3-672d399d4507',)] (Background on this error at:                                                
http://sqlalche.me/e/e3q8)

The Ansible playbook run itself isn’t quick, taking about 13 minutes at the moment, and it seems like sometimes when that was running the check script too was running into locking issues resulting in alerts that were not actionable.

The ARA Ansible reporting plugin is the only thing that should be writing to the database so I thought it should be simple to let that have a write lock while everything else is freely able to read, but I couldn’t get to the bottom of this. No matter what I tried I was getting lock errors not just from ARA but also from my check script.

The basic problem here is that SQLite really isn’t designed for multiple concurrent access. I needed to move to a proper database engine.

At this point I think people really should be looking at PostgreSQL. My postgres knowledge is really rusty though, and although MySQL has its serious issues I felt like I had a handle on them. Rather than have this be the only postgres thing I have deployed I decided I’d do it with MariaDB.

MariaDB ^

So, MariaDB. Debian stretch. That comes with v10.1. Regular install, create the ARA database, point Ansible at it, do a playbook run and let ARA create its tables on first run. Oh.

UTF8 what? ^

Every play being run by Ansible was giving this warning:

PLAY [all] **************************************************************************
/opt/ansible/venv/ansible/local/lib/python2.7/site-packages/pymysql/cursors.py:170: Warning: (1300, u"Invalid utf8 character string: '9C9D56'")                             
  result = self._query(query)

A bit of searching around suggested that my problem here was that the database was defaulting to utf8mb3 character set when these days MariaDB (and MySQL) should really be using utf8mb4.

Easy enough, right? Just switch to utf8mb4. Oh.

[WARNING]: Skipping plugin (/opt/ansible/venv/ansible/lib/python2.7/site-             
packages/ara/plugins/callbacks/log_ara.py) as it seems to be invalid:                
(pymysql.err.InternalError) (1071, u'Specified key was too long; max key length is
767 bytes') [SQL: u'\nCREATE TABLE files (\n\tid VARCHAR(36) NOT NULL,
\n\tplaybook_id VARCHAR(36), \n\tpath VARCHAR(255), \n\tcontent_id VARCHAR(40),
\n\tis_playbook BOOL, \n\tPRIMARY KEY (id), \n\tFOREIGN KEY(content_id) REFERENCES
file_contents (id), \n\tFOREIGN KEY(playbook_id) REFERENCES playbooks (id) ON DELETE
RESTRICT, \n\tUNIQUE (playbook_id, path), \n\tCHECK (is_playbook IN (0, 1))\n)\n\n']
(Background on this error at: http://sqlalche.me/e/2j85)

The problem now is that by default InnoDB has a maximum key length of 767 bytes, and with the utf8mb4 character set it is possible for each character to use up 4 bytes. A VARCHAR(255) column might need 1020 bytes.

Bigger keys in InnoDB ^

It is possible to increase the maximum key size in InnoDB to 3,000 and something bytes. Here’s the configuration options needed:

[mysqld]
innodb_file_format=Barracuda
innodb_large_prefix=1

But sadly in MariaDB 10.1 even that is not enough, because the InnoDB row format in v10.1 is compact. In order to use large prefixes you need to use either compressed or dynamic.

In MariaDB v10.2 the default changes to dynamic and there is an option to change the default on a server-wide basis, but in v10.1 as packaged with the current Debian stable the only way to change the row format is to override it on the CREATE TABLE command.

Mangling CREATE TABLE ^

Well, this is a little awkward. ARA creates its own database schema when you run it for the first time. Or, you can tell it not to do that, but then you need to create the database schema yourself.

I could have extracted the schema out of the files in site-packages/ara/db/versions/ (there’s only two files), but for a quick hack it was easier to change them. Each op.create_table( needs to have a line added right at the end, e.g. changing from this:

def upgrade():
    op.create_table('data',
    sa.Column('id', sa.String(length=36), nullable=False),                       
    sa.Column('playbook_id', sa.String(length=36), nullable=True),
    sa.Column('key', sa.String(length=255), nullable=True),
    sa.Column('value', models.CompressedData((2 ** 32) - 1), nullable=True),
    sa.ForeignKeyConstraint(['playbook_id'], ['playbooks.id'], ondelete='RESTRICT'),
    sa.PrimaryKeyConstraint('id'),
    sa.UniqueConstraint('playbook_id', 'key')
    )
    ### end Alembic commands ###

To this:

def upgrade():
    op.create_table('data',
    sa.Column('id', sa.String(length=36), nullable=False),                       
    sa.Column('playbook_id', sa.String(length=36), nullable=True),
    sa.Column('key', sa.String(length=255), nullable=True),
    sa.Column('value', models.CompressedData((2 ** 32) - 1), nullable=True),
    sa.ForeignKeyConstraint(['playbook_id'], ['playbooks.id'], ondelete='RESTRICT'),
    sa.PrimaryKeyConstraint('id'),
    sa.UniqueConstraint('playbook_id', 'key'),
    mysql_row_format='DYNAMIC'
    )
    ### end Alembic commands ###

That’s not a good general purpose patch because this is specific to MariaDB/MySQL whereas ARA supports SQLite, MySQL and PostgreSQL. I’m not familiar enough with sqlalchemy/alembic to know how it should properly be done. Maybe the ARA project would take the view that it shouldn’t be done, since the default row format changed already in upstream InnoDB. But I need this to work now, with a packaged version of MariaDB.

Wrapping up ^

So there we go, that all works now, no concurrency issues, ARA is able to create its schema and insert its data properly and my scripts were trivial to port from SQLite syntax to MariDB syntax. The database still needs pruning, but that was always going to be the case.

It was a little more work than I really wanted, having to run a full database engine just to achieve the simple task of monitoring runs for failures, but I think it’s still less work than installing Ansible Tower.

It is a shame that ARA doesn’t work with a Debian stable package of MariaDB or MySQL just by following ARA’s install instructions. I have a feeling this should be considered a bug in ARA but I’m not sure enough to just go ahead and file it, and it seems that the only discussion forums are IRC and Slack. Anyway, once the newer versions of MariaDB and MySQL find their way into stable releases hopefully the character set issues will be a thing of the past.

This limited key length thing (and the whole of MySQL’s history with UTF8 in general) is just another annoyance in a very very long list of MySQL annoyances though, most of them stemming from basic design problems. That’s why I will reiterate: if you’re in a position to do so, I recommend picking PostgreSQL over MySQL.