Gather round children ^
Uncle Andrew wants to tell you a festive story. The NTPmare shortly after Christmas.
A modest proposal ^
Nearly two years ago, on the afternoon of Monday 16th January 2017, I received an interesting BitFolk support ticket from a non-customer. The sender identified themselves as a senior software engineer at NetThings UK Ltd.
Subject: Specific request for NTP on IP
This might sound odd but I need to setup an NTP server instance on IP address
18.104.22.168 precious? ^
22.214.171.124 is actually one of the IP addresses of one of BitFolk’s customer-facing NTP servers. It was also, until a few weeks before this email, part of the NTP Pool project.
“Was” being the important issue here. In late December of 2016 I had withdrawn BitFolk’s NTP servers from the public pool and firewalled them off to non-customers.
I’d done that because they were receiving an unusually large amount of traffic due to the Snapchat NTP bug. It wasn’t really causing any huge problems, but the number of traffic flows were pushing useful information out of Jump‘s fixed-size netflow database and I didn’t want to deal with it over the holiday period, so this public service was withdrawn.
This article was posted to Hacker News and a couple of comments there said they would have liked to have seen a brief explanation of what NTP is, so I’ve now added this section. If you know what NTP is already then you should probably skip this section because it will be quite brief and non-technical.
Network Time Protocol is a means by which a computer can use multiple other computers, often from across the Internet on completely different networks under different administrative control, to accurately determine what the current time is. By using several different computers, a small number of them can be inaccurate or even downright broken or hostile, and still the protocol can detect the “bad” clocks and only take into account the more accurate majority.
NTP is supposed to be used in a hierarchical fashion: A small number of servers have hardware directly attached from which they can very accurately tell the time, e.g. an atomic clock, GPS, etc. Those are called “Stratum 1” servers. A larger number of servers use the stratum 1 servers to set their own time, then serve that time to a much larger population of clients, and so on.
It used to be the case that it was quite hard to find NTP servers that you were allowed to use. Your own organisation might have one or two, but really you should have at least 3 to 7 of them and it’s better if there are multiple different organisations involved. In a university environment that wasn’t so difficult because you could speak to colleagues from another institution and swap NTP access. As the Internet matured and became majority used by corporations and private individuals though, people still needed access to accurate time, and this wasn’t going to cut it.
The NTP Pool project came to the rescue by making an easy web interface for people to volunteer their NTP servers, and then they’d be served collectively in a DNS zone with some basic means to share load. A private individual can just use three names from the pool zone and they will get three different (constantly changing) NTP servers.
Corporations and those making products that need to query the NTP pool are supposed to ask for a “vendor zone”. They make some small contribution to the NTP pool project and then they get a DNS zone dedicated to their product, so it’s easier for the pool administrators to direct the traffic.
Sadly many companies don’t take the time to understand this and just use the generic pool zone. NetThings UK Ltd went one step further in a very wrong direction by taking an IP address from the pool and just using it directly, assuming it would always be available for their use. In reality it was a free service donated to the pool by BitFolk and as it had become temporarily inconvenient for that arrangement to continue, service was withdrawn.
On with the story…
They want what? ^
The Senior Software Engineer continued:
The NTP service was recently shutdown and I am interested to know if there is any possibility of starting it up again on the IP address mentioned. Either through the current holder of the IP address or through the migration of the current machine to another address to enable us to lease
I realise that this is a peculiar request but I can assure you it is genuine.
That’s not gonna work ^
Obviously what with
126.96.36.199 currently being in use by all customers as a resolver and NTP server I wasn’t very interested in getting them all to change their configuration and then leasing it to NetThings UK Ltd.
What I did was remove the firewalling so that
188.8.131.52 still worked as an NTP server for NetThings UK Ltd until we worked out what could be done.
I then asked some pertinent questions so we could work out the scope of the service we’d need to provide. Questions such as:
- How many clients do you have using this?
- Do you know their IP addresses?
- When do they need to use the NTP server and for how long?
- Can you make them use the pool properly (a vendor zone)?
Down the rabbit hole ^
The answers to some of the above questions were quite disappointing.
It would be of some use for our manufacturing setup (where the RTCs are initially set) but unfortunately we also have a reasonably large field population (~500 units with weekly NTP calls) that use roaming GPRS SIMs. I don’t know if we can rely on the source IP of the APN for configuring the firewall in this case (I will check though). We are also unable to update the firmware remotely on these devices as they only have a 5MB per month data allowance. We are able to wirelessly update them locally but the timeline for this is months rather than weeks.
Basically it seemed that NetThings UK Ltd made remote controlled thermostats and lighting controllers for large retail spaces etc. And their devices had one of BitFolk’s IP addresses burnt into them at the factory. And they could not be identified or remotely updated.
Oh, and whatever these devices were, without an external time source their clocks would start to noticeably drift within 2 weeks.
By the way, they solved their “burnt into it at the factory” problem by bringing up BitFolk’s IP address locally at their factory to set initial date/time.
I’ll admit, at this point I was slightly tempted to work out how to identify these devices and reply to them with completely the wrong times to see if I could get some retail parks to turn their lights on and off at strange times.
We are triggering ntp calls on a weekly cron with no client side load balancing. This would result in a flood of calls at the same time every Sunday evening at around 19:45.
Yeah, they made every single one of their unidentifiable devices contact a hard coded IP address within a two minute window every Sunday night.
The Senior Software Engineer was initially very worried that they were the cause of the excess flows I had mentioned earlier, but I reassured them that it was definitely the Snapchat bug. In fact I never was able to detect their devices above background noise; it turns out that ~500 devices doing a single SNTP query is pretty light load. They’d been doing it for over 2 years before I received this email.
I did of course point out that they were lucky we caught this early because they could have ended up as the next Netgear vs. University of Wisconsin.
I am feeling really, really bad about this. I’m very, very sorry if we were the cause of your problems.
Bless. I must point out that throughout all of this, their Senior Software Engineer was a pleasure to work with.
We made a deal ^
While NTP service is something BitFolk provides as a courtesy to customers, it’s not something that I wanted to sell as a service on its own. And after all, who would buy it, when the public pool exists? The correct thing for a corporate entity to do is support the pool with a vendor zone.
But NetThings UK Ltd were in a bind and not allowing them to use BitFolk’s NTP server was going to cause them great commercial harm. Potentially I could have asked for a lot of money at this point, but (no doubt to my detriment) that just felt wrong.
I proposed that initially they pay me for two hours of consultancy to cover work already done in dealing with their request and making the firewall changes.
I further proposed that I charged them one hour of consultancy per month for a period of 12 months, to cover continued operation of the NTP server. Of course, I do not spend an hour a month fiddling with NTP, but this unusual departure from my normal business had to come at some cost.
I was keen to point out that this wasn’t something I wanted to continue forever:
Finally, this is not a punitive charge. It seems likely that you are in a difficult position at the moment and there is the temptation to charge you as much as we can get away with (a lot more than £840 [+VAT per year], anyway), but this seems unfair to me. However, providing NTP service to third parties is not a business we want to be in so we would expect this to only last around 12 months. If you end up having to renew this service after 12 months then that would be an indication that we haven’t charged you enough and we will increase the price.
Does this seem reasonable?
NetThings UK Ltd happily agreed to this proposal on a quarterly basis.
Thanks again for the info and help. You have saved me a huge amount of convoluted and throwaway work. This give us enough time to fix things properly.
Not plain sailing ^
I only communicated with the Senior Software Engineer one more time. The rest of the correspondence was with financial staff, mainly because NetThings UK Ltd did not like paying its bills on time.
NetThings UK Ltd paid 3 of its 4 invoices in the first year late. I made sure to charge them statutory late payment fees for each overdue invoice.
Yearly report card: must try harder ^
As 2017 was drawing to a close, I asked the Senior Software Engineer how NetThings UK Ltd was getting on with ceasing to hard code BitFolk’s IP address in its products.
To give you a quick summary, we have migrated the majority of our products away from using the fixed IP address. There is still one project to be updated after which there will be no new units being manufactured using the fixed IP address. However, we still have around 1000 units out in the field that are not readily updatable and will continue to perform weekly NTP calls to the fixed IP address. So to answer your question, yes we will still require the service past January 2018.
This was a bit disappointing because a year earlier the number had been “about 500” devices, yet despite a year of effort the number had apparently doubled.
That alone would have been enough for me to increase the charge, but I was going to anyway due to NetThings UK Ltd’s aversion to paying on time. I gave them just over 2 months of notice that the price was going to double.
u wot m8 ^
Approximately 15 weeks after being told that the price doubling was going to happen, NetThings UK Ltd’s Financial Controller asked me why it had happened, while letting me know that another of their late payments had been made:
Date: Wed, 21 Feb 2018 14:59:42 +0000
We’ve paid this now, but can you explain why the price has doubled?
I was very happy to explain again in detail why it had doubled. The Financial Controller in response tried to agree a fixed price for a year, which I said I would be happy to do if they paid for the full year in one payment.
My rationale for this was that a large part of the reason for the increase was that I had been spending a lot of time chasing their late payments, so if they wanted to still make quarterly payments then I would need the opportunity to charge more if I needed to. If they wanted assurance then in my view they should pay for it by making one yearly payment.
There was no reply, so the arrangement continued on a quarterly basis.
All good things… ^
On 20 November 2018 BitFolk received a letter from Deloitte:
Netthings Limited – In Administration (“The Company”)
Company Number: SC313913
Cessation of Trading
The Company ceased to trade with effect from 15 November 2018.
As part of our duties as Joint Administrators, we shall be investigating what assets the Company holds and what recoveries if any may be made for the benefit of creditors as well as the manner in which the Company’s business has been conducted.
And then on 21 December:
Under paragraph 51(1)(b) of the Insolvency Act 1986, the Joint Administrators are not required to call an initial creditors’ meeting unless the Company has sufficient funds to make a distribution to the unsecured creditors, or unless a meeting is requested on Form SADM_127 by 10% or more in value of the Company’s unsecured creditors. There will be no funds available to make a distribution to the unsecured creditors of the Company, therefore a creditors’ meeting will not be convened.
Luckily their only unpaid invoice was for service from some point in November, so they didn’t really get anything that they hadn’t already paid for.
So that’s the story of NetThings UK Ltd, a brave pioneer of the Internet of Things wave, who thought that the public NTP pool was just an inherent part of the Internet that anyone could use for free, and that the way to do that was to pick one IP address out of it at random and bake that into over a thousand bits of hardware that they distributed around the country with no way to remotely update.
This coupled with their innovative reluctance to pay for anything on time was sadly not enough to let them remain solvent.
18 thoughts on “The Internet of Unprofitable Things”
As long as manufactures still write broken code and unaware of the proper way to use NTP, nothing can be done to solve this issue. Many involved in these misuses and abuses are totally unaware what they are doing. People just assume they are some random super servers that always work, without being responsible for their actions, such as hardcoding IP addresses, writing abusive retry code (without exponential increment of timeout), and making a cronjob that initialize a synchronization exactly at midnight (without randomization), effectively a DDoS.
If a single server got hardcoded in those mass-manufactured devices, serious consequences can happen, the volunteer may literally bankrupt, or your whole institute/school will be kicked out from the Internet ; when you came to the manufacture asking to pay the damage they are responsible for, you are threatened by a lawyer from California.  The whole Internet community should honor the spirit of self-sacrifice of these NTP volunteers. Most of the NTP pool servers has similar issues, once you’re in and became well-known on the net, there’s no way out and keep receiving bad-traffic. Fortunately given a reasonable bandwidth, it is often a negligible issue though. But not as always. 
The proper way to use of NTP should have been written in all textbooks related to practical networking lectures. And NTP Community needs publicity, perhaps we should build a campaign website to educate software developers on the issue. I don’t know if you are interested in it, if so, perhaps we can talk to Mr. Ask, the project leader of NTP Pool to do something?
 Flawed Routers Flood University of Wisconsin Internet Time Server
 Open Letter to D-Link about their NTP vandalism
 Recent NTP pool traffic increase
Specifically, software developers need to understand the following list of DOs and DON’Ts.
1. NEVER, ever hardcode a NTP server, IP or domain. DO NOT just go to a list of NTP servers, then copy a few into your code. DON’T DO THIS! PLEASE!
2. DO NOT use Tier-1 servers. Please use Tier-2 and lower.
3. If the scale of your system is small, in hundreds, or in a few thousands, PLEASE USE pool.ntp.org, this is the NTP community cluster backed by DNS load balancer. Always request the DNS, and make sure the IP is not cached locally for too long. If you need more than one servers, use 0.pool.ntp.org, 1.pool.ntp.org, 2.pool.ntp.org, etc (3 is often enough).
4. If the scale of your system is large, such as tens of thousand, or you are making a new system, you SHOULD request a customized prefix from pool.ntp.org, such as debian.pool.ntp.org, it helps the community to manage the traffic. If your system is a large commercial one, you ARE REQUIRED to donate some servers to the NTP Pool to compensate the community. Another option is running your own private NTP cluster. The policy is here: https://www.ntppool.org/en/vendors.html
5. If possible run a standard NTP implementation, like NTPd, chrony, or something else as long as it’s written professionally. Nowadays even lightbulbs run Linux, then why don’t you run a standard NTPd?
But If you can’t, then make sure…
(a) implement NTPv4, DO NOT use NTPv1.
(b) Read the new SNTP RFC if you are implementing an SNTP client. http://www.faqs.org/rfc/rfc4330.txt
(c) DO NOT synchronize time on the beginning of an hour, or 00:00 UTC! Select a minute in a hour randomly for synchronization.
(d) Use an exponentially-increase retrial interval, DO NOT keep retrying when the server is unreachable, you are launching a DDoS attack!
(e) Support Kiss of Death packet, your client should immediately stop requesting a server, cease and desist, once a KoD packet is received.
(f) Make sure the client will stop requesting the builtin list of servers, once an alternative server is set by the user.
>Specifically, software developers need to understand the following list of DOs and DON’Ts
People who try to teach developers good practices need to understand that they won’t learn. Developers don’t care, companies care even less. They will never ever ever do things the right way unless they are forced to from the start.
Maybe pool.ntp.org should systematically rotate its IP addresses by policy, so that an IP does not remain in the pool for more than 1 week at most. This would dissuade this behavior pretty quickly.
As far as I understand, the NTP Pool does rotate the IP addresses it serves in order to try to share the load according to how big you say the server’s connection is when you add it to the pool. This doesn’t always help very much because many NTP clients (especially badly behaved ones) tend to hang on to the IP addresses for a long time. The NTP servers in the pool are NTP servers 24/7 whether they are currently being handed out by the pool or even if they quit being in the pool entirely. So clients can carry on using them indefinitely.
In this specific case, NetThings UK Ltd intentionally used an IP address of a known NTP server that they got from the pool, in a misguided attempt to reduce bandwidth use: they wanted to avoid DNS lookups. Because the IP address continued to act as a reliable NTP server over a long period of time, they did not realise that what they were doing was a mistake.
There’s nothing the NTP Pool project or I could have done / can do to stop people doing things like this.
Everyone involved was lucky that this product was only deployed in ~1,000 devices, not millions.
An interesting idea (which I’m sure the pool would probably never implement), would be for a server to stop responding to ntp queries when it is removed from the pool. Obviously there would need to be a whitelist for the pool’s own monitoring queries, but servers wouldn’t respond to queries from unknown addresses. (Another alternative would be to degrade responses when the server is not in the pool – eg by adding some jitter to the replies. It doesn’t even have to be random and could be based on a hash of the source address, so it’s consistently wrong for a given client)
I suppose that might go some way to stop people hard coding IP addresses although I can’t imagine that many people making use of the option since the normal way of things is that you have an NTP server which you’re using 24/7, and you also add it into the NTP Pool. It could be a firewall rule that blocks “The Internet” while leaving your own network able to use it.
I have a feeling that you’d still end up with people hard coding and sending queries forever even getting no reply…
It could be that the devices they manufacture are so tiny and have such a rudimentary network stack that they cannot do DNS lookups.
That could be a reason to use a fixed IP. However it does not explain why:
– they did not use an IP address of their own servers or one that agreed to provide the service to them
– they did not change the IP in the production environment as soon as they became aware of this problem
Maybe their developer had already left and there was nobody there who could change the software programmed into production devices?
They used BitFolk’s IP address because it was one they got out of the NTP Pool project and saw that it worked over a long period of time, so thought it would work forever.
They used an IP address instead of a hostname in a misguided attempt to reduce bandwidth use by avoiding DNS resolution.
They did all of this quite simply because they didn’t know what they were doing, they were in a rush and what they had seemed to work – until it didn’t. By that point they had no way to remotely fix it, as they had never allowed for that scenario.
The last time I communicated with the developer was November 2017, which was around 10 months into the arrangement and about a year before the company formally ceased trading. I don’t know how long they stayed with the company after November 2017.
So even though they knew what they were doing was wrong they kept creating/releasing new devices doing it. Reading between the lines they might not have updated any in-the-field devices at all.
This might be a case of management ignoring the devs, sadly not surprising the company closed.
Certainly the claimed number of devices doubled between the time when the arrangement was agreed and ~10 months later when a new arrangement was being negotiated for the second year, so either they were mistaken to start with or else released more with the same issue even after they knew about it.
£70+VAT per month and later £140+VAT per month for all devices was probably cheaper than visiting even one of the devices in person, which is what they led me to believe was required for them to fix them once they were in the field.
It wasn’t clear to me that they could even do a remote upgrade of these particular units could even do a remote upgrade. That might be the core issue -otherwise, they could have migrated to another (fixed) 🙂 NTP address.
Yes, they did not seem to plan for remote firmware/software upgrade at all. The tiny amount of data allowed through GPRS was just for telemetry but even if it had been sufficient, I don’t think they had ever built in the functionality.
So, now I’m curious. Are you going to firewall off the IP address again?
Most probably they had sold some inventory that got used during the period. With IoT devices,especialy thermostat/lightbulbs etc. It is basicaly impossible to go and update them all as most of the times developer/supplier don’t even know where they are.
No online update mean most probably they will have to disassemble flash over RS232/USB over 1000 devices. It would be cheaper just to recall and replace all of them than engineer going onsite reflashing every single one of them. When you think about it 1 000 devices is realy realy small number no idea what their margin/profit per device was but it is not suprising that they went bankrupt.
A more sophisticated rotation algorithm, coupled with a large, wasteful pool of addresses, might work. I’m thinking of having way more addresses allocated than are ever actually in use, such that any given IP address spends most of its time idle, i.e. *not* being an NTP server. The small fraction of addresses in use as a servers at any given time would be registered through DNS as usual, and updated / rotated accordingly. Therefore, the majority of numeric-IP-based acesses would fail, discouraging that practice, while DNS-based access would always succeed. Note that I have not run the numbers / done the math to see where the feasibility break-even point is on this.
In my experience, the people who find themselves employed as heads of IT of companies, Chief Engineers of radio network conglomerates, etc, do not have a college education, and in many cases do not read books. They have learned everything they know by dint of experience, trial and error, and “the seat of their pants.” I myself last had formal education in TCP/IP technologies around 1990, before many of today’s now-standard practices were instituted — maybe even before certain standard ports and services (HTTP?) were defined. I would have had no idea there was an NTP Pool I was supposed to access via DNS lookup, had I been involved in developing a product such as this one, and might well have made this same error out of pure naïveté.