Yesterday morning, I started receiving down alerts for this blog. Pingdom’s alerts include the message: “Reason: Non-recoverable failure in name resolution”. Right from the start, to simplify troubleshooting, I wrongly removed Cloudflare since I had a similar issue with them a year ago and was able to resolve by temporarily removing. However, upon investigation the issue was not related to Cloudflare and soon resolved itself. Cloudflare’s support however pointed me to this HN discussion thread.
Now, this morning via GTMetrix I noticed an increased in load times. I thought, “Oh yea! I should revert nameservers to Cloudflare.” Before doing so I captured the graphs on this page, which show how response time doubled with Cloudflare removed. As Joni Mitchell put it in her song, Big Yellow Taxi: “…you don’t know what you got ’til it’s gone” and that goes for Cloudflare but also my domains’s DNS resolution. lol
Cause of the .IO TLD issues September 20th 2017
The intermittent global DNS issues with resolving records on .IO domains, affected tech domains and .io startups. One such affected startup was DNS Spy who’s founder Mattias Geniar provided me with some details on the cause, he notes:
“The .IO TLD uses 7 different nameservers for its top level domain; a0.nic.io, ns-a3.io, c0.nic.io, ns-a2.io, b0.nic.io, ns-a1.io and ns-a4.io. 2 of those nameservers, ns-a2.io and ns-a4.io, started misbehaving and instead of returning with a correct set of nameservers for the domain you were requesting, started to reply with an NXDOMAIN result. Essentially declaring that the domain you were requesting, didn’t exist.
The problem is that NXDOMAIN is a valid DNS response, which can be cached. So a DNS client doesn’t retry its query on a different nameserver, it got a reply and will honour that: the domain you’re trying to reach doesn’t exist. As far as I’m aware, there hasn’t been any official communication from the .IO registry, so all we’re left with is guessing. Was this related to the recent DNSSEC KEY increase? Was this a targeted attack? Was it human error? Software failure? We don’t know.”
With yesterday’s (and previous) .IO events, we can only hope that raising awareness will help ensure that future issues are made more transparent or better yet, avoided.