Service announcement: 15 March 2016
Service disruption: APNIC services were disrupted on Tuesday, 15 March 2016.
|Start Time||Tuesday, 15 March 2016 09:13 (UTC +10)|
|End Time||Wednesday, 16 March 2016 19:24 (UTC +10)|
|Duration||1 day, 10 hours and 11 minutes|
From Tue, 15 Mar 2016 09:13 (UTC+10) to Wed, 16 Mar 2016 19:24 (UTC+10), incorrect DNSSEC cryptographic information was registered in the global DNS and invalidated reverse DNS queries into the APNIC-managed number space. This was caused by incorrect DS records having been uploaded to IANA by APNIC technical staff during a unscheduled change of nameserver requested by the RIPE NCC.
During this period, Internet services and applications which depend on reverse DNS, and were exclusively using DNSSEC-enabled resolvers, will have been impacted.
APNIC technical staff resolved the problem by uploading the correct DS records to IANA. Users may have experienced some difficulties following the resolution due to cached DNS data.
APNIC has conducted a root cause analysis and identified two critical failures in process which contributed together to the incident. We have modified our operating processes to limit exposure to these causes in future. Operational activity in reverse DNS is now subject to a higher degree of scrutiny by senior APNIC staff.
There are two timelines, one for the IPv4 reverse zones, and one for the 0.4.2.ip6.arpa zone (IPv6).
All zones have been verified for correctness. The affected zones are listed below in Appendix A.
3. Detailed description and root cause analysis
As background, the RIPE NCC provides secondary DNS services for APNIC. Due to necessary configuration changes, APNIC was requested to update its zone files to reflect a new name server.
The following two issues contributed to this service outage.
Incorrect DS files
The zone files were updated correctly with the correct NS and DS records. Unfortunately during the update push to IANA, an automatic configuration management tool (CM tool) rewrote the DS record with previous data, which was invalid. This was a fault because the CM tool is designed to manage systems configuration centrally and is not designed to manage information generated on demand in other IT systems (including the DNS). The limit of configuration which should have been in the CM tool relates to the installation and bootstrapping of the DNS service. The live data is a function managed outside this configuration framework. The DS upload mechanism did not check semantic or syntactic correctness of the messages being sent. This prevented the stale DS and the malformed IPv6 message from being detected before sending.
Delayed monitoring results
Our monitoring system currently checks the DNSSEC validation from our DNS distribution servers. This check runs every 15 minutes, and APNIC can verify that the monitoring system was running for the duration of this outage. Unfortunately, the check did not report any failure due to the fact that the resolver used cached responses.
4. Corrective and Preventative Measures
To eliminate the issue of making incorrect updates, the APNIC Infrastructure Services (IS) team is reviewing the current internal processes and will make any necessary changes to ensure that, in the short term, the configuration system does not override update information, but in the long term, this type of error is prevented from occurring. Because this review process will take time, we have implemented, with immediate effect, a process whereby any changes to production systems will be brought to a specialist panel, consisting of a team of senior staff, for review. This will also serve as basis for future processes.
The monitoring checks for our DNS systems are currently being reviewed, updated and formally tested to ensure this and other problems have been fixed. The code will be rewritten to ensure that caching is bypassed to retrieve the correct information. This process will be extended to all APNIC monitoring capabilities.
Finally, as part of the existing Infrastructure improvement project, we will be conducting an external audit of our systems to proactively detect any weaknesses.
5. Appendix A – Affected ranges
1.in-addr.arpa 101.in-addr.arpa 103.in-addr.arpa 106.in-addr.arpa 110.in-addr.arpa 111.in-addr.arpa 112.in-addr.arpa 113.in-addr.arpa 114.in-addr.arpa 115.in-addr.arpa 116.in-addr.arpa 117.in-addr.arpa 118.in-addr.arpa 119.in-addr.arpa 120.in-addr.arpa 121.in-addr.arpa 122.in-addr.arpa 123.in-addr.arpa 124.in-addr.arpa 125.in-addr.arpa 14.in-addr.arpa 150.in-addr.arpa 153.in-addr.arpa 163.in-addr.arpa 171.in-addr.arpa 175.in-addr.arpa 180.in-addr.arpa 182.in-addr.arpa 183.in-addr.arpa 202.in-addr.arpa 203.in-addr.arpa 210.in-addr.arpa 211.in-addr.arpa 218.in-addr.arpa 219.in-addr.arpa 220.in-addr.arpa 221.in-addr.arpa 222.in-addr.arpa 223.in-addr.arpa 27.in-addr.arpa 36.in-addr.arpa 39.in-addr.arpa 42.in-addr.arpa 43.in-addr.arpa 49.in-addr.arpa 58.in-addr.arpa 59.in-addr.arpa 60.in-addr.arpa 61.in-addr.arpa
We apologize for the loss of facilities and any inconvenience caused. Should you require assistance in dealing with any problems arising from this outage, please contact the APNIC Helpdesk.