Service announcement: 15 March 2016

Service disruption: APNIC services were disrupted on Tuesday, 15 March 2016.

Start Time Tuesday, 15 March 2016 09:13 (UTC +10)
End Time Wednesday, 16 March 2016 19:24 (UTC +10)
Duration 1 day, 10 hours and 11 minutes
Services affected

Reverse DNS


1. Summary

From Tue, 15 Mar 2016 09:13 (UTC+10) to Wed, 16 Mar 2016 19:24 (UTC+10), incorrect DNSSEC cryptographic information was registered in the global DNS and invalidated reverse DNS queries into the APNIC-managed number space. This was caused by incorrect DS records having been uploaded to IANA by APNIC technical staff during a unscheduled change of nameserver requested by the RIPE NCC.

During this period, Internet services and applications which depend on reverse DNS, and were exclusively using DNSSEC-enabled resolvers, will have been impacted.

APNIC technical staff resolved the problem by uploading the correct DS records to IANA. Users may have experienced some difficulties following the resolution due to cached DNS data.

APNIC has conducted a root cause analysis and identified two critical failures in process which contributed together to the incident. We have modified our operating processes to limit exposure to these causes in future. Operational activity in reverse DNS is now subject to a higher degree of scrutiny by senior APNIC staff.

2. Timeline

There are two timelines, one for the IPv4 reverse zones, and one for the zone (IPv6).

IPv4 reverse zones timeline timeline:
Tue, 15 Mar 2016 09:13 UTC+10:
First update with invalid DS.
Tue, 15 Mar 2016 09:28 UTC+10:
Last update with invalid DS.
Tue, 15 Mar 2016 09:30 UTC+10:
IANA received the update with an invalid DS record. However, it correctly was not processed due to a syntax error in the file which stopped their automatic update process.
Wed, 16 Mar 2016 00:30 UTC+10:
Notification from public DNS operations mailing list and ticket opened.
Wed, 16 Mar 2016 02:05 UTC+10:
First update with correct DS.
Wed, 16 Mar 2016 02:05 UTC+10:
Last update with correct DS.  Issue resolved in IPv4.  Due to DNS cache effects, the old cached data may take time to expire in the public DNS and be replaced with the new information.
Wed, 16 Mar 2016 12:40 UTC+10:
IANA pushed through the updates for all IPv6 reverse zones including that had an incorrect DS record.
Wed, 16 Mar 2016 19:24 UTC+10:
APNIC pushed the correct update.  Issue resolved in IPv6.  Due to DNS cache effects, the old cached data may take time to expire in the public DNS and be replaced with the new information.

All zones have been verified for correctness. The affected zones are listed below in Appendix A.

3. Detailed description and root cause analysis

As background, the RIPE NCC provides secondary DNS services for APNIC. Due to necessary configuration changes, APNIC was requested to update its zone files to reflect a new name server.

The following two issues contributed to this service outage.

Incorrect DS files

The zone files were updated correctly with the correct NS and DS records. Unfortunately during the update push to IANA, an automatic configuration management tool (CM tool) rewrote the DS record with previous data, which was invalid. This was a fault because the CM tool is designed to manage systems configuration centrally and is not designed to manage information generated on demand in other IT systems (including the DNS). The limit of configuration which should have been in the CM tool relates to the installation and bootstrapping of the DNS service. The live data is a function managed outside this configuration framework. The DS upload mechanism did not check semantic or syntactic correctness of the messages being sent. This prevented the stale DS and the malformed IPv6 message from being detected before sending.

Delayed monitoring results

Our monitoring system currently checks the DNSSEC validation from our DNS distribution servers. This check runs every 15 minutes, and APNIC can verify that the monitoring system was running for the duration of this outage. Unfortunately, the check did not report any failure due to the fact that the resolver used cached responses.

4. Corrective and Preventative Measures

To eliminate the issue of making incorrect updates, the APNIC Infrastructure Services (IS) team is reviewing the current internal processes and will make any necessary changes to ensure that, in the short term, the configuration system does not override update information, but in the long term, this type of error is prevented from occurring. Because this review process will take time, we have implemented, with immediate effect, a process whereby any changes to production systems will be brought to a specialist panel, consisting of a team of senior staff, for review. This will also serve as basis for future processes.

The monitoring checks for our DNS systems are currently being reviewed, updated and formally tested to ensure this and other problems have been fixed. The code will be rewritten to ensure that caching is bypassed to retrieve the correct information. This process will be extended to all APNIC monitoring capabilities.

Finally, as part of the existing Infrastructure improvement project, we will be conducting an external audit of our systems to proactively detect any weaknesses.

5. Appendix A – Affected ranges



We apologize for the loss of facilities and any inconvenience caused. Should you require assistance in dealing with any problems arising from this outage, please contact the APNIC Helpdesk.