[Azure] Application Gateway certificate gotchas
At my current assignment, my team is using the Azure Application Gateway to securely make available some services within Azure such as API Management and WebApps. Up to a couple of weeks ago, we were using the “old” (what’s old, right?) version of the gateway to do this. Until a production outage woke us up, let me describe what was happening.
The Application Gateway allows you to configure a different listening URL compared to the URL that your back-end is using. In our case, some of our backends are simply using the *.azurewebsites.net certificate, but our front-ends are using customized URLs on the customers domain. This effectively means that the gateway will terminate the “outside” SSL and switch to using the internal back-end certificate for internal communication. This way, the entire connection is still secure and thus we have end-to-end SSL.
The V1 gateways have a restriction in the fact that you either can provide your own certificate to do so, or you can provide a custom one. We were using the latter because our API management endpoints also run a custom cert for internal traffic (which in turn is a ‘restriction’ / requirement of API management instances).
The issue we identified boiled down to the fact that Microsoft had updated their *.azurewebsites.net certificate, but this update didn’t make it to the application gateway instances. So when the back-end hosts started to deploy the new certificate, the gateways started marking the hosts as unhealthy due to an invalid certificate: “BackendServerCertificateNotWhitelisted“. Whoops. It took us a while to find out what has happening as the configuration didn’t change and the certificate itself seemed fine to us. Eventually, forcing an update of the gateway config somehow triggered the certificates to be refreshed which resolved the issue. Microsoft Support confirmed we were not the only ones to have this issue.
Gateway V2: the importance of the certificate chain
After fixing the above issue, support indicated that we might want to consider moving to the V2 SKU of the application gateway. This does not have the limitation of having to pick between either a platform managed certificate or custom certificates, instead it can mix both. It should also be more resilient to updates of the platform certificates, which I guess we just have to believe then.
And so we updated to V2, only to run into the next certificate based issue.
sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Great, so now what? We noticed a part of our Java-based landscape falling over with the new gateways in place. Certificate issues, even though we were using the exact same certificates as before. After again a bit of investigation we found (using ssllabs.com) that the V2 gateway was returning only the primary certificate, where the V1 gateway was returning a full chain. I again got in touch with Microsoft support, who pointed me to https://docs.microsoft.com/en-us/azure/application-gateway/ssl-overview#end-to-end-ssl-with-the-v2-sku.
The problem was with the fact that our PFX we were using did not contain the full chain of the certificate. For the V1 gateway this didn’t matter, but for V2 this does. So again the fix for this issue was a lot simpler than finding the actual issue: we exported the certificate again using the “Include all certificates in the certification path if possible” option. This will create a PFX including the certificate chain.
After uploading these new certs to KeyVault and updating the gateway instance, everything started working again!