Dynamics 365 Business Central is an absolutely reliable SaaS service that runs on the Azure platform. It’s a cloud service with a global reach and scale running on one of the world’s largest hyper-scale infrastructures (Azure) with data centers in regions all over the world.
However, despite the guaranteed reliability of the cloud service, moments where a problem can affect the infrastructure cannot be totally excluded. Partner should be able to see outages enlisted in Partner Center at the following link: https://portal.office.com/Adminportal/Home#/servicehealth
but sometimes the response on the above page is the following also if you have some issues:
Yesterday for example a problem impacted lots of tenants in different regions. A recent service update impacted a portion of the infrastructure responsible for environment authentication and lots of clusters were affected with problems on authentication.
Microsoft has immediately reacted to that and after some hours the problem was fixed with an official communication:
We determined that the recent service update contained a code misconfiguration, which resulted in high resource utilization within the impacted portion of the infrastructure. We performed a rollback of the misconfigured code to the previous healthy configuration. We have confirmed via telemetry analysis and testing that users access to their environment has been restored.
I think that the outages communication process by Microsoft should be improved, the CSS Team does a great job but I think that for partners it’s absolutely better to receive automatic emails communication for service outages, maybe via the Notification email addresses connected to a specific tenant.
But if I want to be proactive and discover service outages on Dynamics 365 Business Central before someone alerts me, how can I perform availability tests for my customer’s Dynamics 365 Business Central tenants? As you can imagine, I can have hundreds of customers all around the globe and I cannot monitor each tenant individually and manually.
For automatically monitoring Dynamics 365 Business Central availability, I normally create two different tests:
- a basic availability test
- an advanced availability test
Basic availability test
As a first step on monitoring Dynamics 365 Business Central service availability, I suggest to create a PING TEST with Azure Application Insights.
This is not a simple PING. This type of test doesn’t use Internet Control Message Protocol (ICMP) to check your site’s availability but instead it uses a more advanced HTTP request functionality to validate whether an endpoint is responding and it measure the performance associated with that response.
From an Azure Application Insight instance, select the Availability option and here click on Add Classic Test:
Then in the Create test window, set your test as in the following image:
Here you need to provide the tenant url, the test frequency (I suggest not to freequent) and locations where the test must be provided. It’s recommended to perform this test from a minimum of five locations. This approach helps prevent false alarms that can result from transient issues with a specific location.
When the test is created, click on Open Rules (Alerts):
Click on the rule name and then define your condition:
Here for example I’ve defined that I want to be notified when 1 request fails.
When the condition is defined, move to the Actions section and here add an action group by clicking on Add actions groups:
Here you can define a group for receiving notifications from the test. When finished, save the alert rule:
Now the availability test is up and running and you can monitor the health of your Business Central endpoint:
If you have failures, you can also inspect the failure details just by clicking on it.
NOTE (out of topic): as you can see, you can monitor also the response time. My demo tenant here is in Western Europe region. I think that it’s useful to understand the importance of placing Azure Services in the same Azure region of your tenant, isn’t it? 🙂
What’s the problem here with this type of test?
This is only an advanced PING test, so it gives you success/failure responses if your Dynamics 365 Business Central tenant is up or not. This is for example what is the result of this availability test in the last 7 days on one of my real customer’s production environments:
As you can see, the tenant was always up and running (no failures).
But yesterday, there was a period of time (less that 1 hour) where they have problems with the Dynamics 365 Business Central authentication. Why there’s no failure signaled here?
Because this test only checks for the availability of the endpoint, not for “functional aspects” inside that endpoint.
Advanced availability test
This is the main reason for which I normally suggest to do also a more advanced test (or a “level 2” availability test) in order to monitor the service health of your customer’s Business Central tenants and to be immediately notified in case of failures. The goal of this “level 2” testing is to discover if there are internal issues with the Dynamics 365 Business Central service, so mainly related to authentication and/or OData services.
How can I create an automatic availability test that will discover if my Dynamics 365 Business Central tenant has internal problems (like for example yesterday’s case) and immediately notify me?
The answer is: create a Timer trigger Azure Function (so executed on a scheduled time) and then inside that Azure Function perform some things like:
- Authentication (this confirm that Business Central tenant is up and running and the authentication service is working correctly)
- API calls to /companies in order to retrieve the companies and (optional) to an entity inside a company (for example /customers)
If the calls fails for some unknown reason, the function will send an alert to your administrators (or more advanced alerts as you need).
The Timer Trigger Azure Function can be defined as follows (pseudo code here):
Because in case of problems like occurred yesterday, the first alert is not triggered (Business Central service responds to pings), but if you perform the “level 2” alert, you receive an exception on authentication and so an advice of service internal outage.
What happened to some of my production tenants yesterday? This:
In a window of about 1 hour in the afternoon, this “level 2” alert has start sending me notifications for a possible service outage. Without this, I will never be able to discover the problem.
If you want to be proactive and react to possible (and honestly so rare) outages before an official Microsoft confirmation (or customer flooding your emails), this can be a possible way.