Sunday, April 2, 2017

I built a cool log analytics dashboard for Azure, now what?

Log analytics is a pretty cool first-class citizen of the OMS suite from Microsoft.  Log analytics was designed to help customers with monitoring (log, performance, security) and alerting.  Recently I had the opportunity to perform a PoC in a production environment for a customer.

Monitoring has become a complex landscape with several different options in the marketplace depending on the actor/viewpoint being used (ie: who is looking for the solution) and the components in scope (ie: cloud, hybrid, etc).  This particular customer was looking for a broad set of monitoring capabilities that extend into the application, cover the services that make up the application, and have a focus on security elements/risks in the application.  One key factor here is that every component of this solution is deployed in Azure.

From a sensing and measurement perspective, log analytics met the requirements.  Using it's tie-ins to Azure monitor (https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-overview) log analytics can pull in key data from the Azure fabric including metrics on the various services in use as well as activity logs.  Using agents, Log analytics was able to collect various logs off of IaaS instances.  Further, log analytics has custom log capabilities that allows us to ingest and parse custom logs.

The other core requirement was around the ability to visualize the data.  Log analytics has a rich feature set for this, and a robust query language.  Further to this, the query language allows for correlation of data across multiple data sets.

In short, it definitely met the base requirements the client was looking for.

The initial PoC dashboard was actually quite easy to build out.  Using a combination of powershell and the portal (as I am monitoring both V1 and V2 resources) it was easy to onboard the VMs.  I quickly added a set of event logs to monitor and added some performance counters.  My main goals were to understand the standard disk/memory/cpu usage, but also to be able to see cpu performance by process.  These are all counters that can be found/monitored via perfmon and, by extension, log analytics.  The last piece was to start adding in diagnostic logs from other azure fabric systems.  I was quickly able to ingest logs from the subscription itself (ie: actions on resources), network logs, app service logs, azure sql logs, etc.  There are generally two ways to do this, either via azure diagnostics setup, or via "solutions" in the log analytics dashboard.

Here is a snap of the solutions currently installed.
 
One of the most interesting ones is wire data 2.0.  I've used this solution to grab a ton of insight into the inbound and outbound traffic on the VMs being monitored.  You get a ton of detail and the wiredata dashboard provides some good views into your network traffic. 

From an event log perspective, I've kept it light during the PoC, with the following focus.

  
 
Many of the systems in this solution had RDP enabled to the internet (by mistake).  I was able to use log analytics to quickly identify that our systems were under brute force attacks and lock down those endpoints.  It showed up quite easy in some of the blades and built-in views.

From a performance counter perspective, I decided to stick with the basics as well.

 
The one key one here is the processor time by process.  This was essentially to start understanding the types of processes on the host systems and determine what they were doing.  While still not the full picture, I was also able to build alerts off of this.  For example, I could scan all machines in a particular machine group and count the number of processes of a certain type running.  If that number ever dipped below a threshold, I could alert on it.

Once you get the basics setup in log analytics, you really start to wonder what you can do next with the product.  Here are some ideas that I want to explore:

  • Adding more of the solutions into play, specifically the ones around security.
  • Trying to build correlation views that tie in events across the system to performance metrics
  • Ingest the application insights telemetry data and determine dashboarding ideals
  • Start to ingest custom logs for various services running in the environment

Lucky for me this particular client is pretty onboard with growing OMS as the monitoring/alerting tool for this application.  I'm hoping to get a chance to build out more on the platform.