Log analytics is a
pretty cool first-class citizen of the OMS suite from Microsoft. Log analytics was designed to help customers
with monitoring (log, performance, security) and alerting. Recently I had the opportunity to perform a
PoC in a production environment for a customer.
Monitoring has
become a complex landscape with several different options in the marketplace
depending on the actor/viewpoint being used (ie: who is looking for the
solution) and the components in scope (ie: cloud, hybrid, etc). This particular customer was looking for a
broad set of monitoring capabilities that extend into the application, cover
the services that make up the application, and have a focus on security
elements/risks in the application. One
key factor here is that every component of this solution is deployed in Azure.
From a sensing and
measurement perspective, log analytics met the requirements. Using it's tie-ins to Azure monitor (https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/monitoring-overview)
log analytics can pull in key data from the Azure fabric including metrics on
the various services in use as well as activity logs. Using agents, Log analytics was able to
collect various logs off of IaaS instances.
Further, log analytics has custom log capabilities that allows us to
ingest and parse custom logs.
The other core
requirement was around the ability to visualize the data. Log analytics has a rich feature set for
this, and a robust query language.
Further to this, the query language allows for correlation of data
across multiple data sets.
In short, it
definitely met the base requirements the client was looking for.
The initial PoC
dashboard was actually quite easy to build out.
Using a combination of powershell and the portal (as I am monitoring
both V1 and V2 resources) it was easy to onboard the VMs. I quickly added a set of event logs to
monitor and added some performance counters.
My main goals were to understand the standard disk/memory/cpu usage, but
also to be able to see cpu performance by process. These are all counters that can be
found/monitored via perfmon and, by extension, log analytics. The last piece was to start adding in
diagnostic logs from other azure fabric systems. I was quickly able to ingest logs from the
subscription itself (ie: actions on resources), network logs, app service logs,
azure sql logs, etc. There are generally
two ways to do this, either via azure diagnostics setup, or via
"solutions" in the log analytics dashboard.
Here is a snap of
the solutions currently installed.
One of the most
interesting ones is wire data 2.0. I've
used this solution to grab a ton of insight into the inbound and outbound
traffic on the VMs being monitored. You
get a ton of detail and the wiredata dashboard provides some good views into
your network traffic.
From an event log
perspective, I've kept it light during the PoC, with the following focus.
Many of the systems
in this solution had RDP enabled to the internet (by mistake). I was able to use log analytics to quickly
identify that our systems were under brute force attacks and lock down those
endpoints. It showed up quite easy in
some of the blades and built-in views.
From a performance
counter perspective, I decided to stick with the basics as well.
The one key one here
is the processor time by process. This
was essentially to start understanding the types of processes on the host
systems and determine what they were doing.
While still not the full picture, I was also able to build alerts off of
this. For example, I could scan all
machines in a particular machine group and count the number of processes of a
certain type running. If that number
ever dipped below a threshold, I could alert on it.
Once you get the
basics setup in log analytics, you really start to wonder what you can do next
with the product. Here are some ideas
that I want to explore:
- Adding more of the solutions into play, specifically the ones around security.
- Trying to build correlation views that tie in events across the system to performance metrics
- Ingest the application insights telemetry data and determine dashboarding ideals
- Start to ingest custom logs for various services running in the environment
Lucky for me this
particular client is pretty onboard with growing OMS as the monitoring/alerting
tool for this application. I'm hoping to
get a chance to build out more on the platform.
No comments:
Post a Comment