Sunday, April 30, 2017

The case for multiple Azure Automation Accounts

Most of my recent work has, in some way, involved the use of the Azure Automation service.  It is a pretty versatile service that can be used for a lot of IT automation needs.  As I start to use Azure Automation in more complicated scenarios, I have run into some security considerations that has got me wondering if the right approach is to have multiple Automation accounts in a given environment.  The goal of this post is to discuss this a bit further.

In Azure automation, there are three "integration" points where security might come up as an issue.  The first is governing who has access to the particular automation account.  Microsoft has come up with a set of user roles that can help control access to the account itself.  Of course, you can always make use of azure custom roles to make further customization to the granularity.  The second is around the credential/secret access that the automation account has been setup with.  These elements are configured within an account and every runbook in the account could, in theory, access those resources.  The third is in the use of hybrid workers.  Workers are configured at the account level, and then, can be selected as the run destination for a given runbook.

Take the following scenario as an example.

I would like to execute startup/shutdown scripts across multiple environments (say dev/test/prod) and have them run by users as required. 

Working through the above defined integration points, the first point to consider would be that of user security.  At some point, someone has to be the owner of the Automation Account, and thus, that owner would effectively have access to all resources within it.  This goes the same even if RBAC is used (out of the box).  Users with the automation operator role would still have access to run all runbooks in the account regardless of the environment it is targeting, or that they actually have authority over.

Moving to the second point, run-as accounts are configured at the Automation account level.  Run-as accounts have no granular permissions and therefore can be accessed by any runbook.  While you can control who can author a runbook, you can't control what run-as accounts the runbook uses.  Further, you can enforce user security on this.  In an ideal state, a runbook could accept a run-as account as a parameter and then ensure that the user actually has access to use that account.  Permissions in Azure automation are not that granular.

The third aspect, hybrid worker, generally only comes into play when you are executing activities that require network access to the operating system itself.  When you setup a hybrid worker, you essentially get another prompt when running the runbook, requesting "where" you would like to execute the particular runbook.  Once again, there is no user level security here.  Everyone who can run a runbook can also select the hybrid worker group to run on.  Further, you cannot actually configure a runbook to ONLY run on a particular destination (unless you use a webhook or schedule).  So, even users with only the "automation operator role" can still run runbooks where they are not supposed to.

So what is the solution? When I first setup the build/release management in VSTS online, I was amazed at the amount of permissions, etc, you need to grant to make something even somewhat useable.  You could run into the scenario where you had access to run a build on a project, but no access to any agents to actually execute the build on.  Credentials could be stored as part of a release line, but not used by any other release lines. Permissions to view those could be granted only to the people that needed them.  To be honest, it seemed tedious, but in retrospect, that is the level of granularity that was required in order to truly make the system multi-tenant.

From an azure automation perspective, I feel like the easiest solution is to create multiple automation accounts.  One per environment that it is targeting.  Then grant permissions to the automation account as required for the people that need it.  Keeping runbooks in sync between environments is frustrating, but the integration is coming along with github and VSTS that will make things easier.  While I still feel a VSTS approach is better off in the long run, multiple automation accounts seems to solve the problem at an environment level quite nicely.

Monday, April 24, 2017

Azure Cost Optimizations: Azure VMs

Compute time is one of the most expensive services one can use in the Azure stack.  The goal of this post is to discuss some of the ways you can optimize the amount of compute you use.  Please note that the compute profile of your workloads will vary, so your mileage using any of the suggestions below will vary.

1) Monitor monitor monitor

Azure has recently released a few new features from a monitoring perspective.  Most of these capabilities are merging into what they call Azure Monitor.  What I particularly like about Azure monitor is the ability to see both host level and guest level metrics.  As with all performance monitoring of this type, you need to have a good understanding of how the data is sampled on the system.

As you can see, I've made a few changes to the sample rate for my VMs.  This was particularly important during load testing I was conducting at my client.

The key point here is that you can start to use these monitoring tools to help determine more appropriate VM sizes for your workloads.  Integrating tools such as OMS can greatly help as you can start to trend performance over time.

One tool that I have run into but have not had the chance to use is the Azure Virtual Machine Optimization Assessment.

2) Switch to batched/scheduled workloads

As you pay for compute only when your virtual machine is running, you can start to play around with batched and/or scheduled workloads.  The essence here is to find an orchestration tool that you can use to manage when your virtual machines are running.  Taking this one step forward, ideally the orchestration engine also only runs the virtual machine for the required amount of time.

Two services that one can look at here are Azure Batch and Azure VMSS.  The later here is more around autoscaling and trying to achieve performance curves that match closer to the demand curves.

3)  Shutdown/Startup VMs

One technique that I am particularly fond of is the automatic startup and shutdown of VMs when they are required.  There are several different ways to accomplish this in the Azure cloud, including using Azure Dev/Test Labs and Azure Automation.  The former has created a VM extension that can be used to autoshutdown the machines.  There is even some built-in runbooks for using tags to schedule startup and shutdown.

There are a few pros/cons to the above approaches which I can cover in another post, but suffice to say, one thing to consider here is the order in which startup/shutdown occur.  Many environments, even dev/test, can be complex and have dependencies.  Generally for this reason, I tend towards an Azure Automation approach to starting up and shutting down VMs.

There are a host of methods and processes for tuning VMs (not just in Azure).  You can use most of those techniques with Azure, just having to change how you get access to the underlying metrics you are relying on.  As always, starting and shutting down VMs can save quite a bit of money in the long run.

Thursday, April 20, 2017

Creating a new Managed Disk from a Managed Disk Snapshot

It turns out that the operation is quite easy.  The New-AzureRmDiskConfig has an option to reference a source resource id  This ID can be found in the portal (or via powershell) when you click on a managed disk.  Simply input that when you create a new disk (as shown below) and it will create a new disk based off of your snapshot.

$diskConfig = New-AzureRmDiskConfig -CreateOption Copy -SourceResourceId "id from portal" -Location westus -DiskSizeGB 64 -AccountType StandardLRS
$disk = New-AzureRmDisk -DiskName "name" -Disk $diskConfig -ResourceGroupName "rgname"

Official documentation can be found here.

Tuesday, April 18, 2017

Checking for Resource Group Locks in Azure

When I deploy production architectures in Azure, I want to put protections in to the deployment that prevent accidental modification/deletion of business critical resources.  One such tool in Azure is the concept of resource locks.  At a bare minimum, I want to ensure that resource groups containing my production deployments have locks on them to prevent deletion.  The goal of this post is to showcase how to use powershell to double check that all required locks are in place.

    [Parameter(Mandatory=$true,HelpMessage="The search term for all resource groups to check")]

$resourceGroups = (Get-AzureRmResourceGroup | ? {$_.ResourceGroupName -like "*$search*"})

foreach ($resourceGroup in $resourceGroups){
    $lock = Get-AzureRmResourceLock -ResourceGroupName $resourceGroup.ResourceGroupName
    if ($lock -eq $null){
        Write-Host "$($resourceGroup.ResourceGroupName) is missing a lock"

The script above is really quite simple.  When I do production deployments, I usually place the word "prod" somewhere in the resource group name.  The script above takes in a search parameter and then tries to locate all resource groups that contain that search parameter.  The foreach loop essentially checks for the existence of a lock.  If no lock is found, a message is printed. 

This script could be expanded in several ways to be more robust.  For example, it could auto correct the condition by placing the required lock.  One could then add this as part of an Azure Automation job that would ensure that production resource groups are not left unprotected for long.  Further, it could actually look at the lock to determine if it is delete or read-only and report accordingly.

In any event, the above script suited my audit purposes just fine.  Enjoy!

Wednesday, April 12, 2017

Azure Cost Optimizations: Azure SQL

Azure SQL is a fantastic service that takes much of the complication of managing/implementing SQL away from users.  As all Azure services, you pay for what you use, and due consideration needs to be given to ensuring you are not paying too much for your needs.  As with most of the PaaS services, Azure SQL scales on one of three factors: DTUs (read performance), Disk Size, and features.  The goal of this post is to chat a little bit about potential Azure SQL cost optimizations.

1) Tune your SQL for performance

One great newer feature of Azure SQL is the query performance/insights and automatic tuning capabilities that have been baked into the service.  While I would not say that these features replace a DBA, you can certainly gain a ton of insight into what is going on.  I have personally used the tools here to look at long running queries and identify bottlenecks to developers who could then work to optimize their code.

For more detailed information about how to use these features, please see the following links:

Tuning your database can help with your DTU requirements.

2) Consider Pooling

Much like how app service plans have been defined, there are two core options to deploying Azure sql.  The first is a single database.  Think of this as a single web app on a dedicated app service plan.  In this mode, the SQL server is guaranteed all the resources assigned to it.  The second mode is elastic pool.  Here, you provision a group of resources and then deploy multiple databases to this pool.  Resources of the pool are shared by all databases, allowing for cost optimizations at the expense of the noisy neighbour problems that arise.

3) Scale down your databases during non-peak usage

Azure SQL is a service, and therefore, it is always on.  However, it is also billed by the hour.  You can make use of powershell commandlets to scale your databases to your requirements.  This works great with non-linear workloads, dev/test situations, and also for load testing purposes.

The command to use for v2 resources Set-AzureRmSqlDatabase.  When I implemented a scaling script I used the Find-AzureRMResource command to find the database via name and then set the appropriate scale.

        $findResult = Find-AzureRmResource -ResourceNameContains $sql -ResourceType "Microsoft.Sql/servers/databases"
        if (-not ($findResult -eq $null)){
            $serverName = $findResult.Name.Substring(0,$findResult.Name.IndexOf('/'))
            Set-AzureRmSqlDatabase -DatabaseName $sql -Edition $edition -RequestedServiceObjectiveName $requestedServiceObjectiveName -ServerName $serverName -ResourceGroupName $findResult.ResourceGroupName

Azure SQL has come a long way since it was first implemented.  New features such as the query insights can really help customers tune their databases and keep costs down.  As always, scaling services when required will generally net the biggest bang for your buck.

Tuesday, April 11, 2017

Azure Cost Optimizations: Azure Storage

Depending on the makeup of your Azure consumption, Azure storage can sometimes make up a small percentage of your total Azure costs.  That being said, it is really easy to waste money in Azure storage if you do not manage it properly.  The purpose of this post it to chat about a couple things to look for to help optimize your azure storage costs.

1) Look for un-leased VHD files

When you used to delete VMs from the classic portal, Azure would kindly remind you (and make it easy) to delete the associated VHDs from storage.  This is not so much the case with all the new models that are in play, and you need to be looking at your storage accounts to remove un-leased VHD files.  As an aside, VHD files are leased for an infinite time period as long as the VM that they are attached to exists ( It does not matter the "state" that the VM is in).  Un-leased VHD files are a good indication of snapshots of existing VMs or of previous deleted VMs were the files were left in place.

You can use the get-azurestorageblob command to get a list of blobs that match your search criteria.  Once you have that, the ICloudBlob.Properties.LeaseState property can be used to determine the lease state.  It would report "Available" if the file was not leased.

2) Review the type of storage accounts deployed

It can be pretty easy to accidentally deploy the wrong type of storage for the need.  You can check the azure storage pricing list to see the differences with the blob types.  One of the most prevalent mistakes is deploying GRS or higher for VM VHD storage.

You can obtain a list of storage accounts by using the get-azurermstorageaccount or the get-azurestorageaccount depending on if you are using ARM or ASM.

In ARM, you are looking for the Sku.Name property to be not-equal to StandardLRS.  In ASM, you are looking for the AccountType property to be not-equal to Standard_LRS.

3) Review the size of your containers

It can be easy to leave storage "hanging around".  Remember that you pay for each and every block that has been stored.  One way to narrow down where to look is to grab the size of containers and look at the ones with the most storage in them.  Delete what you don't need and keep what you do.

Luckily, someone at Microsoft created a script to help us look at the cost of the blobs.  A note from personal experience, this can take a long time to run.

In conclusion, Azure storage can contribute to the waste in your Azure consumption.  The steps above can help you keep track of those costs.  It is recommended to run scripts/tools to report on this at least quarterly.  There are also various 3rd party services that you can use to help report on Azure costs.

Sunday, April 9, 2017

Azure VMs default to 30gb (Marketplace Deployments)

In a very recent change over in Azure-land, the program team decided to change the default OS size from 128gb to 30 gb.  As per the information in this bulletin this change was rolled out in conjunction with managed disk.  It also currently states that Azure is in the process of rolling back these changes.

I first noticed this issue when browsing the Azure advisors boards and someone had posted a question on it.  At the time, I was unsure if this issue presented itself even if you specified the "DiskSizeInGB" parameter in either ARM templates or in powershell. Luckily for me, I had configured a machine just the day before that exhibited the issue.

In the ARM template for the machine, I clearly specified 64gb as the disk size.

          "osDisk": {
            "name": "[concat(parameters('vmName'),'-OS')]",
            "createOption": "FromImage",
            "managedDisk": {
              "storageAccountType": "Standard_LRS"
            "diskSizeGB": 64,
            "caching": "ReadWrite"

Here is a screenshot of what was provisioned in the VM.

As you can see from the image, I do get provisioned an OS disk of 64 gb, however the partition by default only uses 30gb of it, leaving 34gb as un-allocated space.  I think it would have been okay if simply the default had changed (say if you created a VM in the portal), however, the fact that it is also ignoring the diskSizeGB setting is disturbing.  As the link suggested, this change is being rolled back.

Hopefully there wasn't too much automation broken by this change :)

Sunday, April 2, 2017

I built a cool log analytics dashboard for Azure, now what?

Log analytics is a pretty cool first-class citizen of the OMS suite from Microsoft.  Log analytics was designed to help customers with monitoring (log, performance, security) and alerting.  Recently I had the opportunity to perform a PoC in a production environment for a customer.

Monitoring has become a complex landscape with several different options in the marketplace depending on the actor/viewpoint being used (ie: who is looking for the solution) and the components in scope (ie: cloud, hybrid, etc).  This particular customer was looking for a broad set of monitoring capabilities that extend into the application, cover the services that make up the application, and have a focus on security elements/risks in the application.  One key factor here is that every component of this solution is deployed in Azure.

From a sensing and measurement perspective, log analytics met the requirements.  Using it's tie-ins to Azure monitor ( log analytics can pull in key data from the Azure fabric including metrics on the various services in use as well as activity logs.  Using agents, Log analytics was able to collect various logs off of IaaS instances.  Further, log analytics has custom log capabilities that allows us to ingest and parse custom logs.

The other core requirement was around the ability to visualize the data.  Log analytics has a rich feature set for this, and a robust query language.  Further to this, the query language allows for correlation of data across multiple data sets.

In short, it definitely met the base requirements the client was looking for.

The initial PoC dashboard was actually quite easy to build out.  Using a combination of powershell and the portal (as I am monitoring both V1 and V2 resources) it was easy to onboard the VMs.  I quickly added a set of event logs to monitor and added some performance counters.  My main goals were to understand the standard disk/memory/cpu usage, but also to be able to see cpu performance by process.  These are all counters that can be found/monitored via perfmon and, by extension, log analytics.  The last piece was to start adding in diagnostic logs from other azure fabric systems.  I was quickly able to ingest logs from the subscription itself (ie: actions on resources), network logs, app service logs, azure sql logs, etc.  There are generally two ways to do this, either via azure diagnostics setup, or via "solutions" in the log analytics dashboard.

Here is a snap of the solutions currently installed.
One of the most interesting ones is wire data 2.0.  I've used this solution to grab a ton of insight into the inbound and outbound traffic on the VMs being monitored.  You get a ton of detail and the wiredata dashboard provides some good views into your network traffic. 

From an event log perspective, I've kept it light during the PoC, with the following focus.

Many of the systems in this solution had RDP enabled to the internet (by mistake).  I was able to use log analytics to quickly identify that our systems were under brute force attacks and lock down those endpoints.  It showed up quite easy in some of the blades and built-in views.

From a performance counter perspective, I decided to stick with the basics as well.

The one key one here is the processor time by process.  This was essentially to start understanding the types of processes on the host systems and determine what they were doing.  While still not the full picture, I was also able to build alerts off of this.  For example, I could scan all machines in a particular machine group and count the number of processes of a certain type running.  If that number ever dipped below a threshold, I could alert on it.

Once you get the basics setup in log analytics, you really start to wonder what you can do next with the product.  Here are some ideas that I want to explore:

  • Adding more of the solutions into play, specifically the ones around security.
  • Trying to build correlation views that tie in events across the system to performance metrics
  • Ingest the application insights telemetry data and determine dashboarding ideals
  • Start to ingest custom logs for various services running in the environment

Lucky for me this particular client is pretty onboard with growing OMS as the monitoring/alerting tool for this application.  I'm hoping to get a chance to build out more on the platform.