Business Continuity and Disaster Recovery (BCDR) Guidance for AVD Environments with Nerdio Manager

Business Continuity and Disaster Recovery (BCDR) Guidance for AVD Environments with Nerdio Manager

Azure Virtual Desktop is a critical component of an IT environment and it is important to consider business continuity and disaster recovery scenarios to keep AVD available.

AVD deployments consist of several components and we will consider each one individually in the configuration of BCDR for AVD.

 

Component

Description

DR Availability Considerations

AVD Service

The AVD Management service is a Microsoft hosted, globally distributed PaaS service that is responsible for brokering the connections between the end-user client and session host VMs. The AVD service also contains metadata about host pools, application groups, user assignments, etc.

The AVD service is managed by Microsoft and no additional steps need to be taken by you to keep this service operating in case of a regional outage. When an outage occurs in a region, the service infrastructure components fail over to the secondary location and continue functioning as normal. You can still access service-related metadata, and users can still connect to available hosts. End-user connections stay online as long as the tenant environment or hosts remain accessible.

Nerdio Manager

Nerdio Manager is a management console that extends AVD's native management capabilities. It is also responsible for session host VM auto-scaling. Nerdio Manager is an Azure application consisting of several PaaS services.

Nerdio Manager is not in the critical path of a user's desktop connection and therefore an outage of Nerdio Manager does not impact users' ability to access their desktops. Nerdio Manager is also responsible for scaling out the environment to accommodate additional user demand, and when it is unavailable, session host VMs must be powered on manually using the Azure portal. An active/passive configuration for AVD and Nerdio Manager is recommended, as will be discussed below.

Active Directory

Active Directory (or Entra Domain Services) is a critical component of an AVD environment and is responsible for authenticating user sign ins to domain-joined session host VMs.

It is critical to maintain availability of the Active Directory domain controllers in case of a primary region's outage. Without an accessible AD domain controller, users are not able to sign in to their AVD desktops and RemoteApps.

Desktop Images

Desktop images serve as the base images for creating session host VMs.

Desktop images don't participate in providing users the ability to connect to a session host VM. However, they are critically important in the process of provisioning new session host capacity and therefore must be backed up and available when hosts are created.

Session Host VMs

Azure session host VMs that run in your Azure subscription and provide users their AVD desktops and apps.

Session host VMs are responsible for delivery of users' desktops and apps. They must be available for users to be able to connect. In most AVD deployment scenarios, session host VMs are clones of a desktop image and do not contain any data. Therefore, during a DR event, they can be easily recreated if the desktop image is available.

FSLogix Profile Storage

SMB storage location that contains the FSLogix profile container VHD(X) files with users' profile data.

FSLogix profiles are critical to the operation of a AVD environment and must be available whenever a user needs to connect to and work on a desktop or RemoteApp. FSLogix profiles must therefore be replicated and available in multiple locations.

Business Applications

Your other business applications.

We will consider the discussion of other business applications as out-of-scope for this guide.

 

AVD deployments with Nerdio Manager are intrinsically redundant and highly available. However, depending on your RPO, RTO, and budgetary requirements, there are several items that should be taken into consideration when planning and architecting your AVD deployment.

Each outage situation is unique and requires a response that is custom-tailored to the situation. In this guide, we will review three outage situations and discuss how to deal with each of the components listed above in each outage situation.

Situation #1: Local corruption of data, metadata, or resources, and no underlying data center or region outage

In this situation, restoring from a backup or rebuilding session host VMs is the best approach. Let's review how this applies to each AVD environment component:

  • AVD Service: Because this service is hosted, managed, and backed up by Microsoft, there is nothing for you to do. The AVD service fails over automatically and Microsoft is responsible for getting everything back up and running within the provided SLAs.

  • Nerdio Manager: Nerdio Manager is not critical to the users' ability to connect to their desktops because it does not participate in the user authentication or connection brokering process. Nerdio Manager is responsible for scaling out the environment by powering on session host VMs. If Nerdio Manager is unavailable, VMs can be powered on manually via the Azure portal.

  • Active Directory: Functional AD domain controllers must be accessible at all times.

    • Recommendation: Create multiple AD domain controller VMs that are accessible from the AVD environment. Back up the AD system state and restore if needed. Consider using Entra ID and Entra ID Joined, where applicable.

  • Desktop Images: Changes are often made to desktop image during the normal course of AVD environment maintenance. Maintaining backups of desktop images is important to be able to quickly recover from any corruption.

  • Session Host VMs: Hosts can become unavailable or corrupted in the normal course of operations.

    • Recommendation: Enable Nerdio Manager's auto-heal functionality to automatically repair broken session hosts. If necessary, delete any failed hosts and allow Nerdio Manager to automatically re-create them, as needed. SeeEnable Dynamic Host Pool Auto-scaling for details.

  • FSLogix Profiles: Corruption of profile containers can be resolved by restoring the corrupted VHD(X) files from backup.

Situation #2: Single data center or Availability Zone failure within an Azure region

An Azure region is a set of data centers deployed within a latency-defined perimeter and connected through a dedicated regional low-latency network. Azure gives you the flexibility to deploy applications where you need to, including across multiple regions to deliver cross-region resiliency.

An Availability Zone is a high-availability offering that protects your applications and data from data center failures. Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more data centers equipped with independent power, cooling, and networking. To ensure resiliency, there's a minimum of three separate zones in all enabled regions. The physical separation of Availability Zones within a region protects applications and data from data center failures. Zone-redundant services replicate your applications and data across Availability Zones to protect from single-points-of-failure. With Availability Zones, Azure offers an industry-best 99.99% VM uptime SLA. Learn more here.

In case of data center or Availability Zone failure, most components of the AVD environment automatically fail-over to another Availability Zone with no user intervention required.

  • Not all Azure regions support Availability Zones for all products. Review the Regions that support Availability Zones in Azure document before deploying your AVD environment to select the region that addresses your availability requirements. Pay special attention to Premium Files Storage if using Azure Files for FSLogix profiles.

  • AVD Service: Because this service is hosted, managed, and backed up by Microsoft there is nothing for you to do. The AVD service fails over automatically and Microsoft is responsible for getting everything back up and running within the provided SLAs.

  • Nerdio Manager: Because Nerdio Manager is built on top of resilient Azure PaaS services that are automatically redundant in an Azure region across availability zones, there is no action necessary on your part. PaaS services automatically fail over to an available zone in case of a data center outage and Nerdio Manager continues working normally.

  • Active Directory: If your domain controller VMs are deployed into an Availability Zone, within a supported Azure region, then no action is necessary. These VMs automatically become available in another zone and continue servicing user sign in requests. Entra Domain Services operates two domain controllers, in separate Availability Zones if supported, by default. See this Microsoft article for details.

    • Recommendation: If the domain controllers are not currently in an Availability Zone, migrate them into an Availability Zone. See this Microsoft article for details.

  • Desktop Images: Desktop images should not be affected by a data center failure and automatically fail over to another zone.

  • Session Host VMs: Session host VMs running in the data center where the outage occurs go offline.

    • Recommendation: Leverage Nerdio Manager's Availability Zones feature, which automatically distributes VMs across Availability Zones when they are deployed. See Host Pool VM Deployment for details. More information about VM availability in Azure can be found here.

  • FSLogix Profiles - Data center outages can impact FSLogix storage availability.

    • Recommendation: Leverage Azure Files with Premium ZRS Storage in an Azure region that support availability zones for Premium Files storage. In this scenario, no action on your part is needed in case of a data center outage.

Situation #3: Entire Azure region outage

An Azure region is a set of data centers deployed within a latency-defined perimeter and connected through a dedicated regional low-latency network. Azure gives you the flexibility to deploy applications where you need to, including across multiple regions to deliver cross-region resiliency. Failure of a complete Azure regions is highly unlikely and rare. For more information, see Overview of the resiliency pillar.

The recommended approach to protect from an Azure region outage is to deploy the AVD environment in two regions in an active/passive configuration. The active (primary) deployment is the one always in use. The passive (secondary) deployment is kept updated when changes in the primary environment are made and are available to come online in case of a regional outage.

  • AVD Service: Because this service is hosted, managed, and backed up by Microsoft, there is nothing for you to do. The AVD service fails over automatically and Microsoft is responsible for getting everything back up and running within the provided SLAs.

    • Recommendation: Since the secondary deployment is a copy of the primary one, user assignments can be configured in such a way that users see two workspaces with identical host pools and RemoteApps - one primary and one for DR. Alternatively, users can remain unassigned from secondary deployments app groups and be assigned on-demand in a fail-over event.

  • Nerdio Manager: Nerdio Manager is deployed using several PaaS services in a selected Azure region. A region failure results in Nerdio Manager failure.

    • Recommendation: Deploy Nerdio Manager to a secondary region and configure it in an identical way to the primary one.

  • Active Directory: AD domain controllers must be accessible by session host VMs in the secondary region.

    • Recommendation: Create a VNet in the secondary region with VNet peering and a set of domain controller VMs to be available in case of an outage. Also, don't forget to consider connectivity to your on-premises or other networks for your other business applications.

  • Desktop Images: Up-to-date desktop images are critical during a fail-over event, especially if you are provisioning brand new session host VMs.

    • Recommendation: As you make changes to your primary desktop images in the normal course of AVD administration, make cloning these images to the secondary region a standard part of the process after each modification. See Import an Existing VM for details.

  • Session Host VMs: Session host VMs can be pre-staged and created in advance but remain powered off, then started during a fail-over event, or they can be provisioned on-demand by Nerdio Manager. The choice depends on your recovery time objective (RTO).

    • Recommendation: Configure all host pools and auto-scale settings in the secondary deployment as in the first. Turn auto-scale OFF and shut down all session host VMs. In case of a fail-over event, turn auto-scale ON for each host pool and all necessary capacity is brought online. You may also choose to configure auto-scale and keep it OFF to not provision session host VMs in advance. When turning auto-scale ON, Nerdio Manager begins provisioning new session host VMs - this takes longer than turning on already existing VMs. If you choose to keep powered off VMs in place in anticipation of a fail-over event, don't forget to re-image them every time a new image is imported from the primary environment.

  • FSLogix Profiles: Azure Files Premium does not currently support the Geo-redundant storage (GRS) tier. Azure NetApp Files can be replicated automatically to another region.

    • Recommendation: Leverage Azure NetApp Files with cross-region replication. When configuring FSLogix settings in the secondary Nerdio Manager instance, be sure to use the ANF path of the secondary region's volume.

Was this article helpful?

0 out of 0 found this helpful
Have more questions? Submit a request

Comments (0 comments)

Please sign in to leave a comment.