Thursday, August 18, 2011

Monitor Databases in DAGs

A few days ago, someone at the Microsoft Forums asked if there was a script to alert an administrator of when Exchange performs a failover of databases in a DAG.

This was something that I have wanted to do for a long time, but never actually got to do it... So here is my current solution (might get improved in the future).


With Exchange 2010 and DAGs, it is important to monitor whenever a database automatic fails over to another server. Although everything keeps working without any problems for end users (hopefully), administrators still have to investigate why a failover happened.

In case you have Exchange deployed across multiple AD sites and a database fails over to a server on another site, this will probably impact the way your users access OWA, for example.

Databases in a DAG, and therefore with multiple copies, have the ActivationPreference attribute that shows which servers have preference over the others to mount the database in case of a disaster or a manual switchover.

The following output is just an example of what you will get if you run the following command in an environment with at least a DAG and multiple copies:

Get-MailboxDatabase | Sort Name | Select Name, ActivationPreference


Name    ActivationPreference
----    --------------------
ADB1    {[MBXA1, 1], [MBXA2, 2]}
ADB2    {[MBXA1, 1], [MBXA2, 2]}
ADB3    {[MBXA1, 1], [MBXA2, 2]}
...
MDB1    {[MBX1, 1], [MBX2, 2], [MBX3, 3], [MBX4, 4]}
MDB2    {[MBX1, 1], [MBX2, 2], [MBX3, 3], [MBX4, 4]}
MDB3    {[MBX1, 1], [MBX2, 2], [MBX3, 3], [MBX4, 4]}
...

Based on the ActivationPreference attribute, we can monitor if databases are currently active on the servers that they should be, i.e., on servers with an ActivationPreference of 1.

To check this, we can use the following script:



Get-MailboxDatabase | Sort Name | ForEach {
 $db = $_.Name
 $curServer = $_.Server.Name
 $ownServer = $_.ActivationPreference | ? {$_.Value -eq 1}

 Write-Host "$db on $curServer should be on $($ownServer.Key) - " -NoNewLine

 If ($curServer -ne $ownServer.Key)
 {
  Write-Host "WRONG" -ForegroundColor Red
 }
 Else
 {
  Write-Host "OK" -ForegroundColor Green
 }
}



Which basically compares the server where the database is currently active with the server that has an ActivationPreference of 1. If they differ, then write WRONG in red to let the administrator know.

But since we are at it, why not also check for the status of the database and the state of its content index? This can be checked using the Get-MailboxDatabaseCopyStatus cmdlet.

According to the Monitoring High Availability and Site Resilience TechNet article, here are all the possible values for the database copy status:


Database Copy Status
Failed - The mailbox database copy is in a Failed state because it isn't suspended, and it isn't able to copy or replay log files. While in a Failed state and not suspended, the system will periodically check whether the problem that caused the copy status to change to Failed has been resolved. After the system has detected that the problem is resolved, and barring no other issues, the copy status will automatically change to Healthy;

Seeding - The mailbox database copy is being seeded, the content index for the mailbox database copy is being seeded, or both are being seeded. Upon successful completion of seeding, the copy status should change to Initializing;

SeedingSource - The mailbox database copy is being used as a source for a database copy seeding operation;

Suspended - The mailbox database copy is in a Suspended state as a result of an administrator manually suspending the database copy by running the Suspend-MailboxDatabaseCopy cmdlet;

Healthy - The mailbox database copy is successfully copying and replaying log files, or it has successfully copied and replayed all available log files;

ServiceDown - The Microsoft Exchange Replication service isn't available or running on the server that hosts the mailbox database copy;

Initializing - The mailbox database copy will be in an Initializing state when a database copy has been created, when the Microsoft Exchange Replication service is starting or has just been started, and during transitions from Suspended, ServiceDown, Failed, Seeding, SinglePageRestore, LostWrite, or Disconnected to another state. While in this state, the system is verifying that the database and log stream are in a consistent state. In most cases, the copy status will remain in the Initializing state for about 15 seconds, but in all cases, it should generally not be in this state for longer than 30 seconds;

Resynchronizing - The mailbox database copy and its log files are being compared with the active copy of the database to check for any divergence between the two copies. The copy status will remain in this state until any divergence is detected and resolved;

Mounted - The active copy is online and accepting client connections. Only the active copy of the mailbox database copy can have a copy status of Mounted;

Dismounted - The active copy is offline and not accepting client connections. Only the active copy of the mailbox database copy can have a copy status of Dismounted;

Mounting - The active copy is coming online and not yet accepting client connections. Only the active copy of the mailbox database copy can have a copy status of Mounting;

Dismounting - The active copy is going offline and terminating client connections. Only the active copy of the mailbox database copy can have a copy status of Dismounting;

DisconnectedAndHealthy - The mailbox database copy is no longer connected to the active database copy, and it was in the Healthy state when the loss of connection occurred. This state represents the database copy with respect to connectivity to its source database copy. It may be reported during DAG network failures between the source copy and the target database copy;

DisconnectedAndResynchronizing - The mailbox database copy is no longer connected to the active database copy, and it was in the Resynchronizing state when the loss of connection occurred. This state represents the database copy with respect to connectivity to its source database copy. It may be reported during DAG network failures between the source copy and the target database copy;

FailedAndSuspended - The Failed and Suspended states have been set simultaneously by the system because a failure was detected, and because resolution of the failure explicitly requires administrator intervention. An example is if the system detects unrecoverable divergence between the active mailbox database and a database copy. Unlike the Failed state, the system won't periodically check whether the problem has been resolved, and automatically recover. Instead, an administrator must intervene to resolve the underlying cause of the failure before the database copy can be transitioned to a healthy state;

SinglePageRestore - This state indicates that a single page restore operation is occurring on the mailbox database copy;



Based on these values, we want the Status attribute to be either Mounted (true for the server where the database is mounted) or Healthy (for the servers that hold a copy of it). For the ContentIndexState attribute, we want it to be always Healthy.

To monitor both these attribute, we can use the following command:


Get-MailboxDatabase | Sort Name | Get-MailboxDatabaseCopyStatus | ForEach {
 If ($_.Status -notmatch "Mounted" -and $_.Status -notmatch "Healthy" -or $_.ContentIndexState -notmatch "Healthy")
 {
  Write-Host "`n$($_.Name) - Status: $($_.Status) - Index: $($_.ContentIndexState)" -ForegroundColor Red
 }
}



Now, let’s put everything together and tell the script that if something is wrong with any database, to send an e-mail to the administrator! This way, we can create a schedule task to run this script every 2 minutes, for example.

Let’s also compare the AD sites where the current server hosting the database is against the AD site where the server that should be hosting the database is. As I mentioned before, this is important as it can change the way users access OWA.

You can also download the entire script from here.

Function getExchangeServerADSite ([String] $excServer)
{
 # We could use WMI to check for the domain, but I think this method is better
 # Get-WmiObject Win32_NTDomain -ComputerName $excServer

 $configNC =([ADSI]"LDAP://RootDse").configurationNamingContext
 $search = new-object DirectoryServices.DirectorySearcher([ADSI]"LDAP://$configNC")
 $search.Filter = "(&(objectClass=msExchExchangeServer)(name=$excServer))"
 $search.PageSize = 1000
 [Void] $search.PropertiesToLoad.Add("msExchServerSite")

 Try {
  $adSite = [String] ($search.FindOne()).Properties.Item("msExchServerSite")
  Return ($adSite.Split(",")[0]).Substring(3)
 } Catch {
  Return $null
 }
}



[Bool] $bolFailover = $False
[String] $errMessage = $null

Get-MailboxDatabase | Sort Name | ForEach {
 $db = $_.Name
 $curServer = $_.Server.Name
 $ownServer = $_.ActivationPreference | ? {$_.Value -eq 1}

 # Compare the server where the DB is currently active to the server where it should be
 If ($curServer -ne $ownServer.Key)
 {
  # Compare the AD sites of both servers
  $siteCur = getExchangeServerADSite $curServer
  $siteOwn = getExchangeServerADSite $ownServer.Key
  
  If ($siteCur -ne $null -and $siteOwn -ne $null -and $siteCur -ne $siteOwn)
  {
   $errMessage += "`n$db on $curServer should be on $($ownServer.Key) (DIFFERENT AD SITE: $siteCur)!" 
  }
  Else
  {
   $errMessage += "`n$db on $curServer should be on $($ownServer.Key)!"
  }

  $bolFailover = $True
 }
}

$errMessage += "`n`n"

Get-MailboxDatabase | Sort Name | Get-MailboxDatabaseCopyStatus | ForEach {
 If ($_.Status -notmatch "Mounted" -and $_.Status -notmatch "Healthy" -or $_.ContentIndexState -notmatch "Healthy")
 {
  $errMessage += "`n$($_.Name) - Status: $($_.Status) - Index: $($_.ContentIndexState)"
  $bolFailover = $True
 }
}

If ($bolFailover)
{
 Send-MailMessage -From "admin_nuno@letsexchange.com -To "exchange.alerts@letsexchange.com" -Subject "DAG NOT Healthy!" -Body $errMessage -Priority High -SMTPserver "mail.letsexchange.com"
 Schtasks.exe /Change /TN "MonitorDAG" /DISABLE
}




As always, sorry for the format of the code...
At the end of the script, if an e-mail is sent, you might want to disable the schedule task, otherwise you will receive an e-mail every two minutes until you resolve the issue...

Please note that there are more attributes that can and should be monitored! For example, you could run the Test-ReplicationHealth to view replication status information about mailbox database copies.

Hope this helps!

No comments:

Post a Comment