Thursday, March 31, 2011

Operations Manager: Cleaning Up Cluster Resources


Fixing client issues or removing defunct clusters is one of the more frustrating tasks in Operations Manager. You try to uninstall the client on a Windows Cluster node and you are immediately met with a message that that the "Agent is managing other devices and cannot be uninstalled. Please resolve this issue via Agentless managed view in Administration".
If you are brave enough to keep fighting, you will find that you can remove the client from any "passive" nodes in the cluster. This is because all of these clients are truly agentless. The clients on any "active" node of the cluster will not uninstall through any means on the console.
This problem exists for the most part because the cluster itself physically does not. That is to say that the cluster itself, and its associated groups, is a virtual entity that exhibits most of the same properties as a computer resource without actually being a physical resource. It has a name, a domain, an IP address, one or more logical disks… You get the point. However, there is not a physical server to actually "house" the agent. These virtual resources need to be managed "remotely" through agentless monitoring and it is the active node of the cluster housing the physical resources that is responsible for monitoring those resources. It is this agentless monitoring that leads to the client maintenance problems.
There are a couple different approaches to correcting this problem; both are pretty drastic and not for the faint of heart. The first more common recommendation is a pretty ugly SQL query to purge the OperationsManager database of the cluster resources. These queries typically look something like this:
declare @nodeHS nvarchar(255)
Set @nodeHS=N'Microsoft.SystemCenter.HealthService:FQDN'
DECLARE Rel_Cursor CURSOR
FOR

(SELECT [RelationshipGenericView].[Id]
FROM dbo.RelationshipGenericView
WHERE ((RelationshipGenericView.[MonitoringRelationshipClassId] =  dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceShouldManageEntity())
AND
(((dbo.[RelationshipGenericView].[IsDeleted] = 0))))

AND
([RelationshipGenericView].[SourceMonitoringObjectId]

    IN
(select BaseManagedEntityId from BaseManagedEntity where FullName =@NodeHS)))

OPEN Rel_Cursor;
declare @relId uniqueidentifier
declare @discoSource uniqueidentifier
declare @now datetime
set @now =
GETUTCDATE()

FETCH
NEXT
FROM Rel_Cursor INTO @RelId;

WHILE
@@FETCH_STATUS
= 0

BEGIN
    SELECT @discoSource=DSTR.[DiscoverySourceId]
    FROM  dbo.[DiscoverySourceToRelationship] DSTR
        inner
join dbo.[DiscoverySource] DS on DS.DiscoverySourceId = DSTR.DiscoverySourceId

        inner
join Discovery on Discovery.DiscoveryId = DS.DiscoveryRuleId

                   WHERE [DiscoveryName] =
'Microsoft.Windows.Cluster.Classes.Discovery'

                   AND [RelationshipId]=@relId
                   AND DSTR.[IsDeleted] = 0
    exec dbo.p_RemoveRelationshipFromDiscoverySourceScope
    @RelationshipId=@relId,
    @DiscoverySourceId=@discoSource,@TimeGenerated=@now                 
    FETCH
NEXT
FROM Rel_Cursor INTO @RelId;

END;
CLOSE Rel_Cursor;
DEALLOCATE Rel_Cursor;
Where FQDN is the fully qualified domain name of one of your physical cluster nodes.
This query would then need to be run once for each physical node in the cluster.
After that "emergency chainsaw surgery", the client will be able to be removed from the physical nodes in a "normal" way, repaired, or simply deleted and while this method does certainly work, it has all the subtlety of a 2000lb bomb.
There is another way that at least saves you from needing to go tromping through the production database and that is through judicious use of the "remove-disabledmonitoringobject" commandlet. The first step in this approach is to create a group housing the individual physical members of the cluster needing client maintenance/decommissioning. After the group is created, you simply need to create overrides for all of the cluster discovery methods disabling them for targets in your newly created group. After giving your management group sufficient time to fully replicate the new overrides throughout the hierarchy, one can simply launch the Ops Manager Powershell connected to your root management server and run the "remove-disabledmonitoringobject" commandlet. Once this commandlet completes, the cluster instance should be removed from your management group allowing you to then perform any necessary maintenance on the physical nodes' clients. While this is certainly more precise and less risky than direct SQL Queries in the production database, it is still far from ideal.
It should not be necessary to remove the entire cluster from the management group to perform simple client maintenance on any of the physical nodes. Given that clustering is one of the more common HA approaches and the assumption that if you want something to stay HA it needs to be proactively monitored, needing to resort to tactics like these should not be necessary. Hopefully MS finds a better approach in the future.

No comments:

Post a Comment