Cisco UCS Director HA and DR configuration

Cisco UCS Director HA and DR configuration

UCS Director is a simple yet very powerful Automation and Orchestration software ideal for any customer that’s looking to simplify their IT operations and bring a certain level of uniformity on deploying IAAS. Please visit Cisco’s product page to learn more about it. This blog is going to focus on the HA and DR configuration at a very high level.

Assumptions

There are two ways to deploy UCS Director, as a single appliance or as multi-node setup. In this document, the assumption is made that you have deployed UCS Director in a multi-node configuration (separate instances of monitoring and inventory database in addition to the primary and service node). I am going to focus on the vSphere based deployment.

UCS Director High Availability

UCS Director is deployed as an OVF Appliance on vSphere and relies on VMware HA to provide basic High Availability. If using VMware HA is not an option, the recommendation is to make sure that you create a cron job and regularly backup the database. For a full recovery of UCS Director, all you need is a copy of the backup databases for the individual node instance.

UCS Director Database Backup

  1. Stop the Cisco services – login as ‘shelladmin’ and select option 3
  2. Backup the Monitoring and Inventory Database from the shell
  3. Start the services back up by selecting option 4 (make sure you start the service on the service nodes before the Primary node)

UCS Director System Recovery

  1. Stand up the appliances (all nodes) with the original IP addresses
  2. Start the restore process
    1. Shut down the services on the primary and service nodes (use option 3 from shell)
    2. Restore the Monitoring database
    3. Restore the Inventory database
    4. Start the services on the service nodes
    5. Start the services on the primary node.

UCS Director Disaster Recovery Scenarios

Considering the fact that UCS Director is an appliance that connects to various element managers to orchestrate and run the workflows, there are two scenarios that could arise during a disaster event. A partial DR could be described as when the primary site is not entirely down and most/all of the management elements (vCenter, UCSM, Nexus 1000V, EMC Unisphere or NetApp OnCommand) are still functioning or you have been able to stand them up. In this scenario, you could quickly recover by deploying UCS Director with the same IP address before the incident and restore from backup.

A complete DR scenario is when you have lost access to all of your element managers and now you are standing up the environment in a new site. Assuming that the DR site is on a separate routed network with perhaps a different IP scheme, the recovery option creates a couple of new challenges. First of all when you start bringing up your nodes, they will all have new IP addresses, they will also not have access to any of the element managers or the element managers are accessible at a new IP address (vCenter on DR is in a new subnet for example) There are two ways to approach this scenario.

Approach-1: (Recover UCS Director multi-node VMs from replicated storage)

1. Bring up all of the multi-node VMs
2. Stop the Infrastructure services
3. Restore the UCS Director DB backup onto Inventory DB Node and Monitoring DB Node
4. Stop the DB services on both Inventory & Monitoring Nodes
5. Re-IP all of the nodes (update /etc/sysconfig/network-scripts/ifcfg-ethX files appropriately)
6. Next, change the IP address in the service.properties file for each of the deamons namely : eventmgr , idaccessmgr , inframgr. (This will ensure connectivity between the primary & service nodes to the database nodes at the new IP addresses).
Edit the following files and modify the IP addresses to point to the correct nodes
/opt/infra/inframgr/service.properties
/opt/infra/idaccessmgr/service.properties
/opt/infra/eventmgr/service.properties
7. Restart the DB Node and Monitoring node and wait until it is completely up
8. Restart the Primary and Service Nodes
9. Delete and re-Add the BMA account and restart the BMA services (If Bare-metal Agent is setup)
10. Verify VM deployment and the Workflows

Approach-2: (Recover UCS Director multi-node VMs deploying new OVF and patches)

1. First bring up the UCS Director multi-node setup deploying new OVF with new IP address
2. Apply the patch level same as the primary site on both Primary and Service Nodes
3. Configure resource reservations on Multi-Node VMs  per Cisco recommendation
4. Update /etc/hosts, DNS, NTP and other Infrastructure configuration files and set the Date & Time on all of the Nodes.
5. Stop Infra services on Primary and Service Nodes
6. Restore Inventory and Monitoring DB Backup
7. Restart services on the Primary & Service nodes and verify connectivity to the UCS Director UI.
8. Delete and re-Add the BMA account and restart the BMA services
9. Verify VM deployment and the Workflows

Note: If you have made any custom scripts and configurations, please remember to restore them into the respective folders after the UCS Director Nodes are brought back up.

Leave a Reply

Your email address will not be published. Required fields are marked *

*