Community
Discussions | Feature Requests | Tips and Tricks
How to gracefully handle host hardware maintenance
#1June 17, 2011 08:44:59
- Aaron Dodd
- Registered: 2010-09-27
- Posts: 2
How to gracefully handle host hardware maintenance
If we have an issue on a host requiring it to be downed for maintenance (i.e. this isn't a sudden outage but one we can plan), such as to swap memory, what are the steps to gracefully transition or pause the workloads running and resume them on reboot?
Also, is there a means to tell Nimbula this node is going down so that it can transition any services to other nodes?
Also, is there a means to tell Nimbula this node is going down so that it can transition any services to other nodes?
#2June 17, 2011 09:18:32
- Jay Judkowitz
- Registered: 2010-11-29
- Posts: 9
How to gracefully handle host hardware maintenance
aaron_dodd
If we have an issue on a host requiring it to be downed for maintenance (i.e. this isn't a sudden outage but one we can plan), such as to swap memory, what are the steps to gracefully transition or pause the workloads running and resume them on reboot?
Also, is there a means to tell Nimbula this node is going down so that it can transition any services to other nodes?
Aaron,
There are a few questions here
1) How to deal with the control plane when there will be HW planned maintenance?
2) How to deal with instances when there will be short HW planned maintenance (like a simple reboot)?
3) How to deal with instances when there will be an extended HW planned maintenance (like a datacenter shutdown for a weekend)?
We can't really give roadmap out in any detail on public forums, but I can say that we're working on #1 for a nearer term release. For #2 and #3, there is some groundwork in a near term release, but maybe not the end feature you are looking for.
For #2 and #3, I have some follow-up questions for you (and for anyone reading this forum discussion)
* In general, are you looking for an instance to be restarted on the same node whenever that node comes back up or would you like us to just find another node and restart it as soon as possible.
* More specifically, if you want to restart the instance on the same node (maybe because you want to use its local storage for the virtual disk), what is the longest node downtime we should tolerate for restarting the instance? For example, I don't thing you'd want a node to come up after a week long outage and have it's instances come back up. That would be somewhat unpredictable and the end users would have done something to recover already most likely. After what time is it not worth restarting an instance on a recovered node?
Edited jay@nimbula.com (June 17, 2011 09:19:10)