After we were able to recover the server, it tried very hard to catch up - there are a large number of processes that it hadn't been running for almost 3 days. Some of them failed because the ran out of memory, which is a balancing act we usually manage by simply scheduling them at different times.
So my question is, can you please help me try to avoid logjams like this on recovery from future outages? I know before we have changed some of the triggers wrt Action on Misfire, and I do not know if this is a related thing (we changed that because it was triggering the action when the event was activitated). I am just thinking general plans we can implement, and then I may have specific questions as I apply any such plans to our processes.
Post is closed for comments.