Looking foward to ROSCon 2018 we're highlighting presentations from last year. The ROSCon 2018 registration is currently open. As well as the Call for Proposals.
Twinkle Jain and Gene Cooperman present how they are using DMTCP to checkpoint ROS processes.
Video
Abstract
The ROS master is well-known to be a single point of failure. The DMTCP open-source package for transparent checkpoint-restart was recently extended to support checkpointrestart for the ROS master. After a failure, the ROS master is rolled back and resumed from the last checkpoint. Checkpoints can be performed as often as every few seconds. The DMTCP plugin model also allows users to add plugins that model and restart their external devices in a state equivalent to that at checkpoint. Finally, we speculate on the potential of DMTCP's distributed mode to support a global restore with appropriate plugins in the future.
Slides
View the slides here