Somebody drew my attention to NetFlix’s Chaos Monkey. This is a most excellent design trick. The monkey randomly kills parts of your system. His presence helps to avoid developers from falling into the illusion that failures won’t happen. I’ve used this technique, though not enough. It’s great. One nice feature is you can juice up the monkey during testing to assure more failures occur.
What I’ve not done, and I don’t think I’ve seen suggested, is that you should manage the distribution of failures so they match what happens in the real world. I.e. things usually fail one here, one there; but then occationally they fail in bunches. It occurs to me that you really ought to have a barrel of monkeys and they should exhibit a tendency to trigger cascades of failures.
Back in the day there were (was?) folks at IBM who would test DR preparedness by going up to a site unannounced and pulling the breaker and then observing the results.