We build monitoring frameworks like the one I outlined in “Listening to the System” for at least four reasons. Their maybe legal requirements that we keep records for later auditing and dispute resolution. We may want to monitor the system so we can remain in control. We may want to collect data in service of tuning the system, say to reduce cost or improve latency. And there there is debugging. Audit, control, tuning, and debugging are, of course, are not disjoint categories.
Good monitoring will draw our attention to surprising behaviors. Surprising behaviors trigger debugging projects. The universe of tools for gleaning out surprising behavior from systems is very large. Years ago, when I worked at BBN, the acoustics’ guys were working on a system that listened to the machine room noise on a ship hoping to sense that something anomalous was happening.
I attended a talk “Using Influence to Understand Complex Systems” this morning by Adam Oliner (the same talk performed by his coauthor Alex Aiken is on youtube) where I was again reminded of how you can often do surprisingly effective things with surprisingly simple schemes.
Adam and Alex are tackling an increasingly common problem. You have a huge system with numerous modules. It is acting in surprising ways. You’ve got a vast piles of logging data from some of those modules. Now what do you do?
Their scheme works as follows. For each of the data streams convert the stream into a metric that roughly measures how surprising the behavior was at each interval in time. Do time series correlation between the modules. That lets you draw a graph: module A influence B (i.e. surprising behavior in A tends to precede surprising behavior in B). You can also have arcs that say A and B tend to behave surprisingly at the same time. These arcs are the influence mentioned in their title.
If you add a pseudo module to include the anomalous behavior your investigating, then the graph can give you some hints for were to investigate further.
At first blush you’d think that you need domain expertise to convert each log into a metric of how surprising the log appears at that point in time. But statistics is fun. So they adopted a very naive scheme for converting logs into time series of surprise.
They discard everything in the log except the intervals between the messages. Then they keep a long-term and a short-term histogram. The surprise is a measure of how different these appear. The only domain knowledge is setting up what short and long-term means.
The talk includes a delightful story about applying this to a complex robot’s naughty behaviors, drawing attention first to the portion of the system at fault and further revealing the existence of a hidden component where the problem actually was hiding out. Good fun!
I gather that they don’t currently have a code base you can download and apply in-house, but the system seems simple enough that cloning it looks straight forward.
They would love to have more data to work on, so if you have a vast pile of logs for a system with lots and lots of modules, and your willing to reveal the inter-message timestamps, module names, and some information about when mysterious things were happening. I suspect they would be enthusiastic about sending you back some pretty influence graphs to help illuminate your mysterious behaviors.
It would be fun to apply this to some social interaction data (email/im/commit-logs). I suspect the histograms would need to be tinkered with a bit to match the distributions seen in such natural systems better. Just trying various signals as to what denotes a surprising behavior on the part of the participants in the social network would be fun. But it would be cool to reveal that when Alice acts in a surprising way shortly there after Bob does; and a bit later the entire group descends into a flame war.