I seem to be building logging infrastructure today. I keep recalling one or another of the rules for playing this game. Might as well try to put them down.
- Who? – The speaker’s unique ID and type should be in each log line.
- Transcript – The speaker’s utterances should have a serial number, so you can notice gaps.
- Checksum – A running check sum is a big help in proving things.
- When – The utterances should have a time stamp (daemontools multilog t is good)
- Synchronize our watches – NTP is a must everywhere.
- Breadcrumbs – Jobs/tasks/work-items/requests should have a unique ID that is threaded these across to the logs and across process/module/machines
- Health – All processes (machines, threads …) should emit a heart beat; heart beats should include some health indicators so other parties can notice when they expire or get sick
- Replay – Logs that enable a rebuild from last snapshot will save your butt. Often your close and only some minor optimization (truncating output, discarding binary info, say) is preventing it. I once rebuilt an entire source repository from years of mail to prove an intrusion had not touched the sources.
- Syntax – It’s good if the logs are well tokenized, i.e. embedded strings are escaped; and character encodings are worked out.
- Standardized – It’s good, but it’s hopeless. This is the worst case of the 2nd part of “”Be strict in what you send, but generous in what you receive”
- Innummerable – You can lay an ontology over the space of exceptions. Accept that, and then proceed as usual.
- Now – The sooner the log analysis takes place the better. Don’t wait until your patient is in intensive care. Analogy: test driven development.
- Email – The accumulated headers in modern email are full of lessons learned
- FSEvents – the asynchronous file system journaling/notifications of (BeOS, et. al.) are worth looking at closely.
- Fast – I tend to embrace that writing the log is not transactional or even particularly reliable so I can have volume instead.
Here’s a logging checklist by Anton Chuvakin (posted with his permission):
http://juliusdavies.ca/logging/llclc.html
Julius (and Anton) – Thanks.
I’m somewhat ambivalent about priority (warning, error, emergency) since I’ve too often been caught in situations where emergency wasn’t, and warning was.