Do you work with lots of software and need to know what it's all getting up to, when you need to know it? I work for a .Net house with over 10 years' worth of software under our belts, we totally have this problem.
Our full estate includes many small and large systems - Web services, Windows services, .Net web apps, MVC web apps... if it's Microsoft badged, we'll have one somewhere. Each of these systems currently logs to a mixture of database tables, flat text files, e-mail alerts - the whole thing is becoming increasingly difficult to manage.
This blog post will be the first in a series looking at an open source solution, involving Log4Net, RabbitMq, Graylog2, ElasticSearch and NEsper, with the following capabilities:
- Aggregating logs from various distributed systems into a flexible and searchable persistent storage mechanism
- A web front-end for querying and graphing the log data
- A system capable of detecting abnormal or critical log patterns and alerting support staff
A Proposed Solution
Before I go into some of the stuff that's gotten me to this, let's get straight to the planned solution:
My proposed solution involves the following steps:
- Supplement existing text-file logs with logging over a queue, using RabbitMq.
- Implement a first RabbitMq subscriber using GrayLog2 Server and ElasticSearch storage for quick indexed search of all log messages from all applications
- Implement a second RabbitMq subscriber using an event processing tool such as NEsper to handle alerting. NEsper performs CEP (Complex Event Processing) using a domain-specific query language, and works a bit like SQL without the storage. If you want know about nondeterministic finite automa or dynamic state trees, NEsper's full of wierd but clever stuff like that.
- Place a web front end over the central log storage to visualise and explore data, for example Graylog2 or Kibana.
There are still wrinkles that I haven't thought through, for example some applications we can't modify to log anywhere other than a text file. This may be a job for LogStash.
This Sounds Over-Engineered! Is It Really That Bad?
Let's go back to where we started off - a big-ass collection of distributed software components producing a similarly distributed collection of log files. Immediately this gives us three problems.
Drowning In Alerts
What happens when something goes seriously wrong? How do you know that your customer facing website can't reach the db, or that your payment app has tanked? Of course, you hard-code into your app the really serious stuff, and send emails when they happen. Umm... if your DB goes down, you get an email for every hit on your site... that means lots of emails. Also, that kinda hard coding is a pretty obvious risk.
This is where NEsper will come in. It's also the part of the puzzle I've thought least about, I just keep hearing such good things about NEsper / Esper!
WTF Did The User Actually Do!?
Faced with dozens of applications logging to scores of repositories, a common response seems to be to produce as little logging as possible. We can now easily see the critical stuff, it's pretty much all there is to see. But what do you do with those non critical bugs that happen all the time, and you need to know what the hell the user's doing to make them happen?
A common fix here is to reconfigure your app to log certain areas at a higher verbosity to catch just that bug - this is of course reactive and can take time to get right, plus it means tinkering with live app configuration, not an ideal situation. The ability to log and process humungous amounts of data (people talk about GrayLog2 remaining performant with 100 million messages) should fix this.
Uh-oh... Something Screwed The Data...
Finally, there's the situation where you find something in your database that's been screwed with somewhere. In reality, as obvious a code smell as it may be, lots of different apps can touch the same piece of data (SOA anyone?). So when you find broken data, you may need to trawl through logs for your front end app, any offline processing tasks, maybe a related web service, to find out what touched the data. Centralising your logs should make this a painless process.
That's all for today. Next time in Part II, I'll be starting to dig into the RabbitMq implementation, AMQP and the integration with Log4Net.