Coding and Dismantling Stuff

Don't thank me, it's what I do.

About the author

Russell is a .Net developer based in Lancashire in the UK.  His day job is as a C# developer for the UK's largest online white-goods retailer, DRL Limited.

His weekend job entails alternately demolishing and constructing various bits of his home, much to the distress of his fiance Kelly, 3-year-old daughter Amelie, and menagerie of pets.

TextBox

  1. Fix dodgy keywords Google is scraping from my blog
  2. Complete migration of NHaml from Google Code to GitHub
  3. ReTelnet Mock Telnet Server à la Jetty
  4. Learn to use Git
  5. Complete beta release FHEMDotNet
  6. Publish FHEMDotNet on Google Code
  7. Learn NancyFX library
  8. Pull RussPAll/NHaml into NHaml/NHaml
  9. Open Source Blackberry Twitter app
  10. Other stuff

Enterprise Logging and Alerting with Graylog2, RabbitMq and NEsper

Do you work with lots of software and need to know what it's all getting up to, when you need to know it? I work for a .Net house with over 10 years' worth of software under our belts, we totally have this problem.

Our full estate includes many small and large systems - Web services, Windows services, .Net web apps, MVC web apps... if it's Microsoft badged, we'll have one somewhere. Each of these systems currently logs to a mixture of database tables, flat text files, e-mail alerts - the whole thing is becoming increasingly difficult to manage.

This blog post will be the first in a series looking at an open source solution, involving Log4Net, RabbitMq, Graylog2, ElasticSearch and NEsper, with the following capabilities:

  • Aggregating logs from various distributed systems into a flexible and searchable persistent storage mechanism
  • A web front-end for querying and graphing the log data
  • A system capable of detecting abnormal or critical log patterns and alerting support staff

A Proposed Solution

Before I go into some of the stuff that's gotten me to this, let's get straight to the planned solution:

Proposed Solution Architecture

My proposed solution involves the following steps:

  • Supplement existing text-file logs with logging over a queue, using RabbitMq.
  • Implement a first RabbitMq subscriber using GrayLog2 Server and ElasticSearch storage for quick indexed search of all log messages from all applications
  • Implement a second RabbitMq subscriber using an event processing tool such as NEsper to handle alerting. NEsper performs CEP (Complex Event Processing) using a domain-specific query language, and works a bit like SQL without the storage. If you want know about nondeterministic finite automa or dynamic state trees, NEsper's full of wierd but clever stuff like that. 
  • Place a web front end over the central log storage to visualise and explore data, for example Graylog2 or Kibana.

There are still wrinkles that I haven't thought through, for example some applications we can't modify to log anywhere other than a text file.  This may be a job for LogStash.

This Sounds Over-Engineered! Is It Really That Bad?

Let's go back to where we started off - a big-ass collection of distributed software components producing a similarly distributed collection of log files. Immediately this gives us three problems.

Drowning In Alerts

What happens when something goes seriously wrong? How do you know that your customer facing website can't reach the db, or that your payment app has tanked? Of course, you hard-code into your app the really serious stuff, and send emails when they happen. Umm... if your DB goes down, you get an email for every hit on your site... that means lots of emails. Also, that kinda hard coding is a pretty obvious risk.

This is where NEsper will come in. It's also the part of the puzzle I've thought least about, I just keep hearing such good things about NEsper / Esper!

WTF Did The User Actually Do!?

Faced with dozens of applications logging to scores of repositories, a common response seems to be to produce as little logging as possible. We can now easily see the critical stuff, it's pretty much all there is to see. But what do you do with those non critical bugs that happen all the time, and you need to know what the hell the user's doing to make them happen?

A common fix here is to reconfigure your app to log certain areas at a higher verbosity to catch just that bug - this is of course reactive and can take time to get right, plus it means tinkering with live app configuration, not an ideal situation. The ability to log and process humungous amounts of data (people talk about GrayLog2 remaining performant with 100 million messages) should fix this.

Uh-oh... Something Screwed The Data...

Finally, there's the situation where you find something in your database that's been screwed with somewhere. In reality, as obvious a code smell as it may be, lots of different apps can touch the same piece of data (SOA anyone?). So when you find broken data, you may need to trawl through logs for your front end app, any offline processing tasks, maybe a related web service, to find out what touched the data. Centralising your logs should make this a painless process.

Next Time...

That's all for today. Next time in Part II, I'll be starting to dig into the RabbitMq implementation, AMQP and the integration with Log4Net.


Categories: Architecture | Linux
Permalink | Comments (3)

Comments (3) -

nitzu Romania

26 June 2012 12:31

nitzu

Nice article! I think it will be a challenge to configure all these linux friendly components on a windows based host. Can't wait to see the result.

red_square United Kingdom

29 June 2012 00:18

red_square

Why not use zeromq for this. Rabbitmq is great, especially in durable HA scenarios however this is not the case for app logging as it is not so crucial that messages are guranteed to be delivered. The amount of log messages I assume will be large so I would go for the fastest queue available. Who cares if a few go missing (unless your using logging as a poor mans cqrs event store!)

russell United Kingdom

30 June 2012 21:49

russell

I must confess Steve, I'd not come across CQRS before! We're running an e-commerce site and a whole bunch of back end processes (including payment), where theoretically we could see events/failures that we really care about, for example an error handling card transactions.

I have heard ZeroMq is insanely performant - what's the toolset like?

Pingbacks and trackbacks (1)+

Add comment

  Country flag

biuquote
  • Comment
  • Preview
Loading