Introduction to monitoring with anomaly detection


machine-learning-finalIn this article I’ll describe how I implemented customer activity monitoring and anomaly detection. If you are a service provider that provide services to a group of large accounts its vital to know that your customers can do their business.

Monitoring customer behavior is not only required for managing IT operations its also vital to know from a business point of view. Nowadays customers get smart too. High tech customers such as Netflix and Airbnb use data analytics to monitor the results they get from their payment providers. Based on this data they make real-time decisions to switch to another supplier when that better suits their needs.

A company that has hundreds of customers is not able to effectivly manage each customer using monitoring with traditional dashboards and eye balling. These dashboard show the combined activity of all customers and the services provided to them in a single graph. A change in customer activity will be impossible to detect unless it is a very big customer or several customers are hit by the same problem. The usual approach is to monitor each customers activity by itself and generate alerts when activity (transaction volume) is changing in an unexpected or undesired direction. That will only work if the number of alerts is not too high and can be handled quickly, but companies with large amounts of customers will receive still too many alerts.

So what we need is a system that detects situations and can combine several indivual alerts into a small number of high level situations that can be handled by operators and business support teams.

Monitoring your customers and services

A major chalenge with IT Service management and monitoring is the overload of data while we just need clear cut information suitable to take appropiate action. Machine learning can help to detect problem situations but it turns out not to be the simple out of the box solution that it promisses to be. It requirers upfront investment in time and effort for configuring and testing, but once implemented it can help to detect failures and reduce loss of business revenue.

What can anomaly detection do and what not

anomaly-1Anomaly detection is based on the assumption that there is normal behaviour and that any abberation from normal is bad and should be handled immediate. Monitoring with anomaly detection requirers you to collect data as a series of time stamped data values. These timeseries are processed by an anomaly detector resulting in an indication that something is wrong and (depending on the detector implementation) a series of anomaly results:

  1. actual value
  2. expected value
  3. anomaly score

This raw anomaly data needs to be translated into an alert stating: start of the event, duration of the problem, what happened, severity and impact on business. We decided to recognize four situations:

  1. Normal activity
  2. High (higher than normal activity)
  3. Low (lower than normal)
  4. Zero (no activity at all)

The situation of “Zero” is not just detecting that there is no activity, we also must know if its unusual to have no activity in that time span. We want to use machine learning to learn these activity patterns and help us to detect any changes. Therefore we setup ElasticSearch Machinelearning X-Pack to train on the transaction volume data and detect anomalies. We assume an event to be anomalous if its anomaly score (ranging from 0-100%) is above 30%.

With this information we can detect when customers have unusually low or no activity at all resulting in severe impact on business revenue. These alerts can also be an indicator of problems that lay outside the company (external network failures).

Anomaly detection with ElasticSearch

To detect anomalies we used ElasticSearch Machine learning because ElasticSearch is used by many companies as a monitoring tool and its Machine learning is relative easy to use.

machine-learning-final

Data is collected with 1 minute intervals and stored in ElasticSearch which acts as a kind of database. X-Pack machine learning automaticly processes the data and generates a stream of triples (actual value, expected value and anomaly score 0-100%).

Anomalies

Altough the above image of a single instance shows some promissing results, you have to take a better look. For this I have developed a framework based on R Studio a professional tool for data scientists which allows me to process and visualize the monitoring data for review.

RStudio

R Studio scripts can be used to generate pdf reports and even web pages with dynamic content

With R Studio i collected data stored in ElasticSearch to generated hmtl and pdf reports and detail pages for each customer on transaction volume and its anomalies. First i looked at the shape of daily activity. Some customers had nice daily activity patterns with a clear day/night cycle and weekend patterns. But some had very irregular behaviour that showed no pattern at all. This explains why the anomaly detector cannot detect anomalies for all customers.

Transaction volume patterns over several days

To get a better understanding i also plotted the anomaly events for each customer against its average activity.

activity

This image of a particular customer shows that between 7:00 and 15:00 the activity is so low that it frequently touches the bottom (zero activity). The green area is a 99.95% probability band around the average. The red line is drawn where activity is below 10 with a probability of 99.95%. This means that during this RED period its impossible to predict zero activity anomalies.

After a more thorough review of all the results, we concluded that out of the box anomaly detection seemed not to work that well.

  • short dips in active periods where not detected
  • longer drops during active periods sometimes missed
  • a major breakdown that couldnt be missed was only detected after 30 minutes.

The problem is in the data as well in the configuration. Anomaly detectors requirer lots of data in order to train itself and learn the activity patterns. Only after several weeks there is sufficient data to recognize patterns. Data is collected and aggregated in time buckets and machine learning has to be carefully configured. The X-Pack bucket span setting turned out to be a critical factor. After some twiggling results started to get better but still not good enough.

But there is a another problem: Anomaly detection assumes that your data has a relative small amount of anomalies, less than lets say 0.5 percent. So if your data has lots of problems anomaly detectors will start to ignore them. We decided that although anomaly detection has its benefits we cannot not rely on anomaly detection alone.  We need a system that provides us with a second opinion.

Are Anomalies Rare?

Depending on how you look at it, anomalies are either rare or common.
If 99.73% of your data is normal than three observations per thousand will be considered anomalous. That sounds pretty rare, but given that there are 1,440 minutes per day, you’ll still be flagging about 4 observations as anomalous every
single day for each metric. If you have hundreds of metrics than you end up with too many alerts.

Looking for better solutions

Because monitoring based on anomaly detection alone is not enough I tried something else. I wanted to process and reduce the number of alerts with techniques as Complex Event Processing and procedures described by the Alerta monitoring system.

To try out this approach I used some R Studio scripting. I created a list of alerts, their type, start and end time. Alerts of type “Zero activity” could be made without any data from the anomaly detector.  For high and low activity i still needed the anomaly detector. But with this approach I combined the data of the anomaly detector with home grown script programming. I also found that there where many situations in which several customers had the same problem over almost the same period of time.  So i combined these cases thus making good use of evidence to make my case better.

This approach of combining evidence turned out to produce more reliable alerts and reduced the amount of alerts thereby reducing operator workload. In some cases more than 20 alerts could be combined into a single situation. The data in my reports showed that this aproach worked. A practical approach for alert handling will be described in a follow up to this post: “Handling alerts and situations”.

Conclusion and next steps

With this project I learned a lot on anomaly detection. For me its still a lot of magic going on. With pre-packaged products such as ElasticSearch machine learning there is not much you can do to tune or alter its results. If you need more control you need to spend more time and build your own machine learning model.

Out of the box ElasticSearch anomaly detection needs post processing to create alerts.

Now im working on a system for alert processing called the Drools Business Rules Management System. Drools will process the results in realtime, create alerts and send them to the Alerta alert manager. JBoss Drools allows me to do advanced Complex Event Processing. Currently i have a first Proof of concept of Drools running. Im also looking into R for building my own prediction models.

Lessons learned

  • Managing expectations: Monitoring with open source anomaly detection is science and art and both require time (if you don’t have time just hire a data scientist or buy a good out of the box product, the challenge is how to determine whats GOOD)
  • Setting up anomaly detection requirers lots of data and enough time for twiggling to get it right. there is no documentation to tell you excectly what to do (thats part of the fun of anomaly detection)
  • Anomaly detection only detects anomalies (ok for clean systems with rare errors, not so good for detecting errors that happen too often)
  • Even with advanced anomaly detection, You still need something intelligent to combine all the events, cluster them and draw conclusions. Otherwise you just end up with too many stupid alerts.
  • I have learned a lot just by doing all this and this knowledge will give me an advantage.
  • I think that out of the box monitoring solutions give dissapointing to average results this because they are designed to work (do something) under all conditions. Because conditions at your company are very specific (your company is unique) an out of the box product will be mediocre at best.

To get the best monitoring system you should invest lots of time understanding how your system and customers behave. Only then you can start to design and configure a suitable monitoring solution based on best of breed tooling.

Suggested readings

 

2 thoughts on “Introduction to monitoring with anomaly detection

  1. Otha Kicks's avatar

    excellent submit, very informative. I wonder why the opposite experts of this sector do not understand this. You should continue your writing. I am sure, you’ve a huge readers’ base already!|

Leave a comment