Director of AI Research Ram explains how ManageEngine's tools predict anomalies, the long term utility of Human-in-the-Loop AI, and how they've used sentiment analysis & transfer learning to overcome a lack of data.
Director of AI Research Ram explains how ManageEngine's tools predict anomalies, the long term utility of Human-in-the-Loop AI, and how they've used sentiment analysis & transfer learning to overcome a lack of data.
EPISODE 5
[INTRODUCTION]
[00:00:00] RR: It is going to become something like databases, where almost every software is going to be AI powered, but less and less people are going to talk about it.
[00:00:11] RS: Welcome to How AI happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we're about to learn How AI Happens.
[INTERVIEW]
[00:00:41] RS: Recently on How AI Happens, we featured some young companies building exciting new AI tools. But as you all know, AI isn't just the domain of the tech savvy futurist entrepreneur. IT management firm Manage Engine has a decade long relationship with AI, and they're positioned at the intersection between developing new technologies and retrofitting that tech into legacy processes. Their Director of AI Research, Ram, has been with the company for over 10 years. And in that time, he's seen the company's AI approach go from pet project to massive integration campaign. Here's Ram, providing some background on how manage engines AI research team has gone from 1 to 150 engineers, and from 0 to 5 billion calls per month to their AI platform.
[00:01:31] RR: I've been with Manage Engine for close to a decade now. I started off in 2011, and what started as a very small research project in 2011, we tried doing some sentiment analysis on our service delivery tools. AI was just coming out of the AI winter, and AI spring was just about to start, that was a very defining time when we had Hadoop, big data. We had the whole MapReduce thing going on. And then we had some aspect of Bayesian programming come in. And then that is when people like Andrew Ng, a lot of academic literature started coming in with respect to AI and the power of neural networks started coming out. And my goal was just to see how would an enterprise software developer make use of AI to their end users? Is it all just the hype? Because AI is known to over promise and under delivered that's why we have been looking at a lot of AI winters.
We started really small, we started with a very experimental approach. It was a one-member team. I was trying to integrate AI into a few of our existing business verticals. And at Manage Engine, we cater to IT department. So, we have all in one suite for your IT operations. We have log management, we have monitoring suite, we have the security suite, we have the endpoint suite. And then we have the service delivery suite. I set out on a journey to see how AI would fit into these tools. And today we are successfully doing about 5 billion plus calls a month to our AI platform that we have set up. So that is where we are right now.
I build the AI team from scratch. So, we started off as a one-member team, like I said, and today we are about 150 engineers doing AI across the company. Broadly, we are three divisions today. There is a team that does statistical machine learning, then there is a team that does computer vision. And then there is a team that does natural language processing. So, all of these put together is our AI research team.
[00:03:21] RS: The difference in consumer AI products and enterprise products are far ranging in terms of data, trust, anomaly detection. Ram, set the table by outlining some of the challenges and opportunities of AI in the IT and security space.
[00:03:36] RR: The challenge with enterprise, the challenge with enterprise software, at least to me, in Manage Engine we had – in 2011, we had products that were at least 10 years old, right? So, we were not a startup that was just doing AI. My goal was to retrofit AI into products that have existed over a decade. So, this was the first challenge. Now, I learned the math, I understood how to code it, I had frameworks like Spark and Hadoop that helped me implement those algorithms. But the challenge was retrofitting this. So traditionally, engineering managers and product managers have this job description of code extendibility, modularity, reusability, and things like that. But that is when we realized a product manager’s important job description, or a point in which he should take care is going to be the kind of data points his product is going to generate. And he has to orchestrate that properly, because only when your product becomes AI ready.
So, we had this as the first challenge where we had to get the data points which we needed. And we need to make sure they were all clean, meaning they were all in sync with each other. For example, we had 100 different date formats, just date formats used across management. So, we had two data that had existed in silos, data that had existed in silos over the past. AI is something that erodes data boundaries. So, we had to bring in data in a munchable format, and then we had to see if they would really offer us good predictions. It was a steep climb for us to get AI, to retrofit AI into our existing consumer products.
Today, we broadly have IT sector products. Let's say we have a monitoring suite, we have a security and log management suite, and then we have a service delivery suite. We went the typical product manager way and said, we'll pick the lowest hanging fruit. And apparently, all of them had time series data. So, we started off with anomaly detection, we started off with very simple anomaly detection, and we continued building anomaly detection. And today, our anomaly detection platform sits at the bottom of our AI stack and feeds into other complex algorithms so that we are able to get a lot of insights out of it. So, this was how we started into the enterprise, where the whole definition of products slash engineering management was changed. And we had to make sure data came out of the silos without any data boundaries, we had AI, giving out actionable information to our end users.
[00:06:02] RS: We talk about the importance of data plenty on the show. And we're going to continue, if only to satisfy my own curiosity about the myriad ways companies are procuring it. In managed engines case, they've utilized sentiment analysis and transfer learning to inform their tech.
[00:06:18] RR: A lot of times we have data, let us take sentiment analysis on your service desk tickets for an example. You have a lot of sentiment analysis data sets that are available outside. They are probably E-commerce review or movie review datasets which are beautifully annotated. But you don't have data that is specific to your domain, which is enterprise IT support. Because it's an enterprise setup, your emails are very formal. And today, there are a lot of avenues to go purchase data, meaning there are a lot of companies that actually sell data, and this is one way where you can go procure your data. But at Manage Engine, what we do is we basically rely on techniques like transfer learning. We bootstrap our models with the limited amount of data that is available in our domain.
So, let's say we Bootstrap our service test sentiment analysis, with the help of a movie review data set. The challenges, the language used in movie review can be a lot casual. But the same does not happen when you're writing an official email. This was one place where we use transfer learning. We use transfer learning a lot. These days, we are continuously investing in that direction, to make sure our models work with limited amount of data that is available. And talking about time series data and anomaly detection, thanks to cloud, the whole idea of IT has become very flexible, right?
So, it keeps on changing with cloud, you can just spin instances on, you can turn them off when you don't need it. It is all very elastic, and it moves like crazy. The same thing is with security. You cannot really have a supervised learning mechanism when you do AI models to identify security threats, because nobody is going to come back to you with the same security threat again. Security threats are going to evolve. So, we have taken this approach of a semi supervised learning, where we just teach the model what is normal, right? And anything that deviates from it will be considered abnormal. So, it probably falls into three buckets. Is this a frequent pattern? Is this a rare pattern? Or is this an unseen pattern?
Now, this thing gave us a lot of perspective. So, we started with our univariate anomaly detection algorithm, and we got in seasonality to it. We got in trend to it. And then we found cases where the seasonality factor had a trend in it. Let's say every Wednesday, your traffic increases by 5%. While your overall traffic increases by 10%, but when I say it increases more. Depending on the kind of business you do, it is weekday or a weekend, depending on where you are, let's say the Middle East has the week off on Fridays. So, depending on where you are a lot of things change, and you cannot expect your customer to come and configure all of it.
Let's say I own an IT team that caters to my audience across the globe, then there is somebody in Australia, there is somebody in the US, there is somebody in the Middle East, there's somebody in India, and a lot of local holidays, a lot of patterns change. So, you have to make sure with minimal configuration and with the minimal amount of data available, how do you make sure you get a better quality of predictions? And of course, it gets better over time. It gets better as it sees more episodes of data. So, in case of seasonality, we usually recommend four seasons of data. Let's say, you have weekly seasonality, then probably you let the algorithm cool down to four weeks. We start giving predictions after two weeks, and we also give them a confidence interval, how confident I am, how confident is the model about this prediction. So, that thing really helped the confidence interval really helped.
So, after univariate anomaly, we did multivariate anomaly. Now multivariate anomaly can be challenging. Let's say you have a row of values. Let's say you're trying to monitor your CPU, right? You're trying to monitor a hardware and you have your CPU percentage, you have your memory, you have your free disk space, you have your disk temperature, you have so many variables. You probably have 100 variables that you monitor with respect to hardware, with respect to a machine. So, now the challenge here is, let's say one single value could be an anomaly, or a combination could be an anomaly, right? So, CPU at 70% is not an anomaly, but CPU at 70%, RAM at 80%, and free disk space at 5% is going to be an anomaly.
Now, in this case is we have to tell the end user because of what combination have a marked this as an anomalous row. So, that will help them make better decisions. Now, that we have these powerful trend and seasonality detecting, anomaly detection systems in place, that is when we started building on top of it. Let's say in our monitoring suite of products. So, we had two cases where now you have anomaly, your anomaly system is very proactive, meaning it starts flagging you when things start going beyond normal, things start going abnormal. And now, the first thing is predicting an outage, predicting a failure.
So, looking at the forward and forecasting, if I'm going to have a downtime in the next one hour, in the next two hours, then how do I react? And then looking backward and finding out the root cause on what has gone wrong. Usually, we haven’t even started. So, something could be a starter even. For example, a new product buildup date could be a starter even to the impending outage. Finding out the root cause. So, from a monitoring perspective, it gives you a 360-degree. You are time T. What is going to happen at T plus two? It is because of this, which happened at T minus two.
This is something which we did in monitoring. And then talking about the security angle, we built our user and entity behavior analysis, UEBA, totally on top of these anomaly detection systems where we found out the frequent RAR and the unseen patterns, that really help us AI augment the UEBA capabilities, we are seeing how it can mitigate threats much faster when compared to traditional non-AI methodologies. And the same thing applies to service delivery, too. So, we have an NLP based system that helps identify what kind of ticket is incoming. Now, let's say you have your asset request tickets are coming, it may not really contain the word asset request. I can just email my sysadmin to say, “Please give me a new laptop”, and I'm not going to write asset requests anywhere in it. But the NLP models can really understand what the ticket is talking about, and then assign it to the right technicians that are available so that your ticket closer rate is much faster. It basically enhances your productivity.
And then finally, you're also able to get a clearer analytical view of, okay, how many asset request tickets have I got? How many asset defect tickets have I got? Now, let's say there is this particular model of a laptop that has a lot of defect, and then there is this production server that goes down often. So basically, with anomaly detection and forecasting as the underground, we have built a lot of powerful AI features on top of it, which work with minimal amount of supervised training data that we had access to. So, this is the current setup in Manage Engine and put it all together, you have AI ops. So today, AI ops broadly covers monitoring security and your service delivery. So, this is what we have built, Rob, over the years with the limited amount of data that we had access to.
[00:13:18] RS: Anomaly detection. How do you train AI to deal with a set of circumstances that’s never occurred? Given how hard it is even to acquire standard data, how do you acquire data for something that occurs much more rarely? And what good is past data for predicting these outcomes in the event that history doesn't repeat itself? Ram, has some thoughts.
[00:13:38] RR: It is virtually impossible to simulate all conditions that can change. Because each of our customer deployment is different. For a similar setup, the way they have configured could be very different because, especially enterprise IT, you look at an enterprise IT tools dashboard, it's going to be overwhelming, because there are a million ways in which you can configure the same setup. So, we have a very broad – the whole idea of getting AI here is to move away from manual thresholds. A traditional security system would ask you to configure, let us take the number of failed logins per minute. So, it would probably ask you to configure the number of failed logins per minute that you expect, right? And then when it reaches that threshold, it probably throws you an alert, right?
Let's say you have configured 10 failed logins per minute in your SEM. Now somebody can easily fly under the radar with nine failed logins per minute. They can just brute force nine times per minute and then they can run a scheduler, this happens every minute. It is virtually impossible to create rules like this. So, this is where AI comes in, where it knows that a Saturday morning 3 AM, is totally different from a Monday morning 9 AM. So, that is one angle which is one reason we are moving towards AI because manual threshold setting is done. It is over it. It won't really work. The problems it can cause is when you do a manual thresholding, the alert fatigue of the IT operator is going to be huge. So, is he really missing out some serious security threats? We never know.
AI can help you prioritize your alerts. That is one way. This is where the whole idea of human in the loop AI comes in. Because generally in IT, too much of automation is always frowned upon. Let's say you have a script that automatically restarts all servers. Then from our learning, we have seen that not a lot of people are up for it. They always prefer a human IT reviewing however intelligent the tool might be. We still have this whole idea of human in the loop and I consider that to be the primary aspect of modern-day AI. Because we're nowhere close to Artificial General Intelligence, AGI. We’re still in the world of narrow AI where one model can just do one thing, and not many things.
So, if you look at traditional time series data, your time series does not often always reflect the real-world conditions. Now let's say I'm looking at – I'll give you an example. So, at the onset of the pandemic, all of us started working from home. So, our IT systems, which had probably AI models that were built two years ago didn't really have an idea of how this pandemic is going to spawn out, right? We didn't have our training data train on a pandemic data where we are all forced to work from home, right? So, just the day before it was announced as a pandemic, we had 90% of people working from office and 2%, working from home. And right next day, we had 98% of people who are working from home and 2% of people working from office. Any well-trained AI system, however smart your AI system is, it's going to throw you out a lot of red alerts, because it probably does not have the external information that this thing is going on.
In fact, we had a case where a well-known grocer in UK, he had his – he was an online grocer, and people were panic shopping. So, his network monitoring system actually started blocking legitimate traffic as DDoS attacks. We absolutely don't want that to happen, because the AI model is so nailed that it does not know how pandemic changes. So, you can only capture so many patterns. This is where I see human in the loop AI. So, AI is going to augment humans and not replace them. It's going to make them better at their work. AI will be a huge value add to humans and humans would be a huge value add to AI. We saw that at the onset of the pandemic, that was very clear at that time. So, the goal of AI in IT is to optimize processes and to maximize productivity. And of course, human in the loop AI is something that I look forward to and it is here to stay.
[00:17:40] RS: There's plenty of logic we can build in to cope with anomalous occurrences, but in the short-term, human in the loop is the most realistic. And according to Manage Engine’s customers, the most trustworthy approach. Human in the loop feels antithesis to the long-term vision of AI. So, I wanted to hear from Ram what the ongoing utility of this approach will be, as we approach AGI.
[00:18:02] RR: I don't see AI completely taking over. I don't see a fully autonomous anything yet. Because even if you see self-driving cars, they can only do smaller stretches probably on the freeways. But there are a lot of AI techniques that can augment a good driver. So, it can do fatigue detection, it can do some cruise control, where it basically enhances the productivity of the driver. We haven't reached the stage where we fully replace the driver yet. I think human in the loop is here to stay. That will help us extract the maximum out of AI out of the possibilities of AI. It is not because we are in our infancy in understanding AI, even when we have much bigger – today’s AI systems are in general 80% accurate. A good AI system is 80% accurate.
So, even, if it’s close to 94, 95, there are a lot of circumstances which the AI model might not really be fed into it. We might not really feed something like a pandemic situation into it. Now, today, it's a pandemic. Tomorrow, it can be something else. Tomorrow, a company going out of business. So, we probably would not have trained our IT model for a business that goes out of business, right? So, this is where a human in the loop is essential and I don't foresee a fully autonomous anything coming very soon. It's going to be enhancing the productivity of the human who is in charge of it.
[00:19:23] RS: Of course, before letting Ram go, I had to know, what has him most excited about working in our industry.
[00:19:30] RR: I see AI as a digital tech that is going to touch our lives in all possible ways. I'm very optimistic about it even though there are challenges about the availability of data, the huge amount of processing power that you need to get AI running. But I'm really optimistic about AI playing a bigger part in our day to day life. But the way I see it, it is going to become something like databases where almost every software is going to be AI powered, but less and less people are going to talk about it. Less and less people are going to talk about it, but more and more it is going to be used. For example, we have AI ops, it's going to be called DevOps. But AI is going to be an essential ingredient, right? AI is no longer your secret sauce to something. For example, nobody markets their product that I use this database, so come and buy this product. The same way, AI is going to be everywhere. It's going to be talked lesser and lesser, but it's going to be used more and more. That is why I'm very optimistic, very bullish, because it's going to be an incredible ingredient into every software product that is being built from the one.
[00:20:37] RS: As Ram puts it, we've had our fair share of AI springs and AI winters, or putting the terms of the standard hype cycle, peaks of inflated expectations and troughs of disillusionment. Most assessments tend to leave us in the trough. But of course, there are two more stages, the slope of enlightenment and the plateau of productivity. I was encouraged to hear Ram put us somewhere on that plateau, that despite being a far cry from AGI, every day, we see more mainstream adoption, widespread application and broad market appeal. Truly, there's never been a better time to work in AI.
[OUTRO]
[00:21:19] RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, ecommerce, media, medtech, robotics, and agriculture. For more information, head to sama.com.
[END]