Adnan Khaleel, Sr. Director of Global Sales Strategy for HPC & AI at Dell, explains how companies are using HPC and containerization to scale their AI implementations, as well as how Dell parallelized a radiology algorithm, drastically improving both speed and accuracy.
Adnan Khaleel, Sr. Director of Global Sales Strategy for HPC & AI at Dell, explains how companies are using HPC and containerization to scale their AI implementations, as well as how Dell parallelized a radiology algorithm, drastically improving both speed and accuracy.
0:00:00.0 Adnan Khaleel: We were able to scale those 32 nodes, and instead of it taking five hours, it took seven minutes.
[music]
0:00:09.7 Rob Stevenson: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens. On this inaugural episode of how AI happens, we are going to be looking at a perennial tug of war in our industry: Speed versus accuracy. If your AI comes anything close to the consumer, speed is paramount, but if in speeding things up, you can't maintain accuracy, then isn't it just better to be patient? To learn more, I sat down with Adnan Khaleel. Adnan interfaces with hundreds of organisations on behalf of Dell, all of whom are attempting to either scale or optimise their AI approach.
0:01:15.4 AK: I am a Senior Director for Global Sales Strategy for high performance computing, AI, and hybrid cloud, and as part of that role, I get to work with a lot of different customers in many different areas. Most of them are either just embarking on their journey on artificial intelligence, or they're trying to better understand how to scale those systems out. For some, it's as simple as design a better computer system that allows them to be more efficient, but for others, it's how do you scale up? It's not just running it efficiently, but how do you run it efficiently at a very, very large scale?
0:01:49.8 RS: Because Adnan is focused on high performance computing at Dell, I wanted to know about the kind of hardware and cloud computing his customers are implementing for their various use cases.
0:02:00.1 AK: This is where things get difficult, and this is where we come in. Certain customers have gone down the path of doing just general purpose... You could use a general purpose Xeon to do your AI training workload, and it works perfectly fine, and the good thing about a general purpose CPU or... A Xeon, or an AMD processor is that you can use it for everything else. Now, a GPU is also... They've been optimised to the point where people are doing a lot of training on them, and they work really well. GPUs are still not sort of general purpose enough; not all codes run efficiently on a GPU. But... Yeah, and then people who are hyper-specialised, when you talk to the Googles of the world, they've got their own processes that they've designed from scratch, which are customised to do the training workload and are optimised for their particular type of training workloads as well, and these are called tensor processing units. And each one of them has some advantages over the other, but ultimately, if you don't wanna invest in any hardware, the simplest way to go is, you just go with a general purpose CPU. And most of them will do the training; it just may not be fast enough for what you need, especially if it's associated with some sort of a business process.
0:03:19.5 AK: Fast is a very relative term. Now, let's say you're a business, and your entire e-commerce site is based on recognising objects. In this case, let's call it merchandise. Now, any time you have some new objects or some new merchandise that's being introduced, and you're trying to recognise them, well, you've gotta run your algorithm through all of the new sort of merchandise. And in that case, if your site is up 24/7 and you don't wanna have any disruption, well, you wanna do that training... The implementation of that model very, very quickly, because you're not affecting your sales cycles. Now, in other instances, if you're a researcher, and you're trying to understand anomalous cell behaviour, identify anomalous cells... Now, if the training were to take longer, it's inconvenient, but it's not detrimental and you're not losing money. But yeah, so researchers generally... They also don't have unlimited funds as a business would, so they're okay with waiting a while to get their results rather than going and investing in a whole bunch of different hardware.
0:04:25.7 AK: Now, there's also... It sort of varies. You can think of life... Healthcare and life sciences is a perfect example as well, where if you're a researcher, again, you're fine waiting the extra hour or day to get your training models done, whereas if you're on the clinical side and you're working with a patient, you want things to be as quick as possible, because if it's during a doctor consultation, and you find something new in an X-ray or an MRI, you don't want to keep the patient in suspense while you're trying to figure out... While the model is trying to figure out what it is. So... Yeah, so it really depends on what the specific needs are.
0:05:03.9 RS: Here, it struck me that your choice of hardware would move in parallel with how much you want to scale up your AI implementation. Adnan walked me through his experience seeing many customers scale, and shared some key considerations they make along the way.
0:05:19.2 AK: And really, it's a journey, right? So let me sort of use an analogy. When you're first learning to drive... As much as I would like to have just jumped into a Ferrari and slammed on the pedals, you generally don't do that. [chuckle] So, you learn to drive on... Not a sports car, and a Ford... A lesser... Less of a sports car. And then, you learn to drive. And similarly, with AI adoption, it's the same thing. You initially try to understand what the algorithm is capable of, because not... The data scientists and the researchers understand what it can do, typically, but businesses still have to go through a proof of concept, so you generally have a smaller setup. You're trying to understand, is this the right technology for you, do you need to tweak it, and so on. So once you've gone past the proof of concept, then you start building a bigger system, and then... And once you start realising the value of a bigger system, typically, then end users... Either it's more training data that's being associated with the whole training info, and that sort of... You need the whole storage infrastructure and the interconnect and the networking to make sure all of the data is being fed properly and at a sustained balanced rate.
0:06:37.8 AK: And then people run into the problem of, "Okay, well, I wanna do this in lots of different parts of my business," and then you run into this whole scaling problem, is because everybody wants to do it, and generally, that's a good success story. Unfortunately, it doesn't always work that way, because sometimes, people bite off more than they can chew, and as a result of that, they end up failing, so... And that's why we sort of recommend this... Learn to crawl, walk, and then run, so... With the AI world, and the big data world before that. It's a very... Sort of a similar approach. With containerisation, you can take your sort of workflow that you've developed on your laptop and easily transfer it onto the cloud, and it really doesn't take a lot of modification; it's really meant to run out of the box. And for the most part, it'll scale decently.
0:07:29.7 AK: The other sorta challenges come in where people run on the cloud, and then they see, "Wow, this is really expensive." And that's where a lot of the HPC technologies also come in as well, because even the data scientists, even though they understand how all of these things work, they're not hardware experts, they're not infrastructure experts. And then, on top of that, when you build these large systems, you also have to factor in the power, the configuration, the cooling as well, based on how much data centre space you have, or if you're using a colo, and so on. So, there are all of these different considerations that come into play once you look at the scaling picture.
0:08:07.6 RS: Do most of your customers wind up using some kind of containerisation process?
0:08:12.2 AK: Some of them do. It's funny you ask that, because if you look at the typical HPC customer, most of them are like the NASCAR drivers, and if you get into a NASCAR, you can see it's been stripped bare right down, and there's not even a gas level meter, so you're sort of estimating stuff. And that's how most HPC people are, is like they don't want all of the extraneous, nice to have bells and whistles. They just need the code to run as fast as it can, and they wanna have as few distractions as possible. But on the other hand, containers bring with them a lot of conveniences. There's code portability, there's easier sharing, there's reproducibility. So, we've got this happy mix. Some instances, people are fine using containers, and there are some very good HPC-specific containers, like Singularity. And also Kubernetes as well. But in some instances, customers wanna be on the direct... The big iron, as close to it as possible, so they don't even want any virtualisation.
0:09:19.0 RS: What is the bias against virtualisation?
0:09:21.9 AK: That's... There is a performance penalty, because any time you do a system call, it has to go through another layer of software. Although that argument is quickly losing steam, because if you look at the latest versions of the commercial virtualisation software out there, they've reduced that drastically. In the past, and... Say, 10 years ago, you could have seen the overhead of any sort of networking traffic that was using virtualisation. There was quite a high penalty; anywhere from 5% to 10%, but typically now, it's 1% to 2%. For some applications, it's even less than that. It's just that sort of people have stuck on with this mindset that virtualisation sort of exacts a severe penalty, and in some cases, it does even today. But for most instances, I think the convenience outweighs some of the penalties that virtualisation might introduce.
0:10:17.8 AK: Now, if you go back 40, 50 years, people were writing assembly code, and that was really hard to write; it wasn't very intuitive. But then, people started preferring high level languages, like C and C++ and Java. Now, if you look at it, compared to writing assembly, C and C++ are close to assembly, but they're not quite assembly, but they offer a lot of convenience. Now, similarly, when it comes to computing, you can run stuff directly on the bare metal, which is fine, but if you do it a container, you get all of these other benefits, which you wouldn't if you're running it directly on the bare metal. So, it's really ultimately, at the end of the day, I think convenience sorta trumps, but it takes a while. It isn't immediately obvious, but it's also about the maturity of the software tools along the way.
0:11:10.7 RS: This question of when to use bare metal or containerisation and what amount of performance penalty was acceptable led us directly to a key case study that focuses on increasing speed while maintaining accuracy; a study Dell performed on an algorithm; an algorithm called CheXNet.
0:11:29.8 AK: Essentially, CheXNet is an algorithm that was developed at Stanford, with the goal of identifying mutated cells on an X-ray. It was essentially developed to understand X-rays better, to identify cancers, tumors, any anomalous cells on X-rays faster, better than a radiologist, or more accurately than a radiologist. So, this was an algorithm... It was developed a few years ago, and it's... They actually tested it on 14 different anomalous types of cells, ranging from emphysema, to cardiomegaly, to fibrosis, to edemas, and so on and so forth. The great thing about this algorithm is it, today, has an accuracy of 84%, which... Maybe it's slightly higher than your average radiologist at identifying some of these anomalous cells, but, again, it's easier to make this algorithm available globally than it is to train a whole bunch of radiologists, so really, that was the goal.
0:12:38.1 AK: So... But this goes back to our earlier conversation about the need for speed. The great thing is, the algorithm works. It works really well, as I mentioned; really good accuracy, but it takes a really long time to train it. And the way that they had developed the algorithm, it had to be trained on a single machine. In other words, you couldn't parallelise it in order to speed it up. So that was the work that we had done at Dell, and... To see, was there a framework that we could adopt in order to take this CheXNet algorithm and then sort of divvy it up into smaller pieces and spread it across lots of different nodes? In other words, could we scale this algorithm and try to speed it up and still maintain that accuracy?
0:13:23.3 AK: The algorithm itself was meant to run serially. You're sort of meant to go through all of the different images, and if you look at it, the convolutional neural networks used in this algorithm are sort of serial state machines; you have to train them on all of the different images, one after the other, in order to get the full benefit of that training. Now, the moment you sort of deviate from that training cycle and you sorta split it up, well, you've really degraded the accuracy. So, our goal was, is there a way of still maintaining that accuracy, but speeding it up? And ultimately, it was that serial nature of the training of that algorithm that we were trying to sort of understand better, and then sort of distribute it over a framework.
0:14:09.8 RS: Enter parallelisation.
0:14:11.5 AK: Enter parallelisation, which is essentially the domain of high performance computing, because if there's one thing that HPC users are really good at, it's taking a problem and then sort of spreading it over lots and lots of different compute units and trying to speed it up by the number of compute units, so in other words, if a job takes 100 hours on one processor, the goal is, can you take that one job, split it up over 100 processors, and could you run it in 100th the time it would have taken you on that one processor? This is the sort of problem you run into in lots of different domains. Weather modelling is another one of them... There are certain classes of problems which lend themselves very easily to being decomposed, and then you can process them and recompose them back.
0:15:00.9 AK: For example, like in weather simulation and modelling, you typically... What you do is you create these voxels, which are small 3D grids, and then you can do all of your simulations on those small 3D grids, and then combine the edge results, and then sort of... You sorta stitch together the end... Sort of view of your entire simulation in your model, and as a result of that, you can sorta parallelise and get speed-ups. Now, in certain instances, you cannot, or it's very hard, in order to do some of these speed-ups, just because the nature of how these algorithms have been designed is that they have to run very, very serially. Now, serially... The drawback is that it can only run on one process at a time. So, in HPC, there are lots of different cases where you've got loosely coupled work where you can... It's really easy to distribute them, or you've got very tightly coupled workloads, where it's harder to sort of distribute them.
0:16:03.0 RS: Was that the parallelisation approach that was taken with CheXNet?
0:16:07.2 AK: Yeah, that was... It was very, very similar to that. We also wanted to use a readily available framework that we could then sort of extend to CheXNet, and... Rather than create something of our own, and there is an open source tool called Horovod that we used in order to introduce the synchronisations, and... Again, as you split something up... One of these algorithms up, you need to be able to synchronise at certain points, and that's what Horovod allowed us to do. What we did manage to do is, we had these incredible speed-ups. Well, I have to give some base numbers here.
0:16:41.1 AK: But on a single system, on a single node, it would take around five hours to run this particular training job. Now, our goal was, could we throw many more systems at it? And so, we said like, "Okay, from a single node, if we could go up to a 32 node, that would be great." And initially, when we tried to do this, obviously, the accuracy suffered due to the reasons I mentioned earlier, but then ultimately, when we did sorta come around to tweaking the algorithm and making it work in this parallel environment, we had some tremendous speed-ups, and... In other words, we were able to scale those 32 nodes, and instead of it taking five hours, it took seven minutes, so you can imagine the sort of improvement that that could have in a real-world environment. I mean, seven minutes is the duration of... Within the duration of a doctor's visit. So, you could get your X-ray or your MRI taken, and as soon as that happens, the number crunching begins in the background, and when you're ready to see your radiologist or your doctor right after that, the results are present. So, this has real-world implications.
0:17:52.0 RS: And the accuracy was maintained.
0:17:53.6 AK: And the accuracy was maintained; in some cases, it was actually improved, but for the most part, it was maintained. Now, there were some cells which... The accuracy suffered, but I think that was on two of the workloads, two of the different types of anomalous cells. We couldn't improve on the accuracy, and there was a slight degradation, but overall, as a cohort, we saw a slight improvement, in terms of a few percent, in terms of accuracy. So, that is something, today, we are willing to live with, but as science marches on, I'm sure we'll tackle those problems as well and improve the accuracy, not just of those two that were slightly lagging, but of the overall system.
0:18:37.5 RS: Of course, the implications of speeding up CheXNet reach far beyond the examination room. I immediately thought of applying this approach to any sort of image processing, but as Adnan explained, I'll need to start thinking a little bigger.
0:18:51.4 AK: This was done on X-rays, so an X-ray is just a two-dimensional image, the way you can look at it, so any sort of an image, this technique can be applied to. So, if you're trying to identify something that's anomalous in an image, and it doesn't occur very, very frequently, so... Yeah, any sort of an image. Now, there's ways of even sort of extending this to other sorta domains, other than just images, like videos, which are just still frames, clearly, but also time domain signals, for example, like voice, which is... It's a time domain, sort of a signal. Or any other kinds of... Radio frequency signals as well, because... Again, astronomers are also looking at lots of different types of data, obviously, like images from telescopes, but also trying to understand and classify a lot of different heavenly bodies out there, so an algorithm like this could clearly benefit astrophysics.
0:19:52.6 RS: Image processing, video, time domain signatures, astrophysics. I'm sure the list goes on. I'll include a link to the full report on the CheXNet parallelisation in the episode notes. And if this technique ends up working for you, well, I'd love for you to tell me all about it. Moving beyond containerisation, speed versus accuracy, and parallelising algorithms, I wanted to end with a little bit of brain candy. I'm always curious what's most exciting to the experts in the AI space, where they see the most impactful changes occurring, and what stokes their curiosity, even after a long day of making AI happen.
0:20:35.1 AK: Oh god, how much time do we have? There's so many things that are happening in high performance computing and AI, especially on the AI side. And I'm sure you've already heard about or read about deepfakes, and how it's really easy to manipulate videos and to make it look like it's somebody saying things that wasn't actually said. So, to me, what's interesting is, this is going to sort of bring us into an area where, as a human society, it's sort of unprecedented. So that's going to sort of... We're gonna have to have new laws, new legislations... Ethics as well. More and more, we're looking at algorithms that are going to be deciding on people's financial futures. We have to be able to understand why somebody was denied a mortgage versus why somebody was approved. And also, the fact that the data itself... If you use historical data for training, which is the best source of data that we have, whether we realise it or not, there are some biases built into that stuff, so if you've trained a model historically, using historical data, you've already got a biased model, so... We have to be able to not only explain the data, but explain the algorithm, and ultimately find ways of detecting bias and making sure that there's fairness that's sorta built into the system, and be able to detect when something is not fair.
0:22:13.1 AK: The other area is, there's neuromorphic computing, which is another branch of artificial intelligence, which are looking at more... At some of the spiking neural network models of neural... Which are more similar to how actual neurons work. And what's really interesting about that field is that these computational models are several orders of magnitude more efficient, as compared to GPUs and so on, and you also don't need as much data, and you don't need to label them, and so on. They learn similarly to how you and I learn. It's like, if you show me a few images of a cat, and then I tell you what a dog looks like, the next time you see a dog, you probably will be able to sort of infer and say like, "Hey, I think that's a dog." Today, it's really... The algorithms are really good at saying, "Well, that's a cat, that's a cat, that's not a cat." But it's just, all of the development that's going on in this field of AI is just tremendous, and I think we've just... We're really at the beginning of the next stage of discovery, because there's implications in so many different fields, in genetics, in genetic drug discovery, in identifying anomalous patterns that are disease-causing.
0:23:39.7 AK: And I think what really excites me is the fact that, even though we've made some great advances in medical science, with pharmaceutical drug discovery, employing AI, we're just going to get into that next era where... The time to find drugs for very intractable diseases is going to be... Is gonna take a very different approach than what we've done in the past. Rather than it being a lot of trial and error, where we go through all of these compounds in a laboratory, a lot of that can be automated; in fact, you wouldn't even have to physically do it. You'd do it entirely virtually, in simulations, and AI would help you sort of... Even decide which ones are the most promising of these compounds. So, there's no one single area that I can think of that's not gonna benefit from AI, and it's sort of... It's really understanding which technique works best for you, and then ultimately, it's going to be development in these algorithms, as we understand how some of these algorithms work, and believe it or not... And this is the part which sort of catches most people off-guard, is...
0:24:51.8 AK: A deep neural network is... Because of its complexity, is very much like a black box. We still don't understand entirely how it works, which is fascinating to me, because, much like the human brain, we understand aspects of it, but we still don't understand entirely everything about it. So, when you talk to the researchers, they will tell you oftentimes that, "Well, we need a neural network in order to understand a neural network." So yeah, I think... And that's where we are, and there's something called explainable neural networks and so on; that's a project that's been... MIT's been working on. So, it's just fascinating, all the things that we're gonna embark upon, and... I'm just happy that I can work with a lot of these users and customers, and some of the smart people that are working on some of these projects, and I get to learn from them. So to me, that is the most exciting part.
[music]
0:25:50.8 RS: As I learned in the CheXNet example, and as I'm sure some of you said out loud to yourselves over the course of this episode, the concepts of speed and accuracy aren't necessarily an either/or. And as AI moves forward, I think we'll all expect to have both. Next time, on How AI Happens...
0:26:12.1 Speaker 3: We ask the user to determine what kind of outcome they want, then we throw half a dozen or more algorithms at that problem, and then build maybe 100 or 200 bottles in the space of a few hours.
0:26:29.3 RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specialising in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, MedTech, robotics, and agriculture. For more information, head to sama.com.