Igor Susmelj, Co-Founder of Lightly, explains how most companies don't have a problem of too little data, but rather of far too much irrelevant data. He details Lightly's approach of utilizing self-supervised learning to pare down massive data sets into something that can be useful to a supervised learning approach.
Igor Susmelj, Co-Founder of Lightly.ai, explains how most companies don't have a problem of too little data, but rather of far too much irrelevant data. He details Lightly's approach of utilizing self-supervised learning to pare down massive data sets into something that can be useful to a supervised learning approach.
0:00:00.1 Igor Susmelj: In autonomous driving, most companies use less than 1% of the data they actually possess and label it and use it then to train the system. So Lightly is the one company telling you which 1% you should use.
0:00:14.5 Rob Stevenson: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens.
0:00:42.3 RS: Today on How AI Happens, we're going to be looking at supervised and unsupervised learning. There are plenty of schools of thought regarding these approaches. Those who say supervised learning is too expensive or time-intensive are countered by the argument that it's really the only way to have transparent and accurate outcomes. Meanwhile, proponents of unsupervised or self-supervised learning scoff at the suggestion that it can only be used for clustering or that it's by necessity black box. Here to get to the bottom of it is Igor Susmelj, the co-founder of an exciting company called Lightly, working to combine all types of learning into a maximum efficiency process for their AI-building customers. Igor has a deep academic background in robotics and electrical engineering, and he represents my favorite kind of entrepreneur, the kind who personally struggled with a problem themselves and then set out to fix it.
0:01:38.6 IS: So I have a background in electrical engineering. I started here in Switzerland at ETH University. In my undergrad, I was focusing more on robotics. I was very fascinated by machines behaving like humans or smart machines, and back then more on the hardware side, and then did some projects as well. One was where we built an exoskeleton where we helped paraplegics walk again, and this also participated in the Cybathlon. And then afterwards, I moved more and more into software, even worked a bit, worked for two years in a financial institution, then went back to uni doing my master's and finally got in touch and love with machine learning and computer vision. So I also did electrical engineering, but the sweet thing here at ETH is that you can select the electives you take quite flexible, so I was more focusing on computer vision, machine learning, even some lectures about hardware. I even had a project where we built an accelerator for machine learning inference on device, like the idea of us to have a device of fitting into a smart watch, being able to do non-stop speech recognition with a very low power consumption.
0:02:49.8 RS: Did it work?
0:02:50.8 IS: Yes, we even did the tape-out in the end of the chip. So there's a actual chip, I have it at home, and it's an ASIC, application-specific integrated circuit. It's pretty cool to see where technology is still going, especially when it comes to machine learning.
0:03:05.9 RS: Working on exoskeletons and speech recognition devices had Igor steeped in the hardware side of house, so I wanted to know what it was that turned his head toward software.
0:03:16.3 IS: Yes, I would say I also got a bit disappointed with hardware. It's really hard to do stuff with hardware and to make it scale. My goal is always to reach as many people as possible with something I build, and software just scales and chip so easily. You can write a few lines of code and reach millions of people, and if you do something with hardware it takes much more. You need to build it actually, you need to figure out the manufacturing part, how to handle defects, how to handle the shipping. It's just a headache. [chuckle] If you do something like a chip for AI, basically you have to solve for problems. One is the hardware problem and one is the software problem, but I saw that both are still very unsolved. It's very, very new field. At the current stage, the software part is just way more fascinating, and the hardware part is just going to make whatever you do in software hundred or a thousand times more efficient.
0:04:06.6 RS: Yeah, yeah, that makes sense. And when it comes to the actual work of creating artificial intelligence, that is by necessity, software. That is going to be code. And when you want to make hardware, now you're getting into engineering and picking up a wrench. So it strikes me that if someone was really focused on developing something in AI and ML, they kinda have no choice but to get involved in software, yes?
0:04:31.3 IS: Yes, yes. It's also something I found a few times funny when I met people who wanted to do ML without coding, and I think that's something which doesn't really exist. You have to know how to do software engineering. So I would say by default, every good machine learning practitioner should also have basics in software engineering. You can do on the very theoretical side, I think, especially in research, there are areas where it even makes a lot of sense to be much stronger in math and the theory, but think when you wanna go to industry and build even more than just the algorithm design itself, you need to know how to code.
0:05:08.2 RS: You mentioned you were excited by where the technology was going when you were developing that speech recognition software. So where is technology going?
0:05:18.0 IS: It actually, it went very, very far since then. So this was I think around 2016, and even still at uni, I did some other project. It's actually very cool at the university. You can do projects and even do publications afterwards. And another one was in working with GANs, generative adversarial networks, and was my first project in the space of unsupervised learning. And GANs, in case you're not familiar with them, they can be used to hallucinate or generate images. So for example, I can train a model to hallucinate faces or to hallucinate cats and images which look photo-realistic. Back then, it wasn't photo-realistic, but the idea was very, very promising, very sweet. And nowadays, what turned out, or what GANs became of it is also a bit of fear, because people saw that you can use them to manipulate images and videos as well. So one area now people are a bit afraid of is this whole area around deepfakes, manipulating videos, manipulating faces, and what all of that kind of is also coming from this new field of unsupervised learning and especially generative adversarial networks.
0:06:25.4 RS: It seems like there is an application there to make synthetic data too.
0:06:29.0 IS: Oh, yes. That's actually the promise... Or even better, if you can... From whatever distribution you have, your data is a probability distribution, and if you can figure out what distribution is and you can replicate it, you understand it. That's the idea. If you know how a face can look like, you can generate unlimited amount of faces. If you can capture all the modalities, all the characteristics, and the idea there is to generate synthetic data or even fill gaps. If you have, for example, lots of pictures of males and females, and the males have glasses, but the females not, you can hallucinate females with glasses, even though it doesn't exist in the actual training data. So you can do all these kinds of interesting mixes.
0:07:15.5 RS: Is there a reason why you use the word hallucinate as opposed to generate or something else? Is there a unique aspect of this that I'm missing?
0:07:23.7 IS: No, it was more like... For us, it was magic. Back in the lab, we were trying to generate 64 x 64-pixel faces. I think I will never forget that. And the race was who is going to be the first one to go for 128 x 128 pixels. And if you think about it really, really low resolution. And I think to date, researcher went beyond 1000 x 1000 pixels. And for us, this whole thing of GANs was just magic. You hallucinate something which doesn't exist. It was maybe one of the moments I realized, this is really crazy. You can do something I would have never thought about that you could do that. You can now take images, millions of images of one topic... If you like landscape pictures, just take a few ten thousands of them and feed them into a GAN, and you can hallucinate even more. You can have a unlimited stream of landscape pictures every day at your wall if you want.
0:08:16.6 RS: I loved hearing the wonder in Igor's voice about our industry, and so naturally, I was excited to learn more about how he combined this curiosity with entrepreneurship and set out to solve common AI problems. Enter Lightly.
0:08:32.2 IS: So at Lightly we help companies use the right subset of their data to train machine learning models. Most companies have way more data than they can actually use, and most systems are trained in a supervised way, so we need labels. We need to label data and labeling data is expensive. It's expensive in time and it's expensive in money as well. Just to give you a rough rule-of-thumb estimation, in autonomous driving, most companies use less than 1% of the data they actually possess and label it and use it then to train the system. So Lightly is the one company telling you which 1% you should use.
0:09:09.1 RS: I've heard plenty of times now, how more data is usually the answer to creating more accurate output from our learners and models, but then here's Igor telling me we have too much data and that the trick is to find the subset of data that actually has meaning. So where do you start paring down the data?
0:09:27.5 IS: There are different methods we provide, algorithms as well, to do the selection itself, and one of the common themes we see is that you want to have a diverse subset of your data set. In autonomous driving, for example, you want to make sure that you have different weather conditions, that you have different roads, that you have different cities, different objects on the street. You don't wanna have just one street and one car driving there. You want to have as much diversity as possible, especially trying to include all kinds of edge cases. So basically, you start with self-supervised learning, you train on all your data you have, and then you label the 1%, or maybe even less, the 0.1%, and on this labeled data, you would train the model using supervised learning, because you still need this translation layer between what the machine understands and what we assume and understand. If you never teach the machine that there's a concept of a dog, and this is the bounding box around the dog, it would never know it. It would just know that this is an object which is very similar to other objects, which are apparently also dogs, but it doesn't know it's a dog.
0:10:31.1 RS: So what do you think are some of the advantages of layering on unsupervised or self-supervised learning?
0:10:37.3 IS: I think the biggest advantage is the ability to use all data we have. We see this problem that the data in all over the world is increasing at a rocket speed. It's doubling every year, and it's continuing like that for a long time apparently. We have more data everywhere. We create data every day with all our devices and just labeling all of that seems impossible. We need to figure out how we can overcome this limitation and get even better systems.
0:11:07.7 RS: With supervised learning, you at least, you can label it more manually, you can be a little more certain that the machine is making accurate decisions, whereas with unsupervised, there's larger quantities of data perhaps, it's maybe unlabeled. What are some of the challenges of layering this on?
0:11:22.9 IS: So the problem is self-supervised learning or unsupervised learning is still pretty inefficient. It requires not only lots of data, but also lots of compute, which is quite expensive. Just having a few hundred GPUs, it's not something everyone has at home. And that's also one of the general trends, trying to make it more efficient. On the other hand, one of the problems we still have is we don't have very good metrics, so all self-supervised learning models are finally evaluated using supervised learning tasks because there's no way to measure the performance.
0:11:55.2 RS: Even though supervised learning is more time-intensive, the trade-off is more accuracy. I assumed Igor and his team hadn't arrived at these conclusions about supervised versus self-supervised learning in a vacuum, so I asked how they were able to ensure reliable outcomes with their self-supervised approach.
0:12:12.6 IS: So what we do at Lightly is we only use self-supervised learning to help you select the right data, so we don't have to necessarily provide you a model which you would use to measure the accuracy. We have some customers actually doing that. So what they do is still the other way I mentioned before, that you pre-train the model using self-supervised learning and then you do the fine-tuning using supervised learning because it's just more efficient that way. But in the end, the way we evaluate academic models using our framework is using simple supervised methods, like we look for nearest neighbors in the vector space, like if you have two classes like dogs and cats, we wanna figure out how often are dogs close to dogs and cats close to cats for example.
0:13:00.8 RS: Okay, so the unsupervised portion is just helping customers select data that they would then label and treat as any other sort of data?
0:13:10.5 IS: Exactly, that's where we started and where we see the biggest impact in the short term, because otherwise it's like a black box, you have millions of images, but you have no clue what's in there, if you're lucky, you have some additional meta information, you have maybe if you're doing autonomous driving, you might have other sensors providing information, but from the cameras, for example, it's hard to extract a lot of information if you don't have a trained system on the full data set.
0:13:35.4 RS: Yeah, yeah, that makes sense. I enjoy speaking with experts turned founders, because they are the individuals who have really personally struggled with a problem, as opposed to the first wave entrepreneur who tends to be savvy and can identify a gap in a market. Someone like yourself who has an advanced degree, has worked in this field, you are an expert, you would know what the actual problem is, you have struggled with it. So I'm curious how you came to found Lightly, what were you experiencing that led you to want to solve this problem, to do something about these massive amounts of data that weren't being used properly?
0:14:17.4 IS: So before starting Lightly, we were exploring different ML products, like ML-powered products ourself, and one of them was working with visual effects companies from Hollywood studios and lots of freelancers. We try to solve one of the problems they have, which is replacing skies in videos, and because sometimes when you do a recording, the sky might be over exposed and might be very bright and you wanna remove that, make a blue sky, make a night sky. So their task is to do this selection of the full background, it's called rotoscoping. And we said, "Well, that's perfect use case for machine learning, let's tackle this problem." There was no labeled data available, so we had to do the whole labeling ourself. So there was me basically, with some video editing tools spending weeks doing nothing else than trying to select the background, there were some help by doing interpolation between the frames using the key frame tool, but it really was a lot of work.
0:15:15.2 IS: And I was doing this iteratively, I just labeled a few sequences and then tested the model and figured out, "Oh, it's really bad at night images and night scenes." And then I figured out, "Oh, I never trained on night scenes, why did I never realize that?" And then I continued iterating and I always ask, "Does it even make sense to annotate these videos or these frames, because maybe I already have enough of similar data in my data set?" So this question came constantly and I looked for different solutions. I didn't find anything, so I tried to solve then my own problem by trying to help figuring out what should you actually label.
0:15:52.4 RS: The cleverness of Igor's approach was now clear, leverage self-supervision to pair down a data set to only the most relevant data, and only then begin the process of cleaning and labeling, but how do you know which data is okay to leave out?
0:16:07.5 IS: So in the beginning, I started doing some experiments, like I took a data set which is fully labeled and I started removing some of the images, and I removed them using various algorithms. So I set up a hypothesis like, maybe we don't need images which are a very far apart because some of the images might be outliers and we don't want to include them into the training set. And then I run some experiments, I saw, "Okay, maybe we need these outliers because maybe it's still interesting to have them, then maybe we can remove some of the nearby duplicates like very similar images." Like for example, if you work with cats and dogs and you have lots of similar breeds of the one kind of breed of dog, maybe there's no value in adding more of the same breed, maybe you want to diversify having different breeds. And by doing all these experiments, I figured out that there are some patterns and patterns we can actually use to improve the selected data in the end. And I run some experiments and saw it works. I showed it to all my friends. Everyone was like, [chuckle] "Oh no, no, no, it doesn't make sense. We need more data, we don't need smart data."
0:17:09.9 IS: There were still a few, like my co-founder, Matt, he was one of the only ones actually believing that it might work, and then we continued iterating there, and now everyone talks about it in research, in the industry, "Hey, it's not about the amount of data, it's about selecting the right data." But back then, even friends working at Google and Facebook said, "I'm not sure about that." [chuckle] People were very impressed by what happened in 2012 with the ImageNet competition and suddenly AlexNet like the first deep learning or one of the first deep learning-based models outperformed everything before, and then others showed, "Hey, if you take a bit more data, the model gets even better." And then this trend started and Google had the data sets, and I think they call it JFT-300, which is 300 million images. Facebook has a data set of Instagram images, which are one billion images, and numbers just keep going up. So in general, more data is better, but not everyone has this amount of data or access to that, so there's still this smart aspect to it.
0:18:10.1 RS: So more data is better in the event that you have the equipment, the ability to process it all? Is that part of the issue?
0:18:18.5 IS: I think that, but also on the other hand, more data very often leads to more diversity automatically, probably if you take one billion images from Instagram, you won't find only 100 images just replicated many, many times, you will still find maybe half a billion at least very diverse images, and then the rest may be some copies with a mean text or something on it. It's also depending on the problem you try to solve. If you have very specific niche problems you wanna tackle, maybe you need the one billion images, maybe you need all of them annotated. But in most cases, what we have seen, we tackle problems where you care more about the overall scene or the overall big objects in the scene, where you care about specific things which are in focus or very interesting to your task, then it just makes more sense to not use all of that, it's not necessary.
0:19:13.1 RS: This process of excluding the data in the data set that's not relevant to the current problem. You gave some good examples of like, "Oh, this is an outlier. We should not factor that into the data set." How do you tell the difference between something that's an outlier and irrelevant versus something that is maybe anomalous, and therefore, could be really important to involve in the model?
0:19:34.6 IS: The solution we have is we let the user decide of our framework. Something we very early on decided as a design choice that we don't wanna make a black box. We want to make a tool where there are input parameters, so the user can tweak the parameters based on the desired outcome. And this is a very good point because maybe you care about the outliers because you want to make sure you have a robust set of test images who you can evaluate. Does your model actually work on these outliers or not? What happens? Because even in autonomous driving, outliers usually are similar to edge cases, like situations where you wanna make sure that your model performs well, but in other cases, you don't care about them. You just care about the majority main events, and you wanna make sure that these events are present. So it's a bit off-sided. It really depends on the task you wanna solve.
0:20:25.6 IS: You shift a bit of focus from labeling all your data into supervising the unsupervised algorithm to tell you what to label, but you still need to supervise the unsupervised model. And that's a bit the fun thing because most of the approaches in self-supervised learning, they heavily depend on some parameters or design choices you make, like the augmentations. Like, do you care about the colors of the objects in the scene or not? And based on that, the user would pick different augmentations. So basically, what happens now is this whole unsupervised or self-supervised learning is still supervised by a user, but it's not about you labeling now.
0:21:04.5 RS: When you say, in the case of outliers versus anomalies, you let the user decide, what does that look like?
0:21:10.4 IS: So the user can... There are three ways on how the user can select the data. One is based on metadata. They can either upload their own metadata or use the ones we can extract from the images or videos directly. Quick information like, what's the signal to noise ratio, what's the the blurriness or sharpness of the images, what's the luminance, what are the different color values? This is one aspect. The second one is in the embedding space. So the user can visually explore the data into dimensions, can spot clusters, and maybe you'll say, "I really care about this cluster," and you can select it using the mouse, and you can also remove some of the images. So this is the more manual way through the embedding space. And the third one is using algorithms. It's typically called active learning in research. The idea there is that we use the embeddings we can generate, and if you have, you can also use model predictions to improve the selection process. These algorithms run in background and they are very scalable. So typically, in the end, use a combination of all three approaches.
0:22:12.6 RS: Could you maybe give some more examples of parts of a data set that are maybe irrelevant.
0:22:17.4 IS: So what we see very often is lots of our customers work with video data. And for example, you do monitoring of an assembly line. You have products going over the line in front of the camera, and then very often you might have situations where there's nothing going on because there's currently no product on the assembly line. So there are like empty recordings and you will have them all over the place. And what you do currently to remove them is you maybe subsample the video to only take one frame a second, or maybe you have even humans watching the video to decide when something interesting is happening. The same problem you're facing monitoring traffic, for example, because maybe the intersection is just empty. For a few minutes, nothing's happening, so there's no need to record the empty streets for a few minutes, if nobody is moving around and nothing's happening. And in autonomous driving, you have the same. If you have a car standing in front of a red light and nothing is moving around, there's no need to capture all that data.
0:23:12.0 IS: And at the same time, selecting the right data, it's not a simple problem, because what you can do with videos, you can just say, "Hey, I only pick one frame per second," or another thing people do in autonomous driving is they say, "We pick one frame per meter because we have GPS information. So we can say we only record one frame per meter whenever the car moves one meter." The problem there is you still might lose lots of interesting scenes, like somebody jumping in front of the car. Maybe it's only one frame then, but in theory, it should be like 30 frames. It's some really crazy action, which never happened before, so it's... On a semantic level, if you look at the video, you would see every frame is very different and something is going on, but all those simple techniques would miss that. We see it now even more in other industries, in areas you would not have expected it because maybe there was some issue during the recording and the first couple of minutes of a video sequence are a bit weird and you don't wanna use them or maybe there were some glitches which repeat themselves over time, and very often there are some events you maybe don't wanna keep because they are very frequently appearing again and again.
0:24:22.5 RS: I came into this interview expecting a debate on the virtue of supervised versus self-supervised learning, and no doubt that debate will rage on in the hallways of various universities and comment threads. The most honest approach is often, "It depends," and Lightly's hybrid model of applying self-supervised learning in order to enable more efficient supervised learning with just a smack of human in the loop when it comes to edge case selection, feels like the center of a complicated Venn diagram. Since Lightly's customers are all AI practitioners in a variety of industries, Igor has unprecedented insight into some of the AI tools being brought to market. I indulged to that childlike wonder you heard earlier and asked which of them are doing the coolest things.
0:25:10.2 IS: We saw one company doing monitoring of animals like cows, for example, and then suddenly we saw similar companies doing it for other animal types, like we have video monitoring solutions for chicken farms. We had video monitoring solutions for even the fishes underwater, and then even more and more for pet farms and everything. So you have this simple concept of video monitoring and suddenly you have this huge possibility of different use cases, then you have the same with humans. You have monitoring of traffic, you have monitoring of factory workers, you have monitoring in hospitals, you have monitoring in different buildings, streets. It's just a simple concept, [chuckle] but this is really crazy.
0:26:00.0 RS: Video monitoring has a vast amount of applications, and I'm learning that tends to be the case for every new AI application we dream up, like gesture recognition or combining multiple types of supervision or even optimizing the amount of patients you can get into an emergency room. But that's the next time on How AI Happens.
[music]
0:26:25.3 RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, medtech, robotics, and agriculture. For more information, head to sama.com.
[music]