How AI Happens

Upleveling Data Labeling with Sama's Jerome Pasquero

Episode Summary

As a Senior Product Manager at Sama, Jerome Pasquero understands the power of data, and he joins us today to share a wealth of knowledge on how better annotation ensures better models.

Episode Notes

Key Points From This Episode:

Jerome’s background, interest in AI, and how he landed his role at Sama.
Social initiatives, training data, and what attracted Jerome to Sama.
The shift from focusing on AI models to the importance of data quality.
Why academia requires the use of a foundational dataset to compare models.
The reason for the early focus on building new AI models.
Whether datasets will become open source in the future as models have.
The role of annotation in making data meaningful and useful.
Challenges of annotating data and different approaches to doing so.
The three components of data annotation: models, filtering, and the annotation pipeline.
How to hone in on goals for filtering data into valuable subsets that align with your desired outcomes.
How to measure a model’s accuracy by focusing on user experience and more.
What data drift is and how to prevent it by keeping track of it and retraining models where necessary.
How to know that your training data is close enough to your production data.
What excites Jerome most about the world of data and annotation.

Tweetables:

“Most of the successful model architectures are now open source. You can get them anywhere on the web easily, but the one thing that a company is guarding with its life is its data.” — Jerome Pasquero [0:05:36]

“If you consider that we now know that a model can be highly sensitive to the quality of the data that are used to train it, there is this natural shift to try to feed models with the best data possible and data quality becomes of paramount importance.” — Jerome Pasquero [0:05:47]

“The point of this whole system is that, once you have these three components in place, you can drive your filtering strategy.” — Jerome Pasquero [0:14:06]

“You can always get more data later. What you want to avoid is getting yourself into a situation where the data that you are annotating is useless.” — Jerome Pasquero [0:17:30]

“A model is like a living thing. You need to take care of it otherwise it is going to degrade, not because it’s degrading internally, but because the data that it is used to seeing has changed.” — Jerome Pasquero [0:25:49]

Links Mentioned in Today’s Episode:

Jerome Pasquero on LinkedIn

Jerome Pasquero Blog: Top 10 Data Labeling FAQs

Sama

Episode Transcription

EPISODE 30

[INTRODUCTION]

[00:00:00] JP: What excites me the most about the field of AI is that it involves a new type of job that's guaranteed to evolve at, I'd say, an accelerated pace over the next decade. That's the job of the annotating agent.

[00:00:16] RS: Welcome to How AI happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we're about to learn How AI Happens.

[INTERVIEW]

[00:00:44] RS: Joining me today on this episode of How AI happens is the Senior Product Manager at Sama, Jerome Pasquero. Jerome, welcome to the podcast. How are you today?

[00:00:52] JP: Hey, I'm well. Thank you, Rob. I’m really happy to be here.

[00:00:55] RS: I'm so pleased you're on the show because you are an internal subject matter expert at Sama. As soon as I kicked off this podcast, I knew eventually I'd have to have you on the show, because you are just so gracious, I would say, with your knowledge and helping out everyone around the team to kind of uplevel their own conversations and the way they think about AI. So I'm really excited to do so here on the podcast. Thanks again for being here.

[00:01:18] JP: Yeah, sure.

[00:01:19] RS: So rather than just kind of stumble my own way through your background, for the folks at home, would you mind just to set some context sharing a little bit about your background, how you kind of got into the space and how you landed in your role at Sama?

[00:01:33] JP: Yeah, of course. So I'm an electrical engineer by training. I used to study actually the sense of touch with computers, which is a field called haptics. A completely different domain today. But I've always been interested in AI from, I'd say, my early university years, and even before, and as a matter of fact, I took an AI class, believe it or not, in 2001. We were using a book that's still used today by Russell and Norvig, if I remember well, and it was covering neural networks, and not deep neural networks, but neural networks and reinforcement learning.

After that, my career took me to BlackBerry, where it did a lot of work in human computer interactions. I had a lot of work on physical keyboards. And in 2016, there was this new and exciting company, Montreal, all around AI and that was co-founded by Yoshua Bengio, one of the godfathers of deep learning. For me, it was a unique opportunity and I jumped at it and ended up working there for about four years.

Over there, I was in charge of like kick starting new teams and new products. I think I was actually the first product manager over there, but they never gave me that title, or they didn't give me the title from the get go, because it didn't existed. There wasn't a role that exist internally, just yet. And then four years later, after that company got acquired by an American tech giant, I started looking for a fair way to continue working in AI and my passion and but with maybe a social impact side. Lo and behold, summer was growing in Montreal, my hometown. So been at Sama since then.

[00:02:59] RS: I'm assuming you could have wound up at a lot of different companies. It's a very in demand skill set. Why Sama?

[00:03:07] JP: I was really attracted by the social mission. That's definitely something that resonated with me. And when I started talking to people internally, that's also a big reason for them being there. Another part of it is that at Sama, we work very closely to the data, the training data itself, which is the fuel of AI. I really thought that this is where we could have the biggest impact both in an enterprise setting or in a social set, social impact setting over the next decade. It combined like a lot of my interests.

[00:03:40] RS: That actually leads into a little bit what I wanted to speak with you about today, which is the challenges of annotation, particularly in retail, but it's more far reaching. I don't think it's just specific to retail. I think anyone who is worrying about data annotation will identify with some of the challenges. Related to why you chose Sama, in addition to social mission, it was an impactful space. As you're putting together your thoughts for this eBook, what are some of just like the high-level challenges people are facing when it comes time to make annotation meaningful? What is stopping people from creating effective data that they can train their learners with?

[00:04:15] JP: Yeah, I think before we get into the importance of data, it might be a good idea to cover a little bit about a shift that's occurred over the last few years, going from a race to come up with a better, the best AI model architectures to now, everyone realizing that really, data had a much better chance of having a big impact on your models than you previously thought. The story starts with the fact that the boom in AI for computer vision, and other fields as well, really emerged like from academia, and pure research, research labs that were competing for publication in a prestigious conferences in journals.

But in academia, what's important to understand is that in order to be able to compare one solution, in this case, model architecture to the next, you have to do it on the same grounds. In AI, those grounds are the data sets that are used, right? So everyone uses, and to this day, the same data sets that are publicly available. In fact, if you don't use those datasets, and if you don't use them, and you report on the performance of your models on those data sets in your paper, you just won't get accepted.

So naturally, everyone is and is still trying to come up with better and more clever model architectures, rather than to focus on the data, which is the common ground. To be fair, that's not to say that we didn't get great advancement from this, and probably in combination with the kind of access to more computational power, that led to a lot of progress and modern architectures did really improve global performance.

Now, if you fast forward a couple of years, and now you transpose all of this to the business world, where every company is like unique, and it deals with very specific problems that it wants to solve, well, it's a completely different reality, right? Most of the successful model architectures are that open source. You can get them anywhere easily. But the one thing that the company is guarding with its life is its data, right? Now, if you consider that we now know that like a model, it can be highly sensitive to the quality of the data, the training data used to train it, well, there's this natural shift to try to fit models with the best data possible and data quality becomes of paramount importance.

[00:06:22] RS: Is this still the case in academia, that there's a foundational data set you have to use in your research to be accepted?

[00:06:29] JP: It's still the case very much today. And it makes sense because otherwise, you would be comparing oranges to apples or apples or oranges, right? So you do need to have like those datasets that everyone uses to compare a model performance.

[00:06:41] RS: Is that limiting?

[00:06:43] JP: It’s definitely limiting, because it's really not – it doesn't translate to the reality in a business where everyone has different data, they’re willing to share it, because it is one of the most valuable assets, and also, what the companies are trying to achieve their use cases are going to be different from one company to the next.

[00:07:02] RS: There was this early emphasis on generating the best architecture, as opposed to the quality of the data that will be fed through the architecture. Is this just a chicken in the egg situation? Was the idea, no matter how good your data is, if the framework for processing it and giving you feedback is fraught, it doesn't matter. You needed to get the architecture first. Why would you say the emphasis was on architecture rather than data in the beginning?

[00:07:26] JP: Well, first of all, there was a lot of progress. There was a potential for a lot of progress on the architecture side, so it was kind of exciting to be able to come up with like a new architecture and more powerful that would bring better results than the previous architecture, again, on the same data set. It's at the same time that the computation or the access to computational power, which is so crucial in training those models, because they're big, and they require GPUs, or even dedicated hardware was growing at the same time. So people had a tendency to just want to make things better with a better architecture. It's also more prestigious to come up with a new architecture, than to say, “I figured out a way to select the best data annotated in the best way possible, and do this in a tight feedback loop to improve the model performance.”

[00:08:12] RS: So a lot of architecture is open source. You have a GitHub account, it can almost be looked at as a commodity, some of these frameworks. Do you suspect the same will happen with data over time?

[00:08:23] JP: Yeah, that's a really good question. I can't really predict a future any more than you can, right? Especially to like in a fast-paced field, such as AI. But what I will say is that if data ever, like kind of training data ever becomes commoditized, as it's the fact that the modern architecture, as you pointed out, I think it's because we will be in a very different reality. Today a big part of what makes a company's competitive advantage, it really resides in its data and what he can do with it. So we'll be in a different world, if that ever happens.

That's not to say that there's no other areas where you can get a competitive advantage. For instance, if you have access to enormous amount of computational resources, we touched on this earlier in this conversation, and as a matter of fact, this is already happening today, like in some aspects. For instance, in natural language processing, some of the models have so many parameters and they require so much computational resources, the GPUs and the memory and et cetera, that only like a really a handful of deep pocket companies are able to boost them, or even reproduce the work of someone else. So that's another kind of path, we can see a differentiator.

[00:09:28] RS: Right, yeah, makes sense. You say you can't predict a future any better than me. But I would contend marginally better than me, considering your background, right? That sounds reasonable. I agree with you. The world's going to change in ways that we don't fully understand. But the challenge, I think, will be in making data meaningful. If you can get access to a raw set of data, there's still no guarantees that it is curated, tailored in a way that will result in meaningful insight, which kind of brings us to annotation. So when you think about how to begin prioritizing annotations, turning data into something that you can use, where do you start?

[00:10:07] JP: Excellent question. That first stems from the fact that or the realization that not all the data or not all the annotating data is of equal value to the value to a machine learning model. I think it's pretty easy to understand with an extreme example. But it's worth going through when like, say you're training a model to detect pedestrians crossing the street, for instance, a simple model. If you train it with images of pedestrians crossing, always at same intersection over and over again, and you annotate all in this, well your model isn't going to be good at generalizing. It won't be performing well and you'll struggle to detect pedestrians crossing a different street with a different camera angle, right?

So that baseline, not every data is the same. Once you've realized this, the question becomes like, how do you pick what data should be annotated? At some, for instance, we have clients who have like tens of millions, hundreds of millions, billions of assets, so images, videos, Lidar, and everything. They can't afford to annotate all of this. So if you pick one thing some of them do, is that actually manually filter through that data? They might use some of the immediate data to kind of help them navigate the whole data space and pick the ones that they think are the most relevant. And understand that it's very laborious process, it isn't even guarantee that the selected images or assets have maximum value to the model.

That brings us to the next step, like what is a better way to do so? The first step to that better solution is actually to define your filtering goal from the get go. And that filtering goal, so going from like tens of millions of images to a subset that you want to annotate for training data, can vary a lot. And at one end of the spectrum, you might want to get a subset of your data that is well balanced. So you want a subset of data that is really representative of your whole pool of data and it has maximum information, right? So that's one potential filtering goal.

Or another one at the other extreme of that spectrum is complete inverse. You're like, “Oh, I really want to get really good or better at a few classes that my model is able to identify because on these classes is not doing great.” So think of, for instance, identifying a footbridge. My mom is really bad at identifying footbridge. Can you find in all the data that I have that's unlabeled, images that are very close or footbridge like? So that I can send them to the annotators and then use that training data to improve the next iteration or generation of my models.

So once you've defined that objective, the key is really to put in place kind of a tight feedback system that links three main components that are easy to understand. The first one is your model. The model is part of the equation, because that's what you're trying to optimize. So it's important that one of these three components is in there. The second one is kind of a system that allows you to do that data filtering or that data curation. It's at the heart of it, how do you pick those images? And then the third one is the annotation pipeline. Once you have filtered it down, you need to get it annotated, and the model retrained.

Of course, the annotation pipeline is still a very manual process, is the most manual intensive part of the whole process. So you have these three components. What do you do then? You start from the entire unlabeled data, and you preprocess all the data that you have, or as much data as possible unlabeled data that you have, to create something we call embeddings. And it's not important to understand exactly what an embedding is, other than it's just kind of a compressed representation of what features your image hold.

So a way to represent an image in a compressed way, and why this is important is because then you can do simple math, basically, with those compressed representations, so you can do operations, such as similarity searches. With a similarity search, well, it goes back to what I was giving this example before, find me other images that have something that looks like a footbridge, and go through my whole data. And that can be done very quickly, once you've like preprocessed your data and computed the embeddings.

That's one example. Another example is the other extreme is like, instead of finding two images that are very similar, find the images that are most different that where the content is as differentiated, and the reason here is because you want to get as much of a diversity in your data as possible so that your model gets better across the across the board, basically.

The whole point of this whole system is that once you have like these three components in place, they really can drive your filtering strategy, right? You filter down to a subset of all the images that you have, you send that for annotations, you retrain your model, and then you use the predictions of your model to inform the next iteration again, of filtering, annotation, retraining, et cetera, et cetera. Some people call this active learning, and that's totally a valid way of calling it, but the point here is that the goal of setting up this tight loop, again, between the model, the data filtering, and the annotation component is that you are never annotating images or assets that are not aligned with your model improvement strategy. You're not paying for useless information, right? You're doing in a very strategic and surgical way to get the best results possible, and as fast as possible.

[00:15:16] RS: Yeah, that makes sense. It all is in service of this idea that more data is not necessarily better. It strikes me that for a long time, data acquisition has kind of been an arms race, right? Where it's like, just get more, we have this many billion data points that we're putting through our algorithm. But you seem to err on the side of a more heuristic approach where like, let's be smart about which data specifically we're processing. Let's phase out all the stuff that's not relevant to our own objectives. So I'm curious when it comes to setting your filter goals, you have to pose a question that you're hoping that the data is going to answer for you. How do you set filter goals? Looking at the use case of your technology, what the outcome is you want, how do you sort of hone in on the constraints you should be putting on the data and the questions you should be asking of it so that the outcome is something relevant to you?

[00:16:09] JP: Really that, I think should be informed as much as possible from the business side. I don't necessarily mean that business people have to look at it, but just look at the end user experience and look where it's faulting, where it has flaws, right? From there, decide on what are you going to use your resources over the next iteration of the filtering training and annotation and retraining of the model, where it's all about opportunity cost over here. What makes the most sense?

So one of these could be, “Oh, my model is really doing bad on some classes that are very important to my end user.” So I'll focus on that. I'll make sure I can gather or find as much data as possible that is close to those classes, or part of those classes, annotated and retrained the model, so it gets better for those particular one. Or another business decision may be like, “Oh, it looks like our model is really not that great. It's just under the threshold of what would make our users trust it.” So that means we really need to improve its performance across the whole board. Let's do another pass at that going and getting more data.

Just to be clear, everything else being equal, more data is better. There's no doubt. It's just that you can’t afford and sometimes you can't afford to have an infinite amount of data, you have to be smart in which data will have the maximum impact in the shortest time span, over your model performance, right?

[00:17:32] RS: That makes sense. So in a perfect world, where you had infinite resources, infinite computational resources, infinite annotators, then yeah, more data would always be better. But for the rest of us grunts, who maybe have some limitations on those resources, you just have to prioritize, right?

[00:17:46] JP: The other thing that I think that is important here is like you can always get more data later. What you want to avoid getting yourself into is a situation where you're actually annotating data that is useless.

[00:17:57] RS: How would you know?

[00:17:59] JP: Well, thanks to these filtering techniques, and that tight loop that forces you to actually go in rapid iterations, you're never annotating very long without seeing the results on your model performance. If those results are stagnating, you're definitely either hit the maximum performance that you can achieve with a reasonable amount of data, or you're doing something else wrong, and it forces you to revisit your whole strategy and your whole design.

[00:18:25] RS: I'm glad you brought up the feedback loop there. You prioritize and annotate, you feed it through, you measure results, measure accuracy, and then you tweak it and do it all over again, right? So first, how do you measure accuracy? When you look at the output, and now you're going to compare, okay, how relevant was the data I selected? Did I prioritize well? One, how do you measure the accuracy? And then how do you prevent degradation? Or I've also heard it called drift over time?

[00:18:54] JP: Excellent question. I think let's divide it into two parts. How do we measure accuracy to start with? That’s it. That's an easy answer, though, I have a little twist that I add at the end. It's easy to focus on the hard metrics to assess accuracy. I mean, there's well established ways to measure models accuracy. For instance, people will talk about the F1 score, which is a way that combines a way to evaluate like the precision or recall of your models, to see how it's going and doing on like a true positives and false negatives or false positive.

So they're really well established ways of evaluating a model performance. And it's really crucial to be able to measure and track the evolution of this type of metrics over time. But at the end of the day, things such as precision and recall, for instance, they're not processed. I mean process in the sense of like, a cognitive process here. They're not processed directly by the end user. That's not what they experience, right? Either consciously or unconsciously, like these users that are using your system, which I'm always part of, they're trying to get something accomplished. They were really task oriented most of the time, and whether or not that they were able to accomplish this task, with the high degree of perceived efficiency, or whether they trust the model’s predictions, for instance, is what ultimately really, really matters.

So developers, in my opinion, should always keep track of that end game. Get the metrics. It's important. It's just, it's necessary, but it's not sufficient. If possible, I would recommend asking users for feedback within the interface whenever possible, maybe not all the time. But like from time to time, if it's too awkward to ask for user feedback, as a developer, AI developer, you should play the role of the users as frequently as possible. Try the system. Try to break it. Try to think in new ways of like trying to achieve a task.

Other solutions include, like periodically taking some production data and annotating it as well, so that you can retrain the model. Always make sure that your model data is up to date, or the training data that you use is up to date. I'd say apply proper usability testing methods.

Now, it's not because the system has an AI component, that everything is magic, and that all of a sudden, usability and user experience monitoring should be completely neglected. On the contrary, I think that, in fact, you should really double down on those to make sure that in the end, it's really the user experience that shines.

That was to measure accuracy. I think the second part of the question was alluded to a problem that we see in production called data drift, right? So first, I think it's important to realize what is data drift, and it all comes down to the fact that your model is as good as the data that you used to train it. It can't be better. It can't learn new things that it has never seen before. So, the performance of your model in production is undoubtedly going to degrade sooner or later. It's going to happen. It's going to lose some of its predictive power slowly as the data changes ever slowly. That's just because that the data that your model is seeing in production is going to change itself yet.

For instance, let's give it a concrete example, over here. A retailer might introduce new products to its product line, right? And if your model was not trained on images of these new products, there is no way it’s going to be able to recognize them properly, or to classify them properly. The only way to make sure that your model performance doesn't drift is actually to detect that statistical change in the data that happens by the inputs. And once you've detected that, there's data drift. You can add more relevant or recent label data to the training batch, retrain the model through another training phase, and then compare the performance. If you're happy with the performance, put it up in production.

Clearly hear, the key is to detect a data drift and that is not obvious. How do you catch that? So there are methods to compare the statistical properties of the data that is passing through your production model. And then you can compare that to the data that you use for the training and through the statistical tools, you might be able to detect the fact that they're quite different, and that you need to relabel some more current data and retrain. But while it's a necessary step, it's often not sufficient, right? Some experts typically also recommend monitoring proxy, or correlated metrics also, to judge whether a model is slowly degrading.

Now, when you're doing things such as search relevancy, for instance, that's quite easy, right? Where are those metrics? Well, just keep an eye on whether users are selecting like the top results that your search engines is returning, as long as they're selecting those top results, quickly, you're in good shape, and the data hasn't drifted, at least not in a significant way. But in other cases, say the application is one where the model is part of a computer vision system, who's assessing all the items that you have in your shopping cart.

Let's take this example. This is at a brick and mortar grocery store, and you're going through a cashier less checkout experience. Or here's the kicker, though, your data has drifted. So now one cereal manufacturer has changed its packaging, and you don't know about it. And now the model is confused, it's confusing cereal box from this manufacturer with a much more expensive brand. So it's giving you an answer, and it's a totally plausible answer, but it's just wrong. How are you going to capture that change? I think one of the best ways to capture that change is actually to rely on user feedback.

Again, don't forget that they're often end users at the end of a model, and what really matters is their user experience, and there's nothing that I hate more that I have to pay more than I should. So what matters is that a user experience and not some aggregate number that kind of assesses the model performance as a whole. Listen to your users, is I think the bottom line here.

[00:24:52] RS: I'm actually having some trouble with the word drift, and maybe it's because I misunderstand the concept, but it strikes me that, like as you said, your model is only as good as data you feed through it. Is drift, like if you are noticing drift, then there's some problem with the data or your model, right? It's not like, “Oh, how could this happen?” It's just a result of poor annotation, poor relevance in the data, you have poor architect you developed, right? Should we call it drift? Or is that not unexpected?

[00:25:22] JP: No, I would say actually, that data drift is not an indication of bad architecture that you've done anything wrong. It's a natural outcome. Businesses evolve, product lines grow, and things change, right? That's just as true in real life, than it is in business processes. So sooner or later, whatever data was used for training your model, is no longer representative of really the data that's coming in production. That is completely agnostic to the model architecture that that was used. You just have to realize that it's going to happen and put in place the right mechanisms for catching it and fixing that problem, which you will have to fix over and over. A model is really like a living thing you need to care for it. Otherwise, it's going to degrade – not because it's degrading internally, like nothing's happening to the weights within the neural network or anything. But because the data that it's used to seeing has changed.

[00:26:22] RS: That makes sense. In that example, though, of evolving business needs, goals, the original model, and the original data is doing what it was intended. It's more like the goalposts have moved, as opposed to there's degradation. It's working as designed. It's just that the target is different, right?

[00:26:39] JP: Yeah, you're right that in that sense, it might make sense to call it something else than degradation. It's the model itself isn't degrading, like, the weights within the models are not changing. It's really the fact that the reality, the context, in which that model is operating has evolved. And since a model doesn't adapt on its own unless it retrains and goes through like another training session with new data, while eventually it can adapt. It can adapt, in this case, to the ever evolving situation, context.

[00:27:12] RS: Yeah, that makes sense. I'm being pedantic. But you'll forgive me because I do talk for a living. So bear with me a little bit on that one.

[00:27:19] JP: No worries.

[00:27:20] RS: This process of measuring degradation, drift will just keep that terminology for the time being, rather than start a vocabulary revolution here on the podcast. There's always going to be this degradation, not just in terms of business needs changing. But as you hone in on what is the outcome you want, you measure accuracy, you reset, you refeed data back, training data, and production data is always going to be different. How do you know that your training data is close enough that you can approximate something that will be meaning with the intent of tweaking it over time?

[00:27:54] JP: Yeah, I'd say the first step is actually, when you're generating your training the first time around, get it at the source. Go talk to the business owners, and ask them – or the business unit owners, and ask them what data are you expecting? Does this data exist already, but it's all a manual process? Can I get it? The second you get it, it's going to start like probably drifting away from the production data, but at least you're starting with something that's solid and very close to it and annotate that data, we can use some of the tools we alluded to earlier to pick which of that data that you got from the production has the most information and the most value, to kickstart your model. But start with that.

And then constantly go back to the business. Ask, “Have you seen anything? Are you expecting any changes?” Because often, these come from or these changes might come from the business. For instance, if the business is adding a new line of products, that all the products that they have already, well, that will require retraining of the model. Otherwise, you'll get data drift. Another example of that, if for instance, you have a model that extract entities from scanned forms that are handwritten, okay, and it like says, “Okay, this is where the name of the, whoever filled out the form is, this is where the address is and everything”, and all of a sudden, like business decides that they need an extra field in there and they create version two of that form. Okay, that has an extra field.

Well, your model hasn't seen a form of the sort, it's going to struggle trying to extract entity from this new field on this new form. So you need to be connected to the business, and you can also use tools to set up alarms that say, “Enough has changed over here that the performance that you could expect originally is no longer valid. It's probably much lower and you should take a look at what's going on here.”

[00:29:43] RS: Got it. Well, Jerome, we are creeping up on optimal podcast length here. Before I let you go, I want to ask you to indulge the 12-year-old boy who just loves AI and loves the space in yourself and tell me what is it that you're most excited about when you look at technologies being developed or applications within some or without, what kind of excites you the most?

[00:30:06] JP: I love this question. And I think I have a different take than most people in this one. One of the main reasons I joined Sama is really because I wanted to work for a company, like I mentioned, that has a positive impact on the world. But I also wanted to keep on working in AI. And more importantly, hope you got this from this conversation, get closer to the data, right? The data. Because I really believe that that's where we can generate the most value for years to come and have the most impact.

What excites me the most about the field of AI, and most specifically, the world of annotation is that it involves a new type of job that's guaranteed to evolve at, I'd say, an accelerated pace over the next decade, and that's the job of the annotating agent. Today, we're calling them annotating agent, because that's what they do, they annotate. But in the future, we’ll likely be calling them something else. I don't know, something maybe around the model teacher, for instance. That's because today, they're tasked with like doing pretty tedious annotation work, such as drawing a bounding box around a pedestrian, or a vehicle, et cetera.

But as the models evolve, thanks to these annotations that we are using for the training, and as they get better, the annotators are going to be asked to perform tasks that require a high level of cognition, like intelligence, real intelligence, right? Say, stuff like identifying dangerous driving situations, that can be quite complex. Or drawing valuable information from multiple sources of data, things that require a certain level of understanding of the world that it's going to take time for models to get to.

So again, the objective is the same, you want – proving the models with valuable examples, but you see how the role of an agent really shifts from really being a pure annotator to someone who's like transferring some of their knowledge into a program, basically. So that means that the skill levels of the agent is always going to have to be several steps ahead of the state of the art models. And to me, that's really exciting, because it just means their jobs are going to be more and more interesting, and I really look forward to the day where we'll be talking about like model teachers and not just annotating agents. I believe that our agents today are really the most likely to hold those new jobs because they will have been exposed to the all the subtleties and nuances of the data for years, right?

[00:32:29] RS: Yes, as the complexity of the models grows, so too, will need to grow the complexity of the individual working on it. The result is the up leveling of entire labor market and access to a very advanced technical field granted to people who previously would had no opportunity in that space. They don't have to have an advanced computer science degree, an EE degree in order to play in this arena anymore.

[00:32:56] JP: Exactly.

[00:32:56] RS: Jerome, this has been a wonderful conversation. Thank you for being here and sharing your expertise with us on the podcast today.

[00:33:02] JP: Thanks a lot, Rob. It's been a real pleasure and a real treat for me.

[OUTRO]

[00:33:11] RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, ecommerce, media, medtech, robotics, and agriculture. For more information, head to sama.com.

[END]