How AI Happens

Unpacking Meta's SAM-2 with Sama Experts Pascal & Yannick

Episode Summary

Fresh from Sama's AI Research Book Club are Solutions Engineer Yannick Donnelly and Sr. Applied Scientist Pascal Jauffret, who join the podcast to unpack the use cases, limitations, and implications of Meta's SAM-2.

Episode Notes

Pascal & Yannick delve into the kind of human involvement SAM-2 needs before discussing the use cases it enables. Hear all about the importance of having realistic expectations of AI, what the cost of SAM-2 looks like, and the the importance of humans in LLMs.

Key Points From This Episode:

Introducing Pascal Jauffret and Yannick Donnelly to the show.
Our guests explain what the SAM-2 model is.
A description of what getting information from video entails.
What made our guests interested in researching SAM-2.
A few things that stand out about this tool.
The level of human involvement that SAM-2 needs.
Some of the use cases they see SAM-2 enabling.
Whether manually annotating is easier than simply validating data.
The importance of setting realistic expectations of what AI can do.
When LLM models work best, according to our experts.
A discussion about the cost of the models at the moment.
Why humans are so important in coaching people to use models.
What we can expect from Sama in the near future.

Quotes:

“We’re kind of shifting towards more of a validation period than just annotating from scratch.” — Yannick Donnelly [0:22:01]

“Models have their place but they need to be evaluated.” — Yannick Donnelly [0:25:16]

“You’re never just using a model for the sake of using a model. You’re trying to solve something and you’re trying to improve a business metric.” — Pascal Jauffret [0:32:59]

“We really shouldn’t underestimate the human aspect of using models.” — Pascal Jauffret [0:40:08]

Links Mentioned in Today’s Episode:

Pascal Jauffret on LinkedIn

Yannick Donnelly on LinkedIn

How AI Happens

Sama

Episode Transcription

Yannik Donnelly: If you and I were to review the same video and we had to caption it and explain what was happening, how would you describe it versus me? How can we get to the same conclusion but use different words?

Pascal Geauffrey: How can you leverage different cultural biases like we've done experiment with voice and people see things in a different way? Can we leverage this? Is there value for the models in that there'really something interesting on this front?

Rob Stevenson: Multimodal is all fun and games until you appreciate that there's no such thing as an objective experience of reality. Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we're about to learn how AI Happens. Hello. Hello, podcast land. Welcome back to another classic installment, I'm sure, of How AI Happens. It's me, your host, Rob Stevenson, here at the helm once more to bring you the sweet, sweet sounds of AI, ML, computer vision, generative expertise and advice. And today we have a very special episode for you. It is not our typical interview session because I'm bringing in some of my friends from SAMA and we're going to speak about meta'segment anything too. Now, Sam 2 dropped at the end of July and here we are, it's early December. Why now? You may be wondering, why did we not do an emergency podcast episode the day it came out, with maybe an all caps title and a YouTube thumbnail of me doing the home alone face? I guess with bounding boxes covering my face would make sense in that context. And the reason is, number one, this is not that kind of show. And number two, when a new model drops, there is so much speculation and guesswork and happening all around. the power of the model, the use cases, the practicality, the cost, and all of these things tend to become more clear once people get their hands on it and some time passes. Now, my friends over at sama, who I just mentioned, they regularly hold what amounts to basically an AI research paper book club. And in a recent meeting they discussed at length Sam2 and me being the enterprise and content creator I am, anytime someone says something interesting, I automatically ask myself, should we be recording this? And luckily, two, ofama's amazingly bright and experienced subject matter experts agreed to continue the conversation here on the podcast. So let's meet them right now. I'll bring them out first up we have senior applied scientist over at sa, Pascal Geauffrey. Pascal, welcome to the show. How the hell are you?

Pascal Geauffrey: Hie? I'm well, I'm fine on you.

Rob Stevenson: I am great. I'm really excited about this conversation. Can you tell that I have this, this excitement for this episode? I can, I would love it just before we keep going here if you could share just a little bit about your background and your SOT area of expertise.

Pascal Geauffrey: Yeah, for sure. So I'm, a senior applied scientist here at Samar. I've been there for the last three years working mainly on computer vision, single frames, videos, 3D, 2D, all the applied stuff that we do here at SAR. and previously I did a lot of work on natural language processing in bigger corporation in banking, but I've been in various domains and always trying to apply the recent AI research to solve business problems and push the envelope in the end.

Rob Stevenson: Wonderful. So glad you're here with me today, Pascal. Also joining us on today's episode, waiting patiently in the wins, but also chomping at the bit to chime in here is Solutions engineering manager Yannik Donnelly. Yannik, welcome to you.

Yannik Donnelly: Thanks for having me. Excited to be here.

Rob Stevenson: Really pleased you're here too. Same question. Would you just mind giving us a little bit of context for who you are and why you're here? Sure.

Yannik Donnelly: So, I come from a software engineering background. I've been mostly working for the past few decades in aerospace, manufacturing and robotics. So there was a lot of automation being done and some kind of beginning of MLS in my early career and then eventually moved into more of the startup scene and started working into conversational AI, different type of AI that could be used in fintech, andureech and slowly but surely moved my way to SAMA to then start trying to help companies resolve their data needs. Right. This is what we do. A lot of SAMA is helping large enterprise customers get their data annotated at quality and to help train the models of tomorrow. So yeah, I've been here for about four years.

Rob Stevenson: Wonderful. And you are also joined by your adorable dog Lily, who may be chiming in at some point. We should call out. Not going to make any efforts to edit it out or remove her. She's part of the. Part of the family trying to jump.

Yannik Donnelly: And make her bed.

Pascal Geauffrey: Part of the show.

Rob Stevenson: Part of the show, exactly. Maybe we'll have Lily back on if she does a great job. And any case, gentlemen, thank you for being here and let's get into Sam too, shall we so Pascal, was it you who sort of brought the research paper to the discussion group?

Pascal Geauffrey: Yeah, essentially. So we have this reading group internally that we do every week and it was my turn, and this had been in the back burer for a while and so I really took the time to read the paper and present it to our group. It's a fairly technical reading group but everybody is welcome. and a few conversations emerge especially with Yannik out of this presentation. So that's why we're here now to kind of share back.

Yannik Donnelly: Yeah, I've been playing around with SAM v1 since it was announced a while back. And when I was really excited when I saw V2 is because I've been working a bit. A lot of customers who are moving into more like video and sequences, which is a big part of what Sam, V2 can do is now kind of do what Samy1 did, but now on a video or sequences. And I was tinkering around as I do with a lot of different technologies and I've been kind of sometimes discussing with Pascal and some of the teams to be like hey, I think we should look into this. I think we should kind of explore possibilities with this technology.

Pascal Geauffrey: Should we start by explaining a bit what SAM actually is or should we just assume everybody knows?

Rob Stevenson: I think probably most people are familiar with it, but let's do maybe I'll challenge you to do a SAM2 elevator pitch in 30 seconds or less just to set the table here.

Pascal Geauffrey: Some two elevator pitch is actually fair, easy. It's some for sequences.

Rob Stevenson: Wow, short elevator. Right. We're not even to the second floor. This segment anything model and it seems the SAM2's advantages there are some efficiency gains, there's fewer interventions and video is really the splashy addition to it. So something that stood out to me as a non technical person was that early in the research paper they make the point that video essentially is a sequence of images that if you just take the frame by frame of the video but it's not just as simple as processing a video as 10,000 images or whatever. So I was hoping Yannik Mebe, could you explain what is the difference in getting context from video? That means you cannot merely just process 10,000 images.

Yannik Donnelly: They kind of takeaway from splitting videos, right. Is that when we start breaking them down into frames, they kind of lose the connection between each frame. So you might say like you have a video with, let's say we talk often videos of frames per second and the higher the frames per second there is information that is similar between the frames, but once you start breaking them down into individual assets, that connection is gone. So if you do an annotation on one image and then you go to the next frame, which is another image, and it's completely separate, you're kind of starting from scratch again. The idea you lose that kind of temporal connectivity when you're kind of like, especially like you said, 10,000 frames, but that happens on five frames. The second you start talking sequences, you are now kind of starting from scratch on a new moment in time. And then you have to res, segment, or redo the work from scratch again, things have moved. There's motion, there's blurriness. There's all these things that need to come into account when you're trying to interpret what's happening in frame X. The second you go to frame X +1, new context, new world, new time. So you re kind of have to reinterpret the world as it is in that moment. So that's where the difficulty is with sequences.

Pascal Geauffrey: Yeah. One of the hard thing is knowing in the new frame which object was the same as the one in the frame before. Seems like a fairly simple problem when you think about it in terms of a human. But when you'model, you may not have any kind of assumption as to we're driving on the right side of the road. Objects are moving in a specific way with regards to that. Basically this is the problem of tracking. And if you use just a generic model, if I'm trying to track my hand on a video and you're trying to track my hands at a very, very low frame rate. If you have a basic tracker, you know from the information about the human, that which one is my right or my left hand. But the basic tracker might not have this information and might get confused because they are actually fairly similar.

Rob Stevenson: Okay, that is helpful. And I'll link the research paper in the show notes here. And there's a lot more in the paper on the specifics of how that is accomplished by SAM2 for folks if they want to do a deep dive. I am curious, Pascal, when you decided this is a long time coming. I want to bring this paper to the group because Meta's AI research arm is putting out papers every 10 days. It seems they're shipping a lot. I'm curious why this in particular interested you and why you thought it was topical.

Pascal Geauffrey: Yeah, so I'd say, it's a mix of both things. When we do our reading groupupe, we'd like to sometime just get the Latest state of the art model that came out. But also the business is pushing us and we're getting insight from clients. And this model had been there for a while and seemed to be a great opportunity for automation. Janannik had been talking to us about it. It's been in the space for a while. So that's why I decided to take a look at it at that time.

Yannik Donnelly: On the solution engineering side of things, we try to be a little bit more proactive when we start talking to customers and we understand that productizing sometimes capabilities, platform could take a bit time. We need to have people involved and make the analysis to see if this is the right move to do. And that's why we have these conversations with larger technical teams. But on my end I still need to move pretty quickly, test things out. And I saw a benefit. Right. I saw that there was in some use cases, traction from the deliveries we were doing, utilizing some of this, even though it was running a little bit more ad hoc that I'm like, I think we need to dig deeper into this. So we're tackling this from multiple angles. But I'm maybe more on the practical side of things and Pascal is maybe more on the research and how the models are evolving. What are the cool things that were being put together by all of these large companies that could benefit the community as a whole?

Pascal Geauffrey: Yeah, and those M models, like some, they're really great. I'd say baseline, like those are tools that can handle generic use cases very well. And this is where it brings some big value for us at summer.

Rob Stevenson: I see. Now we can read research papers all day while some of us can, but there's nothing like the quality of experience. So Yannik, I'd love to hear from you first. What stood out to you once you got your hands on this toolual and really got to play around with it.

Yannik Donnelly: I would say is just how out of the box it's able to do a very, very good job at breaking down, I would say edges of objects. So being able to really precisely like I'll take an example, like a leaf that has a lot of like little spine or spiky edges. With one click you're able to generate a mask of that object, which usually manually, if you had to do that by hand would require quite a bit of time. Right. And I saw these kind of examples coming up in some of the use cases I deal with, with customers and I was like there is a benefit of accelerating when possible because again there's going to be limitations. But if the team that are using the tools are trained properly to know when to identify when to use or not the capabilities, we can get a pretty big efficiency jump. But again, it's on a use case per use case basis. And some of the stuff that I was looking to kind of align, like, kind of the stars aligned where I had a use case that could benefit from using it. And the model had come out like a couple weeks ago. So I was like, o shiny new model to use and a use case. It's like it was a perfect kind of melting pot to bring both.

Rob Stevenson: What was the use case?

Yannik Donnelly: It had to do it with like a, closed segmentation. Right. And we used to do some of that. We used to do it all by hand, and now we expanded it. And the fact that the customer, again, the use case also is perfect. Like, the use case was putting their products that were now taking photographies of the products with white backgrounds since there's less noise on the image. This model would perform amazingly. Right? add a noisy background or with a lot of texture or whatever, the model might have a little blips here and there. The model would support this use case very well. We tried it out and it was like, wow. It kind of showed that we can get a high lift of productivity because of how the use case was set up from the customer and the fact that the model performed very well. And again, this case was four images. So 7v2 or 7v1 would have also been working. But the fact that as we start ingesting now videos, the fact that my pipeline already works with 7v2, I was able to kind of just move it to a new project. And it was seamless. Like, the pipeline was already in place.

Rob Stevenson: I love it when a plan accidentally comes together.

Yannik Donnelly: Absolutely.

Rob Stevenson: Now, to anyone who's ever done annotation, this idea that a mask would kind of be automatically applied, that's probably pretty exciting. That removes a good amount of the manual amount of work. There is, however, this concern of like, okay, today is just automatically finding the bounds of the thing you want to mask. Is the end goal of this economizing away that human labor completely? I guess my question is, when you view SAM2 in particular, how much human must remain in the loop?

Pascal Geauffrey: I think there could be two questions for this. Like, there's a philosophical question and a more like the pragmatic answer. Right now, from what we're seeing from a client, there's always a need for a human in the loop. You need someone to figure out the hge Cases. For your generic business case, let's say you're annotating cars. Maybe there's specific cars that you want to annotate in a specific way with specific rules, and those rules are not ingrained in the generic tool that Meta provides us.

Yannik Donnelly: I can give you an example of how this model could perform or not perform. But for example, we want to annotate a car. Like Pascal mentioned. Some customers might say, like, I want the car to be segmented, but without the mirrors, because we don't care about the mirrors. But SAM won't make the difference between the mirror or not. It will take it with the mirrors. And then how do you tell that model to say, don't take the mirrors? So you kind of have to do more manual intervention to remove those aspects, potentially retrain the model to, when it detects a car to not take the mirrors. And again, the mirrors, when you're looking at the car from the side, you don't care. But when you're looking it from behind or the front, those things need to be removed. So how do you have a model always kind of interpret those customer requirements means that we now have to have a human make those slight adjustments. Right.

Pascal Geauffrey: Also side note, but some can have positive or negative examples. So you could do negative examples on some part of the object that you don't want. But to Anik's point, you do need a human to make the correction and you need to have an efficient way to get people to do those correction and be smooth and efficient.

Yannik Donnelly: So we're not going towards a world of 100% automation, but think about it more of like in a world of labeling assistance. In a way, we're trying to give the proper tools to humans to make the right decision, but then accelerate things that might require, you know, as dumb as it sounds, to click all over the place to highlight something. Maybe the idea is to click once, maybe two or three times to remove or add missing parts. But then having the rest of the model kind of even like we're taking even the temporal aspect of it, which is going to be very important with SAM v2, is that in the next frames, it keeps those clicks that you'did negatively or positively to continue tracking those pieces that you might have missed the first time. So that's what's really interesting, that it's really, I mean, we haven't touched on it yet, but it's that extra layer, in SAI too, which is like a new kind of layer in the model, which is called memory. Right. How that gets kind of like basically sand v2 is really just sound v1, but with like a few extra components to it and memory being one of the main ones where you can look back on previous frames where things were annotated a certain way with certain inputs and maybe positive and negative outputs and take that into consideration in the next frame so you don't start from scratch. It kind of takes into, in its inferencing all that information from previous temporal inputs.

Pascal Geauffrey: I'd say one of the main challenges is, ok, you have this fantastic tool that can have positive and negative points. It's going to generate some mistake from time to time. How do you provide the best user experience to annotate in the simplest and most efficient way with complex requirements from clients? How do you actually integrate this and not overwhelm the user, make it smooth? That's the real challenge in my opinion.

Rob Stevenson: The complex requirements piece is the operative, I think thing here because it's segment anything, it's not segment everything. And when you get down to that last X percent, that last 10% of. But the mirrors example I think is very instructive, right? And it's maybe not even economical for Meta to teach it to do that, whereas it would be for a specific client use case to train the last x percent of it. So that feels meaningful when it comes to taking something that is presented in this splashy, shiny demo. It's very fun to play around with, translating it into a real use case that can work for a company and not just a user who's having fun. Not just like the version of my mom using ChatGBT to get a new recipe. This is where it starts to get really, really interesting. So I would love it if you guys could share some of the use cases that you foresee. SAM 2 Kind of enabling one of.

Yannik Donnelly: The big, big ones that I see where there's like the most added value is really around tracking, right? Like we do a lot of or a lot of people will boast that you can do linear interpolation. You annotate frame one, you annotate frame 15 and whatever happens between one and 15 will in a linear fashion on every frame basis adapt, right? But then on some frames the shape doesn't match anymore because it's a linear interpolation and not a dynamic interpolation. Now with these kind of mechanisms you can do tracking and also have more, let's say dynamic resizing as your. And you only have annotated maybe a few frames. But what happens in between? Because the model can adapt in a more Dynamic fashion reduces a level of effort for longer sequences. So that's where I see myself utilizing this a little bit more. Another use case that I see is for more precise, let's say or I call it snapping which'like you have a traffic sign, you want to put a bounding box on top of it. Well maybe instead of being super precise and annotating it, you can be less precise and the box will snap because detects the object you're trying to detect in it and it will create like the bonding box around the mask. It says like oh, this is a traffic sign. Even if it's a triangle or whatever shape it is, it will find the proper sizing and it removes again the effort of being super precise on the pixel dragging and going and finding the next click. You can kind of go more rough annotation and then it does the redefinition. So those are kind of like little efficiency gains you can get from the tooling.

Pascal Geauffrey: I think as those tools evolves and become more and more precise and more and more fast and efficient, there'be some kind of baseline tool that you can use in many types of workflow. If you need to draw a polygon, we need to solve for how can you correct a pre annotation faster than doing it from scratch when it's not perfect. But I think this delta will become smaller and smaller as time goes and maybe people will have to only click on the shapes that they want and have an even quicker way to correct the shape if it's not perfect. But Somem and SOMEM two kind of try to solve for a lot of things in 2D like you said because you can run some on a single image and try to generate polygons all over the place and use those as pre annotation for some of your shapes of interest. You can build interaction in D UX with that also. But it's always in my mind about how do you fit with the user and how do you provide the best interaction to be the, the most efficient?

Rob Stevenson: Yeah, the auto detection, the example you gave Yonik that is just such a straightforward quality of life thing that you expect from like V1 to V2 of a product. So it's just a very cut and dried I think efficiency gain and it seems like that is what it keeps coming down to for use cases is a productivity and efficiency gain and then to some degree automation.

Yannik Donnelly: Would you agree that's what customers are asking for. Right. Because they are getting or collecting more and more and more data and they need more and more data to trickle into their models and they need some type of labeling. We are in a very interesting time period where a few years ago everybody was building a little bit more of their custom models. But now with the advent of open source models like Meta has made available, has now given everyone like the possibility of fine tuning these models, using them through like prototypes and whatnot and realizing that yes there is a easier way to automate using state of the art capabilities. But the thing is that there's still an education that needs to happen on the customer side to say like if you try to automate too much, don't expect the same level of quality. We're kind of shifting towards more of a validation kind of period than just like annotating from scratch. I think from scratch will still happen sometimes just to start off a project. But now more and more since these tools are made available to everyone, people are saying, hey, can I just use this, send you the annotations that I have and you fix them. And kind of what Pascal was saying earlier, we are working very hard with our product teams'like what is that experience for associates to ingest the pre annotation? Like visualize them in a way that they can easily edit them, fix them, refine them so that the main output at the end for the customer is these like ultra high quality, like exactly what their modeling needs or you. We say that we do sometimes data to train models but sometimes also we have to have deliver extremely high quality because they want to use that as evaluation data set. They need to use that as a baseline to benchmark the performance so they can't have bad annotations toirle through and then run benchmarking and bad annotation. They need things that are like almost like 99% perfect so that they can see is my model drifting or underperforming and whatnot. And you can't do that I believe with like Sam right now. But the idea is that it's moving.

Pascal Geauffrey: Towards, I think it's interesting to think about this whole thing of correcting like model validation, correcting pre annotation. We tend to think about it really always in the context of a single frame. I have a bounding box and I'm correcting this bounding box. I have a polygon and I'm correcting this polygon. But when you start having sequences and you start having this time dimension it becomes way, way more complicated. You have objects that are tracked through times. There could be mistakes of objects that get inverted. You could have objects that used to be seen before in one of your camera and maybe are, seen after in the second one. The mental load of correcting this versus doing it from scratch. There's really something that needs to be figured out here. And it's an interesting problem that is.

Rob Stevenson: Such a challenge because whether it was more efficient to do it yourself or to go in invalidate, you probably don't know until you've gone in invalidated, right? So you kind of have to climb to the mountaintop yourself to see if it was worth it. Is there a better way to figure out whether this is worth it before you spend all this time figuring it out?

Yannik Donnelly: We've run many, many experiments with customers and often it comes down that because of the like, we deal in complex use cases. We're not like, I make a difference between a cat and a dog anymore. That's been resolved. But in some of the use cases, especially in autonomous vehicle, the stakes are so high for safety and security that quality has to hit sometimes like thresholds of 98%. And relying on models to do a lot of the work, it gets too tricky to tell them like, ah, oh, we're going toa spend a lot of time to automate all of this process. And again, they have usually very tight deadlines. So we tend to work in parallel to try to find ways to become more efficient. And there are other ways. Like we talk about Sam v2 a lot, but there are many other ways other than just models. Like it could be optimizing the workspace to reduce the, movement of the mouse between selecting an attribute as silly as that. Save five seconds times, 100 million times, it's a lot of hours saved. It's that silly sometimes. But models have their place, but they need to be evaluated. And so far we've done about, in my personal experience, about like 20 experiences with models and only about three times did we see a huge enough lift for the size of the deal that it was worth to like implement and have that likeularly running compared to the others because of the tight timelines, the potential of error and the high quality stakes. If some of the customers understand and you explain it to them and they're like, you know what? No, I want everything from scratch because I can't have a car potentially. Like, you know, I'mnna say that. But hitting, a vulnerable road user, like anything that's squishy, you don't want that ever happening ever. All of these models works well. Like, there's always the element of demo candy Right, I think you mentioned that before, Rob. In terms of you go on the meta website, you click that all the videos are like 4k high frame rates y everything.

Rob Stevenson: Serena Williams Yeah, exactly.

Yannik Donnelly: Amazing, right? But then you bring in real data where lower FPS, the tracking and the segmentation doesn't work as well, but it's still okay. But then add on top of that the fact that when you deal with real customers that the data sometimes grainy, the data is dark, maybe a lower quality, there's rain, there's snow, it.

Rob Stevenson: Was cloudy that day. Yeah, exactly.

Pascal Geauffrey: There's really something interesting in annotation is that when you are in the ML space you always think of ground truth as something that is absolute and always true. And you don't always fully appreciate how things are subjective and you pay to get one person to annotate it. But sometimes you would were to ask five different person, you might have different ground truth per se. And there's this whole aspect of consensus that is really interesting. And if I send you a bunch of pixels that are like 300 meters away and I ask you is it a truck, a car, a vulnerable road user and is their lights on and how much are they occluded, are you going to be able to, I don't know, figure it out?

Yannik Donnelly: It's exactly that. And more and more we're like trying to educate customers that the tool needs to be made available, but never expect something like I send you data and before even gets to a human, everything will be annotated and all they'll have to do is like, yes, no on stuff. No, they want more than that. They want like, yes, this is good, this is not good. And at the same time when it's not good, re edit it, change the attribute, like do all this extra work without thinking that the level of effort compared to doing it from scratch somet time is just about the same because you're still revisiting all the objects that are there. And same thing, we're trying to educate them to be like, is this really needed? You're giving me a hard time about this one pixel because there's a button on his shoulder that you want me to include in the thing. But is that going to make or break your model? Because all this back and forth going around this like costing both them and us time and money to readjust. And sometimes yes, they'll have, sometimes they'll put a use case in other cases will be like the example that Pascal mentions, like maybe we don't need to go 300 met away. Maybe we focus on anything that's just like in a region of interest of, 50. Redefine the scope of the work that's required, and then we can start thinking. Okay, now models in this kind of range work better. Now we can utilize a little bit more automation here, not on everything everywhere.

Pascal Geauffrey: I said. I also had a philosophical point early on. If we ever ended up to a point where, you just run something through a model and get everything, you'd have a supermodel that you could send text into and then get all your annotation that you want out of the box. Then why would you want any kind of annotation and not just use this model directly? If it's perfect, Clients are using this to either evaluate or train their own model. If there is a solution out there that can solve it out of the box without any kind of human inputting things into the edge cases, then there's no value in training anything. In my mind.

Yannik Donnelly: Another good example is if we had the model and had everything trained on our end and it did all the auto labeling for them, why even do labeling just license or model or license or version of the model? And some customers sometimes expect when they come to see us like, you don't have a model that does exactly my super niche use case. How come you don't have this already? I'm like, well, it's because clearly nobody else does. You're working on a very unique problem, which is great, but you need to have human in the loop. The RLHF kind of concept, like is still present and still going to be very relevant. But I think it's shifting towards more like, let's do a first passport models. What type of experience does it look like after to have the associates fix it? And how can our customers also validate. This is what it looks like when a human goes through it, and this is what it looks like when a model does it and kind of seeing the evolution and then they can make decisions to be like, you know what, the model was good enough. All the extra stuff that was done by the human, it's better. But maybe our model doesn't need it. And then it's a conversation to have. But the idea here is that again, client requirements, client expectations, and again safety guardrails. And that kind of stuff comes into play when trying to define a solution for customers, which we have to be honest with them, we have to educate them like we are experts in this field. Myself, I've done over 300 projects in annotation that vary from gen AI to sensor fusion to just static 2D to sequences and all sorts of vectors. And sometimes we also learned something that we can then share with the customers. The idea here is that we want to be a great partner in this space, but also we want to set realistic expectations in terms of what models can do. I think there's a lot of hype out there that all models are going to replace everybody and everything and nobody's going to be able to work and AGI is going to like take over all over the place. There needs to be checks and balances and I think that people are realizing that human in the loop is still very important. And let's have a conversation around like what your use case is and what we can put in place to really automate the reality is what can you do to get the most bang out of your buck basically?

Rob Stevenson: Certainly. And that coaching that is necessary to help someone arrive at what is what they actually need. What is is you saidon good enough or as you said Pascal, what is consequential like oh, this partly occluded lights 300 meters away. Wouldn't it be great if the model could measure that? I guess, but who cares? Does it actually have an effect on the performance? And if not, just cut it. It's that simple. But this is also related to just the utility of an off the shelf model. This comes up periodically on the show. Models are being commoditized, there are open source models. Take one off the shelf and do what you want with it. It will have some cool functionality out of the box. But how relevant will it be to your needs? How relevant even can it be to your needs? I think is the important thing there. Now we get into the realm of prognostication a little bit. But I would be curious what you both think about that last point. Will model sufficiently advance to the point where there is an LLM? You can ask it for what you want. It will automatically be able to address your use case.

Pascal Geauffrey: I can speak a bit also from my background. I've been working for the last six, seven years in the data science realm. And I feel like it's always a matter of really understanding the business problem that you're trying to solve, never just using a model for the sake of using a model. You're trying to solve for something and you're trying to improve a business metric or whatever and those are always specific use cases that you're trying to solve. Are you trying to reduce the time it takes to annotate are you trying to improve the value that you're going to have in whatever other project? So from my personal usage I've had the most value out of the LLMs. When I'm trying to do something where I also have a bit of expertise, like not asking something in a coding language that I don't know anything about or maybe just grasping the high level syntax, but inputting something very specific with the right keywords and with the right business knowledge gives me the most bang for my bucks when I'm using those kinds of tools myself. What I'm trying to say is that I think the model with the right expertise is where you'll have the most synergy and the most bang for your bucks in my opinion.

Yannik Donnelly: Correct. And I mean we could do two type of analogy but if I were to take a open EY and anthropic kind of approach to like when you try to prompt these models, CHGPTT will be more upfront to try to get you an answer right away like a one shot type of approach compared to when I interact with a clot, let's say it will ask me a couple of questions. It doesn't give me like it's more of an assistance like oh, you asked me for this, do you want me to go more down this route or more down this route? It's trying to participate in the conversation to get you the best result because maybe you didn't prompt it. It's kind of questioning your initial prompts and has more of an interaction kind of approach. And that's how they built their product chat. GPT seems to be more like you really more of a I ask you this and give me everything without like having a conversation, just give me everything as I want it. And like Pascal said, with the right training and expertise you can really extract more out of those tools to get to a certain point. And I mentioned that earlier when we started talking about sav2 is the same thing as you can make these tools for AI assisted type of labeling, but it's a tool that just helps the associate work a little bit faster when they know when to use it. The right use case, if it knows that the object is well defined, there's good lighting, like I can visually see sharp edges on the object if I click here I'm going to save a lot of time because I know it's well defined. But an object 300 meters away, fuzzy, not clear where the edges are, maybe I should use another type of tool. So again, educating the tool user of when they can get benefit or not from the application of that technology is where we'll get the most bang for the buc, it's really kind of training the user properly.

Pascal Geauffrey: I think there's something to be said also about like when you're using an lm, knowing when it's wrong. Like sometime the answer that you're going to get is going to be just a generic answer because you may have asked something very specific and there's something to be said about like you said Enick, knowing when the tool is not going to be the right tool and actually solve it yourself and not trying to over rely on it.

Rob Stevenson: I think there is a healthy amount of distrust that we still must have with Generative because it's going to tell you something. It's never going to say I don't know. And in the event where it coaches you a little bit to ask a better question, we're getting there. But right now, for example with Chat gbt, you have to ask it to ask you. I've seen that prompt, it's like evaluate my prompt and tell me how I can make it better. Shouldn't it just be doing that automatically? Anyway, that's a side point.

Pascal Geauffrey: It's a bit like in the end before ChatGPT, you are doing a Google search for the information and the information might have been on stack overflow, might have been on different places and you'd still had to have some critical thinking in what you were seeing. And I think it's the same with LLM.

Rob Stevenson: now when we speak about efficiency and getting those bang for our buck, I don't think you can have a complete conversation about models like this without speaking a little bit about the economics. Because right now you have these open source models you can take off the shelf. They're free, right? Well, this compute is fantastically expensive. Feels like we are in this V1 version where a lot of it is subsidized by venture capital. In this moment though, where is the cost of running these models in this compute being burdened? Yannik, I'd like to hear your take first year.

Yannik Donnelly: Well, to kind of go back a bit to the beginning of your question is that I think there's hope from the industry that compute will get cheaper and there's going to be suddenly a unraveling of a new technology that's going to make these models run on the edge with X number of billions of parameters and hopefully I think the VCs are hoping that there'snna be officially a line that gets Crossed, we're like, okay, now it gets to a point where we can actually charge the $20 a month to customers and we're breaking even, let's say. And that's already a pretty good line to cross. But at the moment, clearly with the training, the research investment and everything that goes into it, who's paying for it? Clearly the VCs. Now, for us, on our side, when we want to run these models and we want to offer some of these efficiency gains to the customer, while there's ML Ops cost, there's infrastructure, there's GPUs, and again, that's using off the shelf model. If you want to start fine tuning the model a little bit, then you have to go into training cost and data acquisition and data pipeline. So there's a lot of stitching together that is kind of like black Maggic to a lot of folks using some of these free tooling that in reality there's like quite a bit of tech besides just the transformer technology behind the scene that is making all of this run and quickly too, right? You don't want to wait five minutes for response to show up. You want it to pretty much stream in and be pretty quick. So I find that there's a hidden cost that a lot of people think to assume, like, oh, just use this model, it's going to be 10 times cheaper. Well, not really. Who's paying for the inference cost? I don't passsc offff. You have any?

Pascal Geauffrey: Thank you. You pretty much summ it up. There's a lot of hidden costs with running models. It's not free and in the end it's your provider that is paying it for it. So yeah, it doesn't come for free.

Rob Stevenson: It feels like it is inevitable that this cost will eventually be passed down to the practitioner. And so if you are a practitioner in this space, your efficiency gain case must be rock solid because your CFO is going to knock on your door and ask you if this is really worth it and you better be ready to go to explain why it is. And so whether that's the efficiency gain, the productivity, however you can tie it to roi. That is a skill set that I think is not typical for a lot of technical folks. But it is more and more important that you are going to have to go outside your comfort zone and make this business case. Make a PowerPoint. I don't know to argue for this.

Pascal Geauffrey: If I can jump in on this because we've had experience through that. Again, we really shouldn't underestimate the Human aspect, we're using models, but in the end you have a team that is working on the ground and using those models. And there's a huge aspect of coaching people of organizing the work so that you are actually going to get your ROI because you can provide the tool, but if no one is using them, then you get no efficiency. And I think it's a human thing in the end, more than a purely technological tool. And this is something that, especially on the technical side, sometimes we're not used to it. That's why there's a need for AI product manager and more intermediate between pure research and actually the business.

Yannik Donnelly: I would agree that the AI human meshing in a product is a whole new discipline of like making it easy to use, easy to trust and then of course having proper trainers, making sure that this can scale, you know, especially in our business, in a way that everyone is using it and not just a select few. Right. If we're going to also do the investment on our end to maximize the efficiency of these new technologies, is are we putting the efforts in the right place to develop the proper experience, not only like segmenting, but like editing the mask, validating the mask, creating new tooling to visualize potential predictions without actually doing the annotation just so that before you do there's all these tons of experiences that could be run about how can we create an experience for our employees, but also customers who will be interacting with our platform in the data eventually to kind of make sense of what's happening with this tech. Kind of like opening up the black box a little bit and create those experiences. And we're working really hard at doing all of that kind of stuff. It'just that again sometimes there's hits and there's misses. But the idea is that we have to iterate quickly and we've got a few wins recently and we're going to try to continuously improve on that nonop for the years to come. Like I said, who knows, maybe SAM V3 comes out in six months and the whole paradigm changes again. So we have to pretty much have like our finger on the pulse to see how things evolved.

Rob Stevenson: Yeah, constantly.

Yannik Donnelly: Still on the cost thing because I want to kind of make a mention, right. There was an interview recently with the cfo, of OP openi who like, who are still talking about like we haven't figured out pricing, we are making money, but we don't have full understanding of inference and infrastructure and all that and how that will eventually translate. So back to your analogy of like the Ubers of the world wentot and we suddenly see a sudden price jump. VCs are in it for the long run, but there will come a time, we don't know when, maybe in the next two, three, four years, where there will be a shift to be like, no, you need to go into profitability mode. And then who are going to be the players that are going to survive that shift? Because if you're used to being $20 a month for open EY and the compute hasn't caught up or whatnot, and let's say they now need to start making a profit, well, do they suddenly jack up the price to $1 a month? Some people will stay, absolutely because they'll see a benefit. But will everybody have that kind of luxury to afford that kind of service? Who knows? We don't know really how that's going to play out in the future. And we're trying to be on the side of being mindful of that and that we want to use it. But at the same time, if we're using these models, it shouldn't directly Translate into like 10x cheaper prices. Like we still need humans to adopt it, use it. And we need to find the right use cases to really kind of take full advantage of the efficiency. Right now it doesn't seem like it's always the case. We really need to be experts in our field and educate the customers about what can and cannot be done.

Pascal Geauffrey: I feel that maybe I'm wrong, but I think that there's a, feeling in the VC space that people are rushing and hoping that there will be AGI anytime soon and that this will kind of solve for everything. And I don't know how much that bet will pan out. And I'm really curious to see how it will go because people are pouring down a ton of money to train bigger and bigger, models. And I guess the hope is that those models will be good enough, that they'll solve for everything and you just pay for in France, but it will still cost money.

Rob Stevenson: Yeah. Seeing the benefit is so crucial because if you are listening to the show, you are, I imagine, sufficiently technical to see the benefit right away. But will the rest of your organization? And this is a challenge for practitioners in this space. If you're like a Python developer, no one doubts you're the value to the organization. You're not about to get pulled into a room and be asked, what is it you say you do here? But I fear, like you both, that there has been this willingness to try and this willingness to invest. And at a certain point it's going to have to justify itself and I think it's going to have to be on the practitioners to make that case to their customers, to the organization, to whomever, if they want to keep playing around with these fun toys.

Pascal Geauffrey: And through my career I've been through multiple cycles. There's always new innovation coming in and there used to be RPA at ah, some point, robotic process automation. There was the early days of data science where we would all be training models internally. And Sol, there's always this cycle of we hope to have this new technology that's going to solve for everything and then we adapt and we need to have real life project that brings in ROI rather than demos and it's always ongoing I think.

Yannik Donnelly: And also there's a kind of discrepancy between what tech leadership of some of these more incumbent industries, let's say finance insurance, that are at the top, who are not willing to make large bets yet because a lot of money to invest. And at the same time they're betting a lot more on the human connection. You want to deal with your financial planner, you want to deal with your insurance broker, they value that human interaction with the different clients that they have. And even though some products are becoming direct to consumer, a lot of this still relies a lot on connections and relationships. So what are the use cases that some of these larger organizations who want to utilize automation and tooling, what can they build to kind of benefit that clientele of like, you know, maybe the, I would say bottom of the pyramid, like Romania employees. And at the bond they're like, oh, there's great value, we would like to have it. But then there's also a bit of concern like am I going to get replaced? Am I going to become a chat bot that's going to sell insurance? And it's to find that compromise between what can be done to automate and not put those relationships at risk, but make them more efficient, get better results. And at the top is how much is it going to cost and do we even want to make that type of investment? So when I went to a few conferences, you kind of see like at the bottom people are super excited, but with a little, maybe a sprinkle of concern. And then if you talk to like people at the top, they're like, we're just kind of exploring, it's still very early days for them. Even though all this hype has happened over the past two and a half years, maybe three years, they're still like we are being very mindful of that's not our business. Like, what kind of tech stuff can we. But again, eventually you will have to adapt. It's kind of saying like, oh, we never want to use email because we're used to paper, you know, but there needs to be a little bit. I think there's a bit of a blockade of making the funds available at the top and at the bottom, and eagerness to use some of this tooling. But how do you define a use case that's worth getting the money from the top? Because the ROI again is hard to quantify. So there's a little bit of a war happening internally in terms of these big companies, in terms of politics about the bottom wanting and the top not wanting to invest just yet until they find the perfect use case.

Rob Stevenson: Yep, yep, I see it too. Well, gentlemen, we are creeping up or perhaps well past optimal podcast length. But it's just because I've been having a, ton of fun chatting with you both today. So we're going to have to wind this down. But before we go, I would just love a teaser trailer. What is coming down the pipeline? What's next for the Sama reading room?

Yannik Donnelly: There is a paper on different types of multimodal type of experiences that have been done in the wild. So trying to get a sense of identifying ambiguity in like multimodal. So if you and I were to review the same video and we had to caption it and explain what was happening, how would you describe it versus me? How can we get to the same conclusion but use different words and kind of defining how does that kind of information get packaged together to eventually like a refiner model? So there's all kind of like these. And again, Pascal is really good on this. This whole kind of philosophical type of questions about like how that kind to get put together, how our brains work to kind of utilizing a very common tool that could give the same output, but how can we give it two different inputs? It's at the same output. It's a pretty, interesting problem.

Rob Stevenson: Multimodal is all fun and games until you appreciate that there's no such thing as an objective experience of reality.

Pascal Geauffrey: Yeah, and this is a really interesting topic. something that really is close to my heart is like how can you leverage different cultural biases? Like we've done experiment with voice and people see things in a different way. If you look at a bunch oflogs on a river for Canadian it might be a beaver dam and for the people in other countries might just be a bunch of loggs. Can we le reach this? Is there value for the models in that? There's really something interesting on this front, I think.

Rob Stevenson: Localized egocentric perception perhaps?

Yannik Donnelly: Yeah, something like that.

Rob Stevenson: Yeah, that's a whole different podcast episode or maybe a series of them we simply have to wind down. But gentlemen, this has been really great. Pascal, thank you so much for being here. Yannk, this has been fantastic. So thanks for doing this show with me. I love with you today.

Yannik Donnelly: Thanks so much for having us and looking forward to doing it again.

Rob Stevenson: Can't wait. Hell yeah.

Pascal Geauffrey: Thanks.

Rob Stevenson: How AI Happens is brought to you by Sama. Sama's agile data labeling and model evaluation solutions help enterprise companies maximize the return on investment for generative AI, LLM and computer vision models across retail, finance, automotive and many other industries. For more information, head to sa do.com do.