How AI Happens

Qualcomm Senior Director Siddhika Nevrekar

Episode Summary

Siddhika worked on groundbreaking projects at Microsoft and Apple before co-founding Tetra AI Hub, later acquired by Qualcomm, where she now serves as the Senior Director of Project Management. Our guest explains why she switched focus from cloud to edge computing, why there’s a need to both increase compute on edge devices and develop more efficient AI models, and how the conversation around edge has evolved since the very first murmurs.

Episode Notes

Today we are joined by Siddhika Nevrekar, an experienced product leader passionate about solving complex problems in ML by bringing people and products together in an environment of trust.  We  unpack the state of free computing, the challenges of training AI models for edge, what Siddhika hopes to achieve in her role at Qualcomm, and her methods for solving common industry problems that developers face.

Key Points From This Episode:

Quotes:

“Ultimately, we are constrained with the size of the device. It’s all physics. How much can you compress a small little chip to do what hundreds and thousands of chips can do which you can stack up in a cloud? Can you actually replicate that experience on the device?” — @siddhika_ 

“By the time I left Apple, we had 1000-plus [AI] models running on devices and 10,000 applications that were powered by AI on the device, exclusively on the device. Which means the model is entirely on the device and is not going into the cloud. To me, that was the realization that now the moment has arrived where something magical is going to start happening with AI and ML.” — @siddhika_ 

Links Mentioned in Today’s Episode:

Siddhika Nevrekar on LinkedIn

Siddhika Nevrekar on X

Qualcomm AI Hub

How AI Happens

Sama

Episode Transcription

Siddhika Nevrekar: On top of that, you start with your basic world where you're developing these creative models which are in Pytorch. And you have to get all these Pytosch models into a language that Core, ML, TF Lite and Onyx runtime understand, and all of them to run with the same performance and the same accuracy. And these are all different things that a developer has to learn. And then I'm going to tell you I want to run it on these devices. I'm picking a, Samsung S23 24. This drone, this security camera, and it has different chipsets and I know nothing of nothing. You figure out how you're going to run this model on those devices and give me a solution. It's a very hard problem. I'm making it sound. Oh, it's so simple. Just.

 

Rob Stevenson: No, it sounds hard. Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists and machines learning engineers as they get technical about, the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we're about to learn How AI Happens. Okay, welcome back, all of you wonderful AI, ML, data science practitioning folks out there in podcast land. Whatever your background, whatever your ilk or persuasion, I am thrilled you are back here with us at How AI Happens. And we have another amazing guest for you today. She has had a ton of roles in our space and boy, there's a ton of that we can go into. So I'm excited to have her. Just some a quick sprint through her curriculum vita here at the top. She served as the senior program manager for Bing over at Microsoft. She then led Apple's core ML team where she figured out how to run ML models on any Apple device. She then co founded Tetra AI Hub, which was later acquired by Qualcomm, where she now serves as a Senior Director of Product Management, responsible for building Qualcomm's AI Hub. She is Siddhika Nevrekar Siddhika Welcome to the podcast. How are you today?

 

Siddhika Nevrekar: I'm excited to be here. thank you for having me over at your podcast and I hope lots of people have tuned in to listen and enjoy this ride with us.

 

Rob Stevenson: Oh, so many, so many. Listening and enjoying. I'm threatene of it and boy, you have an interesting background and did I do justice to. At first, I always like to ask if I bricked that or not.

 

Siddhika Nevrekar: Yes, you did, you did. And I can talk through some of it myself as well. If you were interested in something that.

 

Rob Stevenson: Stood out to me about your experience, I'm hoping we can speak about is just that you move from cloud to edge. So you had this experience running clouds at scale, a company like Microsoft, then going to Apple and now you're completely working to try and run these models on the edge, on devices. So I'm curious what was responsible for that shift, if it was deliberate or if it was just like you were bored with one and want to move the other. Why I make the move.

 

Siddhika Nevrekar: Yeah, that kind of takes me down the memory lane. When I started, Bing was a search engine which was put together by Microsoft. And you'd think, why another search engine given this, Google Everywhere. But that was very, very crucial to get a good understanding of AI and ML M back then. Large data training, large models, ranker relevance, making sure what people ask for can be responded within a matter of seconds, by being in an appropriate way. So that kind of. And you rightfully said it was all cloud. Everything was running on cloud. And these were huge models, large amount of data required everything to run on cloud. And I did it for quite a bit of time, about 19 years. It wasn't actually guided by the industry at that point because AI on edge hadn't truly started. There were murmurs around, but not through applications. But I thought, hey, I have lived and spent a, lifetime in cloud and I need something that kind of challenges me, scares me a little bit and can I do that? Can I work with. I know nothing about devices. Can I go learn how devices work? And that was a deliberate move to Apple and it was the best time, place to be there in my opinion. Now looking back, because that was the beginning of AI running on devices with phones being unlocked by your face. So face ID is a very, very big example of that. as well as the silicon or the SoC'which now started to emerge exclusively for AI and to run AI on device. And Apple was doing their first one. I was very, very fortunate to be there at that time and to learn all the goodness around how to run AI on edge. And to be honest with you, I did not believe in it. I was like that this is just a hoax. models are not. Models are too big to run on devices. By the time I left Apple, we had, thousand plus models running on devices and 10,000 applications that were powered by AI on the device, exclusively on device, which means the model is entirely on the device and is not Going to the cloud. And to me that was the realization that, oh, we are now the moment has arrived where something magical is going to start happen with a andml A good example of this for all of us to ground ourselves. When you take a picture on your phone, it could be any phone, There are about 23, 24 models that run back to back within that microsecond to capture that incredible picture which you look at and say, wow, this is a perfect, perfect captured sunset or sunrise or something like that. That's that, 23 models working right on the device, making no contact to cloud whatsoever. So that kind of grounds us, makes us realize that, well, AI is having so much practical impact on our day to day. We are interacting with, unknowingly or knowingly interacting with AI every day now. So that was the reason for the shift from cloud to device. And luckily landing on thei boy, 23.

 

Rob Stevenson: Models running just when you open the camera app. That would have been impossible for 2014, 2015 Satikqa to believe. Right? It sounds like that was a deliberate move, like hey, this isn't possible. Let me go try and figure it out. Was when you say that you wanted to go towards a thing that scared you, was that that you just didn't believe it was possible.

 

Siddhika Nevrekar: Yes. One models were substantially large. So there was a lot of technology around. How do we get the models smaller and models doing what the large models in cloud did, can we actually achieve that? That was one track and that track with a bunch of, I wouldn't say it was any one person who did it. This was a, lot of research done by many universities, a lot of companies that came together and kind of invested in this arena of figuring out how do we compress the models, how do we make them small, how can we train smaller models that run fast and do effectively the same thing? That was sort of an advancement in technology during that era, time period. The second big push happened with the silicon manufacturers themselves. You know, folks like whether you think of Apple or Qualcomm, there was some very specific investment done into a chip, a unit within our devices which would be dedicated to running just AI, which essentially is these are math calculation, matrix multiplication, and can we dedicate a unit that just does that really, really fast? And that was kind of a leap of faith because if we put it in there, will there be use cases which could later on, you know, we could go on and on talking about the AI that we are seeing today. How are the use cases going to evolve? We are in A similar moment right now. And that was a similar moment where, hey, if we put this hardware, will the use cases emerge? And the use cases did emerge because now you had a dedicated processor which a neural processor, it could handle exclusive math dedicated for just AI. So that was the cause around, AI taking off on devices.

 

Rob Stevenson: Okay, yeah. I was going to ask you whether you thought an incre in compute on edge devices or more efficient models was more responsible. Of course it's both.

 

Siddhika Nevrekar: Both, yeah.

 

Rob Stevenson: If you had to say one or the other though, would you say it's about more efficient model? It sounds like that's kind of what you're saying. And my gut tells me that no matter how much compute we get in this phone I'm holding up right now, there will still be less than is possible in the cloud. Right. Is that your position? Do you think that it's more responsible for it is the increase in model efficiency?

 

Siddhika Nevrekar: That is right. So what you said is absolutely right. Ultimately we are constrained with the size of the device and you know, we can, it's all physics in the how much can you compress a little, small, little chip to do what hundreds and thousands of chips can do, which you can stack up in a cloud? Right. Can you actually replicate that experience on the device? And we are constrained by the size, so there is going to be limit there. What helps us though is how the algorithms are evolving, how the technologies around towards the rag. Can we learn and train the model with small amounts of data on the device and make great personalized experiences? Can we have use case specific things going on? How can we reduce and shrink the data and the use cases and the models so that it will be an effective experience? So and how can we systematically dispatch many of these and run many of these in and out into the device space? In a sense, how do we emulate a cloud on device without the large scale is the challenge we all are facing today and trying to figure out. And there are many tracks that are making progress in this area, was that.

 

Rob Stevenson: The philosophy at Apple from the beginning was how do we emulate the cloud on an iPhone?

 

Siddhika Nevrekar: Apple is a very, very unique company. And this is speaking out of just my experience alone, which is privacy has always been a very, very treasured principle all around. And hence when AI came about and something like Siri came about, where there were things or face ID or even within messages being able to complete your sentences, complete your text, the question was, what do we do now? Should we actually go run these experiences in the cloud, which was a big no, no for Apple because what's on the device, on your device and no one gets to see it, no one can unlock that information. So that prompted Apple to start looking into how can we run these experiences on the device? How can we actually give memories in your photos on the device without having to log into and look at your photos on the cloud? Is that even possible? Or how can we autoomete your sentences or translate without sending that data over a wire to the cloud? Where there is a possibility of once it's in the cloud, it's used by for training or it's looked into, or the principle is nothing leaves the device. So how do we actually manage that? And that prompted Apple to actually push in this direction.

 

Rob Stevenson: Interesting. IO had always considered the main inspiration for edge computing was latency was that it's faster and you can't always there are places in the world where you won't have Internet access and can you still run a model without having to connect to the Internet? And it sounds like that was not even maybe that was part of it, but it sounds like it was more about just hey, our customers have an expectation of privacy. How do we deliver that while still giving them advanced models?

 

Siddhika Nevrekar: Yeah, it's not one or the other. You're right. There are many factors. I was just thinking of how this conversation started back back in the day at Apple thinking hey, how can we keep things private and secure? But equally important is the latency. And I remember every single application, whether it was Cisco, Wex, Zoom, Microsoft Teams, every single video conferencing application was trying to figure out how we can run better models that improve image quality, voice quality, all on the device during COVID because all of us were on all these applications and sending that amount of data to the cloud was a huge burden.

 

Rob Stevenson: O yeah, of course.

 

Siddhika Nevrekar: And latency causing if you remember at least I felt it that initially when, when we were using this application at the beginning of that time we would see delays and hang ups and voice getting lost and and then things improved very dramatically and that one of the reasons for that was a lot of these companies invested in can we run the devices are computers can we not run the models on computers are bigger. What if we move the models on onto the devices and run them on the device and right there fix the audio rather than sending the audio stream all the way up, fixing it, sending it back and that causes that delay of you speaking and me hearing. So that prompted so there is definitely an aspect of latency and speed. There are experiences where you cannot tell the user, hey, you know, you've shown the face to camera. I'll go, go grab a coffee and I will unlock your phone in the meantime. It's just unacceptable. So we demand speed and like you said, there are places and there are applications where there is no connectivity and AI is quite essential, or has helped us in many ways and it's time saving and cost saving and then there is privacy, which is extremely important as well.

 

Rob Stevenson: I am really tickled by that. As a fan of tech history, I'll say that all of these fantastically smart and profitable companies realize that there's compute on computers and that they could take advantage of this and maybe that the cloud wasn't the be all end all that there had been this 10 plus year investment and prioritization of cloud computing and then there was now this reason to push it back to the edge. And I guess in the case where people are using laptop computers or in some cases even desktops, maybe that was a little easier than the phones. But anyway, the phones were right. Like mobile first is also happening. Like there was no way to get around mobile compute development.

 

Siddhika Nevrekar: I don't know if you've heard this term, but I have heard it more often now, even among startups and all other application developers, especially that there is free compute in my user's pocket. Why should I pay for it? Right? And that's a true thing because if I'm an application developer, let's say the application we are using for this particular recording and I could run an AI model on your device or my device, then I'm saving that cost on the cloud. So. And you and I are not actually burdened by the cost of it because we're like, yeah, we anyways using the application. So if you want to consume a little bit more of a neural engine processor on the site, sure, as long as it doesn't impact me. I'll grant you that space. So that's a theme that's taken off amongst application developers who are constantly trying to save cost.

 

Rob Stevenson: The free compute in my user'pocket is fascinating to me and maybe this is a little too cloak and dagger. You can kind of bring me back to earth. But do you foresee a future where companies could use their users compute for processes that had nothing to do with what the user was doing? So my phone is idle right now. Apple could use my compute for some other process. Presumably in the case of Apple, they would ask my permission and I would say yes, you may do that. And maybe my iPhone Costs a little less. I don't know. Do you see that as a possibility that all of this distributed compute will be used for non user processes?

 

Siddhika Nevrekar: Very Silicon Valley Black Mirror show kind of things?

 

Rob Stevenson: Yeah, no kidding.

 

Siddhika Nevrekar: We haven't yet gotten there. I'll tell you why we haven't gotten there. The compute that's available today, we haven't bombarded it with a bunch of use cases where we are getting to the point that hey, I would want to borrow the compute from somewhere or some user of mine. We haven't hit that momentum yet. But with the advancement and generative AI, we are on to something that is quite unknown to all of us. Because what we are trying to do is we are trying to train a model on the fly to learn so many things and we are going to run into a lot of scenarios where again, I'm going the Black Mirror oututeoo. I don't know, honestly, but it could happen where I would like to borrow some distributed compute if a technology exists today. Today there isn't something that's built out to do what you're saying. But it's not something I will discount as, oh, this may never happen. This probably would as we move forward because our cars are going to be computers and they're going to be huge and they're going to be sitting in the garages sometimes.

 

Rob Stevenson: That's it. yeah. Why wouldn't you want to do it? Yeah, if there's an incentive to the user, then it wouldn't be Black Mirror. Right. If it's happening in the background like we've seen. Remember Sony rootkids? Remember that whole thing? If you're doing something without the user's knowledge or consent, then shame on you. But if there's like, okay, I can accept this. Not only can I give consent, but maybe I'm incentivized. It's like, oh, your car payment is lower, Right. We'pay you. I mean, this is now web3, right. Like now you own a piece of this, we will pay you. You will get some sort of financial incentive to allow us to rent your compute.

 

Siddhika Nevrekar: Yeah, absolutely.

 

Rob Stevenson: That's only Black Mirror. If you don't tell people.

 

Siddhika Nevrekar: Yeah. Or something crazy happens. That's true.

 

Rob Stevenson: well, Saika, at some point here we have to start the show. We have all this stuff we wanted to talk about, but this is, You know, we're down a rabbit hole here. But I'm enjoying it. I do want to. I guess we'll stay in the rabbit hole for just a minute longer. I wanted to ask you a little bit about the challenge of training models for the edge. In a situation like an Apple where the data cannot leave the device, does that mean you cannot train the model using any of the user usage data?

 

Siddhika Nevrekar: There are techniques that have been developed where you can do things like private federated learning, where you can train the model on the device, you can send the differences up to the cloud, which are very cryptic in nature. It has nothing to really do with the data, but more with the weights, and then you can fuse it with the main model, dispatch the model back again to the devices. So it's kind of an iterative learning. The challenge with today, the challenge with Edge AI or running AI on the devices is more to do with actual inferencing on the device than to do with training. Ongoing training is going to be a point of concern, but the pre step to it, which is, hey, can I even run this model on the device? Is hard today for many, many developers, many, many users. And I'm sure there'll be people listening to this podcast who say, yes, let's figure that out first. Let's solve that first. That is our bottleneck as it stands today. And if we solve that, then we move on to the next arena, which is how can we now continuously train models on the device? Can we train models on the device and can we continuously train them? Can we customize them to just me? All of those things will come into play.

 

Rob Stevenson: So is there a way though to tweak the model while keeping data anonymous?

 

Siddhika Nevrekar: Yes, there is. There are techniques that have been developed but not extensively used. Private federated learning, like I mentioned, is one of those techniques. There are rag algorithms where you can periodically update the model with enough context given about different things happening on the device and it will then learn, behave, act and do new things. Your model will do new things that it used to not do before because it has learned the context being on your device. Those techniques are available today. I would say they're not yet extensively used just yet, mainly because we didn't have a, problem statement maybe just a few years ago which demanded that technology. If you think about running, let's assume you want to run generative AI, you want to run chat GPT on your device, just on your device. Now you're getting into a whole different game. And let's say that model is everything about you and your device, which restaurants you go to, which it has nothing to do with Internet. Let's assume that it has nothing to do with general knowledge, but it is your Model with all of your knowledge now this. And we can imagine now because of Chat GPT that such an experience could exist. Where I go on my device and I will be like, which was that restaurant which had that Mexican food that I liked with that particular margarita something. and the model would respond, oh, I know on 7pm on that day you went here. I don't want the world to know that. I want my model to know it and tell me. But this also now. And we can imagine this experience today. We can only imagine this experience today because just a few years ago we've started playing around with Chat GPT and the likes of generative AI and LLM models which have started answering all sorts of things to us. Summarizing also. So now a use case has come in play for the entire world and we are all like, oh my God, this is a new space, let's do more with this. but that does mean that that model that we have been talking about, which knows everything about you, needs to continuously learn about you. You will constantly do different things and constantly change habits and behaviors and patterns and it needs to consistently learn and can we do that learning on the device? That's a problem statement that has come about now more so than before. Oftenimes things remain in a world of. With AI, especially a ML, it's been a trend. If you look at the history of a AI and ML, it stays in a researching mode for quite a bit where there's a bunch of research going on around technologies and techniques and algorithms and then suddenly there is this big drop of something production oriented that happens and then it evolves. So we are in that phase where there is a drop of production oriented LLM world that's happened. We have seen large language models and now the world is hungry for what's next. So what, okay, we've all gotten used to Chat GPT now what do we do next? So that's, I think it's demanding that now from us.

 

Rob Stevenson: Okay, that makes sense. So we are moving towards this personalization, individualization really. Like it's not about, I'm, a white male between 25 and 34, it's I'm Rob. And so that information, what makes me Rob, as opposed to some other broader demographic. It does not make sense to train a model on that data because it's only, or not widely anyway, because it's only relevant to me. So if you could do it on my device for me, then it would never need to leave the device because there's no utility in Training your model with my data.

 

Siddhika Nevrekar: And the more I train your model with my data or my model with your data, the size of the model increases because it has to learn more information in store, which is not necessary at this point. Right. Because it's. You might need some inputs or some signals of your family behaviors, maybe your friends behaviors, but not all the content. So that's kind of where the trend is going in some sense.

 

Rob Stevenson: Okay, gotcha. I joke that we're down a rabbit hole, but this is all very hyper relevant to the work you're doing now. So I don't think it'll be too hard for us to come out of the rabbit hole and connect the dot. Would you maybe just explain a little bit about kind of your remit and what you're trying to accomplish with Qualcomm'AI hub?

 

Siddhika Nevrekar: I'll go down the memory lane a little bit so it smoothens our transition into out of rabbit hole. When I and my other founder, both of us, we left Apple, one of the biggest problem statements we were very interested in solving was how can any developer deploy any model or run any model on any device? That was the problem statement that fascinated us. And the reason that problem statement fascinated us was what I said earlier. Running model on devices is hard. Now you would ask why is it hard? There are two plates that continuously move in the space we are in. I'm calling them plates. They could be two things. One is the algorithms and the research. So we went from classical simple, easy models, object detection, all the way on to large language models. Now where these are completely different. They are multim modal models now which understand image and language and audio, video. All of the these things, all the context is together that keeps on evolving. So there's constant research going on where algorithms and the math that happens within those models keeps evolving. Where somebody out somewhere in the world figures out, oh, I could do this math this way and suddenly that's magic. That's a new operator, a new way of defining that math that someone has found. And the whole world adopsite very quickly to build models. The second plate that keeps moving is the hardware itself. If you look at it, look at your own phone, five years ago it was different and now it's different. And what I mean by difference is the chip itself, the AI chip itself, even the CPU and the gpu, which are the three main processors on the device doing all the heavy lifting, they have evolved, the architecture has evolved and we have learned to do things faster with the inputs that we've gotten that. Oh, we need to do these kinds of things. So let's do a dual gpu. Let's beef up the NPU a little bit more. Let's make the NPU learn these new things or do these new math operators faster. So with these two plates moving now, as an application developer, I end up in a world where I say, oh, I have a cool application that does face recognition, voice understanding, all of those things all together. Let's say some application which does AI. And I would like to put a model to do this on the device versus sending all the traffic from all my users to the cloud. When I look at my users, some have an iPhone, some have an Android phone, some have a Windows machine, some have an Xbox. Somebody has some other device. They also have a generation of devices. Someone has an iPhone, the latest one, Someone has a really old iPhone. Same with Android, right? And I don't control that because I can't go tell my user he buy the new devices. All right? I can't do that. So now my problem statement is how can I take this model that I just developed, which is with the latest and greatest technology, and run it on all my devices. This is where my problem starts. Every device to add to that runs a different OS. We have the iOS for Apple devices, iPad, MA OS for MacBooks, iPhones, and then you have Android for all your Android phones and then you have Windows and then you have Linux for non drones and security cameras and all of that. So here I have different ose, which means then world has evolved to also say well the software or the we call it runtime. The runt times for each of these oss are slightly different. And that's not intentional. It was just a fragmented world and each of those got evolved differently. So iPhone said oh, iOS devices have Core ML, Core ML is is what you can use to run model on the device. And look, if you use Core ML, our chip understands your math of the model really fast. So use cororeoml. Windows said, well we do Onnyx very well. So come use Onnyx runtime. And Android said TF Lite seems really great, it'll work with tflight. So now you had again three things that a developer has to learn. On top of that, you start with your basic world where you're developing these creative models which are in Pytorch. And you have to get all these Pytorch models into a language that core ML, T, TFite and Onyx Runtime understand and all of them to Run with the same performance and the same accuracy. And these are all different things that a developer has to learn. And that is why we built the Tetra AI hub, which is now the Qualcommi Hub, where a developer can come and say, hey, here is my Pytorch model. I'm going to upload it and then I'm going to tell you I want to run it on these devices. I'm picking a Samsung S23 24. This drone, this security camera, and it has different chipsets and I know nothing of nothing. You figure out how you're going to run this model on those devices and give me a solution. And that was the intent of building. It's a very hard problem. I'm, making it sound, oh, it's so simple, just. No, it sounds hard, but it is super hard. And the idea is to produce that one artifact that will run seamlessly across all the devices. So then your life, you, meaning the application developer's life, become, becomes easier. Thinking of the application and the experiences and not having to dabble into, oh my God, I have to run like, I don't know these chipsets that exist on the devices. Like, I'm not a hardware guy and I don't have time to go invest in that while I'm building my application. So that's the whole idea of Qualcommiub. That's what we've been having a lot of fun building, actually.

 

Rob Stevenson: I'm glad to hear it's fun. It does elev it to sound fantastically hard. So is your approach then to develop a translation method for every device and every operating system, or is it more like a unified theory where everything goes into the top of the funnel and it comes out in. There's one thing that works across any device and neos any instance.

 

Siddhika Nevrekar: The goal is to work with these community runtimes and the reason I call them community is while they are developed by the likes of Know, Onx is developed by Microsoft and TF Lite is developed by Google, there's a lot of community contribution to it. Our contract with for Qualcomm starts below that, which is once you get the translation from your Pytorch model onto a TF Lite Onyx, how easily, how unified can it be? That like you said, one format which goes down, we have that and you have to constantly work at maintaining that contract with all of these community. And then the other part is, how do you go from Pytorch to all these community runtimes if there's Some translation that exists. In some cases there's fantastic, development that's happened. We leverage that. Where there are gaps, we build by, we fill by building ourselves those gaps. And result for the developer is they do not have to worry. We provide them the solution where they come in and they upload the Pythtorch model and they say, well, I am targeting Windows devices, recommend a runtime, make it happen for me and we'll go and give them the right solution end to end, from the top, all the way to running it on the device.

 

Rob Stevenson: Yeah, it makes sense. And it feels like the use case is so clear. I fear that some developers might choose to just work in one instance rather than figure out how to serve their product to the other 75% of the world. So the ability to translate these models in this way, yeah, fantastically helpful. Gosh, I could stand to hear more. Siddhika We really covered a lot of ground here, but we are creeping up on optimal podcast length so we unfortunately have to wind down and get on with our lives here. But before I let you go, I. You can maybe just share, share. What are you hearing from the folks out there when you speak to developers? Obviously the challenge of translating this into all these different use cases or instances is hard. But what are the really common challenges that developers are facing and what advice would you give them?

 

Siddhika Nevrekar: Right now the most hot topic for every single developer is how can I run a generative AI model on the device? And we are limited by the size of the model at this time and point. All devices are, and it's know up to 7 billion for, let's say a phone and it can go up to 30 billion for a computer and a large, large 60 billion plus for a car. But it is very challenging to run a large language model on the device because it has more to do with the end to end pipeline, not just running the model on the device. That's been the latest challenge and that's what we have been focused on to figure out, hey, how can we actually compress this model, quantize it, what are the quantization schemes and how do we make it such. Just like the previous problem statement where a, developer could come and upload a model and say, hey, I want to run this model on the device and didn't have to worry about anything. Can we do that with large language models? Can, can we actually bake in all the quantization flows and make them dead simple so that a developer almost pushes a button and makes it happen. Is that possible? It's the latest challenge and I don't have a I know the line of sight and we are heading there. It is going to take a couple of quarters for us to make that really smooth for developers.

 

Rob Stevenson: So Q3 25, you're going to come back on the podcast and tell me how you solved it.

 

Siddhika Nevrekar: Yes, that's the goal. That's the goal.

 

Rob Stevenson: I love it. Well, hey, the last thing you thought was impossible. It seems like you figured it out. I'm not a betting man, but I certainly wouldn't bet against you. Sadikqa this has been fascinating. I really loved having you on and I mean it. If I would have you back on in a little bit to talk about what you've been up to and what the progress you make on that line of sight you described. So. But at this point I'll just say thank you for being here. I really, really enjoyed speaking with you today.

 

Siddhika Nevrekar: Thank you so much. Me too. I enjoyed this too.

 

Rob Stevenson: How AI Happens is brought to you by Sama. Sama's Agile data Labeling and model evaluation solutions help enterprise companies maximize the return on investment for generative AI, LLM and computer vision models across retail, finance, automotive and many other industries. For more information, head to sa.com.