How AI Happens

Egocentric Perception with Facebook's Manohar Paluri

Episode Summary

Joining us today is Senior Director at Facebook AI, Manohar Paluri. Mano discusses the biggest challenges facing the field of computer vision, and the commonalities and differences between first and third-person perception. Manohar dives into the complexity of detecting first-person perception, and how to overcome the privacy and ethical issues of egocentric technology. Manohar breaks down the mechanism underlying AI based on decision trees compared to those based on real-world data, and how they result in two different ideals: transparency or accuracy.

Episode Notes

Joining us today is Senior Director at Facebook AI, Manohar Paluri. Mano discusses the biggest challenges facing the field of computer vision, and the commonalities and differences between first and third-person perception. Manohar dives into the complexity of detecting first-person perception, and how to overcome the privacy and ethical issues of egocentric technology. Manohar breaks down the mechanism underlying AI based on decision trees compared to those based on real-world data, and how they result in two different ideals: transparency or accuracy. 

Key Points From This Episode:

 

Tweetables:

“What I tell many of the new graduates when they come and ask me about ‘Should I do my Ph.D. or not?’ I tell them that ‘You’re asking the wrong question’. Because it doesn’t matter whether you do a Ph.D. or you don’t do a Ph.D., the path and the journey is going to be as long for anybody to take you seriously on the research side.” — Manohar Paluri [0:02:40]

“Just to give you a sense, there are billions of entities in the world. The best of the computer vision systems today can recognize in the order of tens of thousands or hundreds of thousands, not even a million. So abandoning the problem of core computer vision and jumping into perception would be a mistake in my opinion. There is a lot of work we still need to do in making machines understand this billion entity taxonomy.” — Manohar Paluri [0:11:33]

“We are in the research part of the organization, so whatever we are doing, it’s not like we are building something to launch over the next few months or a year, we are trying to ask ourselves how does the world look like three, five, ten years from now and what are the technological problems?” — Manohar Paluri [0:20:00]

“So my hope is, once you set a standard on transparency while maintaining the accuracy, it will be very hard for anybody to justify why they would not use such a model compared to a more black-box model for a little bit more gain in accuracy.” — Manohar Paluri [0:32:55]

Links Mentioned in Today’s Episode:

Manohar Paluri on LinkedIn

Facebook AI Research Website

Facebook AI Website: Ego4D

Episode Transcription

0:00:00.0 Manohar Paluri: When we actually act, we have particular goals in mind. Whereas when you're trying to record something, you're recording it from a third person point of view. For the first person point of view, you can start predicting the next set of actions with lot more precision.

 

[music]

 

0:00:16.0 Rob Stevenson: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens.

 

[music]

 

0:00:50.7 RS: How do you make a machine see? This is the question that drives much of the research of Director in AI, Manohar Paluri and his team at Facebook. Mano has been toiling away in computer vision for years, and he understands intimately the challenge is inherent in replicating the human process of sight. For Mano though, making strides in this field isn't merely about detecting and labeling images and video. Mano joined How AI Happens to explain how computer vision is just one building block in a big bet Facebook is making. A multi-modal approach that combines personalized data, generalized learning abilities, and ultimately a shift from third person to first person point of view data, all in the interest of synthesizing information in a way that more accurately represents the human experience. Put more succinctly, egocentric perception.

 

0:01:45.9 MP: It's been over 17 years since I joined computer vision, which is one of the important fields in AI and making machines understand, especially the visual world. This was during my undergrad, I was a computer science undergrad, did bunch of electives. Computer vision stuck with me because of the gap that humans were compared to where machines are at that time, or were at that time, and that's how I got into computer vision, then I joined SRI, then Sarnoff, as a researcher, worked on contract R&D for a few years and decided that I wanted to go pursue a PhD, so people can take me more seriously. But then I realized very quickly as I was doing PhD that the affinity towards the real world problems was something that excited me. So I did a bunch of internships in IBM research, Google, and then also I was lucky to land an internship in Facebook. And when I joined Facebook, I was blown away by the sheer scale of things, and for the first time I felt working on computer vision while having impact on a billion people, was an opportunity. So that's kind of how I wound up at Facebook.

 

0:02:47.0 RS: It's interesting, you decided to pursue a PhD because you said, "You want to be taken seriously." But did that belief that that's how, what you needed to be taken seriously? Did that belief change?

 

0:02:57.5 MP: Yeah, so I was pretty naïve back then, like many people even today are, and it's just a learning curve, right? I learnt over the past years that you could be a researcher in various forms and various ways. Some of the obvious paths are to go and do a PhD and publish, and that gives you the all the facets you need to be a strong researcher, but then another way of... And I never finished my PhD by the way, so I dropped out in between, but as you can see, or at least as evident people do take me seriously without PhD these days. And the reason for that is just the success of leading, building, doing research projects, and learning by doing it. So what I tell many of the new graduates when they come and ask me about, "Should I do my PhD or not?" I tell them that you're asking the wrong question, because it doesn't matter whether you do PhD or whether you don't do a PhD, the path and the journey is going to be as long for anybody to take you seriously on the research side. The thing that you need to ask yourself is, what kind of an environment are you most excited to work in? And what kind of problems are you excited to work in? Those will dictate the decision whether you should go to university and be in an academic environment or be in an industry and try to be closer to practical problems, but still try to do research in that context.

 

0:04:16.6 MP: There are pros and cons to both. In my case, I guess I was lucky because I spent part of my PhD in academia, but the other part of my PhD doing research projects in industry. So there are so many researchers who have had PhDs, postdocs, even multiple PhDs in some cases, who are amazing that I work with every day. And there are so many amazing researchers who actually never even went into a PhD program, but do amazing research. So I get to learn from both kinds of people. And I myself, I'm neither here nor there.

 

0:04:46.2 RS: I love that advice, particularly, "What areas are you excited to work on?" What was the answer to that question for you?

 

0:04:52.9 MP: So, when I was in undergrad, I was very fascinated by computer science, that's kind of one of the reasons I joined computer science. My undergrad curriculum was very nice, because the first two years what they did is, they basically set me up with, here is how a compiler works, here is how coding works, here is the language that you use to speak with machines, whether it's C, C++. During that time, it was basically C, C++. Java was just getting started and nobody took Java seriously. MATLAB was one of the computing languages that we used for prototyping. The good part is that the first two years was foundational computer science courses, but the next two years were these electives that you could take. And I was completely blown away by the opportunity that was there. It's like I was staring at computer graphics, compilers, optimization, databases, machine learning and AI, and computer vision, image processing.

 

0:05:40.9 MP: So I did the crazy thing, which is I just took all courses, I just took all electives and it was the best thing I did, because each one of them was very hard, new learning experience, but computer vision somehow stood out to me because what was so easy for us humans was extremely hard for machines. So there were two things that were very fascinating to me, one, the challenge of it, and the second that I could be doing this for eternity until I'm done, and few other generations follow me. We will still do work on it as like, this seems crazy, this seems great because I'll never be out of job.

 

0:06:10.1 RS: Yeah, absolutely. I'm glad you brought up computer vision because I've had a bee in my bonnet shall we say, about the terminology for that field, because to say computer vision feels very human-centric, because that to me is like oh, vision is ocular and vision is image-based, but more and more, you see all these other types of sensors. When I speak with individuals who work in conversational AI, they always repeat the maxim that the speech is not the text, so there is much more to be perceived here surely than something visual in the field of computer vision, am I onto something there? Is that term computer vision limiting?

 

0:06:52.8 MP: No, I mean you're definitely bringing up a very important point. The straight-forward answer to the specific question is computer vision limiting? My answer is no, not because I'm passionate about computer vision or I worked in it for the past 15 years or so, this is a framework that I think about. It's like focus that helps versus focus that hurts. The core problem of making machines see is a pretty critical one, and even with all the progress that we have made and amazing systems that are in production these days, that people use everyday without even thinking about it, we are still in our infancy. This was one of the main reasons I joined computer vision and it's still fascinating how far machines are in terms of recognizing and understanding things from a vision perspective compared to humans. Humans just learn quickly, they are able to generalize to new instances better. They handle challenging circumstances much, much better and are unparalleled compared to where machines are. Of course, there are places where machines are much better, in terms of focus, they can focus on everything that a camera can see, they can crunch through things much faster.

 

0:07:56.3 MP: I mean if you look at recognizing one in a billion people, machines can just do it, humans cannot even think about doing it. So there are definitely places where AI and machines are better, but it's still... There's a significant gap that we need to cover and in that sense, we are in infancy. So computer vision as a field is very ripe and cameras are ubiquitous, pretty much every smartphone is now having multiple cameras, you have systems and home devices with cameras, the future will be a lot more of this. Obviously, we need to really think about privacy and all of those aspects, but assuming we're doing all the right things and making sure that biases and privacy is thought out front, CV as a field is very ripe. So going back to the framework of focus versus non-focus, I think defining the problems as vision only do bring focus, and is very valid as an important problem to solve, but as you rightly pointed out, many real world applications do go beyond computer vision, and that's why machine perception is the way to think about where you are actually think about the end problem.

 

0:08:58.7 MP: So perception as a whole is how do you take the raw sensory signals, whether it's a camera or a microphone or an IMU or a gyroscope or other kinds of sensors that are producing this raw digital samples that need to be crunched through with machine learning models so we can actually reason through it and have higher level of representation and abstractions that we can build experiences on, that is perception. So going beyond vision, I mean look at video, it has vision and language, it has speech, definitely a real world problem that we need to tackle, real world applications that we need to build. So in that sense, when you think about focus that helps, which is hey, we need to bring in audio signals, we need to bring in tech signals, we need to bring in vision signals, definitely go beyond computer vision. Many of the real world ranking cert systems, many experiences that we want to build could probably expand beyond computer vision.

 

0:09:50.7 MP: So that kind of focus does help. But on the flip side, solving the recognition problem and scoping it down to specifically from pixels to understanding is in itself a very valid problem and that focus can help you propel further. And historically, that's what happened, right? When you look at how these fields are created, there are two threads. One thread that is trying to focus and define the problem so we can make progress incrementally over time or have leap frog moments, and second, is how all of it ties together, so AI is definitely something that ties this together. But computer vision, speech, language understanding are all of these really important domains under the AI umbrella and that we need to make progress on.

 

0:10:32.8 RS: Okay, that makes sense. So we shouldn't say computer vision when what we mean is perception. And using the terminology computer vision is to be specific, it's to focus on image processing or video processing.

 

0:10:45.8 MP: Yeah, visual understanding. So computer vision specifically is the domain trying to make machines see the way humans do. Perception is the definition of making machines understand from all sense of sensor, all sorts of sensors. By the way, I mean another one that we don't quite often talk about is tactile sensors, so when you think of a perception, the sensory touch, I mean hopefully, at some point we'll start touching the taste aspect of it, but [chuckle] that's probably a bit away. But hearing, seeing, touching, all of these are sensory activities and perception is what brings all of this together. Another thing to think about is you don't need to solve these problems in a silo. So for example, when you're thinking about a vision problem, if it turns out that audio does help you or language does help you, bring that in.

 

0:11:30.3 MP: That's why vision and language is such a critical component. If it turns out that the vision system will be much better by active perception, so you have an agent that is actually acting on your behalf saying, "I don't know whether this is a mug or not, let me actually go interact with it," and that interaction is active perception, so you're kind of changing the state of the world and then perceiving the changes, and the perception of the changes can help you better understand, is a very valid thing. So in some sense, yes, you should not be limited by computer vision if it's hurting you, whether it's a real world application or research that you're working on, you should be really thinking about the problem and bringing in all the sensory information that you can and make the right assumptions to make progress on the problem.

 

0:12:11.3 RS: Is that one of the big projects you're working on at Facebook? Are you focusing beyond computer vision, let's bring in other sensor data to start to formulate perception?

 

0:12:22.2 MP: Yeah. So there are two directions. Obviously there is still a lot of work that we need to do in pushing computer vision. So there we are thinking of really different kinds of problems around how do we go off large vocabulary? Just to give you a sense. There are billions of entities in the world. The best of the computer vision systems today can recognize in the order of tens of thousands or hundreds of thousands, not even a million. So it's like abandoning the problem of core computer vision and jumping into perception would be a mistake. In my opinion, there is a lot of work we still need to do in making machines, understand this billion world... A billion entity taxonomy. So that's our line of work and set of projects that we are pursuing in research. And then there's another line of research that you're talking about, which is, hey, when you think about the next generation of assistants or experiences, which are hardware software, which are proactive, where we want the machines to be proactive, multi-modal, interactive and augmenting your capabilities. That is another line of thread where we are really going into perception and bringing in all of these modalities together. So both kinds of research are valid as an organization. We are pushing on both at the same time and I think progress in each will help the other.

 

0:13:35.8 RS: That's fascinating to me, that computer vision only extends to 10,000 entities by which to say a real world object that can be detected, right? Why is it such a low number? What is the gap there?

 

0:13:47.9 MP: So the exact number would probably change from one company to another one system to another. I would say to the best of my abilities, the way I understand the field today, and even the competitive state of the art across the landscape is we are definitely in order of hundreds of thousands at most a million, maybe a production system that is running on device might be more limited. Maybe on the cloud system can do more, but definitely orders of magnitude away. It stems from multiple problems. So think of it as coarse-to-fine to personalized. So coarse is, is this a wallet or a purse. Fine is, is this a Gucci wallet or a Louis Vuitton wallet or something like that and personalized is, is this is my wallet with my engraving on it, my initials ingrained on it. When you think about a computer vision system, you don't want... A human... If I just interacted with you... Right now, we are not in the same room, but when you are in the same room, you look at me placing my wallet and my phone on the desk.

 

0:14:44.0 MP: And we are having a conversation at the end of the day, even miraculously, let's say you and I had the same wallet, just where I put it. The fact that you tracked over time, you can easily say, this is yours. This is mine. We could do it naturally. We don't even think about it. But for a computer vision system to think about it, it'll basically say both are wallets and to go one step further and say, this is Mano's wallet, this is Rob's. Here is the reason I see the engraving. That's why I made this decision. We are not there yet. So that's one way of thinking about coarse-to-fine to personalized. We are still in the coarse category. In some places we are going fine category like birds and food and other places like that. But even there, it's a long way to go.

 

0:15:25.7 MP: Second is how you learn. We have pretty generalizable way of learning. So when let's say a designer comes up with a completely new wallet or a new article that is not fitting into the norms of what the current definitions are, we quickly adapt. We are like, okay, at the end of the day, it opens up, I see a bunch of cards and cash coming out. So that must be a wallet. That is not how machines learn. The way machines learn right now is here are 1,000 types of wallets or 10,000 types of wallets. And based on that, I have a fairly good idea when I see something new, but if something crazy came out of the distribution, it cannot generalize because it doesn't know the definition of wallet as having cash and cards. And it opens in this way and people use it to actually carry valuables.

 

0:16:09.6 MP: That's not how it thinks about. So the line of arc of connecting vision to real world symbolism or bringing in vision and language together is something that we haven't necessarily done at scale. That's the second aspect. And the final and third aspect is the over time. I do anticipate there being enough privacy and bias aspects that we need to cover. So more and more, we should push for this personalized perception aspects happening on device, closer to you where you feel more comfortable, whether it's an assistant in your home or AR glasses that you carry and wear all the time. So there's a significant amount of challenge in taking these models that are beefy and making them work in real time, on device, in a privacy centric manner. So this is why it's so hard to do all of this at the same time. And it's gonna be at least a few decades before we push the learning paradigms. We push our vocabulary to the scale of humans and beyond, and we push the compute and efficiency of these models in a way that they can fit on very small silicon and have them on the go all the time.

 

0:17:16.8 RS: Yes. Okay. That's really helpful context as well. I'm curious, Mano, could you share just some details about the organization you manage at Facebook and some of the big bets and the projects that you're focusing on right now?

 

0:17:30.4 MP: So obviously it's involving some of the things we have talked about externally and some of the things are ongoing, but everything we do, we talk about it more openly and we share at the beginning of the project, or as we make progress or towards the end. I am particularly situated in the research organization. So Facebook AI is the overall umbrella in Facebook, AI, we are organized based on outcomes that we wanna achieve. There are research outcomes, there are developer outcomes. There are product outcomes, as you can imagine, product outcomes are really thinking about how do we take the AI technology or produce AI technology that can solve real world problems? Could be keeping people safe, could be building the next generation recommender system like newsfeed, could be thinking about experiences in commerce like online shopping that don't exist today.

 

0:18:14.9 MP: When you think about developers, look at the crazy amount of growth we had in PyTorch. And how developers are able to express themselves in AI without having to start from step zero. All along for everyone, it's like really super charging everybody's abilities to build AI models whether it's training or inference or using pre-train representations. And then on the research outcomes, really asking the questions and what are the bottlenecks to push AI forward whether it's in computer, vision, latch language, understanding speech, personalization, and so on and so forth. So I'm part of the research organization it's called Fair Externally. It's fairly well known in the research community in that we...

 

0:18:51.0 MP: Some of the bets we have taken are... We actually announced one last week. And one is in egocentric perception. What I noticed over the past few years is as we get into the world of people wearing smart glasses and having sensors that are on the go, egocentric perception which is seeing the world from your eyes, as a first person point of view is a pretty budding area. Whereas historically, we have been working on heliocentric, which is third person vision, like somebody takes a photo and uploads it, and then you're processing it and understanding what is happening. So that is a big bet that we made. And the thing that we launched last week is Eagle Ford Consortium. I can talk a little bit about that if you wanna go deep there. So that is one big project. There are two other directions I can share at high level, we are still in our infancy, but directionally, this is what we're working on.

 

0:19:42.3 MP: One of them is this notion that somehow we hold some of the AI models to lower standards in terms of expandability and ability to build trust. It's the really deep techies that understand what is happening. But other than that, these models are making bunch of decisions that not everybody really understands. And once you don't understand something, you either appreciate it, but keep it at a distance or you don't appreciate it at all and you don't give it the benefit of doubt. So one of the things we are trying to do is really increase transparency in these models. So people have more understanding of how these work and can actually help when it's making any mistakes. So that is a big direction that we are taking. Again, as I said, it's in its infancy.

 

0:20:23.1 MP: And another direction is around generating models. And we have always been fascinated about all the progress we have been making in generating models, especially around, "Look at this room. This is actually generated by a computer. This is not a real furniture or a room." This is cool, but why, and how is it going to be impactful for the world, has always been a question that I have heard. Again, we are in our early stages. We are actually exploring a few options where we are bringing in creators and generating models together and see what can happen there.

 

0:20:54.6 MP: So those are a few bets in the context of perception, all of them are super fascinating. And as I said, we are in the research part of the organization. So whatever we are doing, it's not like we are actually building something to launch over the next few months or a year. We are really trying to ask ourselves, how does the world look like three, five years from now, 10 years from now, and what are the technological problem? What is the technology needed? And what are the research problems that we need to tackle and go after them and make progress over them?

 

0:21:21.2 RS: Well, I feel as though each of those three bets probably deserves its own episode [laughter] to properly explore, but we can touch on all of them I think, as we continue here. When it comes to egocentric perception, I believe you called it, the shift from third person data to first person data. And whether it's a wearable or what have you, I suppose it would have to be wearable. What changes in terms of the data processing that's going to happen when that shift occurs? When it's no longer me holding my arm out and taking a photo. And it's more of one to one matching my point of view as a human being walking through the planet.

 

0:22:00.3 MP: Yeah, no, it's a great question, especially when you think about egocentric perception. One thing I wanted to maybe not correction, but add is we should not assume egocentric data only coming from humans wearing glasses or wearables. We should also think about robots having their own sensor and robots seeing through the camera as egocentric perception, because it's the specific robot seeing the world. So the agent can be a robot with the sensors seeing the world or a human wearing the sensor seeing the world, just a matter of thinking about it that way can help.

 

0:22:36.2 MP: But going back to your question on egocentric perception, there are some commonalities and differences. Let me actually start with differences, so it's exciting. And we can see what is so unique about it. When you think about egocentric perception, you give away the attention as to what you're attending to in egocentric perception. So when you are actually wearing the glasses or cameras, and you're going on let's say shopping or playing football, and I highly encourage viewers to see the Eagle Ford example video so you get a sense. You literally see, when you open, let's say, you're going to shopping, you literally see that you're going towards the fridge, you're opening the handle. And as you're looking at it, you're picking up the milk bottle. When in a third person view, you have to make sense of what is happening because everything is visible and there is no attention mechanism. So you don't know what the wearer is actually going to do as much in third person view. So that's one huge difference.

 

0:23:32.3 MP: The other difference is when you actually wear glasses and record, you'll see the amount of jitter and aliasing effects because of constant motion. We are not tuned towards keeping our head steady and our gaze steady. We basically tend to whatever is needed and we move around a lot. And there'll be so much stabilization question. Whereas in a third person, when you actually record a video, you instinctively record it while trying to keep your phone or the camera very still. So that poses a bunch of really interesting challenges for egocentric perception.

 

0:24:03.6 MP: The third is when we actually act, we have particular goals in mind. Whereas when you're trying to record something, you're recording it from a third person point of view. For the first person point of view you can start predicting the next set of actions with a lot more precision, and it becomes a really nice task. And then again, many of these sensors do come with microphones and cameras, multiple cameras. So you can get 3D sensing of the world is like not just what you're seeing, but the depth as well as audio.

 

0:24:31.5 MP: And the last thing is it's the same person again and again. So you can start personalizing things saying, "Okay, so when Mano is here, he's actually going to work out over the next half an hour typically, or he's going to eat the meal with the spoon, not the fork. After this, he has this meeting that he can go to and so on and so forth." So you really start bringing in this multi-model perception, predicting things, personalizing things in a much more richer way in case of egocentric perception, and your field of view, or what you're looking at is fairly overlapping with the person's field of view or person's auditory information. So in that sense, you can start really getting to the core problems of understanding the world from our point of view building representations that are as close to, not necessarily the way humans do exactly. I'm not going to get into the biological models aspect of it, but as close to how humans think about the world and react to the world, right? So those are their differences.

 

0:25:28.5 MP: And then the last difference I would say is, we have always been asking questions around long term reasoning, temporal reasoning which is, you see something and half an hour later you see something else, should we solve the problem of tracking it? Those were all in the context of surveillance or they were all artificial in nature. I think egocentric perception is one ripe area where long-form reasoning is critical, saying if I am having a conversation, what you're saying, "Hey, I want to leave after 45 minutes because I have another meeting," just the fact that our assistant might know that and as we are inching towards that 45 minutes, it remembers 45 minutes ago I said that, saying, "Hey, you might want to wrap up with Rob because you need to leave in five minutes," requires you to hold that important information over time and use that correctively and trigger the right things, which is something that you don't do in non-egocentric perception world which is like offline processing or something else. So those are the differences.

 

0:26:23.4 MP: The commonalities are, you are still trying to understand the world you are still trying to get visual representations, you are still trying to do data collection and training models, and either you're doing inference on the cloud or on device, you are still looking at many of these models that are fairly similar to how we do third-person vision. So there are definitely commonalities, which is exciting, that's why you're able to learn from what is happening in broader vision community and apply to egocentric data.

 

0:26:51.4 RS: It's really interesting you pointed out the difference of intentionality with third-person and first-person, so if I am in the example of going up to the fridge to get milk. What does he get there? Why is it so much harder to tune into intentionality if I'm holding a camera versus if it was... The camera was more in my point of view?

 

0:27:11.6 MP: Yeah, I mean, it stems from multiple things Right, obviously, one of the things is just knowing the person, let's say you are a vegan. If I see an almond milk or almond milk if I know you, you will probably go with almond milk if I don't know you, both as equal. So if my task is predicting the fact that I know about your preferences increases my odds significantly, whereas in third person, if I might not know, and that could be a problem. But the most important one is gaze and object interactions, those are fairly visible in a first person point of view, whereas in a third person, it might not be visible. By the way, for gaze, you do need additional sensors it's not like cameras looking out are enough, you also might need cameras looking in on your eyes without blocking your view, which is something that people have experimented with a lot. So when you bring in that additional, attention, context, whether it's gaze or what am I interacting with in my hand, you do have a lot more going for you in egocentric perceptions or in... You can do that in third person, it's not like you can't, but the problem seems less natural to me, I'm much more contrived, it's much more around monitoring what is happening rather than predicting and helping you with your next steps.

 

0:28:29.7 RS: Yeah, the gap there is as simple as the overarching goal of AI to imitate human processes, right? Like holding your phone a few feet away from you and document what you're doing is not a human process, it's a human process as of like 15 years ago, a smart phone came out, right? And so your actual lived experiences are obviously much more. The second that you were mentioning on, you can see how this weaves together a little bit when you think about the why and you think about the privacy aspects and the ethical considerations, for me to have a wearable and to want to opt into egocentric perception technologies. That's pretty straight forward, if I say I want it and I buy the glasses or contact lenses and I check the box, great, I've opted in, there's not an ethical consideration if I've opted in. But what about everyone else around me who I may put these glasses over their person, what they're doing, this is already going on a little bit because you can't go out in public without expecting that someone's smart phone is probably gonna capture you, or you're in the back of someone's photo probably, do you think that is gonna get ramped up more with a widespread nature of egocentric perception?

 

0:29:38.0 MP: No, this is a great question. So we launched Ray-Ban Stories a few weeks ago, maybe a month or so ago, and on Ray-Ban Stories, there's an LED that indicates whether somebody's recording, whether it's a photo or a video. As you rightly pointed out, today I could be in the background of many photos that I don't even know. So we're trying absolute best to indicate what is the right way to do things here, and adapting and learning, obviously as we build more hardware. But when I think about building trustworthy models it's not necessarily only in the context of egocentric perception that's why the bet that I'm talking about is not in the context of AR necessarily. It's taking a step back and saying, there are vision models that we're building for various applications, including ranking systems for that matter, or integrity systems, we need to read out a lot of unsafe content according to the rules or policies that we have, and if a particular content violates a policy, we need to remove it.

 

0:30:34.4 MP: In those cases, there's a lot of vision systems that are being used and applied. We want to up the transparency there, and this is research, it's not like I can basically deploy it. First, we need to even understand whether such modals are possible. So let me give you a concrete example. If my classifier is a decision tree, and say, if I see this, I should do that, if I see that I should do this, turns out that is very human-interpretable so anybody looks at the decision tree even if it's very complex. Let say, 10 questions like, how do you play the 20 question game? For example, you basically find out, ask the first question that eliminates the probability of many things and going to a particular silo, and then you ask another question and then you ask another question, this is the most interpretable system, but turns out...

 

0:31:22.9 MP: The world is so complex that decision trees just by themselves don't work as well. So, you'll get a very low accurate system that is extremely clear for people and the decision-making process is extremely clear. Now flip that and say, "Hey, just let me train a model on billions of instances and here is a test-set," and I get a huge accuracy jump compared to let's say a simpler model like decision trees or linear classifiers, but you lost the ability to explain why the model specifically made that call because of the complexity of the model that you have used now, which is neural networks or other kinds of models, like transformers and so on.

 

0:32:02.5 MP: What we are trying to do in this research space is, is there really... Do we have to stay in those gross generalizations or ends of the spectrum? Can we actually stay somewhere where we do get a high accurate system while trying to explain as best as possible to humans so they can understand and appreciate the transparency a lot more and gain more confidence? And as a follow-up to it, also ability to correct these models so we can build more accurate models for themselves, right? This is... So, it's less about egocentric or variables necessarily and more about how do we strike that balance of accuracy and transparency, which so far has gone in both ends of the spectrum but not necessarily foundational. Research hasn't been yet done where we can try to strike the balance between the two. That's the core question. We could probably end up finding out that you cannot have both. That would be sad but we want to at least try.

 

0:33:04.4 RS: When you think about striking that balance of accuracy and transparency, whatever that looks like for a given process, is that now going to be baked into what you'd consider the standard life cycle of AI development?

 

0:33:20.0 MP: It's definitely not standard, at least not by today's terms, but as we make progress in this, and not just we as in my team or Facebook, I think as the global community; as we make progress in pushing on more explainable models and more transparent models, I do think it will set a standard. It's this, right, like once you set a standard on accuracy, it's very hard for people who want to use the best, to not use that high accurate system. Similarly, once you set a standard on bias and fairness, it'll be very hard to justify why you will not use a more fairer model. So, my hope is once you set a standard on transparency while maintaining the accuracy, it'll be very hard for anybody to justify why they will not use such a model compared to a more black box model for a little bit more gain in accuracy. So, if we make progress as a community on this, I think it's the right thing and it will actually help increase the adoption of technology much more faster and in a more reliable way, while making sure we're trying to do right things than punting the hard questions to later on and just hoping for better accurate and better accurate systems over time.

 

0:34:29.1 RS: Assuming that comes through and it becomes more standard, where would you slot it in? [chuckle] When you think of the way that the development cycle takes place, is it something that happens throughout the entire process? Is it like an oversight committee that has to drop into your process? How do you think about the way that this will insert itself into the generation of these technologies?

 

0:34:53.5 MP: It's a great question and only time will tell. Right now, the way I think about it is obviously there is a set of guidelines that we as developers or researchers will follow, but there's also some guidelines we might want to give for people who are building products and there might be an overarching committee, these amazing organizations in AI, like partnership in AI and so on and so forth. So, the real world implications on where these technologies get used, if it's autonomous driving versus if it's a ranking system online, versus a variable in your home, a variable or new or a device in your your home, each one of them will have different complexities that we need to tackle, so there will be probably specialized needs that we need to think about in each of the circumstances, but the common things will be probably close to the development where the researchers are working on it. So, it could be related to the data you collect, it could be related to the models that you build and the tools that you build to ask these questions as you're developing the models. So, it's too early to tell, to say, because we are still in our infancy in the research, but as we make more progress, we can share more.

 

0:35:54.3 RS: I like that you reiterate throughout how early we are in this whole developmental process and then the deployment of this technology. When you think of the next phase of this, the next five, 10, 15 years, what have you, what is most exciting about working in this field for you?

 

0:36:11.3 MP: There are so many things that are exciting but I should definitely name a few that are close to heart. So one of them I already mentioned. I think the evolution of computing is we are still in the mature stages of mobile computing I would say, but nobody has yet figured out what is the next computing platform and I don't believe mobile will be the final computing platform in any way. So, is it exactly AR, is it combination of VR and AR, is it a combination of variables, mobile and desktop? All TBD, but I'm very fascinated by the next generation of compute having all of these sensors, seeing the world through your eyes, hearing it through your ears and then being there for you in a proactive way. That part is really fascinating to me and we are just starting to write that future.

 

0:37:07.3 MP: It's so early, but over the next decade or so we'll probably make progress on defining the next generation of computing and how people will interact with it. Everything from how we interact with it, what senses go in it, what kind of computing you will have, what kind of information will they store about you, about the world and how it will be used to help you in the moment as well as achieve your long-term goals, these all will be re-written in that computing platform, and to me, computer vision and AI will make a huge impact in deciding that future. So, that's one thing that is super exciting to me, that's why Ego4D, egocentric perception, embodied assistance, these are all areas of ripe research contributions that we can actually make today to kind of write the future of computing platform.

 

0:37:54.4 MP: So that's one thing that is exciting. The second thing that is exciting to me is there is this natural progression of many of these fields coming together. So historically when I was looking at computer vision and this is going back to the conversation that we just had. It was really around understanding the visual world from pixels or pixels over time, at best. Right now, when you think about AI models, the good thing that has happened is the constructs, the atomic constructs that we use are fairly generalizable across these modalities. So the models that we have in speech, the models that we have in vision, some of the language models like Transformers like convolutional neural networks.

 

0:38:36.2 MP: They are actually mathematical constructs that we can use at our will that combine all of this information together, and that actually allows us to build more intelligent systems over time, more practice systems over time, more multi-modal systems over time, and some of the assistive capabilities that we can think of are gonna be amazing when you bring all of those modalities together and when you build AI models of the future where all of this information is synthesized.

 

0:39:02.2 MP: So that's the second one. The third one is, so far, we have... If you look at historically, how computing platforms have been used, they have been used in places where humans are not good at, whether it's crunching numbers or whether it's doing mundane tasks. So just going around that direction and thinking along those lines, humans are amazing, creative individuals, but we do get stuck and once in a while, we need that nudge, we need that additional input. I actually anticipate human-computer interaction where the new computer is defined in a different way by the way, not the current computer.

 

0:39:40.2 MP: Assuming the new computing platform, whatever shape or form it takes, the interaction between human and computer is going to be amazing, where you will completely have neural interfaces maybe, you will have a lot more personalized interactions where the part of the algorithm actually understands you and you understand the algorithm, and then there is this complementary aspects that come together for you to become an amazing creator, for you to achieve your goals much faster with lesser dependence on resources that not everybody has, that to me is very fascinating. One of the reasons people are successful is obviously their own talent, but also the resources that they were able to get, and just being lucky to be born in a particular area or region or with particular resources or to a particular family. If we can normalize that over time saying, you can actually have as much access to many of the informations, you don't need to necessarily travel to a conference, you don't need to necessarily be rich enough to have access to this particular information, those are all things that are extremely exciting to me, and I think... So those are the three things. Just defining a new compute platform of the world, it's awesome. Thinking about how it normalizes and gives opportunity to everyone is awesome. Merging these fields of computer vision, natural language, and understanding, speech is awesome. Those are the three areas that the next decade also will...

 

[foreign language]

 

0:41:02.9 MP: Many of my groups are working on small aspects of those big directions.

 

0:41:07.6 RS: Good. That's a great answer. There's so much here to develop it to see over the next however many years, so I can't wait. Mano, this has been really fascinating, learning from you today, thank you so much for sharing your time and your expertise with me, and I just really love having you on. Thanks for being here.

 

0:41:22.9 MP: Thank you so much, Rob, for having me. Of course, every time I'm having conversations about this, I could go on and on, hopefully, it was useful for the folks tuning in and looking forward to doing one more after a few years when we actually look back and hopefully some of the predictions and some of the things that I talked about have become a reality.

 

0:41:42.9 RS: Would love that. Thanks so much, Mano. How AI happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, MedTech, robotics, and agriculture. For more information, head to Sama.com.