How AI Happens

Solving Conversational AI with Valyant AI CEO Rob Carpenter

Episode Summary

Rob Carpenter is the founder and CEO of Valyant AI, which is on a journey to solve the complex problem of conversational AI in the food service industry. In today’s episode, Rob explains the three main components of AI speech processing (and the challenges that arise at each of these nodes), how conversational AI has the capacity to improve conditions for human workers in the food service industry, and what this technology is going to be like in the future. After this episode, you’ll understand the importance of being more thoughtful about how you communicate your next burger and fries order to a conversational AI system.

Episode Notes

Rob Carpenter is the founder and CEO of Valyant AI, which is on a journey to solve the complex problem of conversational AI in the food service industry. In today’s episode, Rob explains the three main components of AI speech processing (and the challenges that arise at each of these nodes), how conversational AI has the capacity to improve conditions for human workers in the food service industry, and what this technology is going to be like in the future. After this episode, you’ll understand the importance of being more thoughtful about how you communicate your next burger and fries order to a conversational AI system.

 

Key Points From This Episode:

Tweetables:

“I thought the hologram was the hard part and that the conversational AI was solved, but it was basically the inverse of that.” — Rob Carpenter [0:06:47]

“There’s benefits when you get into a new industry or technology not knowing the problems, because you don’t know what your limitations are. I think a lot of times that frees you up to be more creative and innovative.” — Rob Carpenter [0:08:38]

“If I was to postulate where things would end up, I’d say it’s probably a 90/10. 90% is that the technology has to be better, and keep getting better. 90% of everything needs to be handled by the AI. The other 10%, people need to be more thoughtful when they communicate with these systems.” — Rob Carpenter [0:17:22]

“There’s 1.7 million unfilled positions in the restaurant industry right now. 1 in 6 of every position available right now is in restaurants.” — Rob Carpenter [0:20:26]

“Innovation is not only built into economies, but it’s essential for their health and long-term safety.” — Rob Carpenter [0:23:28]

Links Mentioned in Today’s Episode:

Rob Carpenter on LinkedIn

Valyant AI

Episode Transcription

EPISODE 16

[INTRODUCTION]

[00:00:00] RC: Or, I want the loaded, the loaded sundae with the funnel cakes. As a human, or as an AI, you almost have to throw your hands up and just say like, “Wait. What?”

[00:00:17] RS: Welcome to How AI Happens, a podcast where experts explain their work at the cutting-edge of artificial intelligence.

You'll hear from AI researchers, data scientists and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I’m your host, Rob Stevenson, and we're about to learn how AI happens.

[INTERVIEW]

[00:00:45] RS: Joining me today on the latest installment of How AI Happens is Valyant AI's CEO, Rob Carpenter. Rob, welcome to the podcast. How are you today?

[00:00:55] RC: I'm doing awesome. Thank you so much for having me. I really appreciate it.

[00:00:59] RS: I'm really excited to have you on the show, Rob. I have a ton of questions for you about conversational AI and some of the challenges there. I think, it's a common problem across AI. Most AI tools that we interface with, any human being interfaces with, is going to use conversational AI, right?

I would love to, at the beginning here, would you mind sharing a little bit about your background, and then how you came to start Valyant? Then we can get into the chief opportunity and technology of your company?

[00:01:28] RC: Yeah, absolutely. That'd be great. My background is an undergrad degree actually, in entrepreneurship. I think, I've always been a bit of a nerd for businesses, a little bit of a classic story in the sense of I had a lawnmowing business in high school. In college, I started an advertising company Pretty early on, I read Rich Dad, Poor Dad. It just really opened up my eyes to the world of business and entrepreneurship, and building and starting companies.

As I looked at all of the people around the world that were effectively living the lives that I wanted to live, everything from buying tickets to space, or partying on a yacht, a huge number of them had start and built companies. I think, there were elements of those lifestyles that I was really drawn to and wanted to emulate. Since I'm never going to get picked for the NBA, and my singing abilities is woefully inept, I figured I would pin all of my hopes on being able to be a successful entrepreneur.

[00:02:31] RS: Got you. Then, you have the entrepreneurial bug sounds like, your whole life. What led you to point that curiosity and grit and hustle at AI?

[00:02:42] RC: Yeah. The initial concept for what has become Valyant AI was actually holographic employees. When I started the company in 2017, the first iteration for the business and for our product, we used a transparent OLED display. It’s about 3-feet wide and 6-feet tall. Then, we use the Unity game engine and we rendered a five-and-a-half-foot tall digital person. She was fully animated. She could blink her eyes. She would speak and use her hands for emotional gestures as humans do.

We used the Intel RealSense camera to be able to recognize people and greet them by name. In 2017, we were trying to use off the shelf conversational AI technology. To be frank, it was awful. It really couldn't handle more than one or two intents. If you said, “I want a cheeseburger,” that would work 80% of the time with off-the-shelf technology. If you said, “Hey, I want a number three, just the sandwich, no lettuce, but could I also have a Coca-Cola large, light ice.” The system would either struggled to respond, respond inaccurately, or just wouldn't respond at all.

We had multiple sessions trying to use things like dialogue flow for those multiple intense situations, where it would take the system 10 seconds to respond to us. That was an awful experience from a customer service standpoint. We realized pretty quickly on that if we were ever going to build this holographic employee technology, we were first going to have to master conversational AI, or put the whole company on ice and wait for existing technology to catch up, which obviously, the grit and the hustle we didn't want to do.

We pivoted. We doubled down in mid to late 2017 on building our own conversational AI platform. Today, that includes a proprietary speech-to-text engines, proprietary NLP engine, proprietary natural language generator. We also have a hardware component, where we install patented hardware inside the restaurant that integrates with their existing infrastructure to enable the AI to talk to customers. Then some pretty critical proprietary software around integrating directly into the point-of-sale systems to enable this entire technology to happen.

I’m really excited with where we've ultimately been able to get the company by bringing this technology in-house. Then long-term, the goal is still to reintegrate this technology back into the hologram and then eventually, roll that technology out into the marketplace.

[00:05:06] RS: Fascinating that the hologram was the easiest part of the challenge for you.

[00:05:10] RC: I know, right?

[00:05:11] RS: Then that it broke when it was actually time to interface.

[00:05:14] RC: Yeah. I guess, we should maybe asterisk that and say, proto-hologram, because it wasn't a true three-dimensional light-based image. I think, it's the clearest way to explain it. Just using that transparent, display was a nice way where you could see through it. Then obviously, the parts of the character were opaque. You got that feeling of this digital entity standing with you inside of a physical space. You're right. I was totally off in that I thought the hologram was the hard part and that the conversational AI was solved. It was basically the inverse of that.

[00:05:47] RS: Conversational AI is not solved.

[00:05:49] RC: Conversational AI is very much not solved.

[00:05:52] RS: Valyant, it sounds like, your technology is primarily positioned right now, aimed at the point of service in restaurants and fast food and in places like this. What is the gap between the reality of the technology maybe as we know it, as I know it with Siri and Alexa, versus what happens there at the point of service?

[00:06:13] RC: Yeah. I mean, the challenges, at least in our environment, I consider a daisy chain type experience. Because the customer pulls into the drive thru, they'll say something like, “I want a cheeseburger.” The audio will flow through from the microphone into our hardware. We currently reside up in the cloud, so the AI does its thing in the cloud. Decides how it wants to respond to the customer, sends it back down. Then we play it out through the speaker to the customer. I'd say, big picture, there's probably roughly 10 different nodes along that chain.

Each successive node is dependent on the information that it got from the node before it. Let's say, you've got 10 different key components. Each component is just off by 2% from perfection. You have a 98% success rate at each node. Well, 2% off times 10, you're now looking at an 80% degradation in overall order completion rate, which is not a viable product for the marketplace. If you consider one of those nodes, which is speech to text, right now, world-class is roughly 95%. You're saying, “Hey, I need to get 3% better than world-class.” When I take everything else into account, I'm still at an 80% completion rate.

There's benefits when you're going into a new industry, or technology, not knowing the problems, because you don't know what your limitations are. I think, in a lot of times, that frees you up to be more creative and innovative and lead to breakthroughs. Over time, you also start to find out where those barriers were that the rest of the technology industry had been experiencing. It's been a fun and painful process to work through that.

[00:07:56] RS: Yeah, it's interesting, the speech-to-text. The last guest I spoke with was working on a similar challenge. Repeated the truism that the text is not the speech, and that there's so much more to processing speech than putting it in text and parsing it out. What else is there left? Say, you are able to accurately transcribe what a customer says into text, then how do you turn that into insight that the technology can take action upon?

[00:08:27] RC: Yeah. The other two key components, and we roughly, nerdily refer to these three components as the three amigos. Speech-to-text is first. The second one is obviously, natural language processing. Once speech-to-text outputs what it believes the customer said in a written form, then that feeds into the NLP and then our NLP, at least takes that information, parses it, and then outputs action.

Let's say, the customer said, going back to the initial example, “Hey, I want to number three. Just the sandwich, large Coca-Cola, light ice.” Then NLP is going to output actions that say, add, and let's say, the number three is a double cheeseburger. The NLP is going to output add one double cheeseburger to order, add one Coca-Cola to order, size large. Then, it might have a additional comment for modification, which is light ice. Then the NLP outputs those actions statements. If you were to look at this a mathematical formula, you could then bracket speech-to-text in NLP, and say, “This is our best guess of the intent that the customer wanted to convey to us.”

If the customer was always perfect in terms of their orders, then you'd be done at that point. You say, “Great. Yeah, add that to the order.” You might need to ask them if they want anything else to keep the conversation moving along, but that would be the end. Unfortunately, that's not the case. That's really just the first half of the problem. Again, if you bracket speech-to-text in NLP, that's the customer telling you what they want.

Then, the other side of the equation, which is equally as critical as those two items combined is, how do I communicate back to the customer? What do I communicate back to the customer? In the example that I gave, generally speaking, in restaurants, when you have an entree and a drink, you then tried to add a side and turn the whole thing into a combo. This other key component is the natural language generator. It's also a way, you could think of it as the brains, or the common-sense engine, or the element of a customer service experience, where you try to determine how you decipher and respond what the customer wants, and then respond intelligently to them.

We have run through a variety of interesting types of communication from our customers. Just a couple of examples that we've heard over the years, and these are actual customer statements. The first one is, “Then, can I get an order of the – Do you have the funnel fries? Oh, can I get an order of those too, please?” Here's another one. “Hey, you know what? Take off the fries off.” Or, “Okay, just give me lemonade, if you have any.” Or, “Um, the half-pound. Do Coke – it still come with fries, right?”

[00:11:20] RS: Do Coke is still come with fries, right, is the title of this episode.

[00:11:24] RC: I love it. Or, “I want the loaded. The loaded sundae with the funnel cakes.”

[00:11:29] RS: Oh, gosh.

[00:11:31] RC: As a human, or as an AI, you almost have to throw your hands up and just say like, “Wait. What?” That's what the natural language generator does, is it takes what the customer has tried to convey to you and says, “What do I do with this?” I think, some simple ways to think about it is, okay, what are the products you actually want? Do we have all of those products? Are they in inventory? Is it the right time of day? Are we a restaurant that sells those types of items? What if somebody pulls up to a burger restaurant and says, “I want spaghetti”?

Well, speech-to-text and NLP, let's say they're perfect. They tell the natural language generator, this customer wants spaghetti. The natural language generator then goes, “Awesome. I'm a burger restaurant. I cannot sell you spaghetti.” You have to have an intelligent response to that, so that the customer has an experience that this is actually an intelligent system, that it is interacting with.

You might need to figure out, “Hey, is this an appropriate situation to upsell? If so, what should I upsell? Did I get everything complete that I needed?” I need to now prompt you to continue the conversation by saying something like, “Can I get you anything else?” Or, did you give me one thing I can work with, let's say, the Coke with light ice, but then you gave me something else ambiguous. You said, “Give me a burger, or a sandwich.” We don't actually have generic items that fit those categories. I need to know that those are categories. Now, the natural language generator needs to resolve that unspecified item and say, “We actually serve multiple kinds of burgers. Would you like a A, B or C burger?” Then, actually give the customer information and prompt them to complete and resolve those unresolved items.

You might also get situations where the customer says, “Hey, can I have two cheeseburgers, fries and a Coke? Oh, and light ice, or light lettuce, or no lettuce.” Then, you have to take that. If it's a Coke, and there's only one item in your order, great. Light ice, no problem. What about the example where there's two cheeseburgers and you say no lettuce? Well, does that mean no lettuce on both burgers? Does that mean no lettuce on just one burger? What if you already had a third burger that was in your order? Could that no lettuce go all the way back to that first item? That's why it really feels like, understanding what the customer is trying to convey is half the battle. Then being able to respond intelligently to the customer is the other half of the battle.

[00:13:58] RS: There's also this notion of, can we coach people on how to speak to a robot? Another example is, if I ask a voice assistant, what's the weather in Denver, Colorado on Thursday? Like you said before, 95% accuracy. Probably. It can give me an insight. If I was to say, “Well, I need a jacket for soccer practice,” which is the same questions. Totally different thing. Also, I suspect that people are a lot more patient with a human being at a POS, than they would be talking to AI. In the funnel cake example, which was just a garbled mess of words that as you pointed out, even human would be like, “Hold on. Wait. What is this person asking me?” They would probably ask clarifying questions. Ask, “Wait, so do you want the funnel cakes and a sundae as well?” Then, the person would be like, “Yes, I do.” Then they could tap that out.

With the robot, they would be just like, “Stupid robot. Why don't you understand me?” And hit the screen. I'm curious, just from a future of interaction with conversational AI, is it going to have to be that conversational AI can perfectly parse out these completely broken, fragmented sentences and come up with what someone means? Will people get better at relating to conversational AI and understand, “Oh, you're speaking to a robot. You have to speak in this more clear fashion”? Was there a middle of the road where there will be a compromise? What do you suspect is the future of this?

[00:15:32] RC: Yeah, that's a fantastic question. That's something we spend a lot of time thinking about. How much time and energy do we need to put into making the product better, and how much time and energy do we need to put into training the customer? That could be prompts based on inferences by the AI back to the customer, because here's a ton of background noise. Maybe it asks the customer to speak a little louder, so it can hear them. It could be other things like, putting a sign on the microphone speaker posts that asks the customer to identify the items they want, and then order one item at a time. There's a variety of different ways you can go about this.

I think, if I was to postulate where things would end up, eventually, I'd say, it's probably a 90/10. 90% is just the technology has to be better, and it has to keep getting better. 90% of everything needs to be handled by the AI. Then another 10%, in terms of especially that example I gave you, people just need to collect their thoughts and be a little more thoughtful when they communicate with these systems.

We've even thought about how can we play on customers’ empathy, or sympathy, if that still exists in our society. By putting things on there saying like, “Hey, I'm a brand-new baby AI.”

[00:16:48] RS: It's my first day.

[00:16:49] RC: Go easy on me. Yeah, it’s my first day. We'll say that in our prompts when we struggle. We'll say, “Hey, I'm sorry. I didn't understand that. I'm trying really hard, but I'm having a hard time. Could you please say that a different way?” We'll do things where we try to get the customer to basically feel bad for the AI, which might be some weird manipulation in its own way. In truth, it's just like, just please be clear, be direct, be concise, speak loudly. Because if you do those things, current conversational AI is 90%, 95% accurate.

When you give us just a garbled stream of consciousness, that's where we see the real degradation with the system. To your point, people do then get very frustrated with the robot. You roughly get one, maybe two mistakes before people give up on these systems and ask for an employee. You don't have a lot of wiggle room to work with in these types of environments.

[00:17:48] RS: Yeah. It sounds like, the net effect of people are realizing, “Oh, I'm speaking with an AI here. I need to be a little more clear.” The net effect is people expressing themselves more clearly, which is just better for everyone, right?

[00:18:01] RC: Yeah. I think that helps whether there's an AI involved, or not. Not to get too off track, or too political here. I mean, I think, there's just too many people on both sides of the spectrum, just yelling, or speaking past each other. I think, if we were just maybe a little more patient with each other, a little more empathetic, a little more concise in our communication, I think it would help everybody. Who knows? Maybe there's a long tail where these systems force people who are hungry to learn how to communicate with each other in a more effective manner.

[00:18:31] RS: How do you first see this technology being used in parallel with employees? Most of the individuals I speak with, they're not looking to replace the human employee. It's, how do we help the human employee get better at their job, be more efficient, focused on higher leverage activities? What does that look like for Valyant’s customers?

[00:18:50] RC: Yeah, that's a fantastic question. We're in a very unique opportunity in the world right now, especially with the market that we ended up in, here with quick-serve restaurants. Right now, the most recent report I read in The Washington Post, and this is what? Late September of 2021. There's 1.7 million unfilled positions in the restaurant industry right now. One in six of every position available right now is in restaurants. The reality is these businesses are working from a massive deficit in terms of the number of employees that they have available to them.

We don't really view that we're displacing anybody. We view that we're coming in and filling a gushing hole that restaurants are having a hard time filling right now. I think, that's a very positive place for us to be, because we are making the employees’ lives better. Imagine a restaurant is supposed to run with five people, but only three show up. Well, you're going to be extremely stressed and overworked for that shift. If you can flip a button and suddenly, one of those positions is filled, that then takes a significant amount of pressure off of those people.

As a quick anecdote, one of the franchisees that we're working with right now, they had a situation where we were going to do some work on site. An individual was walking up to us in the morning, and we're like, “Hey, are you the manager here?” They said, “Well, I am today.” It turned out, that was one of the owners. That's how big the challenges are right now that these restaurants are facing.

[00:20:34] RS: That's a great point that we're not taking humans’ jobs. These are jobs that humans don't want, or that companies can't fill this job. I had a similar experience. There's a ski town in Colorado called Steamboat Springs. The restaurants are all having a hard time. They can't be open seven days a week. The quality of services is degrading, because they can't find enough people to work there. Part of it's because it's city problems. There's nowhere to live for these individuals. It's too expensive, blah, blah, blah. At the one restaurant I was at, I do enjoy the personal touch of a human server. They don't need to be there to take my order. I could have done that on a touchscreen, probably.

Then, that human worker could have been focused on getting more orders out, or greeting people at the door, or other things that are more high leverage, and from this experience that only a human can really do. That makes sense to me. It's just like, hey, when it comes to

allowing human beings to do high-leverage activities, there's no such thing as taking their jobs.

[00:21:39] RC: Right, exactly. I think, what this does is it really frees people to work on the things that AI and robotics can't do. That's creativity, that's personal touch, innovation, big picture thinking. I also think that we get into a little bit of a mental rut as a society, where what we view around us, we believe has always been. I think, if we can take a step back and look at economies as being dynamic and long-term, innovation is not only built into them, it's essential for their health and long-term safety of an economy.

If you go back to the late 1800s, it was something like 95% to 97% of all employed people in the United States worked in farm and agriculture. Fast forward to 2021, that number is 1.3% of people in the United States work in farm and agriculture. We don't have a 98% unemployment rate. I don't know about you, but I really enjoy being in my office with a warm cup of coffee, maybe playing ping pong with my team, here and there, versus being outside in the elements doing back-breaking work. I just think it's important that we take a more holistic long-term, big picture perspective and just say, “Look, economies are very dynamic.” Ultimately, that is a very good thing for us overall.

[00:23:10] RS: Yes, I couldn't agree more. I do want to ask you before I let you go, Rob, just to prognosticate a little more on the future of conversational AI. When it is more widespread, when we can guarantee that 95% accuracy, regardless of how a human being expresses themselves. What is the utopian/black mirror extreme, that when this technology is perfect and is rolled out everywhere, what can we expect from conversational AI?

[00:23:43] RC: Yeah, so absolutely, it's going to sound human. I think, there's going to be a legal element of do these things need to identify themselves? Are they allowed to sound like humans? Let's say, we're talking seven to 10 years out, pending no outside influence from laws, they're going to sound human. They're going to feel human. They're going to be fast. They'll respond within a couple of 100 milliseconds. You will be able to have an open-ended conversation with them.

It's not going to be the experiences that were familiar with, with the current smart homes, where you have to say one thing at a time, and then sit and wait two to three seconds for it to respond back to you. You will be able to have within the domain of the environment, the restaurants, hospitality, airline, call center customer service, you'll be able to have a fairly wide-ranging discussion with the systems, but they are going to be the embodiment of your perfect employee.

They'll, theoretically, provided service don't crash, they'll never be late. They'll work as long as you want them to work. They're going to speak multiple languages, if not all of the languages. They're going to be tied into data systems, so they'll have an API, so they'll know your order history with that company. They'll be able to talk about a personalized experience. They'll be able to handle customer retention type of discussions based on your past behavior, interactions, things like that with the brand, and they'll be widely scalable.

I think, especially in that five to seven-year time frame, we will start to see whether it's Valyant, or another company. These digital employees starting to show up in physical locations. Right now, and I would say AI, conversational AI historically has either been consumer goods for first adopters, or they've been used in the background in a lot of different environments, like analyzing call center interactions. We’re now in a second wave, where we're seeing them be used in places like restaurants and a frontline environment to interact with customers, but still in a somewhat virtual environment, in that it's either through a microphone at the drive thru, or maybe it's a call-in order.

The third wave in that five to seven-year out, we'll see these things start to have physical embodiments, one way or the other. You will walk into a restaurant to order and you will interact with a digital avatar that will be able to have these wide-ranging conversations, in multiple languages with all your personalized info, provided you've opted into it, around whatever it is that you're trying to do. It'll rent you a car, it'll check in your hotel room, it'll help you when your flight gets delayed, or cancelled and you need to rebook. You name it.

Expect to in the next five to 10 years, see a lot of the entry level, not too complicated customer service work be automated, both in terms of virtual environments like calls and physical environments like in the restaurant.

[00:26:42] RS: I can't wait to see how it plays out. Best of luck to Valyant. Sounds like, you are really well-positioned in an awesome place. Congrats on your progress so far. Thank you so much, Rob, for being on the podcast. I've loved learning from you today.

[00:26:53] RC: No problem. I really enjoyed being here. Thank you so much for having me.

[END OF INTERVIEW]

[00:27:01] RS: How AI happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, medtech, robotics and agriculture. For more information, head to sama.com.

[END]