How AI Happens

Arize Founding Engineer Tsion Behailu

Episode Summary

Tsion discusses current challenges in ML observability, and explains how Arize's Bias Tracing Tool was developed to help companies root out bias in their models.

Episode Notes

Arize and its founding engineer, Tsion Behailu, are leaders in the machine learning observability space. After spending a few years working as a computer scientist at Google, Tsion’s curiosity drew her to the startup world where, since the beginning of the pandemic, she has been building breaking-edge technology. Rather than doing it all manually (as many companies still do to this day), Arize AI technology helps machine learning teams detect issues, understand why they happen, and improve overall model performance. During this episode, Tsion explains why this method is so advantageous, what she loves about working in the machine learning field, the issue of bias in machine learning models (and what Arize AI is doing to help mitigate that), and more! 

Key Points From This Episode:


“We focus on machine learning observability. We're helping ML teams detect issues, troubleshoot why they happen, and just improve overall model performance.” — Tsion Behailu [0:06:26]

“Models can be biased, just because they're built on biased data. Even data scientists, ML engineers who build these models have no standardized ways to know if they're perpetuating bias. So more and more of our decisions get automated, and we let software make them. We really do allow software to perpetuate real world bias issues.” — Tsion Behailu [0:12:36]

“The bias tracing tool that we have is to help data scientists and machine learning teams just monitor and take action on model fairness metrics.” — Tsion Behailu [0:13:55]

Links Mentioned in Today’s Episode:

Tsion Behailu

Arize Bias Tracing Tool

Arize AI

How to Know When It's Time to Leave your Big Tech SWE Job -- Tsion Behauli

How AI Happens


Episode Transcription



[00:00:03] RS: Welcome to How AI Happens, a podcast where experts explain their work at the cutting-edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson. We're about to learn How AI Happens.  




[00:00:30] RS: Here with me today on How AI Happens is Founding Engineer over at Arize AI, Tsion Behailu. Tsion, welcome to the podcast. How are you today?


[00:00:39] TB: I'm good. Thank you for having me.


[00:00:41] RS: So pleased you're here. You're sprinting on your way to a holiday in London so thank you for squeezing me. I know how things get hectic right before the break. I appreciate you taking the time.  


[00:00:51] TB: Yup, no problem.  


[00:00:53] RS: So much to go into with you. First, I think maybe it would make sense to just get to know you a little bit. Would you mind sharing a little bit about your background and how you wound up in your current role at Arize AI?


[00:01:03] TB: Sure. Yes. I'd love to. My name is Tsion Behailu. I'm a part of the founding team at Arize AI. I'm a staff software engineer on the full stack team. I grew up in the metro Atlanta area. I left for college, I went to UC Berkeley, that's where I discovered computer science, tech, all the great Silicon Valley things and decided to pursue that as my interest. After studying, graduated, did the usual big tech life, joined the Google at Sunnyvale Mountain View area. I was there for about two and a half years. I had a really great time, but I just wanted to change.  


I transferred to the New York office, worked out of the Google office there for about a couple of years. That's when this opportunity came up to explore startups. I had a friend at the time from college. She just co-founded an ML company and reached out to me, and it was at a really great time in my life and I decided to just take a chance on it. I started March 2020, cusp of the pandemic about, like March 8, right before New York shut down. I pretty much spent the entire pandemic, helping build a product from the ground up and still here, so having a great time.


[00:02:27] RS: Good, good. Glad to hear it. What was your experience with machine learning before coming to Arize?


[00:02:34] TB: I had almost no machine learning experience aside from an upper division AI course in college. I hadn't worked on anything really ML related at Google. I actually appreciated the opportunity to be able to learn ML from scratch, pretty much, in a startup environment. Also learn how to build a product from the very beginning. At Google you're working in a really big system. It was a lot of late nights on the call. The co-founder, Aparna, graciously spent many hours on Zoom with me teaching me every single ML term, while I'm actually writing the metrics for the platform. So definitely a crash course ML 101 and 102. All the courses on the job.  


[00:03:19] RS: Yeah. So it's on the job just-in-time-learning for you.  


[00:03:22] TB: Yes.  


[00:03:24] RS: I speak to folks on the show who'll tell me, “I've been in AI since the late 80s. Rob, I bet you didn't even know AI existed in the late 80s.” But you're a little newer to the field. I'm curious, what has struck you? When you got into ML and started up skilling and learning about this field. What source stood out to you about this as compared to your previous background in CS and engineering?


[00:03:44] TB: Yeah. So I spent about four and a half years in industry prior to joining Arize. I think what was really refreshing was just getting back to the theory. In college we focused a lot on the theory side of computer science. I did enjoy the application side, actually building things, implementing, making things work. Ultimately, as an engineer, you get a lot of gratification from making things work, but getting back to reading up on why a metric is important or how it's implemented or really just digging into the entire ML field like, model building, model analysis. It was nice to just have my brain exercised in a different way. I think that was probably the most notable part of it for me, switching from industry into like, something a little bit more research backed.


[00:04:39] RS: Got it. So you were able to combine this theoretical with now this practical of building and shipping products.  


[00:04:45] TB: Yes.  


[00:04:47] RS: At that point, you've got a few years in the industry under your belt. Google is a gold star on your resume. You probably could have picked a few different options. Why Machine Learning?


[00:04:56] TB: I was really curious what the hype is. I feel like that's why I joined Google to begin with. People talk about it and I was like, “Well, let me go see what it's all about.” I did go see what it was all about. It was really great. A lot of perks, but then as my curiosity around, “How next do I want to propel my career? Is it exploring different teams inside the company? Is it going to a start up in a more emerging field?” I felt like it was just the allure. Getting to be part of something breaking edge, but I also cared about the environment in which I learned. I was pretty hesitant about just going to any startup. You hear all the horror stories. So just having someone I was very close to, really respected, and knew would make sure I was in a healthy environment, while I'm learning something new in my field. All of that combined was what tipped me over into going for it.


[00:05:54] RS: Yeah. That makes sense. Trust is so important, especially with these smaller companies at the beginning. You wanted to see what all the fuss was about. What was all the fuss about? How would you characterize Arize AI’s mission?


[00:06:06] TB: Arize AI, I mean, where do I begin? So yes, we're a platform to help monitor and troubleshoot AI models. What that means is, model building has really been taken off, it's a hot button topic. You've got all of the tools right now to build models, but you build a model, you put it into production, what happens then? We focus on machine learning observability. We're helping ML teams detect issues, troubleshoot why they happen, and just improve overall model performance.


That part of the lifecycle was at the time very much new. We're helping shape what that means in that space. That was really exciting, like being part of the early stage of the market. Aside from that, you're building tools. We've really shaped up, like, there was no website when I joined. Now we have a really fully fleshed product. I encourage you to check it out. I'm very excited about it. Also developing it in a pandemic, you get to output a lot more work than if I was dilly dallying in a Google cafe.


[00:07:08] RS: Off course. Yeah. Would you mind sharing a little bit about what is meant by ML observability?


[00:07:14] TB: ML observability is what it says. You're observing how your models are performing. Let's say you have a model that's degrading in accuracy over time, right? Maybe you launched it in a new state. You have a model that's predicting something time related, or really anything, but you've launched it in California. Now, you launched it in Arizona, and all of a sudden, it's degrading in accuracy. How do you know it degraded? Well, if you were using Arize, you'd be able to quickly monitor the accuracy of your model over time, and get an alert that it degraded in accuracy from 90% to 70%. That's just one of the many examples.  


There's obviously many ways in which you can measure the performance. There's precision, just the prediction would be drift, like, are your features drifting? In the Machine Learning research space, there's so many metrics that researchers are coming out with. We're quickly adding them to the platform. We're also adding the ability to allow customers to do custom metrics. Maybe they have things that they care more about that we don't have available, but yeah, that's what we mean by observing your models.


[00:08:26] RS: Got it. What is the most common concern to Arize’s customers? Is it pure accuracy, drift precision? Those are the examples you mentioned, but what do you see coming up the most when people are asking for various things to observe?


[00:08:41] TB: If you're thinking of the categorization of metrics, you have performance, data quality, and drift. It depends really on the customer and the type of models that they're building. If they're building models that are just predictions, maybe they only care about drift. If they have models with ground truths, and they're trying to analyze the performance of it, they might want to use one of our many performance metrics, that's accuracy, precision, F1, recall, all of those. If they just want to care about the data of the model like, let's say the cardinality is changing. The cardinality of this feature was only, I don't know, they want to have a certain set of unique features. Now all of a sudden new features showed up. Maybe they care about that. It's depending on the industry that the customers in, the type of models that they're building, they might care more about drift or performance or a data quality metric.


[00:09:33] RS: Got it. Now, this may be gets into the founding story a little bit, but what is your understanding of how ML professionals were monitoring these sorts of things, absent a tool like Arize?


[00:09:46] TB: The troubleshooting process prior to the amount of observability space is pretty manual. It is probably pretty manual in a lot of companies to this day. Let's say a problem comes up. Now the data scientist or the ML engineer has to go and say, “Oh, well, why? Why did this model all of a sudden stop working or degrade so badly in production?” They pull that data. They try to comb through it. Hone in on where the issue is. Fix it and then deploy. But because of this, there's, of course, a lot of business applications. You're losing money. You're losing trust in whoever your customers are.  


What customers are doing right now is either they're looking for companies to provide a service, or they're trying to build their in-house solution, but in-house solutions take a really long time, they're not easy, and they still probably have a very manual step, as well. They're probably also very much developed specifically for a certain subset of models that that customer might be building.


[00:10:47] RS: Right. Even assume there is a reliable internal process for checking up on that, there's still the question of, how often do you check up on it, right? Is it constant? Is it once a day? Is it once a week? That's time your developers are spending away from other things. Instead of building the models, they're tweaking models or making sure that model does what it's supposed to do. Feels like maybe a poor use of time that adds up over the long run. I'm just curious about that, because every exciting, usable product has some really awful pain point at its base, where people are like, “Oh, yeah. It'd be really amazing if I didn't have to XYZ.” In this case, it's maybe like a spreadsheet, or however you're tracking this manually, as you said that wouldn't it be better if you just got a notification.


[00:11:30] TB: Exactly. Just to add on to that, let's say the in-house solution is a dashboard, right? It takes a lot of engineering effort to build integrations like, maybe you want to be notified on Slack, maybe you want to be notified by email or like a pager duty. Building also that huge set of integrations is also a lot of work. If you have an in-house solution, it might just be that someone has to manually check it every day. It might not go as granular into the model like, how do you break it down by feature filtering? Yeah, just a little add-on to what it could mean to have a very robust in-house observability platform.


[00:12:07] RS: Yeah. That makes sense. You mentioned some of the things it can drill in on are drift, accuracy, precision over time, etc. What about bias?


[00:12:18] TB: Bias is really exciting, because earlier this year, we released a product that lets you explore bias in your models. This was really interesting for me, because it's the intersection of actual real-world issues with ML, so I had a really great time doing this project. Models can be biased, just because they're built on biased data. Even data scientists, ML engineers who build these models have really no standardized ways to know if they're perpetuating bias. So more and more of our decisions get automated, and we let software make them. We really do allow software to perpetuate real world bias issues. So what does bias mean in different industries and different model use cases? It can really mean different things.  


Healthcare field, is healthcare being provided equally amongst its population? In finance, is there a bias against who's getting approved for loans? You also have a lot of facial recognition software out there right now. How is that data being used? Is it eventually becoming discriminatory? I think my most popular, probably the ones that we see the most online and a news articles, just because they're very buzzwords is like social media. How are these social media algorithms building their timelines, like the trends? Is it hurting different content creators? There's just so many real world implications when it comes to throwing models out into production, and having people interact with them with no safeguards.


[00:13:49] RS: Enter the bias tracing tool into all of those examples of problems, how does it detect bias?


[00:13:56] TB: The bias tracing tool that we have is to help data scientists and machine learning teams just monitor and take action on model fairness metrics. What we mean by model fairness measures. We decided to focus on three, recall parity, false positive parity, and disparate impact. These are all measured things like, how sensitive the model is compared to one group or another. Is it predicting an outcome more likely for a marginalized group over a base group? Or is it falsely predicting something as more likely for a sensitive group? These sensitive and based groups could be gender, race, age, religion, nationality, really any protected attribute. This was the first set of metrics we decided to release and it's been pretty heavily used. You can compare it against either your model’s features or like, additional metadata that you might have adjusted with your model.


[00:14:48] RS: I'm really interested in the whiteboard session, from which those three buckets sprang when you came up with, “Okay, we are going to focus on recall parity, false positive parity, and disparate impact.” Can you share a little about that conversation and how the team over there decided, “This is how we're going to put a circle around what it means for a model to be biased or what goes into a model being biased?”


[00:15:11] TB: Yeah. The discussion around the metrics that we're supporting is really around the research that's happening in the algorithmic bias space. If we're seeing that, hey, a lot of models are discriminated against marginalized groups because they end up being more sensitive for those marginalized groups, we want to make sure we're including recall parity. It's also based on the customers, right? If the customers are primarily in a certain industry, and this is the way that the discrimination is happening or the bias could show itself, then let's integrate those type of metrics. These are the conversations we're having around like the metrics to deliver. Obviously, as more research happens, we want to be able to flesh out the feature even more and add more offerings to the platform.  


[00:16:00] RS: What does the output look like? Say I run the tracing tool on my model, and the alarms go off and they’re like, “Rob, your model is mad biased.” What does that actually look like?


[00:16:10] TB: Well, of course, you can go to the platform. Use it. We’ve got some demo models for you to play around with, but the way the measurements work is you probably have a range that's safer. If your model is at one or at zero, it's either good or bad. If you're noticing, the trend over time is, “Oh, in the middle of the month, I see a huge drop and recall parity. Well, what did I ingest or how did my model's data change?” You actually have a feature breakdown. You can go and examine the feature, click on your data set, see what's the composition, and see the distribution of the feature. It really helps. Maybe it happened during training, you ingested new training data, or it happened in the validation step. We support those environments as well. You get to really dig into the model, examine the data set at the feature level, and then maybe if you filter down on that feature, does the metric improve? So you can see, “Oh, this feature is really the one that's causing the degradation in my metric.”


[00:17:15] RS: Interesting. Okay. It does show you timing, which feels really important, because knowing that you're biased, or that your model has biases is like less than half the battle, because you have to understand where that comes from. Even if I know I'm biased you still have this blind spot, because it's like I don't have the life experience to understand where I'm even being biased, right? Is that meant to be prescriptive, where it shows you, “Here's where this bias began to show up,” and now you need to drill in and ask yourself, what changed at this moment to result in it. It’s like what is the second half of this once it's detecting the bias? Is it prescriptive to be like, “Here and here's why?”


[00:17:53] TB: Yeah. When I think of the model lifecycle that ML engineer might be going through, and that we would be a part of, it's their model building, right? They first build the model, they train it, validate it, and then they release it into production. Then we fit into the observability part. We observe the model for them, alert it, they get that notification, they get to drill down, figure out the root of the issue, start back at building the model now with that fixed, and it’s like now a healthy ecosystem of model building a model monitoring. I don't want to say that's what you're supposed to stick to, but that seems to me like the type of ecosystem we should have. It should not just end at, “All right, models out there. Let it do its thing. Hope it doesn't break. Hope doesn't cause any issues.” It's a circular ecosystem of building observability and then iterating on that.


[00:18:49] RS: In your opinion, Tsion, what is most responsible for bias showing up in ML models? Is it merely data? Is it the data in rotation? Is it the professionals themselves who are imperfect? Where does this come from?


[00:19:01] TB: That's a really good question. The common causes of bias in ML systems could be a number of things. It could be a limited amount of features like maybe some features are just less informative or reliably collected for minority groups. It could be a skewed sample, so where are you getting your data from? It could be tainted examples. If you're thinking around job orgs, maybe if the system uses hiring decisions made by a manager as labels to select applicants, rather than the capability of job applicants, then the system is trained using the samples will replicate any bias from that manager's decisions. Or it could just be sample size disparity. The training data coming from the minority group is much less than those coming from the majority group. Is it less likely to model perfectly the minority group? Yeah. In the field of algorithmic bias, it is all about the data, but it's like, what part of the data contributes the bias depends on how the data is flawed.


[00:20:03] RS: There's really no limit to the way that bias might creep in is what it sounds like. I hesitate to try and simplify this much, but it does feel like, if data was collected in a bias free manner, that would solve a lot of the problem. However, in the cases where you're relying on historical data, it's either skewed in its collection or in it's being recorded, all these problems.  


This may be a tough question, but what can one do in the case of bad historical data? Do you just throw it out? Do you get to a point where you’re like, “We can't use this, because it's inherently biased and we can't change it, because that would be changing how it was recorded?” Do you see what I’m getting at here? What does one do when, in a case of historically bad data, you still want to presumably make a model?


[00:20:46] TB: Yeah. I know. That makes sense. I think of also around just the lack of regulation in tech spaces. When you think of a model's bias like, who's going to still stop you from using the model? Right now, it's just at the very early stages of having government level regulation around your models. I feel like maybe if you start at the end, like you have a base set of metrics that machine learning experts say, “This is a starting point for evaluating your model.” If it performs at a certain level, then you have to go back and look into your data and figure out how to fix it, but that's not even there yet. What's the motivation for businesses to go and fix their data? What measurement are we using? What are the regulations around where you have to perform? In so many older industries, there's just like a bar that you have to be at. I don't know if we're there yet, as a society.


[00:21:48] RS: Yeah. There's just really no consequences for not doing this, right? For not honing in on how you're biased. Even in a world where everyone's using the bias tracing tool and it's fantastically successful and widespread, there's still going to be people who don't bother to optimize for that, right? Like, “Oh, no. That's just an added expense. Who cares? We'll just ship our biased product because people will buy it.” That's just another one of these perverse incentives, I suppose. So until there are consequences, unfortunately, we expect this to continue. It's not dissimilar from diversity hiring to connect a dot.


[00:22:17] TB: I know. Make it a business incentive, I always say. I don't think it should be a good faith thing. It's great if a company decides to incorporate this out of just going above and beyond, but if it's a business incentive, then everyone has to abide by it.


[00:22:34] RS: What would it look like if it were a business incentive?


[00:22:37] TB: I think, it would just be like having the people first of all that write these bills be working side by side with the experts in the field. Figure out what is a mandate for the things that impact a certain population. I honestly think even just like the fields, the healthcare field or the finance field, even social media, it'd be nice if there was no discrimination there. If the way that we're crafting the bills takes into account the standard and then having that be like, you're penalized if you don't follow those standards.


[00:23:12] RS: Yeah. That makes sense. It’s a start anyway. Before I let you go, Tsion, I wanted to ask you, for the folks out there who are working on their ML models, maybe they are just beginning on their bias tracking journey. They suspect their model is biased or they're sure it is because they know that they're a flawed human being with their own biases, but they don't know where to start. How can someone just begin to monitor the potential bias that their technology may suffer from?


[00:23:41] TB: Yeah. I know. Great question. You know, you can use Arize. Start by adjusting your model. We're out there. Start collecting the metadata for your model and just see how you evaluate on just some of these fairness metrics. It's a really great starting point and insight into figuring out a starting point for your model.


[00:24:03] RS: Presumably, this person is a data scientist or something adjacent, so they know their way around data. Point that expertise at the potential bias, and you will make progress at the very least. Tsion, this has been a fantastic conversation and we are creeping up on optimal podcast length here. I’ll just juncture I’ll just say thank you so much for being here and sharing with me your experience and expertise. I've loved chatting with you today.


[00:24:23] TB: Ditto. Thank you so much for having me. I really enjoyed being here.




[00:24:28] RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media MedTech, robotics and agriculture. For more information, head to