How AI Happens

Gyant CEO Stefan Behrens: Building Datasets & Ensuring Interpretability

Episode Summary

Gyant CEO & Co-Founder Stefan Behrens explains the challenges inherent in creating datasets for healthcare purposes, as well as the importance of building interpretability into their AI tools.

Episode Notes

Gyant CEO & Co-Founder Stefan Behrens explains the challenges inherent in creating datasets for healthcare purposes, as well as the importance of building interpretability into their AI tools.

Episode Transcription

0:00:00.0 Stefan Behrens: That's one of the challenges in our field is that you don't wanna have a black box AI, whatever a convolutional neural network that you throw things in and then comes up with an answer, even if it's right a lot of the time. If you can't explain why, it's not something that clinicians and people responsible in hospital systems for deploying this sort of technology would feel comfortable with.


0:00:25.9 Rob Stevenson: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we are about to learn how AI happens.

0:00:57.0 RS: Today on How AI Happens, we're talking about data, or more specifically, the lack of data. With increasingly specific and creative uses for AI, getting your hands on the data you need isn't always as easy as calling up your friendly neighborhood Sama employee. Over at GYANT, an exciting new company using AI to streamline the patient healthcare experience, they had to take matters into their own hands. To learn more, I sat down with their CEO and co-founder, Stefan Behrens, to discuss how GYANT is collecting this data and ensuring that it yields interpretable outcomes.

0:01:33.3 SB: Sure, yeah, happy too. So I'm originally from Germany, which is where I started my career, originally started in consulting and then jumped into the entrepreneurship world in a video games company that we scaled about 400 people and then sold, and then after that started a direct-to-consumer e-commerce brand that we scaled about 25 million in revenue and then also sold. And those two exits gave me the flexibility and financial freedom to actually start working on a much bigger problem. We wanted to fix a real problem in the world, and we thought access to healthcare and taking some of the scariness and the suck out of the healthcare experience for patients was a well worth effort of our time. And for me, this also really started with an experience that my dad had, so a couple of years ago, for example, he had a routine blood test from his primary care physician that came back abnormal and it showed what's called smudge cells, and usually that points directly to a blood cancer diagnosis, CLL, chronic lymphocytic leukemia.

0:02:31.1 SB: And as far as cancers go, that's probably one of the better ones, but you can imagine, my entire family was in shock: Me, my dad, my mom, our relatives, and the next step then was a specialist consult with a hematologist, but it took eight weeks to get that appointment, and then you have all that anxiety, all the Googling in the process until you get to that appointment, no one to talk to. After that appointment, they ran some more tests which came back clean, and the whole thing was resolved. So it was a false alarm, but you can imagine for like three months, our entire family was upside down. This kind of anxious experience prompted me to say, "There has to be a better way on how you can address questions along the way. There has to be a better way of how you hand hold patients through those scary journeys." And that's where GYANT comes in.

0:03:17.1 RS: There are plenty of savvy entrepreneurs out there who experience a problem personally and set out to resolve it, most of them though don't have the technical know how to find a solution using AI, which is part of what set Stefan apart. I wanted to know when it became apparent to his team that AI was the key to solving this healthcare problem.

0:03:35.8 SB: Healthcare is really, really complicated. There's a really long, long tail of all the things that might influence your healthcare outcomes and your conditions, if you look at what's driving the majority of the volume in primary care and urgent care, it's actually only about 25 different things. The problem simplifies or something that seems more manageable, so you need to get really good at those 25 things that drive the volume, and you also need to get really good at identifying when it's not one of those 25 things and you need to escalate to someone who has the necessary skill sets and broader ability to handle those cases. So that seemed like a more immediate rather than having to deal with thousands of potential classes, that seemed like a more manageable problem, and so we set out to acquire and then build a data set and start working on how to tackle these high volume, high frequency conditions that make up most of the waiting room in any primary care office.

0:04:31.7 RS: As GYANT is focusing on the early stages of the healthcare funnel, they don't have the luxury of feeding x-ray images or MRI into their learner. In fact, what they found was that in many cases, the data their technology needed just didn't exist.

0:04:47.5 SB: There's not really a good data set that you can go off of. For example, if you think of a headache, you would think, |Okay, there's gotta be some papers on how headache relates to conditions," and then you can probably build Benson model that allows you to identify priors and work with that, and the reality is, there's no good data set out there about people who are currently experiencing a headache, you only have the cases where the headache was concerning enough that somebody got into a car and drove to their physician, and then there's a record of that. You don't know how long that happened before, you don't know what happened after, whether they had episodes in between, so the prevalence is really unknown because there's no good data set for that. And that's actually true for a lot of symptoms, so it's not trivial to go about this, the underlying model essentially was difficult to build because that data set just did not exist in a shape and form we needed it.

0:05:40.9 RS: So even in the cases where data was available, the team was looking at self-reported symptoms or the famously illegible doctor's notes.

0:05:49.4 SB: Yeah, So we looked at doctor's notes early on in the history of the company, we would be partnered with a health system in SoCal, and we actually looked at written doctor's notes from emergency room visits, from primary care visit, urgent care visits. And that was the first starting point. So we built an NLP extractor that took those written notes and turn them into features, so we extract that somebody has a headache or does not have a headache, and then you can map that to the diagnosis that was assigned by the doctor after the visit, so you have an ICD code that codes the final outcome in the class. And so that was... That made it somewhat tractable, but what we... Some of the challenges we found is that those doctors notes are written by hundreds of different doctors, each one of them has a slightly different style of how they write things, what they include, what they don't include, this tons of abbreviations, this typos. And so you're dealing with a pretty messy data set that you first need to process before you can extract features, and then the second thing we noticed...

0:06:50.8 SB: And that comes back to what I mentioned earlier. You have bias. So for example, we looked at just the notes of those encounters for children, and the majority of those cases are middle-ear infections, so any kind of model you build, just by the fact that you're talking about a patient that is not 18 yet, it's almost always gonna predict middle-ear infection, because that's just the prevalence and the bias and the classes that you see based on the data. Obviously that's not a correct outcome, but you just have to understand that what you see in those notes is a very biased sample of what patients are experiencing so you can't just apply that same logic to someone who's experiencing a headache or someone who's... Or a child that's experiencing ear pain, you're gonna get pushed in the wrong direction, you need to correct for those biases in the data.

0:07:39.2 RS: Inherent biases. Completely un-standardisable doctor's notes, messy, messy data. This is when GYANT realized they needed to develop their own data set.

0:07:49.6 SB: The next step after that for us was to actually start augmenting the dataset that we had from the doctor's notes, so trying to rebalance the biases, we operated a direct-to-consumer app for a while that generated a dataset directly, so we start with patients that are experiencing some sort of symptom and then we ask some structured questions about their presentation, their symptoms, we then funnel them to an actual doctor encounter, and then that doctor signs a diagnosis and a treatment plan. So that kind of is the final answer, and what that allows you to get is a much better and more cleanly labeled case where it's not just extracting a random set of features from the doctor's note, which is not very well structured and may include certain things and not other things. Doing it the way I just described, it allows us to have a standard set of features for every case that is then associated with the diagnosis.

0:08:44.0 SB: And that actually allowed us to plug some holes and some gaps, and we looked specifically for certain cases that we didn't have enough representation and the original EMR sampling, so that allowed us to balance the data set that we have for training purposes and validation purposes. We had a team of providers operating the back end of that actually dealing in treating patients. And that was the way to generate both of the symptom information in the structured way, and then also have a diagnosis assigned by a provider on that dataset. But essentially it resulted in a complete data set where you had a structured view of the symptomatic presentation, and then a structured view of the diagnosis and associated treatment plan.

0:09:25.3 RS: So there you have it. A full data set, right? Not so fast.

0:09:30.1 SB: How do you find someone who experiences a symptom right now of a particular type? Because you're trying to correct biases in your existing sample data, so you don't need just cases that have a headache or maybe a sniffle, so you wanna try to get as many presentations as possible, so that was one of the challenges is like finding the right patients that are currently undergoing some sort of symptom and get them to engage with a digital health platform, but what it boils down to is you're operating in telehealth service, just like whatever, Doctor on Demand or Teladoc, and you tie into the data and you need to make sure that patients understand that their data is being used for this purpose.

0:10:14.3 RS: Given the limited amount of data at hand, particularly in the case of a patient only reporting a handful of symptoms, I have to wonder, can this technology really make an accurate diagnosis with such few inputs? 

0:10:26.9 SB: It's challenging, and for us, this resulted in actually developing a hybrid model, so there's a pretty substantial rules-based expert system that sits on top of this to help manage some of the downsides of the inherent biases we have in the data. So for example, you may have someone that presents with chest pain and in the majority of those cases that's probably related to a cough they have, if they have bronchitis, and it might also be gurge if they have some acid reflux and things like that. But there are these rare cases where the chest pain is directly associated to a cardiac event, so you have the most likely explanation of the chest pain, which is probably cough and some respiratory infection or maybe acid reflux, but there's the small probability of a much more serious outcome, where if you miss it, the patient may suffer the consequences.

0:11:19.7 SB: So you also have this problem of not only just trying to maximize the likelihood of the predictive class, but also having to wait that with what is the impact for the patient if you get it wrong. You may be pretty confident that it's a particularly low acuity concern, but if there's even a small chance of a cardiac event, you still wanna get people to the ED and have them checked out, even if the chances of that occurring are maybe only 5%, but you wanna have the safety built in, and so there's not just the element of predicting the right class, but also based on the prediction, you have to weight it with severity or potential severity of getting it wrong, and then making sure that you have that you give recommendations on that basis. So that's why we need this extensive rules layer to essentially pick out cases that have red-flag symptoms that might indicate a much more serious condition and filter them out. Then only of those, if none of these urgent presentations or red-flag symptoms are present, then use the AI model to predict the most likely outcome and recommendation.

0:12:24.9 RS: In these cases where you can be reasonably certain that it's nothing, but in the case that it is something it's catastrophic. Is the answer more data? How do you make sure that you're capturing those instances when it's a more severe outcome? 

0:12:38.5 SB: It's probably more data and you can essentially... You can try to manage that by over-sampling, you essentially over-sample these cases and make sure that they represent... That your model can actually pick up on these cases, and then you have a safety layer based on rules on top of that as well. The sensitivity needs to be different depending on the severity of the potential outcome, you wanna make sure that you pick up on cardiac events, on meningitis and other potentially fatal outcomes with a much, much higher sensitivity, and then you accept that you're gonna be wrong in more cases and that you escalate people to go to the ED when it turns out they actually had something much less severe, but that's okay. That's the right trade-off. Essentially, depending on the individual presentations and the conditions, you have very different sensitivity levels and you accept that the specificity is gonna be poor in some cases. And that's fine because it's the right call from a medical perspective.

0:13:36.5 RS: If GYANT could get their users to sit down and answer 400 questions about how they feel, the problem gets a little easier, but at a certain point, it's easier for someone to call or go to a doctor than it is for them to fill out forms on a browser. This is a crucial part of AI product design, and it got me wondering, whose responsibility is it to balance the needs of the technology with realistic user behavior? 

0:14:03.8 SB: It's a very interesting mix of skill sets and experiences that are necessary to build the right experience. For us, it's a combination of the data scientists, the clinical team, and the user experience team, and they have to work very closely together because the user experience side imposes constraints, as you mentioned, no patient will sit through a 40o-question interview online before they get an answer, that's just not realistic. So you kind of put a constraint on, there are probably 25-30 questions max that patients will sit through, and we've introduced some elements to make sure that patients understand where they are on their journey. So there's a progress bar, there's a time indicator, you wanna give people a census to how much more is coming to maximize the number of questions they will actually answer. From a clinical safety point of view, you need to ask all the very low likelihood, but high impact, high severity questions first.

0:15:02.4 SB: But from a user experience point of view, that's kind of weird because you just have a fever or a headache, and then they ask you all these questions that you think are completely irrelevant for most people, completely irrelevant to your particular case, but we also throw in questions that seem relevant to your case, so that you can actually answer a yes or something that feels relevant to your case, and you can have a way of expressing yourself to the machine, so it's a mix of what the clinical team needs, what the data science team needs, and what the user experience team has found out through trial and error and testing with hundreds of thousands of users as to what patients will answer.

0:15:40.0 RS: Given this balancing act between the need for data, clinical responsibility and user behavior, not to mention the importance of interpretability, I wanted to know more about how the model's accuracy is measured and reported.

0:15:52.8 SB: The model we currently have in production is actually a random forest model. And one of the reasons for that is that it deals well with two aspects, one is that we have incredibly sparse data, so of the whatever, 4000-plus data points that we might collect about a case, and I don't know, thousands of condition classes that might be the output of the model, that it deals pretty well with the fact that you may only have 30 or 40 of the 4,000 data points filled and the rest is unknown. So that's one element. While it doesn't give you a basin explanation, like you can't interpret the probabilities directly or the conditional probabilities directly, it does allow you to indicate which of the features cause the model to predict a particular class, and that's useful because that's something that you can resurface to the clinician or resurface to an expert team and say, "Does this make sense?"

0:16:48.0 SB: The model picked this particular condition on this patient because it highlighted these features that were the defining ones that actually made it pick this, and these features added to the probability of this class and this feature attracted from it, so you have some interpretability that allows you to review this and have good conversations with clinicians who need to build trust in a solution like this, that that makes the right calls. That's one of the challenges in our field is you don't wanna have a black box AI, whatever, a convolutional neural network that you throw things in and then comes up with an answer, even if it's right, a lot of the time, if you can't explain why it's not something that clinicians and people responsible in hospital systems for deploying this sort of technology would feel comfortable with.

0:17:33.1 RS: It struck me here that the increasing focus on interpretability from AI tools is in direct conflict with the prevalence of so-called black box algorithms. According to Stefan, this conflict is minimized or magnified based on what's at stake and the implications of being wrong.

0:17:49.4 SB: For certain types of problems, I think certain approaches that are more black box are very successful in predicting the right outcome. So in cases where it's okay to be wrong, I think that is not a problem. In this situation, you're dealing with patient lives and outcomes and being wrong is not acceptable, so you need to find ways that make it auditable and understandable for humans who take responsibility and take risk for applying such models into production. I haven't found a good way of doing that without having some way of explainability and some extra, let's say auditable elements of the model that clinicians can look at and say like, "Okay, I understand, it's asking these questions for a case of chest pain, it's asking these questions for a case of headache and fever. Those are the same questions I would ask in kind of the same sequence I would ask them. So okay, I'm fine with this. It's the right approach."

0:18:49.2 SB: It's much more challenging if you have a model that kind of dynamically calculates the next question that needs to ask and then you don't really know what the outcome is, because then you have to go into looking at whatever, a thousand different cases, run them through the model and see what the outcome is. So the validation effort for the party accepting the risk of deployment have significantly more effort on their side to validate the outcomes, and that's hard and often difficult to expect them to do it, so you need to come up with a different way of proving that your model does what it's supposed to and doesn't introduce some unknown quantity of risk into the equation.

0:19:28.1 RS: Isn't the challenge that there are perhaps biases within that algorithm that we cannot understand and root out because of its black box nature? 

0:19:37.9 SB: That is another element, yeah. We get that question increasingly, how do we make sure that the data set that we're training our model on isn't inherently biased and introduces some challenges like perpetuating biases that are introduced by social determinants of health, color of your skin, or the social-demographic background and access to resources that you have. And I think there were actually some examples in the news where a model was built based on training data that took real life patients and real life outcomes from the EMR, but because there were these biases in the decision-making of the clinicians, you essentially perpetuate that bias by building a model based on those same decisions, so that is a complicated area that I don't know if there are any simple answers, but we go through a pretty rigorous process to make sure that these elements in the training data sets that we work with are not reflected.

0:20:32.8 SB: So you run some statistical tests against demographic data and things like that to make sure that there are no inherent biases in the data that you then perpetuate. It is by no means perfect, but it's the best we have been able to come up with so far. And it requires you to actually have a pretty well-balanced data sets or as multiple data sets that you can compare, so it's not easy, but yeah, I think that's what we're doing now and what kind of is probably the most practical way of approaching it.

0:21:04.5 RS: Surely the potential for bias is enough of a reason to move away from black box algorithms, even in the cases where the impact of being wrong feels small. If there's any possibility to perpetuate a harmful bias, then aren't the stakes always high? This is a topic deserving of its own episode, or perhaps a series of episodes. We're going to explore it much more on the show, so stay tuned. Before I let Stefan go, I wanted to hear what about our industry has him most excited.

0:21:36.7 SB: I think there's a lot of hype associated with AI in healthcare and also AI in other industries, and what I'm most excited about is how this hype goes through that hype cycle, you have that peak of inflated expectations and then the [0:21:53.3] ____. But I think we are seeing some applications that have passed through those two stages and are now in production deployments, and they actually work, they may not be perfect and they may not have kept all of the initial promises, but they do meaningful work, and I think in medicine, for example, there are plenty of examples of how computer vision and analyzing whatever x-rays and computer tomography data, AI is actually doing a same, if not better job of even trained radiologists, and that doesn't mean we can get rid of the radiologist, but means we can have maybe an additional safety net or assistant pointing certain things out that then get reviewed by a human to make decisions on it.

0:22:39.8 SB: And I think that's all moving in the right direction because it increases the productivity of very scarce resources. And that's the part that also excites me about what we are doing, it's all about just extending really scarce resources and bending the cost curve of those down, because if you can go from a radiologist reviewing a particular image to a computer plus and radiologist reviewing the image, and you can increase the throughput and the safety at the same time, that means the cost per review gets a lot cheaper and that makes healthcare more accessible, brings costs down, and what we do is similar, probably in many cases augment the contact center and medical assistant, the provider, nurses in the follow-up, because these resources are super scarce and they will just not be able to keep up with demand as populations get older, and we can't train doctors and nurses fast enough right now to keep up with the demand.

0:23:39.4 SB: So the only way to not have shortages in care and keep access as equitable and broad as possible is to use technology to augment these resources in meaningful ways, so I'm excited about that, and I think we're just seeing the beginning of it because everyone is trying to grapple with these challenges that we've outlined, the right data sets don't exist, they're full of these biases, but I think people are finding more and more ways of overcoming these initial obstacles and get applications into production, and then as these cases are in production, they often generate an additional data set that is then cleaner than the original data set and that allows to improve the algorithms and applications even further, so I think we're at the beginning of a really exciting development of AI in medicine.


0:24:30.6 RS: Next time on How AI Happens.

0:24:34.7 Speaker 3: Well, when we talk about ethical considerations of AI, it is not just about the human-in-the-loop sitting at the end of your AI pipeline, the responsibility starts from the developer who is building it.


0:24:52.7 RS: How AI happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, E-commerce, media, MedTech, robotics and agriculture. More information, head to