How AI Happens

Valo Health Chief AI Officer Brandon Allgood

Episode Summary

Valo Health is transforming the drug discovery and development processes while accelerating the creation of life-changing drugs. Brandon manages the development and application of Valo's AI platform and is responsible for the company's technological vision. Despite his unconventional background, he is an expert in machine learning and AI and leverages these technologies for innovating pharmaceutical drug discovery and development processes

Episode Notes

In our conversation, we discuss Brandon's approach to problem-solving, the use of synthetic data, challenges facing the use of AI in drug development, why the diversity of both data and scientists is important, the three qualities required for innovation, and much more.

Key Points From This Episode:

Tweetables:

“Instead of improving the legacy, is there a way to really innovate and break things? And that’s the way we think about it here at Valo.” — @allg00d [0:08:46]

“Here at Valo, if data scientists have good ideas, we let them run with them, you know? We let them commission experiments. That’s not generally the way that a traditional organization would work.” — @allg00d [0:11:31]

“While you might be able to get synthetic data that represents the bulk, you are not going to get the resolution within those patients, within those subgroups, within the patient set.” — @allg00d [0:15:15]

“We suffer right now from a lack of diversity of data, but then, on the other side, we also suffer as a field from lack of diversity in our scientists.” — @allg00d [0:19:42]

Links Mentioned in Today’s Episode:

Brandon Allgood

Brandon Allgood on LinkedIn

Valo

Valo on LinkedIn

Opal platform

DALI Alliance

Logica

Brandon Allgood on Twitter

Rob Stevenson on LinkedIn

Sama

Episode Transcription

[INTRODUCTION]

[0:00:04.5] RS: Welcome to How AI Happens, a podcast where experts explain their work at the cutting-edge of artificial intelligence. You’ll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they’re facing along the way. I’m your host, Rob Stevenson, and we’re about to learn How AI Happens.

[INTERVIEW]

[0:00:32.3] RS: Here with me today on How AI Happens is the Chief AI Officer over at Valo Health, Brandon Allgood. Brandon, welcome to the podcast, how are you this morning?

[0:00:39.8] BA: I’m doing well, thanks for having me.

[0:00:41.4] RS: Really pleased to have you. You have got on a plane to visit the labs, right? Of your company? Is that where you are this week?

[0:00:47.9] BA: Yeah, today, I’m in Lexington where we have a big set of our labs.

[0:00:52.1] RS: Where do you like to get up to once you are kind of rallying the troops there at the laboratory.

[0:00:55.9] BA: You know, I sit in front of a screen all day most of the time and so it’s nice to actually walk through a laboratory and see equipment and talk with the people that are generating the data and making sure that we’re all connected and everybody understands why we’re generating certain datasets, how we might improve datasets, how we might improve speed just to feed the beast that is the machine learning and the Opal platform.

[0:01:18.1] RS: Taking a break from screens to do some science, right? Have some real conversations. Good, glad to hear it. Well, before we get too deep in the weeds here, you mentioned the Opal system. I want to make sure we set some context for you and the company, can we maybe learn a little bit about you Brandon, your background, how you wound up at Valo and then we can get into the company as well too.

[0:01:34.7] BA: Yeah, absolutely. So, I kind of come from a bit of a non-traditional place, which I guess is pretty typical for people in machine learning and AI. I mean, when I was in school, it was a very under-taught or under-appreciated topic. So my PhD is actually in cosmology. I wanted to be an academic, you don’t normally do that if you don’t want to be an academic but quickly, got dissatisfied with academia, landed squarely in Silicon Valley and for me, my life and passion is around the intersection of large scope computing, machine learning, and science. What I found out very quickly is that I personally studied the science of last century. 

I believe the science of this century is biology and really, that’s where the frontiers, that’s where the real unknowns are and because of the nature of biology versus physics, physics is a really, a reductionist science, biology is – was, I think hurt by physics being a reductionist science and us people believing they could apply that to biology, you know, sequencing human genome. Now, you know, everything will be all diseases will be able to find cures for. 

You can see how that breaks down and so now what we have in front of us, this century is the combination of biology and not so coincidentally, combined with the computing power and the power of machine learning so bringing those together. So, I’ve spent most of my career for the last 17 years, working that intersection, starting a number of small companies and non-profits. Valo is kind of a culmination of my experiences and the experiences of other executives here at Valo.

So one of the things you come to appreciate after spending time here is that drug discovery and development is death by a thousand cuts, and while you can build a solution, a point solution and that is the best point solution in the world that essentially reduces the time of that particular part of the drug discovery and development pipeline to zero and can guarantee everything. It turns out that if you truly want to have an impact on probability of success of a drug, cost and time to market, point solutions are not really going to move the needle.

Because that program or that drug then gets killed by every other orthogonal thing that happens and really, the way that you want to approach this and have the impact that I think machine learning can have on this industry is you need to be able to build out a system, build out a capability that attacks everything along the value chain from target ID through chemical design and into clinical development and that is what the Opal platform does here at Valo. 

I think the other thing that came in at about the same time, and if you’re in startups, you’ll understand, what ends up happening is startups and successful startups ultimately can point back to a number of synergies and one of those other synergies is human data, right? Longitudinal human data is not something you can go into the lab and generate and generate any quickly than it takes to, you know, than our lives. 

Someone had to 15, 20 years ago say, “Hey, let’s start collecting this EMR dataset, this longitudinal data in electronic form, assuming we will be able to use it in the future” and certain institutions around the world started doing that around 20, 15 years ago and so now, we have some pretty amazing data sets and what Valo has done is we’ve developed exclusive relationships with these suppliers where we can get this very large amount of longitudinal data.

Which then allows us to really dig in on disease and disease progression and so I think it’s a culmination of these factors that ultimately has resulted in us founding Valo and developing the Opal platform.

[0:05:16.6] RS: When you say the approach that was needed was attacking every stage along the way, do you just mean optimizing every step? What is the operative word attack mean there?

[0:05:25.7] BA: So it’s interesting. In certain aspects, right? So I mean, if you look at the entirety of the drug discovery and development pipeline, that and the science that’s being done there and the approaches that are being applied there, those have been developed historically largely since the 1950s and even before and have been developed in a longitudinal way. 

So for example, you know, the innovations in the past have been things like high throughput screening, which is really about applying a larger shotgun to a shotgun approach, right? So what you find is there’s been a lot of innovations but it’s been towards a certain type of drug and a certain type of prospects. 

So for example, most of the system, at least in the discovery phase was designed for blockbuster drugs, right? It was designed for blockbuster drugs that attack what I would say, low hanging fruit. So statins for example. Statins lower cholesterol, you find an inhabitant enzyme. So statin inhibits an enzyme and inhibiting an enzyme is actually a fairly simple process relatively speaking. 

I mean, I don’t want to underplay what people done in the past but it is fairly simple, low hanging fruit process and then once you lower cholesterol, you can make billions of dollars. So it’s okay to spend a lot of money and have a lot of inefficiencies as long as you are ensured a billion dollar plus drug.

But what we found is now, all of the low hanging fruit is gone and we’re starting to tackle diseases like cancer, like cardiovascular disease. Many of these more complex diseases that where we’re going to take more rifle approach and potentially have a smaller patient populations. What that then requires is you know, in some cases, higher efficiency and in some cases higher accuracy. 

So it’s not one approach overall, some places it is efficiency but other places, it’s taking a very different approach, a much more thoughtful approach and that’s one of the things I would say at Valo, 50% of the effort is in building the algorithms and the platform. I’d say, 50% of the effort is thinking about if you were to start over from scratch right now and you were faced with human health, without any legacy issues, how would you change the process? 

How would you make science and data science interact together in different ways? So I would say, it’s both increasing efficiency but also really just questioning the process.

[0:07:51.7] RS: Is that approach to machine learning limited to healthcare, just given the way that traditional pharmaceutical development has happened or would you say, you could apply that philosophy just about anywhere?

[0:08:04.3] BA: I think you can apply that philosophy in many industries, right? If you look at how innovation happens most recently in the auto industry, right? The real innovation doesn’t come from making gasoline gas cars more efficient, it was really by thinking in a different direction, right?

It wasn’t really about, it’s you know, about re-examining the question and you’re re-examining the problem, the fundamental problem with a new set of tools. With the transportation system and you know, the Ubers and the Lyfts of the world. Again, it was about just kind of looking at the fundamental problem and thinking, ‘How would we solve this today?’ Forget all the legacy”.

Instead of improving the legacy, is there a way to really innovate and break things? And that’s the way we think about it here at Valo.

[0:08:54.4] RS: So what role would you say machine learning is playing in traditional pharmaceutical companies and what is sort of the disruption that Valo offers?

[0:09:03.4] BA: Right, so I think it goes back to exactly that. I think, if you look at most pharmaceutical companies and I would actually say, all pharmaceutical companies, data science is seen as a support tool. As you pointed out, as a way of making existing processes slightly more efficient, they play an advisory role.

I would say at Valo, the Opal platform and the machine learning and the data science, we see those as fundamental, right? While certain investors, certain individuals we look at Valo and say, “Okay, well, they’ve got two phase-two clinical trials, they’ve got a pipeline of drugs, that’s their value” and in reality that is, I don’t want to scoff at that that there’s definitely value in the drugs we’re bringing forward. 

But internally, we believe and our investors believe that the value really is in the Opal platform and that those programs are validations and test beds for improving the Opal platform because it’s essentially, why would you put all of your money in the eggs instead of the chicken, right? I mean, it’s kind of we’re investing in the chicken that can lay the golden eggs, not necessarily the eggs themselves and so I would say, that takes a different mindset and a different approach and I would say that’s really the value at Valo or what we call the Valo ad.

[0:10:21.5] RS: I like that Valo ad. Yeah so, this problem of which came first, the data science and machine learning or the applying the data to machine learning is interesting and so I’m curious, what people should be listening for and looking for in their organizations to ensure that they are themselves at a data driven organization?

[0:10:39.9] BA: Yeah, I would say who is leading the teams, right? Who is leading the teams? And what you’ll find if you look at the leaders, the lead of a program or the lead of a team, if it’s always the clinician or always the medicinal chemist or always the biologist and not the data scientist, then you’re likely at an organization where data science is in service. 

But if you start to look at, “Well, wait, you know, a data scientist can lead a program, can be fundamental in decision-making” right? Where decision making does not just rest with the traditional scientist but rest with either a shared group of people or with data science. So for example, you know, I know in other companies and other pharmaceutical companies, their data scientists that have a lot ideas and want to drive programs forward but are often shut down.

Here at Valo, if data scientists have good ideas, we let them run with them, you know? We let them commission experiments. That’s not generally the way that a traditional organization would work.

[0:11:42.7] RS: Yeah, that makes sense. It’s really as simple as, who gets to make decisions, where is investment happening, what behaviors are rewarded, that sort of thing. That’s not even really technical question, it’s just sort of follow the money in a way, follow the activity, follow who gets the trust.

[0:11:55.2] BA: Absolutely and it’s pretty interesting if you think about it and the money pieces is even more interesting. There are these traditional metrics within financing and my background is not finance, so I don’t want to overstep my bounds here but there are traditional metrics that CFOs within pharma use to track whether capital was being spent properly, right? 

What’s interesting is those metrics are almost the opposite of the metrics that CFOs use in technology and so we’ve often sat down in discussions about capital deployment and how we think about capital deployment not from a pharma perspective and not necessarily from a tech perspective but as a blend and I’m not going to say it’s easy. It takes some tradeoffs, right?

Do you want to invest in that clinical trial or do you want to invest in the platform? And I think it just takes a lot of really smart people sitting down and thinking about where the short-term value is and where the long-term value is and where we want to invest.

[0:12:54.5] RS: Yeah, yeah, makes sense. I’d like to go back to something you said about the longitudinal data or lack thereof and in lots of cases and how, really, only 10 to 15 years maybe of data is available in some of these cases. Are we at a point now where it is more reliable for you to source this data or do you still expect synthetic data to play a part moving forward?

[0:13:18.4] BA: Oh, well, I think those are two different questions. I think again, with data science, there’s never enough data. It doesn’t matter if you’re in imaging, if you’re in NLP but undoubtedly in human data, there is definitely not enough data. There is definitely not enough diversity of data in terms of genetic diversity, in terms of ethnic diversity. 

We can source pockets of data and we have found some pretty diverse sets but there is still large parts of the world’s populations that are lacking and if we are talking about human health, we have to be thinking about all humans and ensure that all human’s data that we get is representative of the human race and of those that are most likely to suffer from certain diseases. 

On the other side in terms of synthetic data, you know, there is the problem we run into privacy issues all the time, right? Privacy issues whether it is under GDPR or HIPAA or some other compliance and often times, because these are data from individuals and we want to respect the privacy and the organizations that are collecting and offering those data for use want to do what’s best for patients. 

So you know, often what that means is these human datasets are siloed and have rigorous controls around them, which then inhibits science, right? So there is always a balance between those things. Now, if we could get to a point and there are some pretty interesting recent advances in the application of transformers to EMR in our ability to generate EMR datasets that are synthetic but representative of the underlying training set from which they were pulled. 

We’re not quite there, certainly not in the case of rare diseases and or even in diseases that are less rare like many types of cardiovascular disease. I think you and I talked about it earlier, these diseases are largely what I would call syndromes and so that while you might be able to get synthetic data that represents the bulk, you are not going to get the resolution within those patients, within those subgroups, within the patient set. 

Such that you can really start to drill into the fundamental biological dysfunctions leading to that syndrome. So I think absolutely synthetic data would be hugely helpful in our ability to develop algorithms and apply them. 

[0:15:38.8] RS: But that synthetic data is perhaps hamstrung by the lack of existing human data. 

[0:15:43.0] BA: Yeah, well, I think it’s hamstrung by both. Yeah, can you get a – so number one is really the, I think we are still not there in terms of algorithms. So you know, what you find for in other cases where you are able to generate synthetic data, I mean obviously, DALI-2 is the big deal right now. Look at how many training data points that took to do that and to build the model of that size, right? 

You’re never at least, well, I am not going to say never. Right now, you are not going to find a human dataset that is anywhere close that size. I would say there’s always three competing things that happen and inhibit certain problems, right? Size of dataset, amount of noise and algorithm sophistication. You have to have two of these increase, well, the noise should decrease. 

But the point is that in order to take an advancement, you either need an algorithm advancement and a high reduction in noise or an algorithm advancement and an increase in the amount of data, right? So there is always a combination of two to see an innovation and right now, I don’t see us getting a huge increase in the amount of data and because it is historic, the noise is the noise and so really I think we need to make advances both on the algorithmic side but also on the dataset set size side. 

Again, that then bleeds into the concern about bias and kind of needing more data that’s more representative of the human population. 

[0:17:12.0] RS: What makes you say we’re not there algorithmically? 

[0:17:14.6] BA: Well, I mean, the efforts currently, the latest publications on archives are not. I mean, they are promising in their ability to produce synthetic data but you know, given even some of the larger datasets in the United States, they’re still just not there. I mean, that’s what I would say. 

[0:17:31.4] RS: Just not sufficiently advanced or not enough of them, what’s the state of the delinquency? 

[0:17:35.1] BA: I would say it is not, it is not sufficiently advanced. I think like I said, transformers are starting to help. One of the big problems here is the heterogeneity of the human data and also the temporal nature and the missingness, right? Again, those are the heterogeneity of the data leads to noise, the missingness is a source of noise, right? Then the temporality of the data, I think we are getting better at representations. 

Most of the representations that are applied in at least longitudinal human data are largely derived from signal processing where there’s temporal or from NLP, there are tweaks on that but it is its own unique dataset where we have these different types of data. There are diagnostic codes, there are labs, there are drugs, there are different types of interventions and those are spread out temporarily overtime. 

So you know, if I think about is there similar dataset or a dataset problem out there with the similar aspect that isn’t encumbered by privacy or you can get larger datasets like if we could get to that, if we could find that dataset, we might be able to develop representations that are better but none of them come to mind. So we are often borrowing from these other datasets and problems that have very similar properties but aren’t quite there. 

So we really need to continue to invest more and specifically, algorithm specifically designed for longitudinal patient data. 

[0:19:04.7] RS: Right. As you say, a couple of the challenges with regard to data here are the lack of longitudinal data, you come into privacy issues, there is also plenty of historical social reasons why there’s less data on certain groups. Let’s maybe sail past some of those problems and just like indulge in a world where we have much more representative data collection. Once that’s the case, you are still a far cry from developing a bias free your solution, right? What were some of your approaches to ensuring fairness and rooting out bias in datasets? 

[0:19:34.6] BA: Yeah, so the answer to that is ensuring that your data science teams are as diverse as possible. You know, we suffer right now from a lack of diversity of data but then on the other side, we also suffer as a field, from lack of diversity in our scientists, in our data scientists, right? So I think one of the things that we try and do is focus on DEI when we are thinking about hiring and ensuring that the teams that are working on these datasets are diverse. 

I constantly think about diversity, I am constantly thinking about bias but even though I think about that a lot, I have missed things in datasets that colleagues who had different life experiences opened my eyes to that I never would have thought, that I never would have saw because I didn’t have that other life experience and so I think in all cases when we’re dealing with data, we need to have diverse opinions. 

We need to have diversity in the data scientists that are working on these problems and I think that is our biggest weapon. I mean, there are more traditional ways of re-weighting training sets, looking at covariates, looking at counterfactuals and trying to develop, you know, try to deal with it algorithmically but really, that diversity and team members is going to payoff ten times more than data science tricks.

[0:20:56.4] RS: Yeah, that’s interesting. That’s not just the solution in developing algorithms either like having a diverse set of perspectives, right? As you say, having your eyes open to life experiences you didn’t have and would not have considered even if you were really, really working as hard as possible to try and get outside of yourself and remove biases is not something that would occur to you, what do you mean when we say like what is the value of re-weighting a training set? 

[0:21:21.3] BA: Yeah, so re-weighting a training set is what you want to be able to do, right? So again, the fundamental statistical kind of basis of machine learning or assumption is that your training set and your test set are pulled from the same underlying distribution and we know that that’s almost never the case because your training set by FIAT is historical and history is not always representative of the future, right? 

So you’re always going to, in cases of human data, in cases of even financial data, you know these kinds of data that there’s always the future is always going to look different than the past and so you are always going to have to continue to kind of re-weight based on the future. In the case of healthcare, what we find is the vast majority of the pristine data, not surprisingly come from areas where there’s a rich population, a wealthier population, which tends to be less diverse. 

So what we then often unfortunately need to do is go back and try and re-weight the datasets to be more representative but there’s still, there is only so much you can do there in terms of controlling it. You might change that dataset that you have to make it more ethnically representative by re-weighting the different ethnic backgrounds but the problem is, is there are correlates in that poor populations might have a different ethnic group diversity. 

But that’s also associated with poor nutrition for example and you can’t make up for that poor nutrition. You don’t get that when you re-weight it on that richer patient population. You might get a different, a change in the ethnic weighting, which might correlate with genetic factors, which is great but it is not going to correlate with socioeconomic factors and so, this is why re-weighting is at best a partial solution. 

[0:23:14.7] RS: One tool and a varied and long tool belt I suppose, right? 

[0:23:17.7] BA: Yeah. I mean, honestly this is part of why I also have started in that non-profits. Part of starting some of these non-profits is to help government’s and government organizations to think about data, think about these problems and play a role in helping us find those datasets, helping us get those datasets because it is going to take a larger, broader, government based effort in order for us to get into many of those communities. 

Convince them that giving their data is going to ultimately benefit them, right? So that’s again why I’ve got kind of my commercial side and I am also very active on a number of non-profits as well as starting a few. 

[0:23:59.7] RS: Well Brandon, we are creeping up on optimal podcast length here but before I let you go, I wanted to ask you to consider perhaps healthcare or drug development but really anything you like, what is going on in the space of AI and machine learning that has you truly excited and inspired to continue working in the space? 

[0:24:17.0] BA: I would say, yeah, a couple of things. Again, I think I mentioned it before, you know I think recently we have punched through. So for a while there, I think a lot of people were predicting that deep learning kind of reached that asymptote in its ability to make predictions to improve on existing standards. I think more recently with some of these large language models, these large image and generation models like I mentioned before, DALI-2 and the innovation of transformers. 

I think we have seen, you know, there is kind of there was a lull. There was a bit of an asymptote and then we have seen them increase in performance. That is exciting to me. Now unfortunately, much of that is not applicable to where we’re working, which is in healthcare, which is like I said, like we’ve said before is a much more of a low data environment and so there I would say, one shot, no shot type learning. 

Where you might have a lot of data in one area but no data or very little data in another, that’s also been exciting in our ability to do transfer learning and like I said, no one-shot type learning. On top of that, I think the work around generative modeling, so GANs and others in order to facilitate an active learning loop, which what we are leveraging for example in our Logica, in our pre-clinical work is also been very exciting and I think very promising in its ability to push us forward especially in these kind of lower dataset environments. 

[0:25:51.3] RS: That’s fantastic. Brandon, thank you so much for sharing that and for sharing all of those. I’d love learning from you today. This has been a fantastic conversation. 

[0:25:57.2] BA: Yeah, thanks Rob. Thanks for the opportunity. 

[END OF INTERVIEW]

[0:26:01.4] RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video and sensor data annotation and validation for machine learning algorithms and industries, such as transportation, retail, ecommerce, media, medtech, robotics and agriculture. For more information, head to sama.com. 

[END]