How AI Happens

Dremio Tech Evangelist Andrew Madson

Episode Summary

Tech evangelism is a relatively new term, so Andrew graciously explains what a tech evangelist is, how he became one, how he embodies this role at Dremio, and why he values this designation among modern tech professionals. Our guest then shares the pain points that inspired his choice to join Dremio, what Dremio as an organization is all about, how the company uses generative AI, and the common obstacles that Andrew is faced with at work.

Episode Notes

Andrew shares how generative AI is used by academic institutions, why employers and educators need to curb their fear of AI, what we need to consider for using AI responsibly, and the ins and outs of Andrew’s podcast, Insight x Design. 

Key Points From This Episode:

Quotes:

“Once I learned about lakehouses and Apache Iceberg and how you can just do all of your work on top of the data lake itself, it really made my life a lot easier with doing real-time analytics.” — @insightsxdesign [0:04:24]

“Data analysts have always been expected to be technical, but now, given the rise of the amount of data that we’re dealing with and the limitations of data engineering teams and their capacity, data analysts are expected to do a lot more data engineering.” — @insightsxdesign [0:07:49]

“Keeping it simple and short is ideal when dealing with AI.” — @insightsxdesign [0:12:58]

“The purpose of higher education isn’t to get a piece of paper, it’s to learn something and to gain new skills.” — @insightsxdesign [0:17:35]

Links Mentioned in Today’s Episode:

Andrew Madson

Andrew Madson on LinkedIn

Andrew Madson on X

Andrew Madson on Instagram

Dremio 

Insights x Design

Apache Iceberg

ChatGPT

Perplexity AI

Gemini

Anaconda 

Peter Wang on LinkedIn

How AI Happens

Sama

Episode Transcription

Andrew Madson  0:00  

Depending on which AI you're using, how are you testing for bias? If your employees are just, you know, generating all their ideas or their marketing campaigns out of generative AI? How are you testing for bias tracking, the provenance of your training data is going to be come more and more important, especially as companies start creating their closed LLM.

 

Rob Stevenson  0:24  

Welcome to how AI happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson. And we're about to learn how AI happens. All right, hello, again, all of you wonderful machine learning engineers, data scientists AI building hooligans out there in podcast land. It's me, Rob here back with another installment of how AI happens and I have a wonderful guest for you. He has a ton of experience in the space. Currently, his focus is on being the tech evangelist over at dremio Andrew Madson is here with me today. Andrew, welcome to the show. How are you today?

 

Andrew Madson  1:15  

Hi, Rob. I'm great. Thank you. Thank you for having me.

 

Rob Stevenson  1:17  

I am really pleased to have you and tech evangelist. I feel like there's so much more to your role. It's summed up very neatly into words. But I was hoping you might explain a little bit what that means and what your role is exactly over a dremio.

 

Andrew Madson  1:31  

Yeah, of course, tech evangelism is a growing role. And it means different things to different companies. It's an offshoot of developer relations, which, historically is really technical hands on documentation, you know, running hackathons. And evangelism is a part of that. But it's really understanding the use cases of individual developers and finding out how the product really works best for them and taking that information back to the product team. And then when new features rollout figuring out the best use cases, for those features, and showing them demoing them and speaking about them with technical audiences. So instead of the C suite, I speak with people who are hands on keyboard.

 

Rob Stevenson  2:14  

Got it? Okay, a million follow up questions on that, obviously. But first, I would love to just clear up what dremio is, what is the tech you're evangelizing

 

Andrew Madson  2:22  

at dremio is a data lake house platform. So historically, data lakes were really hard to access the data, it was raw, hard to use. So we piped them into data warehouses like snowflake redshift, and then did our analytics and use cases based on the data warehouse. But what that's doing is it can create costly copies, there can be delays in between the raw data and the cleaned data warehouse. So what a data lake house does is using a technology like Apache iceberg, it lets you do those analytics right on top of the data lake. So then you're using the live data, it's not a copy. So it happens to be cheaper and faster. And that's the goal.

 

Rob Stevenson  3:06  

And since it's not a copy, then you can kind of like the changes you're making are sort of happening live, it's not this issue of like, okay, let's do this, push it to production, or it's not happening live. But you can sort of visualize as opposed to make the changes, push to production, see what breaks start over? Like it's a more fluid experience, I imagine.  

 

Andrew Madson  3:24  

Yeah, absolutely. And one of the benefits of doing it through a lake house is that you can do it similar to like get we use Git within software development. You can do a git like process, where you can branch the code, and do all your changes on a branch. And once you're happy with it, then merge it back into production without breaking anything. So you can do all your experimentation on a branch that's not going to impact production.  

 

Rob Stevenson  3:48  

Was this an experience you had was a pain point you felt before joining dremio Because these are my favorite kinds of people to speak with the people who really intimately felt the pain of a problem that their company is now solving.  

 

Andrew Madson  3:59  

Huge. Yeah, I've always worked with data warehouses, and they're great, and there's a place for them. In fact, Lake houses and warehouses can play nicely together. But I always found that the more cooks you have in the kitchen, with moving data downstream and cleaning it, and deciding the names of columns and the semantic layers on top of it, analysts and data scientists are at the very end of that process. So sometimes things would change in the source system or in the upstream data, which would then impact models at the downstream but the data analysts and scientists didn't know until something broke, then they had to try to figure out what happened upstream. Maybe the data didn't refresh. There's lots of things that can happen in in a warehouse, the longer you stretch out that ETL process. And so then once I learned about Lake houses and Apache iceberg and how you can just do all of your work on top of the data lake itself. It really made my life a lot easier with just doing real time analytics, and especially with AI and data science use cases. is when you really need to have access to that data a lot faster than even hourly refreshes, depending on your use case may not be enough. .

 

Rob Stevenson  5:11  

Gotcha So how did you come to be in this role, I would love to know the Andrew, if at all,  

 

Andrew Madson  5:16  

I teach a lot of data science at a few different universities. And part of my teaching process is creating content on LinkedIn and helps me crystallize the thoughts when you try to take a complex idea, and simplify it into 200 words and then create like a simple graphic around that it really helps me understand the concept a lot better. So I've been doing that for a little while. And it turns out that that's a role at tech companies called evangelism where they really need folks who are able to take some of these ideas and make them more accessible to technical audiences. And so we were able to align our interests of mine and content creation and teaching, and dremio is of being able to reach technical audiences. So it was a happy coincidence that we were able to align our objectives.

 

Rob Stevenson  6:06  

Yeah, it's interesting, because I feel as though the technical folks at any given company, they're kind of just expected to know it, they're expected to have the back end or understand exactly the the extent of what's going on. It's not that that's not the case, or it sounds like this is somewhere in between career development, and upskilling. And just, hey, we have a fantastically complicated product we need you all to be up to date on it constantly is that sort of where you live in the middle of those two things.

 

Andrew Madson  6:32  

Exactly. And often technical audiences really just want to stay hands on keyboard and do their best work and upskill. And then folks who generally create content, like marketing teams generally aren't technical. So you need to have kind of both skill sets of being able to create content, and make it visually pleasing and relatable. But also you need to have that technical expertise so that you know, what you're creating is accurate and being able to answer technical questions from your target audience.

 

Rob Stevenson  7:06  

This is funny, because this morning, right before you and I recorded I met with a woman who she recently stepped down as the Chief People Officer at a huge PR company, she was telling me that she was trying to find vendors who could help upskill the technical talent in her company, and she was completely coming up blank, like you need a data science and machine learning upskilling options for her workforce, and it didn't exist. And so what she ended up doing was finding out that basically, Microsoft was doing the best job at it. So she was speaking with Microsoft being like, hey, can we pay you for this thing you do internally for your employees. So it feels like being able to hire the right folks is now maybe not enough these like that's a huge indicator for me is that this fortune 500, Chief People Officer is trying to pay Microsoft upskill and find this thing that doesn't exist. But that sounds like what your role is?  

 

Andrew Madson  7:56  

Yeah, I agree. It's growing, I think technical skills are required of more and more roles. Product managers at a lot of companies are expected to know SQL and understand their technical products intimately. Of course, data analysts have always been expected to be technical. But now given the rise of the amount of data that we're dealing with, and the limitations of data engineering teams and their capacity, data analysts are expected to do a lot more data engineering, data, scientists have the same bottleneck where data engineering teams are incredibly busy. But data science teams, especially with the rise of AI, need to get access to data quickly, they need to have it in a certain format, than they need to build pipelines and see ICD processes around it. And then a lot of them are needing to do that themselves. So even technical roles are needing to upskill into new branches of technical expertise. And then non technical roles are having to learn more and more of these analytical skills.  

 

Rob Stevenson  8:54  

You know, it's funny, I never thought about it this way, Andrew, until you and I were going to speak which is like this, how much YouTube watching Can you require of your staff? Right? Because I think that upskilling it's every function, you need to get better, you need to find ways to improve it your role. And maybe your company is cool enough where they understand that you watching a YouTuber or something or taking an E course, on company time is worthwhile, but rarely is that person on staff, right? But it makes all the sense in the world to me that they would be. So this feels like an important it's not merely an AI specific or a data science specific trend. But like a career specific thing. Like as you were said this role is increasingly more important and popular. So I'm curious for you, Andrew, what are the kinds of questions that folks come to you with what are you hearing a lot?

 

Andrew Madson  9:40  

Right now. dremio is really focused a ton on data engineering teams and helping them upskill and implement a patchy iceberg because the patchy iceberg being a columnar table format makes analytics on a data lake possible. So data engineering teams are the first responders to implementing a patchy iceberg get there. companies. And my counterpart, Alex Merced is focusing on that. But they brought me in to work on data analytics and data science. Having those use cases in my past and understanding what data analysts and data scientists and AI researchers are looking for and deal with. Those are the types of problems that I'm handling. So the questions that are coming to me is, one, why should analytics teams care about the speed of their queries? Like, what does it matter if a query takes 10 minutes versus point one seconds to the analyst? And so addressing that, how can they do easy real time analytics? On the AI side? How can they have a rank system that takes advantage of a data lake and is able to speed up the training process? On the data science side? Similar question with like fraud detection? How can you have a faster training process and faster access to real time data without implementing a lot of costly procedures? And how can you do this dremio is certainly a fantastic way there are other ways to do it. So understanding those use cases and building out demos, and showing folks case studies of this is the benefit of just using iceberg and Lake houses in general.

 

Rob Stevenson  11:20  

Gotcha. Now, Andrew, because I'm a content creator in the AI space, I'm illegally required to ask you about generative. If I don't ask you the content, creating police will kick in my door and cancel my podcast. So we we must speak about it. So I guess we'll just start with like, where is it layered in at dremio? Are you is it as disruptive inside dremio As you're seeing other places?

 

Andrew Madson  11:41  

Yeah, we use it a lot. Within the tool itself, we have generative AI, which takes text and turns it into SQL queries. So then it shifts analytics, left up to business users and helps them who don't have SQL expertise, get access to the information that they need, super easy. But in terms of our teams using it, we all use it. We use generative AI for brainstorming, we use it for thinking through new ideas and case studies. It's a very tech forward company and embracing new technology. But then also looking at the use cases of end users of how can dremio and Lake houses in general, supercharge AI and generative AI. AI requires a immense amount of data in general. And a lot of that needs to be real time, especially if you're talking about generative AI in sensitive industries like healthcare, financial services, where the most up to date information is incredibly important. But then also data privacy of making sure that it's only accessing the right information, making sure that it's not pulling things that should be masked or encrypted and providing that information out. And so thinking about ways that we can make that easy for end users, like I said, the longer you stretch out an ETL process, the more cooks there are in the kitchen, things like data privacy become much more complicated. Because if you need to delete data, or there was a data breach, now you've got more systems, more people looking at it more areas where that could lead through. So keeping it simple and short, is ideal when dealing with AI.

 

Rob Stevenson  13:24  

Yeah, definitely. So that is happening behind the scenes for the users, right? Like these queries are being shortened. They're being made more lean. Is that correct? Yes, absolutely. So is it cutting out irrelevant syntax? Or is this like it's NLP? And it's cutting off? Can I have the blank? Like, what? How is it shortening it?

 

Andrew Madson  13:45  

So there's a few ways. First of all, if you're doing your AI in a data warehouse versus a data lake house, just doing it on the lake house immediately shortens out a huge chunk of that process. So you've pulled out a lot of that ETL, you pulled out a lot of the vendors that could be accessing the data, you're doing it just on your lake. So that shortens that right there. In terms of the views, or the data that the queries or the AI is accessing. One way that dremio Does this as it creates a cache that we call reflections. And it's commonly accessed data, or you can manually tell it which data to cache, and it will refresh. But then the query doesn't need to go and access the raw data, it just needs to access the cache. And that has sub second queries, then, and so think of it like a materialized view without all the overhead of having to create a physical copy of the data. It's just a virtualized cache of the data. So now the AI or the query can just access the cache super fast.  

 

Rob Stevenson  14:47  

Okay, gotcha. So there's this unique happenstance, Andrew in this industry, where folks kind of have one foot in academia one foot in the private sector, and there's lots of reasons for that. We've kind of gotten to that in previous episodes. That's a reality for you, you kind of live in both worlds, your grass is always green, right both sides of the fence. And because of that, I wanted to ask you a little bit about what you're seeing in academia specifically, like how generative is kind of taking hold. And, you know, we've all heard the like, oh, kids are using generative to write their essays. That's kind of what I mean, I want to know more about like the institutions and the colleges, the schools themselves, like what are you seeing in terms of how they're going to be using generative? Is it comparable to what you're seeing in the private sector?

 

Andrew Madson  15:26  

I think it is comparable. I think that there's different concerns within academia, when generative AI first rolled out that mean, generative AI has been around for quite a while. But when chat GPT initially made it very accessible, and popular, I think a lot of universities had a knee jerk reaction and said, Oh, students are immediately going to cheat. And that was their main concern. My take on it is cheaters are going to cheat. And really, it doesn't matter what kind of processes you have in place. If there's a tool there and somebody wants to cheat, they're going to cheat. And those who don't want to cheat aren't going to cheat. It's not the tool. That's the problem. It's cheaters, who are the problem? It's human beings. Yeah, it's human beings, regardless of what the tool is, but the positive impacts really have been that chat. GPT can really help with brainstorming, it can help clarify a lot of things, not just chat GPT all the generative AI models, they can help with creating practice problems offer instant feedback, it can help with research, there's definitely limitations there, but it's gotten much better. And it can even create better accessibility. You can have educational content in multiple languages, formats modalities, which is a concern for a lot of folks going into universities, maybe English is their second language. Generative AI can really help make those more accessible, maybe it can create a visual for a concept that's only written down in the textbook. on the teaching side, it can help streamline administrative tasks. Many universities are technologically nascent, and they don't have a lot of high tech processes in place. And there's still much more manual work than you would think, in higher education. But generative AI can help professors create rubrics, it can help not with grading, but it can help create lesson plans that can help crystallize ideas and say them more clearly, or similarly help create visuals, generative AI for both the learner and the teacher is really just like an assistant. It should never be in the driver's seat, it shouldn't be telling you what to say. It shouldn't be doing your work for you. It should be your assistant, you should be monitoring it and checking its quality just like you would with any kind of assistant. But with all those benefits comes the academic integrity. You know, the purpose of higher education isn't to get a piece of paper, it's to learn something and to gain new skills. And using generative AI. Is there a case to say like, well, if they got the work done? Do they technically have the skill? Maybe. But do they really understand it? Maybe not? And what was the purpose of the course. We've seen bias and misinformation within generative AI. So that can be an issue as well. It's not like you're going to a scholarly source. And reading that information from, you know, a peer reviewed journal. It's more like Wikipedia was in its early days, where you know, kind of anybody could throw anything on there. And maybe it was right, maybe it wasn't. That's kind of the way some of the generative AI is viewed now, like, could be helpful could be wrong.  

 

Rob Stevenson  18:42  

Yeah, it's the same skill set to with like, Okay, here's my generative answer. Where did it come from? Right, like that is making a bibliography that as citing your source, that is a little nebulous right now, in some generative tools. It's like Where's this coming from? So that feels like a an important feature for these generative tools to put in, like right now at Google searches, putting generative at the top, but that wherever it's pulling it from, there's a link to that. So you know, whether it's okay, this is a research paper I'm looking at. It's usually not often it's just like whatever the top ranking blog post was. So it's not that much different than Google Search is a little more streamlined. But surely that's a feature we'll see. Right? When when in generativity Okay, here's, here's where this is coming from, which is what we're talking about our attention mechanisms like this exists in AI like this need to know where a judgment or where material comes from. That exists. We just haven't necessarily seen it so much and user facing generative.  

 

Andrew Madson  19:39  

Agreed, yeah, perplexity AI does a really great job of providing links and videos and sources, but that's really their focus. They want to be the AI for search. But as you note with some of the other AIS, it may provide a link but the link may be fake, right, deciding its sources isn't a strength of all of the AI tools.

 

Rob Stevenson  19:59  

The link may be fake isn't like the place is coming from is not reputable or the URL is just for Ghazi.

 

Andrew Madson  20:06  

Yeah, the early days of chat GPT 3.5. It would just make up URLs that looked real. And then you click on it and it wasn't a real URL, or it was a real URL. And it just went to a place that had absolutely nothing to do with the answer that it provided. scandalous.

 

Rob Stevenson  20:23  

It's like that scene at the end of boiler room where they go into the office, and there's a bunch of phones on the floor. Like there's nothing here that's resinous.

 

Andrew Madson  20:31  

Well, that's one thing I saw, you know, maybe a year ago, when I saw a lot of generative AI content flowing through my courses, is one really easy way to tell if the student use generative AI and whether or not they did didn't super matter to me, but I would look at their sources. And I would just look up is this a real source because they looked real, and it was in APA format. And it just wasn't real, like it would list real researchers and real professors, but like a fake journal. And so it looked super real. But if you check the sources, they weren't real. That's scary, because that's going to fool a lot of people. I'm sure because you have to take that extra step of verifying information.  

 

Rob Stevenson  21:11  

Yeah, we're all going to need to learn to do that. Like we're trusting headlines. And basically, oh, I can see New York Times in the URL, I'll trust this, right. And now it's not merely even enough to click it and read it, right? Like there's this extra step of verification. So will that be automated? Well, that's become, that's the skill academia has to teach, right? Which is all of what advanced education is, right? It's critical thinking. So it's like, okay, can we teach people to cite their sources, you can use generative, so long as you've done the work to understand to make sure that like you verified it, right?

 

Andrew Madson  21:41  

Exactly. Like if you can pull together a comprehensive paper that's accurate and thoughtful. Regardless of what sources you used, as long as you're not plagiarizing. Good for you, you know, maybe you took the generative AI information as a brainstorm, and then you fixed whatever was wrong, or you clarified it and put it in your own voice. We see that with coding as well. People thought generative AI was like the miracle new coder. But again, it's like your assistant, I've seen a lot of coding projects go wrong, because they were people who didn't super know what they were doing thought that they could do the whole project, using whichever generative AI tool, then they got stuck because it didn't really work or didn't work, right. And they didn't know how to debug the code. So it's like a Junior Assistant, like if you know what you're doing, it can definitely make you so much faster, but it needs to augment you instead of replace you.

 

Rob Stevenson  22:30  

Yeah, of course, this extra step of critical thought and verification, it makes sense for academia to put into place and because that's such a paragon of learning, you know, and just what it means to be educated. But I fear that private companies are less incentivized to encourage that from their employees.

 

Andrew Madson  22:50  

I don't think so. I think that employees are responsible for their output, regardless of how they generated it. So if you're an employee, and you're using generative AI to make yourself faster, to be smarter about things, but you're outputting, poor quality, because you're not verifying the information, then you're still on the hook. And additionally, I think companies are much more sensitive to putting in proprietary or sensitive information into AI tools. And so there's a different limitation that companies are considering when they're implementing generative AI because they are more on the hook for proprietary information. Whereas a student is acting as their own agent or putting in whatever information that they want to if they put in their own sensitive information that's really on them. But if you're an employee, and you're putting in proprietary, sensitive information that really reflects back both on you but also on the company, and now it becomes a legal risk, right,  

 

Rob Stevenson  23:56  

the stakes are a little higher in the private sector, right? Because you might get a slap on the wrist from your professor in academia or a bad grade, but it's not going to go live to millions of people like it would in the private sector.

 

Andrew Madson  24:08  

That's right, you can find proprietary information within some of the chat GPT information that the LLM has trained on because people have uploaded that into the training data. And that was before you could have a corporate account on open AI, for instance, since they don't train on my data. But Gemini, I don't think Gemini advance has that option. I think you have to agree to let Gemini train on whatever you put in there. So if you're copying and pasting from corporate documents and saying, hey, you know, legal documents, say, hey, explain this legal document to me in Gemini. Now Gemini has that information and it's going to train on it.  

 

Rob Stevenson  24:47  

Yikes. I mean, of course, people are doing that, right. Like they're not thinking about where this is going once I hit Ctrl V. Exactly. That's kind of scary, but it's also like, not by a user beware, right? That's like where these folks who might be liable to do that, that ought to be security training, right? In addition to like, is this a phishing email? It's like, Are you being naughty with sensitive company data, right like that, that just feels like the world we live in now?

 

Andrew Madson  25:11  

Exactly. There's a lot of considerations for companies, and I hope they do it. You know, I'm a tech evangelist, I believe that folks should do it. But also, there's a ethical and smart way to do it. Depending on which AI you're using, how are you testing for bias? If your employees are just, you know, generating all their ideas or their marketing campaigns out of generative AI? How are you testing for bias? For instance, as you noted, explainability is incredibly important. One thing that Anaconda is doing right now, is they're launching or have launched an AI platform that helps you track the training data for your LLM. Because what happens if you've been training on billions of tokens, but some male actor injected something that shouldn't be in there injected information that should not be in your training data, and now it's in there, and they can hold you for ransom until you pay them, for them to tell you what they put in there to pull it out. We've seen that we're tracking the provenance of your training data is going to be come more and more important, especially as companies start creating their closed LLM

 

Rob Stevenson  26:23  

provenance of training data. I've never heard that before. But it makes all the sense in the world. It's not merely like, okay, show your work. How did you generate this? But like, let's see how it came to be. The way it is. Right? Show me the maturity show me what you thought this thing. Um, it's not enough for you to just tell me that. Here's where the output came from. Like, I need to see the as you said, provenance, it just, I I'm tickled by that, because I never heard it. But it's just it's such an apropos right word for it.  

 

Andrew Madson  26:50  

Yeah, I hadn't thought of it before. But I saw Peter Wang at a meeting, we got talking after his session, and he was talking about that importance and why anacondas really going that way. And it made so much sense to me, right? You need to ensure that you're being fair, and you're reducing bias, you have to have that transparency and explainability, reproducibility and research for a lot of companies. But then, you know, with GDPR, CCPA, and lots and lots of states launching their own privacy, and consumer rights initiatives, you really are going to have to know what data is involved in the LLM?  

 

Rob Stevenson  27:26  

Yep, of course. Well, Andrew, we are approaching optimal podcast length here. But before I let you go, I would be remiss if I didn't share a little bit about your show the insights X design podcast, we're gonna link to it, of course, in the show notes, but I definitely think people should check it out. They want to hear more from you. Would you share us a little bit like what's the logline of your show? And and you know, How's it been going for you lately?  

 

Andrew Madson  27:46  

It's been going really well. We've had a huge reception. We started in January, and we just hit 85,000 subscribers on YouTube. I don't know how many or Spotify probably like 20 people on Spotify, but on YouTube, it's gone super well. And we just speak with expert data practitioners across AI, ml analytics, and then also technical UX design, and try to understand what are the problems that folks are facing? What are the innovative solutions that companies have had? And then the personal stories of these practitioners, how did they get to where they are? And what lessons have they learned?  

 

Rob Stevenson  28:21  

Gotcha. Again, it's called the insights X design podcast check it out on YouTube if you want to see Andrews handsome face and some other graphics I imagined do there's a video reason to watch it or you know, wherever podcasts are found. But Andrew for now, I would just say this was a delight talking to you. Thanks so much for taking the time and for sharing with me your your background experience. I've really loved this conversation.

 

Andrew Madson  28:40  

Thank you Rob is a pleasure.

 

Rob Stevenson  28:44  

How AI happens is brought to you by sama. Sama provides accurate data for ambitious AI specializing in image video and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, ecommerce, media, med tech, robotics and agriculture. More information, head to Sama.com