How AI Happens

CarbonChain Head of Data & Machine Learning Archy De Berker

Episode Summary

Archy de Berker began his journey with machine learning in the context of academic neuroscience, interested in how machine learning can help us understand the brain. Today, he is the Head of Data and Machine learning at CarbonChain, driven by his desire to apply machine learning to climate change solutions.

Episode Notes

In today’s episode, Archy De Berker, Head of Data and Machine learning at CarbonChain, explains how he and his team calculate carbon footprints, some of the challenges that they face in this line of work, the most valuable use of machine learning in their business (and for climate change solutions as a whole), and some important lessons that he has learned throughout his diverse career so far! 

Key Points From This Episode:

Tweetables:

“We build automated carbon footprinting for the world’s most polluting industries. We’re really trying to help people who are buying things from carbon-intense industries figure out where they can get lower carbon versions of the same kind of products.” — @ArchydeB [0:02:14]

“A key challenge for carbon footprinting is that you need to be able to understand somebody’s business in order to tell them what the carbon footprint of their activities is.” — @ArchydeB [0:13:01]

“Probably the most valuable place for machine learning in our business is taking all this heterogeneous customer data from all these different systems and being able to map it onto a very rigid format that we can then retrieve information from our databases for.” — @ArchydeB [0:13:24]

Links Mentioned in Today’s Episode:

Archy de Berker on LinkedIn

Carbon Chain

Episode Transcription

EPISODE 39

ADB: Or it’s disguised as AI problem for climate change, but it's really a statement about data ingestion as a whole.”

[00:00:08] RS: Welcome to How AI Happens, a podcast where experts explain their work at the cutting edge of artificial intelligence. You'll hear from AI researchers, data scientists, and machine learning engineers, as they get technical about the most exciting developments in their field and the challenges they're facing along the way. I'm your host, Rob Stevenson, and we're about to learn how AI happens. 

[00:00:37] RS: Here with me today on How AI Happens is the Head of Data and Machine Learning over at CarbonChain and podcasting in from foggy London Town, Archy de Berker. Archy, welcome to the podcast. How are you today?

[00:00:48] ADB: Thanks very much. Delighted to be here. Keen to reestablish my relationship with the North American AI audience. I've been missing it since I moved back here from Canada about a year ago. Yeah. Excited to see where we go.

[00:01:00] RS: Glad to bring you back into the fold. I'm sure you've been sorely missed. There's so much I'd love to speak with you about. Before we get into the nitty-gritty here, would you mind sharing a little bit about your background and how you wound up in your current role at CarbonChain?

[00:01:13] ADB: Yeah, absolutely. So I did a PhD in computational neuroscience at a time when deep learning was really exploding, particularly in London. DeepMind was spun out of UCL, but DeepMind was founded by people from UCL at the same time I was doing my PhD in neuroscience there. Demis Hassabis, the Founder of DeepMind, was actually a neuroscientist in the same place that I was doing my PhD. So there were lots of people who were very excited about deep learning. 

I first got interested in machine learning and deep learning from a kind of neuroscientific perspective, like can these techniques help us understand the brain. They're certainly very useful for analyzing brain data, but they also hold promise as kind of models of how intelligent things can work. So that can be quite a good approach for understanding the brain. So I sort of went down that rabbit hole, and then I realized very swiftly that this was actually more interesting to me than doing academic neuroscience. So I transitioned into industry. 

After working for an energy company for a while, I moved to Montreal to work for Element AI, a Canadian AI startup, extremely exciting. I raised a lot of money but really with the aim of bringing deep learning to the enterprise. Kind of cut my teeth there in machine learning and making products out of machine learning. So I did some time as an applied machine learning person. Then I moved into product management to kind of focus more on how do we get this stuff out of the lab and into the real world. 

Then in mid-2020, I decided that I really enjoyed working on these kind of enterprise problems, really, that I need to be more focused from people like me on climate change. So then I spent a bit of time consulting the various climate tech companies, trying to figure out where can we apply machine learning effectively, if we want to work on climate change. After doing some interesting stints in various organizations, including one led by Al Gore, I got into my current role, which is with this company called CarbonChain in London. 

We build automated carbon footprinting for the world's most polluting industries to really try to help people who are buying things from carbon-intense industries, figure out where they can get lower carbon versions of the same kind of products because within some of these product categories, there's huge variation in terms of how carbon-intense these products are.

[00:03:12] RS: So that would be measuring the carbon footprint of, well, like freight ships. This is like the impact of logistics probably on the movement of products.

[00:03:20] ADB: Yeah, exactly. So there's a few different components there, and people think a lot about the way things are transported because it's kind of an easy thing to visualize, and it's attracted a lot of attention in the media. You shouldn't be flying anywhere, careful that you don't buy your bananas from the wrong place far away. I know that that does make a difference. 

Actually, in the industries that we work, things like mining, where you're digging stuff out the ground and then you're melting it, they’re so energy-intense that choosing somebody who's able to do that in an energy-efficient way make way more difference than any choices you subsequently make in terms of how you move the thing around. So it just kind of dwarfs everything else. Like are you using solar power to do that extraction and melting process? Or are you using coal? That's the decision that really matters there. Not do you put it on a slightly more efficient ship or slightly less efficient ship.

[00:04:06] RS: Interesting. Do you think that will be a differentiator for consumers when they might pay a little more knowing that company was conscious of their carbon footprint?

[00:04:15] ADB: I think that's a really fascinating question, and I think we don't have the data to answer it conclusively yet. So I'd be looking for examples of where consumers do actually make that choice and what kind of premium people pay for it. They're a little bit hard to find. So the airlines have had some success with having a offset button when people buy tickets. That's not quite the same thing as choosing a different product, but you're just taking the option you wanted and then paying a little bit more. 

I think some of the best evidence actually comes from Google Cloud. So as listeners of this podcast will be familiar, data centers play a big role in the modern software stack, and where you put that data center, or rather where you as a consumer, which region you choose to use of Google Cloud or AWS or Azure makes a big difference to the carbon footprint of your computing activity. 

Google recently rolled out a scheme in collaboration with a Danish startup called electricityMap to surface to users, which were high and which were low carbon data centers. From what I've seen, that seems to have had quite a big effect. So they've seen something like 50% of users who are exposed to this switching their choice towards a low carbon option. I think that's encouraging data. Yet again, you're not actually imposing an inconvenience on the consumer, and that's where the rubber really meets the road. How much is someone willing to pay for this reduced carbon intensity? I think the jury's still out on that. 

[00:05:30] RS: Yeah, Yeah. Fair enough. I'm really interested in this tour you did, trying to understand the areas where machine learning can be applied to climate change solutions. Before we get to CarbonChain, would you mind sharing a little bit about that experience and kind of what you learned sort of figuring out where the skill set can be applied?

[00:05:48] ADB: Yeah, for sure. So the two substantive projects I did as part of that tool were for a battery optimization company. So these guys were taking other people's batteries. So somebody else had funded a battery, but then somebody needs to run that battery. That sounds kind of trivial, but it's absolutely not trivial, and it's a really chewy interesting AI problem because you're optimizing for a variety things, and you're creating quite complicated forecasts. 

Essentially, what the battery gives you is arbitrage over time. So you can store something over time. So you can basically buy something when it's cheap, and then sell it when it’s expensive. That's what we're trying to do with the battery. Kind of fill it up when there's loads of energy on the grid, and it's cheap. Then you hold it for a period of time, and then you sell it again. Discharge the battery into the grid hopefully at a time where the price is high. So in order to do that effectively, you've got to try and predict when the price is low, when the price is high. 

Then there's various other optimization constraints in there. Because you can only charge and discharge the battery so quickly, there's a constraint in the batteries’ lifecycle in there. So this is an example of the kind of problem that is going to become really, really widespread in grids over the next 10 years, as we get more and more distributed energy because, historically, the way that these national grids have been run is very, very centralized. There's somebody in a control room somewhere, making a lot of decisions and like flicking switches and being like, “You turn on this massive gas power plant, and you turn on this other massive gas power plant.” So it's not very adaptive, and it's very dynamic. 

But that's not really an option when half your power is coming from distributed sources around the country, and people's rooftops are solar and things like that. So that's a really deep and interesting place to apply AI machine learning, and you kind of have to because you're doing things at timescales and scales where you don't really have any other choice, and you need to create some kind of intelligent software.

[00:07:35] RS: That's fascinating. So you're viewing energy as a fluctuating asset, basically, in the same way you might view like petroleum, right? The idea being that you can time the market essentially and be like, “Okay, we will efficiently charge and discharge his battery based on when energy is more or less expensive.”

[00:07:55] ADB: Exactly. There's a bit of two additional nuances there. One is that it's quite easy to store petroleum. So the price doesn't fluctuate on short timescales all that much, like it's still a volatile asset. But within the course of a day, say the price doesn't fluctuate that much. It's really expensive to store electricity. So the price fluctuates absolutely loads within the course of a day. 

The other thing is that the petroleum has a constant carbon intensity. So if you care about carbon, what time of day you buy and sell your petroleum doesn't make any difference. That's really not true with electricity because the electricity you get at midday in England has a reasonably high percentage of solar. The electricity you get at 6:00 PM in December has no solar. So it's most net deficit is made up by gas power. 

When the time of electricity basically dictates the carbon content, and so not only is there arbitrage in terms of money, but there's arbitrage in terms of carbon. The way the grid is going to have to work is people are going to use batteries to charge up when we want low to clean energy and then fill the gaps when we don't have so much renewable generation going on like when the wind goes down or when the sun goes down.

[00:08:59] RS: Interesting. Okay. So you sort of answered my follow up question. I love when that happens, which timing when to charge and discharge your battery. I can see how that might save money, but you just explained how it also impacts the environment, right? Because in the times where it's in more demand, that it's supplemented by nonrenewable sources.

[00:09:16] ADB: Exactly. So if you can do this at scale, then basically you can take this clean energy that you made earlier and then put it back into the grid and prevent somebody turning on a fossil fuel resource to fill that gap. Now, that's not cost effective of really long timescales. So if the wind stops blowing for a week, which it sometimes does, then you still have problems because it's not really, at the moment, financially sensible to store energy for a week. So people are looking at all kinds of other wild solutions for that like melting salt or lifting things up great distances when you've got loads of electricity and then letting them fall slowly when you don't have very much electricity.

[00:09:51] RS: Right, right. Correct me if this is an oversimplification. But when I asked what are the applications of machine learning for climate change, that's perhaps naive because really anytime something needs to be optimized or just like an increase in efficiency, machine learning could be a solution. Is that an oversimplification? Or would you characterize it that way?

[00:10:13] ADB: No. I think that's a great way to think about it. So the way I think about it is when things are happening really fast, you're probably going to need some kind of clever system. When things are really variable, you're going to need some kind of clever system. When I say variable, I mean variable over assets, for instance. So like if you have 10 gas power plants, it probably doesn't make sense to use machine learning to operate these 10 gas pipelines. You just hire 10 people and have them make a decision about when to turn those on and off. But if you have like 30,000 solar farms or 100,000 wind turbines, then obviously that doesn't scale at all. So, yeah, I think that's exactly the right way to think about it.

[00:10:52] RS: If something fits in a Google spreadsheet, then you can perhaps do it manually or with a quick little formula, right? It's at scale that these things tend to be broken. So in the same way, when you were saying it's less about the logistics, the carbon footprint or the pollution caused by a type of boat and more, the scale comes into play with examples like mining. Am I standing at the right tree here?

[00:11:14] ADB: Yeah. So I also don't want to say that shipping isn't important, but the shipping is really important. One of the problems with shipping is that countries don't take responsibility for it because it's happening in international waters. So it's not subject to quite the same level of scrutiny and regulation that things that happen on land are. But specifically for our clients, because they tend to be moving things that are really carbon-intense, that's what dominates. 

That's not true for other products. So if you're moving like an iPhone around, if you choose to fly it, that obviously makes a massive difference. So then that becomes really key again. So, yeah, in terms of the way we think about these different assets, there are – On the order of thousands of assets, when I say an asset, I mean, like a mine or a smelter or a refinery or a farm that's growing something, in most of the domains we work, there are an order of thousands of them. It’s really important for the people we work with that we have a principle transparent way of coming up with carbon numbers for those assets. 

Usually, for each asset, we will try and get a number not with a model. That break down when you get to ships because there are something like 130,000 commercially active ships. So we have to use models to approximate those. But then coming back to these assets, these like static places that are doing something like melting something or pressing something into a shape or growing wheat, with those, typically we have a static data source. Then the machine learning’s role is to help us map whatever data we get from clients onto those databases. So basically help select which row in our database corresponds to this set of transactions we're seeing.

[00:12:56] RS: Okay, makes sense. When you say 130,000 ships, you mean like different models, like types of ship? 

[00:13:02] ADB: No. I mean, literally, there are like 130,000 vessels that do stuff commercially. Those are the ones you have to worry about.

[00:13:08] RS: I guess I would –

[00:13:09] ADB: [inaudible 00:13:09] pretty much. 

[00:13:10] RS: No, it doesn't. But if they were all together, it would be quite a few.

[00:13:13] ADB: It'd be quite an armada. Yeah. These are these are big ships, I should say. But I mean most commercially active ships are big. 

[00:13:21] RS: Could you explain a couple of the areas where you're deploying machine learning at CarbonChain specifically?

[00:13:27] ADB: We work with a variety of customers who store their data about what they're doing in their business in a variety of forms. This is kind of a key challenge for carbon footprinting, is you kind of need to be able to understand somebody's business in order to be able to tell them what the carbon footprint of their activities are. So in the industries we work, the most important forms of data are invoices or transaction records of things people are buying, which contain details of where they've been sourcing things and how they've been moving things around. 

So probably the most valuable place for machine learning in our business is taking all this heterogeneous customer data from all these different systems and being able to map it onto a very rigid format that we can then retrieve information from our databases for. So basically creating this digital twin in order to provide these automated carbon footprints so that when the data gets sent to us the following month, we have an automated pipeline there with a few models in that can automatically ingest that and then spit out carbon footprints for all of those transactions.

[00:14:29] RS: Going back to like your tour of duty, sort of exploring the ways machine learning could contribute to the field of climate change research, why is it the right approach for you at CarbonChain?

[00:14:37] ADB: Yeah. In a sense, this is kind of a generic enterprise problem, actually. I worked on similar things when I was at Element AI. Document understanding is a huge field, and it's necessary because if you want to build enterprise scale products, you need to be able to ingest lots of different kinds of data. The way that people store data in their organizations is just incredibly heterogeneous. Like this is an AI problem for climate change, or it's disguised as an AI problem for climate change. But it's really a statement about data ingestion as a whole. 

In our case, there's like an added twist, which is like we're trying to map this data onto a sequence of events in the real world. But it's the same kind of challenge you see in any business, any data-driven business that's ingesting large quantities of data from their customers and then trying to give them some analysis or insights back. I think the main difference here is that what is required of the practitioner in this case is also a really deep understanding of these physical processes. Like you need to know how aluminum is made. You need to know roughly what determines the emissions of a ship, this kind of thing. So you've got to get much closer to the domain than I have previously. 

But I think at its core, it's a kind of data standardization issue, which is quite common across industries and across use cases.

[00:16:01] RS: Yeah. Familiarizing yourself with the process too. That's more design thinking. In any use of the skill set, like understanding your users and the challenges they face and like the drift of the problem, that's part of it too. It's fascinating to hear you characterize it as like a very common enterprise challenge. It’s interesting, isn't it, how across all these verticals and use cases, it feels like people in this field in AI and machine learning are facing a lot of the same kinds of challenges, and it's like processing data at scale.

[00:16:32] ADB: Yeah, yeah. Just coming back to the design thinking and user understanding, a big part of that is understanding how does this data get generated because a lot of data we work with has been put in a system by somebody at some point. This is true of most enterprise data. There is a another system that puts data into the system or there’s a human somewhere. Being able to unravel that is really, really important as well. 

At Element, we had an experience where it turned out for over a period of time, somebody was changing the labels in the database that we were trading our models against. So we were getting basically input data that was the same as it had been 10 years ago. Then over the intervening 10 years, the training labels had mutated. Because things were changing in the real world, and people would update the database record, but it wouldn't update this original document that apparently has information on. 

It took us ages to realize that, and then eventually we plotted like the performance of our models over time, basically, against like how old documents where. We’re like, “Oh, we’re doomed.” We do really well on the new documents. But anything that's been in the system for a while, we do really badly on. Why is that? It’s just somebody was overriding the labels, and we could have really avoided a lot of pain if we'd been a bit closer to the data generation process and understanding where it was coming from.

[00:17:42] RS: Is there a way where you can foresee that kind of fallibility happening? Or do you have to go through it like you did at Element AI, where you just happen to run a report that showed that there was a deviation at this one point, and now you know how to kind of click on it?

[00:17:54] ADB: I think to a whole host of problems, the solution is getting the person doing the technical work closer to the problem and closer to the users. That's part of the reason why I moved into product management for a while is because I realized that my ability to solve these problems was really strongly proportional to how close I was to the problem and how well I understood the user and the use case. That if you're a technical person and you're several levels removed from the source of the problem, you're just like really blindfolded when you're banging your head against it. 

Now, I'm a big advocate of small teams with technical people like in front of clients, listening to them, understanding what is the thing that I'm trying to solve for. Obviously, at some level, that breaks down, and there are some problems which are much more abstract. Like if you're just trying to predict the stock market, then you don't need a deep understanding of users. You can do that in a different way. But even there, like if you could spend a couple of days on a trading floor, you'll get a much more nuanced understanding of like what is driving fluctuations and stock prices. That is going to be way more useful than two days sitting in the office, like hammering a GPU with some fancy model.

[00:18:59] RS: I love what you said about bringing these technical individuals as close to the designers as possible. Are these individuals who are themselves responsible for generating the data and inputting it? How far back do you go?

[00:19:10] ADB: Yeah. That's generally the level that we try and get to. So we're trying to understand who is going to be using this system that we're building but also who is inputting the data that we're going to consume in our system and then do something with? 

[00:19:22] RS: What do you ask them? 

[00:19:24] ADB: You're trying to understand like what constraints are they under and what matters to them at the time. For instance, a common problem you have is that there's another source of data, which you're not privy to, which contains redundant information with the one that you're looking at. So people are sloppy at entering certain bits of data into the system that you're extracting from because it's actually this other way that they have this information. So it's not important for them to record it twice. They're like, “Ah.” 

But it's always also in this email that's attached to this other thing, and that's the kind of thing that you don't know until you've mapped out the information flow or maybe you just have with them. You've watched over the shoulder and you're like, “Ah, interesting. There’s this database dump, which is what I'm looking at.” But half the time, you're not looking at the database interface at all. You're looking at this email over here. Well, I don't have those. So it's that kind of thing. 

Overall, the tendency is for people to describe solutions that they envisage, rather than articulating their problems. So sometimes, you need to spend quite a lot of time like digging in. Like what is it about this process that's really painful? Where are you spending most of your time here? Like which is a valuable thing for us to automate? Because otherwise, the tendency at a technical level is you kind of automate the thing that was most obvious to you. But it could be that that's relatively quick to do manually anyway. So you can expend your effort a lot more efficiently by spending that time with users.

[00:20:41] RS: Yeah, yeah. It's a fantastic reminder to keep people front and center. The users front and center anyway. Because it's easy to say, “Oh, well. It works in the model.” Like you say, spend your time hammering on a GPU. Again, I am just so interested that you kind of characterize the problem as a data problem, rather than a problem specific to climate change. I am curious, though. How do you foresee additional ways that AI machine learning might be deployed for future use cases at CarbonChain or to solve climate change problems writ large?

[00:21:15] ADB: I think the thing that we talked about earlier in terms of optimizing these complex dynamic systems at timescales that are not plausible for humans is going to be a major application of AI, and that's going to bring together a cluster of different technologies. So there's a big forecasting component, and then there's also like a big optimal control component. 

On the problems that I work on, a lot of them come down to some form of natural language processing. So like the growth and the success of large language models is probably going to be impactful there. Because if you can get more efficient ways of extracting data from enterprises in terms of their activities, you can gain a much clearer picture of their activities and then their carbon footprint. 

That continues to be the direction that we're heading in, in terms of being able to minimize the amount of data that anybody has to extract from a back end system. If you allow me to just digress for a little bit, one of the things that is really hard about building enterprise software is that there's somebody who cares about the use case that you care about, like understanding the carbon footprint of a company. But they're really in no position to give you direct access to what you need. They need to go and persuade somebody else in the business who has the keys to these systems to open up and export the things you need. 

Technologies that allow you to bypass that are really powerful. For instance, in the domain that we work, where we're looking at products that we'd like to know the carbon footprint of, ways that allow us to at least get the ball rolling by working from images or product descriptions or product spec that we can find online, automatically ingesting those and estimating the carbon footprint on the basis of that is hugely powerful because it generates that initial proof point that then allows internal stakeholders kind of gives them enough ammunition to then go and unlock higher quality forms of data. 

Monitoring is a big deal and will become ever more important in the next 30 years because nations are making all these commitments. Like one of the big things you saw at COP26 was various commitments around deforestation. Deforestation is example of a problem where the scale of the data is just enormous. You cannot have people keep an eye on forests using satellite imagery. There's too much forest. So you need machine learning models to do that job for you, and there's a bunch of interesting companies and nonprofit projects in this space, basically training big vision models to keep an eye on forests and understand where they're being encroached upon. 

I think one really interesting extension of that is around carbon offsets. So a lot of companies have invested in carbon offsets. But there's a big issue, which is you pay some money for a carbon offset, which corresponds to a tree being planted somewhere. Ideally, you then want to be able to keep an eye on that tree for the next 30 years to be sure that it's still there, and that thing you pay for us is sequestering carbon as you expected and as the person who sold it to you implied it would. 

Again, there are some interesting companies in that space like Pachama and Sylvera in the UK, who are really focusing on doing that observation piece. As vision models get more powerful, that will only aid with that and more efficient actually sorts of big deal.

[00:24:27] RS: I could adopt a tree, and then with a sufficiently advanced vision model, I could like search for my tree over time.

[00:24:35] ADB: Yeah, yeah. Pretty much the problems in those domains. So I worked a little bit on monitoring with this project called Climate TRACE, which was led by Al Gore, and I worked with an amazing London startup called TransitionZero on that. They were using satellite monitoring to monitor the activity of coal power plants. So they're basically recognizing with various vision models when there are plumes of smoke coming out of these coal power plants. So then you can figure out when it's on and when it's off. Then you can say within a month, “Oh, this is how much carbon its emitted.” They are applying similar methodologies to industrial assets like cement plants and steel plants. 

The big challenge there is how do you get the training labels. So you can – Like the models are pretty capable. But you've got to get ground truth data of when this thing was on, when it was off in order to train against it. That's also the challenge with forest observation projects. You need some way of getting this ground truth data of how much carbon is this tree sequestering, and then you can train your models to recognize that from visual features of the tree.

[00:25:30] RS: Got it. Quite a difficult problem because I believe there's something in the order of three trillion trees on the planet, which like, like you said, if there were 150 trees, maybe we wouldn't need to apply machine learning to it. But when you get to those bigger datasets, it is the tool for the job. Archy, we are creeping up on optimal podcast length here. So at this point, I'll just say thank you so much for being a part of the show and for sharing your experience with me today. I've loved chatting with you. 

[00:25:54] ADB: Not at all. Thanks very much.

[00:26:02] RS: How AI Happens is brought to you by Sama. Sama provides accurate data for ambitious AI, specializing in image, video, and sensor data annotation and validation for machine learning algorithms in industries such as transportation, retail, e-commerce, media, medtech, robotics, and agriculture. For more information, head to sama.com.