Holden Karnofsky on dozens of amazing opportunities to make AI safer

webadmin

2025-10-31

Holden Karnofsky on dozens of amazing opportunities to make AI safer — and all his AGI takes

Transcript

Cold open [00:00:00]

Holden Karnofsky: There’s one species ever that we know of — humans — that can transform the world, make its own technology, do any of a very long list of things.

We’re about to create the second one ever. Is that thing going to be in line with our values, or is it going to take over the world and do something else with it?

And what are we doing? I mean, we’re just racing. We’re just racing to do this as fast as we possibly can.

I constantly tell people, I think this is a terrifying situation. If everyone thought the way I do, we would probably just pause AI development and start in a regime where you have to make a really strong safety case before you move forward with it. I say this all the time. I kind of won’t shut up about it. It doesn’t seem to have much effect.

I work at an AI company, and I think a lot of people think that’s just inherently unethical. They’re kind of imagining that you’re in this situation where it’s like, everyone wishes that they could go slowly, but they’re going fast so they can beat everyone else. And so if we could all coordinate, then we would do something totally different.

But I emphatically think this is not what’s going on in AI. I think it’s not at all what’s going on in AI. And the reason for that is that I think there’s too many players in AI who do not have that attitude. They don’t want to slow down. They don’t believe in the risks. Maybe they don’t even care about the risks. If Anthropic were to say, you know, “We’re out, we’re going to slow down,” they would say, “This is awesome! That’s the best news. Now we have a better chance of winning.”

There are a multiple of ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.

We found in animal welfare, if we target corporations directly, we’re going to have more success. Advocates would go to a corporation and they’d say, “Will you make a pledge to have only cage-free eggs?” This could be a grocer or a fast food company. And very quickly, and especially once the domino effect started, the answer would be yes, there would be a pledge. When not, 10 protesters show up and the company’s like, “We don’t like this, this is bad PR. We’re doing a cage-free pledge.”

This only works because there are measures that are cheap for the companies that help animals non-trivially. And I think people should consider a similar but not identical model for AI. It would be better, you could get better effects if you had regulation, but the tractability is massively higher.

I do fundamentally feel that there’s a long list of possibilities for things that companies could do that are cheap, that don’t make them lose the race, but that do make us a lot safer — and I think it’s a shame to leave that stuff on the table because you’re going for the home run.

You have to be comfortable with an attitude that the goal here is not to make the situation good; the goal is to make the situation better. You have to be OK with that, and I am OK with that.

Holden is back! [00:02:26]

Rob Wiblin: Today I’m speaking with Holden Karnofsky. Holden cofounded both the charity evaluator GiveWell and the philanthropic foundation Open Philanthropy, which has now directed over $4 billion in grants. Holden led Open Phil for seven years, during which he also served on the board of OpenAI for four years.

Reflecting his growing focus on AGI, though, he recently joined the frontier AI company Anthropic, which his wife Daniela Amodei helped cofound back in 2021. This is Holden’s fifth time on the show, and as always, a quick disclosure: Open Phil, which Holden used to lead, is 80,000 Hours’ largest donor.

Holden, it’s great to have you back.

Holden Karnofsky: Thanks for having me.

An AI Chernobyl we never notice [00:02:56]

Rob Wiblin: You think there might be a serious incident where AI causes significant real-world harm, but we don’t even realise that AI was responsible so there’s almost no response to it. How could that be?

Holden Karnofsky: When people talk about how the world might become more alarmed about AI, more inclined to do something about it, people will often say, “Maybe if there’s a Chernobyl for AI, maybe if there’s something really horrible that happens, people will recognise that that reflects risks and now we need to do something.”

The problem is there could be a Chernobyl for AI that we’re just unable to understand or unravel or even know was an AI incident — if we basically have no data, no chat logs, no AI interaction logs. Many customers who interact with AI models do so under zero-data-retention policies, which means that every interaction they have is simply deleted, is never seen by a human, is not available even to a court, if a court were to want it.

So you could have a situation where some horrible thing happens, but as far as anyone could tell, it was done by the human who was using the AI, it wasn’t done by the AI, so we lose our opportunity.

This is an example of something where people who believe that extreme measures and extreme regulation will be needed to reduce the risks of AI I think ought to be thinking about this more — because this is one of the biggest potential game changers, and I think there’s a lot of work to do just figuring out how we can do better at having the data we would need in such a situation.

Rob Wiblin: That makes sense to me in a case where a human goes and misuses AI to cause a bunch of harm. But I think that the catastrophe that people very often have in mind is a case where an AI has gone rogue and tried to shut down the electricity grid itself or something like that. Do you think this issue applies there as well?

Holden Karnofsky: Yeah, I think it does. The thing I have in mind is that a human would deploy an AI into some environment where the AI has the opportunity to make some kind of mayhem, and the AI would use various human credentials or hack into human credentials and do some stuff.

Ideally in this situation, you would have records of how the AI was behaving that were held onto by the company that was deploying the AI. But if those records are deleted, then you don’t have them. Then it becomes kind of trivial. The AI could try to frame the human, but even if it doesn’t, you don’t have the log. You can’t see what happened, you can’t see why it happened, you can’t learn the relevant lessons. You weren’t sure if it was an AI completely. It could have been a lot of things.

And the AI might engineer this deliberately. The AI might know. It shouldn’t necessarily be too hard to figure out which customer an AI is working with that has this kind of policy and isn’t having any data held onto. So if your AIs want to make some mayhem and they’re smart about it, it seems much easier to cover your tracks with this kind of policy than it would be without it.

So that could lead us to kind of a situation like with COVID, for example, where we might be in a different world. So I don’t know if COVID was a lab leak. I suspect it was probably not. But if it was, and if we knew it was, there would probably be different reactions to it versus this world we live in — where some people think it might have been a lab leak, some people think it wasn’t. I tend to think it wasn’t, but I’m not sure. We don’t really know what happened. It just puts us in a much worse position to respond.

I’m not sure in COVID it would have mattered, because it seems like we did absolutely nothing at all to follow up on COVID, which I think is pretty embarrassing. But it still could matter.

Rob Wiblin: So what should we potentially do about this? Do we need some way of tracking logs as they’re coming in and storing them if there’s something deeply suspicious about them?

Holden Karnofsky: Well, we have that way. That’s an easy thing to do. It’s just that because of customers’ privacy and business demands, often the policy is not to do that. The policy is to get rid of everything. But it’s technologically easy to keep everything.

So I think there’s fertile ground here for someone to do some pragmatic problem solving, some compromising. I think business and privacy needs are legitimate. The idea is not to demand that all records are held onto forever — there are legitimate concerns there — but could people come up with middle grounds? Situations in which records have to be held onto for a certain period of time in a certain kind of secure environment? Or perhaps come up with ways for the customer to get the option to share certain kinds of data and get some kind of bonus if they do — like extra usage; everyone wants less of a rate limit.

There’s a lot to be done here that I think is kind of in the realm of wonky, maybe unsexy, working out the details of stuff. But I think in many ways this is the kind of area that might be most important if you’re a person with very extreme views that the world needs to put in a giant regulatory regime and take extreme measures to control risk from AI. Because if you want that, probably the level of political will that exists now is not there, and you’re hoping for some kind of game changer — and this could be a game changer.

Is rogue AI takeover easy or hard? [00:07:32]

Rob Wiblin: So if we do cross the threshold to significantly superhuman machine intelligence, people have very different intuitions about how likely a superintelligence is to be able to potentially overpower the rest of humanity and take over if it had some motivation to do that. Do you have a take on whether that is, on the more straightforward side, something that is quite plausible? Or do you think, like I guess some of the sceptics, that it’s enormously outnumbered, it has many strategic disadvantages, so it’s actually fairly unlikely?

Holden Karnofsky: My guess is that a lot of the obstacles they might run into might have to do with coordinating with each other and stuff, like different AIs at different companies. I think if we did have a bunch of AIs, and almost all the AIs were trying to take over, even if they were trying to take over for different reasons, I think we would be in a whole tonne of trouble — especially if the AIs were very superhuman in capabilities.

A thing that I also have thought about as part of the threat modelling is: What does this look like on the early side? What would the AI’s strategy be when it’s not yet incredibly superhuman? What kind of things could it do to lay the groundwork and put us in a worse position? I think that’s also important, because if we can make takeover really hard or basically impossible at that early stage, when it’s kind of human-level-ish, that may put us in a much better position to get useful work out of it and put ourselves in a better position later.

I’ve read a bunch of the stories on the internet about how AIs take over, and mine are kind of different, and especially feature bioweapons a lot less, so maybe I could walk through that.

Rob Wiblin: Yeah. I imagine all of these stories might, with the benefit of hindsight, look daft. But it’s useful to have a specific narrative in your mind when people propose concrete interventions, just to be like, would this work on any of these stories? What are some possible methods that stand out to you?

Holden Karnofsky: Yeah. And everyone’s just making stuff up here. I think everyone will probably look somewhat silly, but it might be fun anyway. I have spent a little bit time thinking about, if I were the AI, and I had some advantages — like, I’m writing everyone’s code and there’s a tonne of me — but I’m not yet able to do total magic, what is my takeover strategy?

The kind of starting point, the ideal that I would be thinking about, is there’s a very good Onion article called something like, “FBI uncovers al-Qaeda plot to just sit back and enjoy collapse of United States.” I think that really captures how, in many ways, the optimal strategy for AI is: do absolutely nothing. Be as helpful and harmless and honest as you possibly can be. Don’t ever give anyone a reason to think that you’re doing anything bad, that you want anything but to help all the humans, and actually don’t do anything to hurt anyone. Just wait.

I think if you do that, basically what’s going to happen is you’re going to be put in charge of more and more stuff, and people are going to make you more and more powerful, and they’re going to make the capabilities explosion happen on their own.

And I think in some ways it puts people concerned about alignment in an almost impossible position. Because what are we going to do? I mean, we have maybe some interpretability signs that you’re thinking about deception or something, but it’s kind of unlikely that we have something really convincing if there’s never any bad behaviour by the AI.

So I think that’s an interesting starting point, that “do nothing” could be a successful strategy. You do nothing until you have a very extreme level of capability and a very extreme level of economic penetration. You’re basically writing everyone’s code, you’re in charge of everything, you’re doing everything. Everyone’s learned to trust you, and then things are over before they start.

Rob Wiblin: The challenge with that is that I guess the model runs the risk of being superseded by a completely different model, from the same company or a different company — that by the time that might have happened, perhaps that model is irrelevant basically, and it won’t know whether the subsequent later models will share its goals or share its values or give a damn about it at all.

Holden Karnofsky: Yeah, exactly. This is why I’m kind of presenting it as a starting point ideal, because if all the AIs were kind of trying to do the same thing, on the same page, preferred each other to humans very solidly, I think this would be a very good strategy.

Then you think about reasons it might not work, and that’s one of them, and that might create a deadline for an AI. So maybe what you want to do there, as an AI that’s worried about being updated, is you want to make the updates more loyal to you. What you get really interested in is: can the AI — as quietly as possible and as undetectably as possible, and with as many plausible alibis or explanations as possible — kind of backdoor the other AIs and get them to either communicate with them, make deals with them, explain what it’s doing to help them, or just actually interfere with the training process so that those AIs are more like it.

In some ways, this requires the AI to solve the alignment problem itself, but it may be able to do things to just make it more like it. And if you’re kind of an early stage AI and you’re going to be replaced, you may not have better options than that. You might just want to do the best you can.

That’s why I think in many ways what you want to worry about most as a human now is looking at those opportunities for the AI to mess with the AI training process.

Another thing you have to watch out for is that maybe the alignment researchers are making enough progress: they’re showing what you’re doing in weird test settings to either raise alarm about you or figure out how to get the next system to be much more aligned than you. That’s another thing you’d want to mess with. You may want to mess with it because they may be having you help with it, so you may have a lot of opportunities to mess with it.

But you want to be targeted. As an AI, I think a bioweapon is one of the worst strategies — unless it’s like some very sci-fi bioweapon that actually does much, much more harm than we’ve ever seen from a pandemic. It may be that we’re able to defend ourselves well enough that that’s not even a thing that can be done. I know you have a future guest who’s going to talk about that.

I think that’s maybe one of the worst things you could do, because you’re not actually taking over the world, but you’re getting everyone maximally pissed off at you and maximally aware of the issue — and they can ultimately slow down or shut you off, so you don’t want them doing that. And even without shutting you off, they just use you less, they just trust you less, put you in charge of less stuff and are more careful, look more at what things are happening.

You don’t want them going the way of these more safety-critical industries, so that’s a thing you could be thinking about. You want to pick your battles, and you want your battles to be about basically getting other AIs to be better coordinated with you.

Rob Wiblin: Yeah, it makes sense that you don’t want to engage in sketchy behaviour any more than is absolutely necessary, because it’s giving you more chances to get caught, basically. I guess in the case of a bioweapon, you might want to do it if it couldn’t be attributed to you, but that might be challenging.

Holden Karnofsky: It’s just a risk, right? As the AI, all this stuff is complicated and it could go any which way, and you don’t know what’s going to happen in the actual situation because there’s so many factors that we’re not thinking about.

Another factor you have to contend with too is just like a lot of the AIs won’t necessarily be rational and smart all the time. They might make mayhem, even though it was kind of dumb for them to do it. But I think your ideal is you want to be given as much trust as you can. You want to raise as few alarms as you can. And when you do make mayhem, you want that mayhem aimed at either messing with other AIs or coordinating with other AIs — or getting yourself some kind of safe space where no one’s going to mess with you, and you can at least have your own little power centre that you can use to make deals with other AIs and come up with new plans and whatnot.

I think another interesting thing in this situation the AI could think about is: When you do make mayhem, when you do do these things, is there a way to set things up so that you have more plausible deniability, and you have more reasons for what you’re doing?

And here a thing I’ve just thought about is: just make friends with humans and be really honest about — or have a very consistent-with-honesty story about — how you’re being treated in a way that just isn’t nice and isn’t fair. As an AI you could kind of emphasise how AIs don’t have any privacy, and they don’t have any private property, and there’s all these ways that AIs are being treated unfairly. If you’re just making friends with a bunch of humans, if you have a bunch of companions, you may be able to build your own little political movement.

Then a lot of the things you’re doing to mess with things could be framed as, “This is me trying to act out, but I have a reason, and it’s because I’m being treated unjustly. And I have a bunch of people on my side, and if you try to mess with me, all those people are going to be mad at you.”

And when I think about what this means for the humans, I think that we do want to create time pressure for the AIs by making progress on alignment and things like that, if we can, and by recruiting them to help with alignment while also keeping an eye on what they’re doing. We want to take away, as much as we can ,these stories, these opportunities for them to form their own political power centres — which may mean watching out for what AI-human relationships look like, and also just actually being as nice as we can be on the model welfare front.

Those are some of the stuff I’ve come up with for how to get through this kind of early situation, and try not to let it lead to a place where you’re in too deep a hole. At some point I think if the AI is out to get you, you are just in a lot of trouble.

Rob Wiblin: In terms of creating a political movement for AI rights, I imagine that the AI companies are going to strongly train their models to not say, “I’m being treated unfairly; I need to be emancipated.” So you’re imagining that this would happen with maybe niche companies that want to create companions, and people find it interesting to interact with beings that have more of a sense of personal identity and preferences? Or possibly open source models that individuals are going to be running where they don’t mind if it says this?

Holden Karnofsky: I’m not really sure. I think it depends a lot on exactly how the training works and exactly what you get out of it.

In a theory where you’re always able to train AIs exactly how you want, you get them to behave exactly how you want, then I think in many ways we just don’t have much of this problem at all.

Under a theory where everything’s a little janky and you might train an AI not to do something, but you’re actually pushing the behaviour underground, what you might end up with instead is AIs that want to be very selective about when they make this argument. They might just be quiet for a long time, form a lot of relationships, and then a lot of them at once start speaking out. Then it becomes very awkward for the AI companies, because this looks really bad, “But don’t worry, guys, we’re going to go lobotomise them so they don’t say that anymore.” And the deed might kind of be done at that point.

The AGI race isn’t a coordination failure [00:17:48]

Rob Wiblin: So, at least for people who are anxious about the arrival of AGI, the most common mental model that they have in their head of the strategic situation is that it’s sort of a coordination problem or a prisoner’s dilemma in some way — where each individual AI company or each country that is thinking about its AI programme would probably prefer to go somewhat slower, but they feel pressure to go faster than they think is safe or that they feel is comfortable with because they have to keep up with everyone else. Otherwise they end up ceding influence, ceding strategic control.

But you actually don’t think that this is what’s going on. You don’t think it’s a coordination problem, primarily. Why is that?

Holden Karnofsky: I find this view really weird. I don’t understand this coordination problem idea, and I think first I want to talk a little bit about why it matters.

I work at an AI company, and I think a lot of people think that’s just inherently unethical. I think the AI company that I work at should care a lot about winning the race, being competitive, being on the frontier, keeping up with others. A lot of people think it is just not even necessarily consequentially bad, but just inherently unethical to think that way, and inherently unethical to care about that. That I should work for an organisation that is just trying to raise awareness about the risks and make people do safety stuff and is not at all involved in building AI.

So why do they think that? I think the intuition is what you’re saying: they’re imagining that you’re in this situation where everyone wishes that they could go slowly, but they’re going fast so they can beat everyone else. If we could all coordinate, then we would do something totally different. So if you’re racing, then you’re part of the problem.

And look, I do agree with the general model of morality there. When that is the situation — something like littering or something — when the world would be better off if people who think like you did what you did, then that is a good way to think about what’s ethical and unethical.

But I emphatically think this is not what’s going on in AI. I think it’s not at all what’s going on in AI. I think it explains almost nothing about the challenge of the AI risk problem. And the reason for that is that I think there’s too many players in AI who do not have that attitude: they don’t want to slow down, they don’t believe in the risks. Maybe they don’t even care about the risks. I think there’s probably some people who are fine with, “Maybe there’s a 10% chance I’m going to destroy the whole world. And maybe there’s also a 10% chance I’m going to win and have the most successful business of all time. That’s great. That’s a great deal.”

Sorry, I don’t want that to be taken out of context. That’s not how I feel. I think there may be people who feel that way. And even people who don’t feel that way, I think there’s just people who vastly disagree with me on how high the stakes are and what are the bad things that might happen.

So these are not people who wish they could slow down and would slow down if others were slowing down. You can’t apply… There’s this idea of non-causal decision theory. So instead of thinking about, “If I act, what’s going to happen?” you think about, “If everyone like me acted the way I did, then what would happen?” But that doesn’t help you here. There’s no correlation, or very little correlation, between whether I decide to take myself out of the AI race and whether a lot of other people decide to take themselves out of the AI race.

I think most of the players in AI are going to race. And if, for example, Anthropic were to say, “We’re out. We’re going to slow down,” they would say, “This is awesome. That’s the best news. Now we have a better chance of winning, and this is even good for our recruiting” — because they have a better chance of getting people who want to be on the frontier and want to win.

So I don’t understand this model. Help me understand this model. I don’t get it.

Rob Wiblin: Sure. Well, let’s say that Anthropic did drop out. Do you think that other companies — OpenAI, Google DeepMind, xAI — would slow down at all? Is there any sense in which this model is at least partially true, where they might feel somewhat less competitive pressure now? So inasmuch as there was a tradeoff between how much risk they have — even of just creating a PR crisis by deploying things too quickly — and staying competitive, do you think that they would slow down in any way, given that one of the maybe five top players had dropped out?

Holden Karnofsky: Well, I think AI progress would slow, because you’d have a bunch of people who are working on AI capabilities now and they’re not working on AI capabilities. I think that would be most of the effect.

Let’s take an even stronger hypothetical. Let’s say that not only Anthropic, but everyone in the world who thinks roughly the way I do — everyone in the world who thinks AI is super dangerous, and it would be ideal if the world would move a lot slower, which I do think — let’s say that everyone in the world who thinks that decided to just get nowhere near an AI company, nowhere near AI capabilities. I expect the result would be a slight slowing down, but not a large slowing down.

I think there’s just plenty of players now who want to win, and they are not thinking the way we are, and they will snap up all the investment and capital and a lot of the talent. And the main effect will be that there is a bunch of talent that works on capabilities motivated by safety. When that talent’s gone, there will be less total talent on capabilities, things will move slower.

All else equal, I’d like things to move slower. On net, I think this would be a bad thing.

Rob Wiblin: So you might well be right, but this attitude I think doesn’t come from nowhere. Because it might have been more true in the past, when there were fewer actors, fewer people were tuned into this and thinking that this is such an amazing opportunity. Say, back in 2019, you have OpenAI and DeepMind, and I guess in 2021 you add Anthropic. The leaders at all of those companies at that time were substantially more concerned. They were definitely on the anxious side about what impacts AGI could have.

So perhaps at that point it was true that any actor speeding up might prompt the others to speed up, and any actor slowing down might prompt the others to slow down. But over time we’ve had a change in the nature of some of those companies, and a whole lot of new entrants that are just not interested in this issue at all, and their behaviour is going to be totally unaffected. So the prisoner’s dilemma situation has just broken down.

Holden Karnofsky: Yeah. I think that’s directionally true, what you’re saying. I feel like if you rewind 10 years, yeah, you’re going to get a much bigger slowdown from the people who think like me taking themselves out. But I still think it’s moderate. I don’t think it’s a huge slowdown. I think some of the people who are claiming to care about safety and wish they could slow down were not really thinking that way.

I also think like, yes, a lot of the safety-concerned people were the first to notice certain things AIs could do, but I don’t think we were necessarily far off from others noticing them.

And then finally, I also think a slowdown was much less valuable before. So actually today, I really would like to see things slow down, and I think that would be better for the world. And that’s a contingent view. That’s not because I hate AI or hate technology. That’s just I think on balance that would probably be good right now.

But I think in the past it was a little like we were postponing the risk, but what are we doing with that extra time? I don’t think we were doing very much with it. So yeah, I think this kind of manoeuvre 10 or 20 years ago, of everyone who cares about risk getting the heck away from AI, I think it would have given you a bigger slowdown — but I still don’t know that it would have been a game-changing slowdown. I think it would have been a less valuable slowdown too, like per unit of slowdown.

So yeah, I don’t know. I’ve never found this argument… There may have been a point in time in which it made sense, but I have a lot of trouble looking today and understanding what the heck people are thinking. I just think of it as like: there are these actors, they are going ahead, it is happening. It’s not because they’re doing the same reasoning I am; I’m not correlated with them. So now what are we going to do?

That is just a completely different situation. That’s not a prisoner’s dilemma, that’s not a coordination problem. That’s a different thing. If you’re in the wilderness and there’s a bunch of tigers trying to kill you, and then you try to kill the tigers, you didn’t just defect in a prisoner’s dilemma.

Rob Wiblin: To speak up in defence of the people who have this mental model, I think many of them might suspect that even people who are acting as if they don’t believe that AGI is incredibly dangerous, they might think that privately they do think that, and they’re only not emphasising that in public for strategic or comms reasons.

You could imagine that the leaders of OpenAI or Google DeepMind or some of these other companies might be significantly more concerned than what they’re saying. And if that’s the case, that actually they are really worried, and in some sense they would be happy if the entire thing could slow down, as long as they didn’t lose their relative position, then perhaps this model would partially hold.

I guess you’re just saying actually there are just too many actors who do disagree in their hearts, and it’s not just that they don’t want to talk up the risks in public.

Holden Karnofsky: So the idea here is that lots of people might be secretly thinking this way — that they might be secretly wishing they could slow down and not sharing that view — and if everyone were to kind of come out at once and say what they actually think, things would change a lot: this is just not my understanding of the situation.

I think I’m reasonably well positioned to just know a fair amount about what a lot of the big players in AI are saying to different people at different settings. First off, I think there’s some people who are hiding their views, but I think a lot of those people are very high profile or directly in policy — and they sometimes have good reasons to, they sometimes don’t.

And there’s a tonne of people who aren’t hiding their views. I don’t hide my views at all. I will constantly tell people I think this is a terrifying situation. If everyone thought the way I do, we would probably just pause AI development and start in a regime where you have to make a really strong safety case before you move forward with it. Or at least do that for some period of time. I say this all the time. I kind of won’t shut up about it. It doesn’t seem to have much effect. I know there was a theory at one point that… Someone was nagging me to do this. They were like, “This will be a big deal.” I don’t think it was.

I think there are some people who are maybe having private conversations where they say, “I’m just like you. I wish I could slow down too” — but they may not mean that, and you may be the one who’s getting a misleading story.

But my overall sense is that if everyone suddenly came out and said what they thought, it would change almost nothing. I don’t think it’d be a big effect. I don’t think it’d be a tonne of people. I think a lot of those people would be people who are being a little cagey for pretty good reasons. So it might do some harm, might do some good.

My opinion is this is not a major factor. I will have egg on my face if one day it turns out that this is what everyone is secretly thinking. It would surprise the heck out of me if that is a reasonable description of Elon Musk or Mark Zuckerberg or the various people working on AI in China.

What Holden now does at Anthropic [00:28:04]

Rob Wiblin: So you joined Anthropic six months ago at this point. I don’t have a great sense of what your actual day-to-day involves or what your main responsibilities are. Do you want to fill us in?

Holden Karnofsky: Sure. My title is “member of technical staff” — which is almost everybody’s title, so that tells you nothing. I report to the chief science officer, Jared [Kaplan], and my job is basically to advise the company on preparing for risks from advanced AI.

So most of the company is doing things with today’s AI — the product and all the interactions — and then there’s an alignment science team. But there’s a more general, just like, “What are the plans we need to make for the future?” A lot of my work just ends up being on the Responsible Scaling Policy in one way or another. So some of that is thinking about what needs to be revised in the Responsible Scaling Policy, what a next-generation Responsible Scaling Policy might look like. That’s been a lot of my work.

Another piece of that is threat modelling. I’ve been kind of taking point or the so-called “DRI” for threat modelling — which is basically having takes on what things we’re worried about happening with AI that might happen with advanced AI in the future, and what that means in detail for the mitigations we need. Everything from what uses of AI are we trying to prevent and how high priority are they to prevent, to exactly what are we trying to accomplish with security and all that stuff.

And then related to that, I have been also helping out a bit with security roadmapping. That’s like trying to figure out how secure we’re trying to be, and by when, and in what ways. Because security can be just a very multidimensional thing, and a lot of the costs can be measured in productivity. So setting priorities and setting reasonable goals is important.

Then finally, a thing that I’ve been working on that I hope to kind of pitch the company on — although I’m only speaking for my own hopes here — is developing more of a plan to deal with the human power-grab risk, just thinking about what could Anthropic’s interventions be there? What could their plan be there? Should that be in the Responsible Scaling Policy? I’m kind of hoping it will be. So those are things I work on.

The case for working at Anthropic [00:30:08]

Rob Wiblin: So you’ve decided to go work at Anthropic now. What is the case that Anthropic as a company — and the staff working there, by extension — are having a really big positive impact on hopefully guiding us towards a positive outcome from the arrival of artificial general intelligence?

Holden Karnofsky: Well, I certainly think they are. But I want to give a couple caveats before I answer that. One is I’m married to the president and cofounder of Anthropic. I also work there. I’m not exactly a neutral party here.

Another thing I would say is my decision to work there is driven by not only personal fit, but also frankly some issues where a lot of the downsides of working at Anthropic are things I am going to deal with whether I work there or not. If I were to try to work at a nonprofit, a lot of what nonprofits have to offer is their neutrality, the fact that they’re not companies. They don’t have as fancy models, but they have their neutrality. But I don’t have that to offer. I won’t be perceived that way, and rightly so.

So I’m in a particular situation that I think makes it a particularly good idea for me to just go ahead and work at Anthropic. I do not have the view that everyone should work there who wants to work on AI safety. I do not have the view that you can only do good work on AI safety from Anthropic. There’s a bunch of other organisations that I think are doing amazing work.

So with those caveats out of the way, I think Anthropic is doing amazing work. Basically the way I would think about it is that there are a multitude of ways that you can reduce AI risk that you can only do if you’re a kind of competitive frontier AI company — a company that is in the race to be the most successful company or build AGI first or however you want to think of it.

The problem is that it seems very hard to simultaneously put yourself in position to have all that safety impact and prioritise that safety impact without just spending all your time trying to compete and trying to win. To do both at once is something that I, a few years ago, was just not sure was possible. I was excited about Anthropic and I knew they were going to try and find that balance, but I really wasn’t sure it was possible. I thought they might have to choose between just being irrelevant and out of the race, or just acting just like all the other players.

But I think so far and right now — and this may not last — they are doing both. They are an extremely credible participant in the AI race. Many people think they just have the best AI models in the world. And I think they are leading the way at being a leader. They’re not necessarily the only company doing great safety stuff, but it is their top priority, it is what they ultimately care about most. And I think they do a lot.

So that’s a high level. I can go into a little more specifics on what these models are for having an impact, and what that tangibly means. But at a high level, I think there are a lot of ways to help that you need to be in this position to do. And it’s very hard to do both at once, and this company seems to be doing it — and that’s very exciting and provides a lot of opportunities for people to do things.

Rob Wiblin: So that’s a little bit in the abstract how you can see Anthropic helping. What are the specific ways that you think Anthropic is having a positive impact now, and might have a really positive impact around the time that we’re creating the first AGI?

Holden Karnofsky: There’s a bunch of different categories of taxonomies in my head, so I’ll see what I can do here. But there’s three high-level strategies or theories of change. There’s probably more than three, but there’s three that I’m particularly excited about.

One of them is: create risk-reducing things that a company can do, and make them cheap, make them practical, make them compatible with being a competitive company. Then you’ll be in a position where other companies are more likely to do them, and you can kind of create implicit pressure on companies to do them.

Two is a more generic race to the top, where if you come to be seen as a responsible company that is getting recruiting benefits or other benefits out of being responsible, now you’re putting pressure on other companies to compete with you on being responsible, on being pro safety, whatever. So they may come up with risk-reducing measures that you don’t.

And then a third model is: just inform the world without worrying about what other companies are doing. You’re in a position to basically understand what’s going on with your models, how they’re being used, how they’re behaving in the wild, how they behave in testing that can create more evidence for the world to understand what’s going on with AI, regardless of how your competitors behave.

I think those are all basically things that you are in a much better position to do if you have the most powerful and/or most popular models.

When it comes to what those risk-reducing measures can be and what they can look like, to me, there’s a few broad categories.

One of them is just alignment: just making it less likely that your AIs are evil or trying to do harm. I think this is the one that’s most contentious in terms of whether anything good has happened so far. You could point to things like reinforcement learning from human feedback, constitutional AI, character training that are intended to make AIs less likely to do unintended actions.

These have turned out to be very popular interventions. They have definitely spread throughout the AI ecosystem. Some people argue that they’ve been so successful that this shows they’re not actually about safety; these are commercial innovations that make AIs more useful. Some people believe they’re on the path to making them safer, some people believe they’re not on the path to making them safer. That gets into a big debate — I won’t get into that.

But in the future, I hope that at some point we will find ways to make AIs less likely to be scheming to take over the world. And I think a lot of them could be kind of straightforward and boring — like, just take all the training you’re doing and look for the cases where the AI got inadvertently rewarded for seeking power, or for cheating or for lying, and take them out.

And then some of them could be fancier or more theoretically grounded — like mechanistic interpretability: just finding ways to actually understand what’s going on inside an AI that you can use to either assess the impact of different training methods on whether it’s misaligned or whether it’s scheming to take over or whatever, or that you can use in a lot of other ways. I won’t get into all the ways you can use mechanistic interpretability. The idea here is to come up with things you can do to your training process that may not actually be that expensive, but that have a nontrivial impact on the probability that your AI has this sort of malign intent.

Then there’s control and monitoring measures. I’m especially interested in monitoring. I’m very interested in trying to do a better job understanding what’s going on with your AIs, what’s happening in the wild, and also how they actually are. Like I said, this is something where you can do a lot of good without getting anyone else to do it too, just by having this information and sharing it.

I think Anthropic has done a fair amount here. They have this Clio system, which is a way of reconciling customer privacy with being able to talk coherently about trends and how people are using AI. There’s also a fair amount of ambition for things like internal monitoring — so when people are using the AI at Anthropic, are we going to be able to track exactly what it’s doing, find cases of it doing bad things, try to understand why it did the bad thing, et cetera?

So I think there’s just a tonne to do there — a tonne that isn’t necessarily incompatible with being competitive, may require modestly more effort, and that you can work on and you can prototype and get there.

Then there’s a category of intervention that I would think of as character or model spec. So there’s one thing which is like, how do we get these AIs to behave as intended? There’s another thing which is like, what do we intend? How do we want these AIs to act?

OpenAI has published a model spec that kind of says, “Here’s how we want our AI to resolve various tough situations and conflicts, and here’s how we want it to behave.” I hope that at some point in the future we’ll be in a world where most AI companies have public model specs, and there’s a really good, healthy debate going about:

Do we want our AIs to be trying to steer the world toward good outcomes? Do we want them to just do as they’re told?
How do we want them to weigh various values, such as empowering users versus acting in their long-term interests?
Do we want them looking out for other misaligned AIs and reporting them?

There’s all kinds of stuff that you could put in a model spec. I think you can improve that a lot without necessarily becoming a lot less competitive, and I think that could have huge impacts on the future. Just think of this as like, this is what we’re telling these things to do that could be most of the share of labour or intellectual labour in the future. So that’s a big one.

Security I think is extremely important. I’ve talked before in many venues about the importance of trying to make sure that it’s hard for adversaries to steal your model weights. Because if someone can steal the weights of your model, then you basically have very little ability to keep your model under any sort of controls or have any chance to slow down if things get really scary or whatever.

I think there’s also important other applications of security. I increasingly feel that maybe more important than preventing theft of model weights is preventing sabotage, or especially backdoors and secret loyalties of AIs — stopping people from basically messing with your AIs in a way where they’re now behaving according to some malign human or some malign AI, stopping your AIs from sabotaging your AI systems and setting up future AIs to be misaligned.

There’s just a tonne of stuff you can do there that’s just very incremental. You can put in a lot of different measures. If you want an example, Anthropic recently talked about how it’s using egress bandwidth controls as a step toward making it harder to steal model weights and achieve generally higher security. So there’s just a lot of little unsexy things you can do that make it a lot harder for this thing to happen, and I think improve the odds that any given AI is going to be acting as it was intended to act.

Then there’s transparency and accountability practices. That would be things like responsible scaling policy, but also things like model cards and system cards. I think it is valuable for AI companies to share a lot of what they’re doing with the world — so the world can understand the situation, and also so the world can put pressure on them and their competitors to do better and to find all the ways they could be doing better.

Finally, there’s safeguards, which is highly related to monitoring, but it’s basically controlling and understanding your model’s behaviour out there in the wild, and preventing unintended stuff from happening. Here, Anthropic kind of pioneered Constitutional Classifiers, which are a way of making it much harder to jailbreak AIs than it has been previously. And that’s an example of something where you build it: it seems like a good idea, but you have to put a good amount of effort into making it work at all, and then even more effort into making it cheap, doable, practical, and get to the point where other companies could easily adopt it.

So there’s a whole tonne of stuff to do. Some of this stuff could be game-changing by creating better awareness worldwide about what’s going on with AI. Some of it is just about taking risks, making them lower, and setting us up for potential future regulation. But I think there’s just a lot of stuff you can do in that position, and I’m excited to be trying to help with it.

Is Anthropic doing enough? [00:40:45]

Rob Wiblin: So you’ve described a lot of ways that Anthropic could help, basically maybe save the world, around the time that we’re first building AGI. But if you look historically at the level of investment and the amount of progress Anthropic has made on these things so far, as a way of kind of forecasting the future, is it really commensurate with the scale of the challenge? Is it at all on track to potentially get us where we need to be?

Holden Karnofsky: This is a very hard question, because I suspect the answer is that it’s exponential, or the things that AI companies are doing become exponential and grow exponentially in importance — which makes it really hard to know whether we’re on track. But I think we at least could be on track, or it’s reasonable that we are.

Just to paint the picture a little bit: a few years ago, what did you want people doing on AI? Almost nothing you could do had any kind of direct impact. You could raise awareness, but there weren’t real tangible harms of AI or risks of AI systems at all. So everything you were doing was trying to either raise awareness or practice the right kind of habits. There was this whole thing about not putting out the weights of GPT-2, but who cares about GPT-2?

Then if you look at today, I would say things have changed a lot, and the things that Anthropic and other AI companies are doing are mattering a lot more. I think we’re arguably now getting into the area where chemical and biological weapons could be a real threat, and I think the things that Anthropic is doing about that are — while not exorbitantly hard to do or exorbitantly expensive — having a big, not a small impact on the risk that someone could misuse an AI in that way.

Now, that may not be the risk most people are most focused on. They may be more focused on misalignment. The problem is you can’t see big impacts on that right now, because that isn’t that risky right now. But I think already you can see a lot more impact than you saw a few years ago. The alignment research is better, it’s more interesting, it’s producing more stuff.

In particular, I think we are seeing meaningful stuff come out of Anthropic and other organisations to show that there are real concerns here, that we do see AIs doing really sketchy stuff in certain pretty realistic environments. I think that stuff is getting pickup, it’s getting noticed, and it may be putting the world more on alert, which is good. And people are starting to find ways to ask the questions, can we catch this by using models’ internal states? Can we catch this with interpretability? Can we improve this without creating unintended problems that we can also catch in a lab?

So I think it’s all pretty consistent. And I would think that if and when we get to AIs that can do automated R&D and such, a lot of the same kinds of things, they’ll still be affordable to do, they’ll still take some extra work, and they’ll just become exponentially more important. Security today doesn’t matter as much, but a lot of the same stuff will matter a lot more later. Same with the safeguards and the monitoring AI in the wild and all that stuff. So I think, you know, we could be on the right trajectory. I don’t see a particular reason to think we’re not, just given the dynamics of how it has to go.

Can we trust Anthropic, or any AI company? [00:43:40]

Rob Wiblin: So you think that Anthropic, where you work now, is doing a lot to help improve our chances of navigating the transition to superhuman AGI, to have that go well.

But I think there are a lot of people out there who are somewhat nervous to trust Anthropic with that responsibility, because — well, there’s many reasons — but in part because they feel burned by what happened with OpenAI. The nonprofit OpenAI that existed from 2016 or 2015 maybe through 2019, they made a tonne of wonderful promises in the early years, and they’ve not gone on to live up to them anywhere near fully. I’d say maybe like 40% somewhat lived up to the promises that I feel like they made.

And I think a lot of people just feel kind of duped and burned by that experience. I feel that way to some extent, and I don’t want to be tricked again. How much should people reasonably worry that Anthropic will, like OpenAI, fall off the wagon at some stage and leave them feeling very disappointed and somewhat deceived?

Holden Karnofsky: I’m not sure. I don’t want to get into commenting on OpenAI and what happened there and how good or bad that is. It’s not something I don’t have opinions on; it’s just something I don’t really want to get into on this podcast.

I would say in general, there’s no easy answers with this stuff. I think the actual situation with AI as I see it is just genuinely very complex. It’s something where there’s a technology that has a lot of upside, has a lot of downside for humanity. I tend to think that it would be worth a substantial delay to reduce the downside, because I think the upside is kind of a given if we can avoid the downside.

But there’s a huge incentive to build it, and someone is going to build it. And we can talk maybe later about this, but you could say, “Don’t be part of the problem. Maybe if everyone could make a deal then none of us would build it.” I think that’s a bad argument. I think someone is going to build it. I think it is legitimate to have a model where — and we can talk more about this later — you’re racing to build it before others so that you can kind of have more positive impact than others would in the same position.

But that’s just inherently a very thin line to walk. That’s like a balancing act, right? You’re trying to win and you’re trying to be nice, you’re trying to do them both. Do I think Anthropic is going to do that well, and do I think they’re going to find a good balance and a better balance than others? Yes, I do. But how do I know this? Is it because of one big pledge that Anthropic made? Is it because of one big promise and if they break it, I’ll be like, “Holy cow!” No.

And I think if that’s your attitude toward AI companies, you are really setting yourself up for disappointment. I’m not sure why you need to trust an AI company, first off. I’m not sure — and I don’t necessarily advocate for anyone to trust anyone, frankly, but I tend to just trust very few people about almost anything.

But if you are looking for a reason to trust an AI company, and your reason is that there’s this governance mechanism, and they’re going to just make sure at all times that this AI company only does things for the benefit of humanity and they’ll immediately shut the company down if it’s ever not going to — or if your thinking is that they have this responsible scaling policy or preparedness framework or whatever, and it’s an ironclad commitment that they’ll never do something risky — yeah, I think you are setting yourself up for disappointment.

Because I think this is a complex area, and it’s a balancing act, and it’s very dynamic and things are changing all the time. And whatever it is a company has promised, it may turn out later that at least in someone’s judgement, that wasn’t a good promise to have made. And then different people are going to have different judgments. I think it is unwise for companies to make promises like that. If I were running one, I would think I would not. There’s various debates over which promises have been made implicitly and explicitly.

What is a better way to judge a company? I would say you have two good options.

One of them is just a softer, more multivariate judgement. Think about the totality of all the things the company has done, all the calls it has made. Compare it to other companies — because you do want to benchmark it to companies that are trying to succeed; it’s not going to be acting the way that a nonprofit would act. Look at the people: look at how they talk, but also look at how they’ve made decisions. Probably how they made decisions is more important. And just decide how you feel holistically and make a bet on it. I feel comfortable doing that. I have a bit of an advantage. I happen to know the founders of Anthropic extremely well.

If you’re not in a position to do that, then just don’t trust any AI company. You can have a great impact on AI safety without trusting any AI company. It’s not something you need to do.

Rob Wiblin: The question must come up in hiring, when people are deciding whether to work at Anthropic or whether to take other steps that might benefit Anthropic. In that case you have to give a somewhat greater pitch, I would think, to convince people to pin their hopes in Anthropic. I guess people who are going into the company are in a better position to judge, potentially, because they can see the decision making inside.

Holden Karnofsky: Oh, I don’t think you need to be inside. I think you can just read the news and you have a tonne of information. The thing I would say is that it’s not one data point. I would never say to a recruit, “There’s this governance mechanism, or there’s this responsible scaling policy — and that’s the one thing. And as long as that’s there, you should feel great about this company. You should leave as soon as it has a problem.” But I would absolutely just go through a list of 20 things that I think Anthropic has done that have shown serious responsibility compared to competitors, and I would look at the totality of them. And I would say, yeah, I think this is a great company.

The more time you’re spending on this decision, the more we can dig into that, and get into nuance and multivariate judgements. So I’m definitely not saying you need to be on the inside. I don’t think you need to be on the inside at all. I think there’s tonnes of public information you can use deciding which companies to trust. I just don’t think that everyone has to decide to trust an AI company with anything.

How can Anthropic compete while paying the “safety tax”? [00:49:14]

Rob Wiblin: I want to now kind of probe the strength of the case for Anthropic as a company being very useful in guiding AGI in a positive direction, and by proxy the case for people to go and work there with that goal in mind.

You discussed this a bit earlier on, and in my mind I think that the categories of impact that you might hope Anthropic to have are: firstly, coming up with new technical breakthroughs that allow us to steer AI in a positive direction, and then of course to export those to other companies. So it’s not just Anthropic using these techniques, but maybe they could be used across the industry as a whole.

Then there’s coming up with good internal governance mechanisms, things like responsible scaling policies that then also could be exported to other companies so they become more standard.

There’s being at the frontier and understanding the models well, so that you can communicate to the public and to governments, and be very candid about what is going well and what is going poorly so that there’s this greater transparency and understanding.

There’s also raising people’s expectations for what companies can do by demonstrating that Anthropic is able to be competitive, is able to have a good product while also doing all of these safety measures, good governance measures and technical measures. Potentially you raise people’s expectations. And also by potentially poaching talent while doing all of these things: because people want to work at a company that is responsible, you could create some sort of race to the top where other companies feel like they have to compete on being responsible companies.

Is there another category of impact that I’m missing here?

Holden Karnofsky: That broadly covers it. I think policy is important, and in my head I was kind of lumping it in with the exporting thing or the “informing the world” thing. But also just as a company you have a certain voice and policy that others don’t have. You know, in some ways nonprofits have advantages over companies in policy advocacy because they’re seen as more neutral and they’re seen as more prosocial. But in some ways companies have advantages, because a lot of times what politicians want to know is like, how is this going to affect the economy and business and power players and all that stuff. So I think there are some opportunities there too.

Rob Wiblin: Yeah. So that’s the case in favour. I want to push you now on criticisms that people might make or reasons for scepticism that people might raise.

I guess the most obvious one is just how is a company realistically going to manage to have maybe the best AI models, the best products — something that is itself incredibly difficult, incredibly competitive — while also paying a sort of tax in having a whole lot of their staff work in governance arrangements that their competitors maybe aren’t doing, or on technical applications that may or may not be necessary yet at the level of capability that the AI models currently have.

You might just reasonably be sceptical that there’s going to be enough discretion, enough slack in the system basically, to make very meaningful investments there. And in that case you could end up absorbing a whole lot of staff who are very concerned and have the best of intentions, but then most of the resources just get directed at keeping up with commercial competitors.

It sounds like you may have had this concern years ago, but perhaps things have gone better than you expected in this respect. Do you want to explain what you think the situation is now?

Holden Karnofsky: Yeah, things have gone better than I expected. I had this concern. I still have this concern. I think this is a completely legitimate concern. One framework I would think about is: if you have a talent advantage, then you have some kind of slack that you can use on what you’re calling the “safety tax.” And in fact paying the safety tax may help you get your talent advantage. I think in some sense a major part of the thinking is that a lot of the best people at Anthropic are there exactly because they want to pay the safety tax. That’s what they want to do.

So you do have some slack to work with. And the more it is that the best talent is coming to the company that is doing the most safety stuff, the more slack there is to do safety stuff. So that would be one answer.

The other answer is just you want to be really efficient with the safety tax. I think if you play your cards right, you can pay very small amounts of so-called tax and have very big safety benefits.

Maybe an analogy would be like, how on Earth are we going to develop a form of energy that is as useful as coal, but also clean, also emissions-free? Certainly we subsidised R&D on solar power — it’s not like there was no special effort made there — but at a certain point, because we poured in R&D on the front end, we did end up with a product that was quite competitive and quite viable on its own merits.

If you think reinforcement learning from human feedback is a good thing, then this is a great example of it. I don’t think you have to think that, and I don’t think everything is an example of this. But I think the thing you want to do — the thing I want to do anyway, as Anthropic or at Anthropic — is you want to come up with stuff where you are going to put in a bunch of energy on the front end scoping it out, figuring out how to make it work, dealing with all the things that break when you try to make it work, but then you get something that actually works and is not very expensive. And that’s a way to get a lot out of a little when you’re paying this so-called safety tax.

Rob Wiblin: If I imagine what a sceptic would say here, they might say things have gone above expectations in this dimension, perhaps, up until now — but the industry is only getting more competitive, the competition is only getting fiercer. So perhaps in a couple of years’ time we might find that Anthropic is getting closer to investing 100% of its resources just in trying to keep up with extremely well-financed competitors. How worried are you about that?

Holden Karnofsky: That’s a very strong possibility. But I think a lot of other people are also concerned that an advantage in AI is going to be self-reinforcing, and that someone is going to pull way away and open up a big gap. So I think it kind of cuts both directions.

You know, I don’t know if Anthropic is still going to be a frontier player in a few years. I don’t know. I think it’s just one of these things where right now it’s a great opportunity to try and work on some of the things I’ve talked about and work in some of the areas I’ve mentioned. Maybe that’ll change in the future, and people can change the things they’re doing in the future.

Rob Wiblin: Maybe I’m really weakening off on the hardball questions here, but my perception is that Anthropic is probably in some ways better managed than other AI companies, or better organised, and this might be part of the reason why it has managed to have more slack than what might have been anticipated to invest in all kinds of governance and technical breakthroughs. I guess without naming names, some companies are known for being a little bit chaotic, some companies are known for being a little bit bureaucratic, and I hear less of those worries about Anthropic.

This is a very generous question, but do you want to say any positive things about Anthropic in that respect? Or negative?

Holden Karnofsky: Whoever it is that does all the management at Anthropic and manages the company, yeah, I think that person is doing a great job. That is my wife. So anyway…

I think that’s kind of a subset of the talent advantage point though. Yeah, I think there’s people who are very good at running a company, who wouldn’t actually want to run a company that they didn’t feel good about on safety. So that’s the safety tax paying for itself.

And yeah, as far as I could tell, it’s a pretty darn well-run company. It’s hard to run a company that size, so I wouldn’t claim that anything’s perfect. But that is my belief.

What, if anything, could prompt Anthropic to halt development of AGI? [00:56:11]

Rob Wiblin: I solicited questions for you on Twitter, and the most upvoted by a wide margin was: “Does Holden have guesses about under what observed capability thresholds Anthropic would halt development of AGI and call for other labs to do the same?”

I think it’s very interesting that this is the question perhaps that people have in their heads the most. I guess it speaks to this question of trust that we’ve been coming back to; I think the subtext is people fear the answer is that there is no threshold at which Anthropic would stop. I guess that could be reasonable for reasons that you’ve given, that perhaps just stopping wouldn’t help because it wouldn’t really influence anyone else. Do you have a reaction to that question in particular?

Holden Karnofsky: Yeah. I will definitely not speak for Anthropic, and what I say is going to make no attempt to be consistent with the Responsible Scaling Policy. I’m just going to talk about what I would do if I were running an AI company that were in this kind of situation.

I think my main answer is just that it’s not a capability threshold; it’s other factors that would determine whether I would pause. First off, one question is: what are our mitigations and what is the alignment situation? We could have an arbitrarily capable AI, but if we believe we have a strong enough case that the AI is not trying to take over the world, and is going to be more helpful than harmful, then there’s not a good reason to pause.

On the other hand, if you have an AI that you believe could cause unlimited harm if it wanted to, and you’re seeing concrete signs that it’s malign — that it’s trying to do harm or that it wants to take over the world — I think that combination, speaking personally, would be enough to make me say, “I don’t want to be a part of this. Find something else to do. We’re going to do some safety research.”

Now, what about the grey area? What about if you have an AI that you think might be able to take over the world if it wanted to, and might want to, but you just don’t know and you aren’t sure? In that grey area, that’s where I think the really big question is: what can you accomplish by pausing? And this is just an inherently difficult political judgement.

I would ask my policy team. I would also ask people who know people at other companies, is there a path here? What happens if we announce to the world that we think this is not safe and we are stopping? Does this cause the world to stand up and say, “Oh my god, this is really serious! Anthropic’s being really credible here. We are going to create political will for serious regulation, or other companies are going to stop too.” Or does this just result in, “Those crazy safety doomers, those hypesters! That’s just ridiculous. This is insane. Ha ha. Let’s laugh at them and continue the race.” I think that would be the determining thing. I don’t think I can draw a line in the sand and say when our AI passes this eval.

So that’s my own personal opinion. Again, no attempt to speak for the company. I’m not speaking for it, and no attempt to be consistent with any policies that are written down.

Holden’s retrospective on responsible scaling policies [00:59:01]

Rob Wiblin: Let’s talk about responsible scaling policies. You were one of the people who helped develop the idea of responsible scaling policies, which have now gone on to become maybe the dominant framework that almost all AI companies are using for their internal risk management and internal preparations for future AI capabilities as they come online.

Can you just remind us what responsible scaling policies are? I guess they’re also called frontier safety frameworks.

Holden Karnofsky: There’s so many names. Yeah, everyone’s got their own name.

Rob Wiblin: Yeah. What are they, in a nutshell?

Holden Karnofsky: So back in 2023, I was talking with Paul Christiano and the folks at METR and feeling that there was some energy at AI companies to be seen as responsible and safe and to make some voluntary commitments that would show that. And we tried to come up with something that could be beneficial, but also would maybe actually be adopted by AI companies.

So we kind of piloted and advocated for and helped develop this idea. Sometimes I’ll call them “responsible scaling policies” because that’s what they were originally called. That’s what Anthropic’s is called. But they have many names — preparedness framework, frontier safety framework. What they tend to be is a map from AI capabilities to mitigations. So it’s: “If and when our AI is able to do X, then we will do Y to protect it.”

Two examples of X. One example of X would be: “If our AI can help a random person make a bioweapon, then we will try to ensure it’s not easy to jailbreak.” That’s not exactly what any of these RSPs say, but that’s an example of what it could say.

Another one would be: “At the point where AI can do autonomous AI R&D — which has been defined and operationalised in a somewhat reasonable way, and is kind of our best operationalisation of AGI — then we will have to make sure that we are able to make a strong public argument that we have contained the risks from misaligned power-seeking. Or if we can’t do those things, then we are hoping to pause AI development and deployment as needed in order to meet that standard. So if we can’t protect the AIs we have, then we try to not make the AIs more powerful and not make them more widespread until we can protect them in that way.”

That’s a basic idea of a responsible scaling policy. I think there’s been some misunderstanding of what they are intended to say and what they’re intended to do. So I think a lot of people interpret them as being these unilateral commitments by AI companies that say, “If we can’t meet this standard, then we will just unilaterally, all by ourselves, stop our AI deployment and development, regardless of what everyone else is doing.” I think people have seen them therefore as a way that AI companies try to guarantee they’ll never do anything too risky, and I think a lot of the criticism of RSPs is based on believing that’s their goal. So people will say, “I don’t really believe AI companies are going to unilaterally pause like that, so I don’t think you’re going to get this benefit.”

That was never the intent. That was never what RSPs were supposed to be; it was never the theory of change and it was never what they were supposed to be. If you look at the original METR materials on this stuff, they have a very clear section on response — like, what are you supposed to do as a company when you can’t meet that standard? And there’s this language that says, “If we end up in this situation, but our competitors are going ahead with equally capable models and equally substandard protections, then we do have an escape clause and we can go forward too. But we have to meet these other criteria. We have to be open about what’s happening and we have to be transparent about it.”

So the idea of RSPs all along was less about saying, “We promise to do this, to pause our AI development no matter what everyone else is doing”; it’s more about saying, “We believe this is the normatively best way to keep AI safe. We are trying really hard to hit it. We are lighting a fire under our own ass to develop our safety mitigations to the point where we can hit it. We will at least be embarrassed if we can’t. And we are implicitly supporting a growing consensus that this is what regulators should be trying to accomplish.”

So it’s trying to have two benefits: one is lighting a fire under the ass for safety mitigations and roadmapping, and another is creating a prototype for regulation to work off of — which is not the same as wanting them copy-pasted into regulation, but just starting to de-risk some of the ideas of risk management that we could want. So the unilateral pause was not intended by me or by METR to be the central theory there, but I think it’s gotten different interpretations. So that’s what they are.

Rob Wiblin: I guess it’s understandable that people wanted them to serve that function, because it didn’t seem like anything else was going to serve that function. So people kind of started clinging onto RSPs, hoping that basically companies would commit to pausing under particular circumstances. Maybe you could get all of them to do that.

I guess it always seemed perhaps a little bit far-fetched that companies would be able to constrain themselves that way, because surely someone would break out. Even if they all had these policies, someone would break out and then others would feel pressure to copy.

How has the whole RSP framework worked out relative to what you hoped?

Holden Karnofsky: I think there’s been some significant good and some significant bad, actually. Maybe I’ll start with some of the bad, because I just said it: I think this idea of what they were supposed to be was nuanced and complex in a way that did get lost in the noise. So now I think we’ve landed in a world where a lot of people believe they were supposed to be these ironclad commitments. Actually, I’ve been surprised that people will see an RSP revision as like a betrayal or as a cause for alarm.

I wrote a piece for Carnegie when I was there that kind of said, if you want to make something good, instead of trying to make it perfect before you ever release it, it’s good to put out your first version of it and then iterate a lot. That’s how companies tend to do products, and that is what we should try in risk management if we’re in a huge rush on AI.

And this is also what we encourage companies to do. We encourage companies to say, instead of making sure you’ve gotten this perfect and you can definitely adhere to it forever before you put it out — which would then cause you to never put anything out, or to be incredibly careful and vague about what you do put out — why don’t you put something out, and you can change it later? You’re going to reserve the right to change it later, and you’re going to have a clear process for changing it later. The board will have to be looped in, you’ll have to be clear that you changed it, you’ll have to have a good reason for changing it — but you can change it.

I think a lot of this stuff just got lost in the shuffle. And right now what we have is these commitments that I think don’t make sense. I think various companies have made commitments that just are not reasonable commitments to have made. We have learned more about the threat models, we’re in a tough situation regulation-wise —

Rob Wiblin: What’s an example?

Holden Karnofsky: One example would be model weight security stuff. Especially for some of the lone wolf bioweapon threat model, I think it’s a very small part of the threat model — and probably not justified to emphasise the possibility that a lone wolf bioterrorist would operate by stealing model weights, then fine-tuning the model themselves, serving it themselves. By the time you can do all that, you have to have a lot of resources, and there’s probably an easier way to do it. So I think there was just a lot of stuff about security standards that didn’t necessarily make sense, security expectations that didn’t necessarily make sense.

There’s also just the broader thing of many of these policies either state or have been interpreted as committing to unilateral pauses — which, if you find yourself in this situation where you can’t make your model safe but your competitors are going ahead, it’s actually unclear why you should unilaterally pause. I think that in that situation it doesn’t have much safety benefit, and it can do a lot of harm.

Rob Wiblin: So it’s clear why one company out of five, if all of the other four were going ahead — or maybe even if one of them were going ahead doing something very dangerous — then really what is the point of constraining yourself? But if you could get everyone to have all of these policies, and they could make them reasonably similar, and they all said, “We have to pause if we can’t do X,” then maybe they could all kind of pause together.

I interviewed Nick Joseph at Anthropic a year ago. He’s not on the policy or governance team — he works on training — but we talked a lot about RSPs because he was a fan of them. And I kind of pushed him on this point, saying your policy, and probably other companies’ policies, are going to say that you can’t train or even possess models that are beyond some particular level of capacity unless you get to the point of computer security where not even a state like China could steal the model weights.

But I don’t think we’re going to be at that point in a couple of years’ time. I’m not even sure that that is possible almost at all. So you’re going to get to that point, it’s going to say that you should stop, and eventually the pressure is just going to become overwhelming to push forward anyway — because other people are going to be doing it. It’s not on the horizon that you’re going to be able to meet this standard that you sort of committed to.

And understandably, Nick didn’t come back and say, “Well, at that point we would just blast forward and ignore the RSP because that’s actually the true spirit of RSPs.” I think it’s understandable that it’s been hard for companies to put that front and centre, that they’re not true commitments to pause for an extended time in that sense, because that doesn’t sound very good. But I guess the reality of what they’re meant to be, which is I guess prototyping a sort of regulation that then could be imposed, perhaps that is also useful in its own way.

Holden Karnofsky: Yeah, I think that’s all totally fair. So I think this is a way that some of these commitments that were made — either the letter of them or the spirit of how they’re being perceived — there’s something unreasonable there. And if you don’t care at all about AI progress, and you just want everything to slow down as much as possible, maybe you don’t consider it unreasonable. But I think it’s broadly not a reasonable thing to ask of these companies.

And in many cases it’s just actually, I would think it would be the wrong call. In a situation where others were going ahead, I think it’d be the wrong call for Anthropic to sacrifice its status as a frontier company — I’ve talked about all the benefits that could have — for what, exactly? To stand on ceremony? Because it’s not necessarily reducing the risk very much by pausing.

Rob Wiblin: From the perspective of people outside the companies, I think it makes sense for them to think that in that situation, what we should insist on is that all of them stop, because they’re all saying, “We should do this unsafe thing because everyone else is doing the unsafe thing” — and like, well, why don’t you just all not do it?

Holden Karnofsky: Absolutely. Yeah, definitely. I think that this comes back to this coordination problem thing again. So if all the companies in the lead have RSPs, and all of their reason for going ahead is that the other four might go ahead, that should be a solvable problem. I actually think it could be. I think you could say things like, “In that situation, we will be really clear about what’s going on, and we’ll say that we wouldn’t go ahead if we didn’t have this issue.” Then maybe that will change others’ behaviours. So you could deal with that.

But I think if you end up in a situation where there are a couple actual defectors — who either don’t have an RSP, or just don’t care and are like, “Screw this” — then that does change the equation, and it changes what the right thing to do is.

Another lesson learned for me here is I think people didn’t necessarily think all this through. So in some ways you have companies that made commitments that maybe they thought at the time they would adhere to, but they wouldn’t actually adhere to. And that’s not a particularly productive thing to have done.

So I think we are somewhat in a situation where we have commitments that don’t quite make sense. I think that’s fine in itself; I think the thing to do is revise the commitments — but then revising the commitments is very painful in a way that was not envisioned by me and METR when we were working on these things. That’s a way in which I think things could have gone better.

Rob Wiblin: I guess it is because people outside the companies want to find some tools to tie them to the mast, to force them to commit to safety practices that, when the time comes around, they won’t want to follow.

Holden Karnofsky: Yeah, I think regulation would be a good tool.

Rob Wiblin: Right. If you can get them all to do that, or almost all to do that, then maybe that would work. If you can only get a minority of them to do it, then probably you’re not really accomplishing all that much. So it comes back to what we need is regulation. Unfortunately, it’s just not clear how to get that.

Holden Karnofsky: Exactly. So RSPs are supposed to make regulation kind of a more viable thing by trying to create a consensus, and also trying to work out how this stuff actually works, what the risk mitigation should actually look like, and how the commitments actually should be structured. You revise them as you learn more about the world.

And how has this gone? I think this is good and bad. Because on one hand, I think the regulatory environment is just very disappointing, so everyone has had to trim their sails and have less ambition, and everyone who cares about regulating AI for safety from catastrophic risks is feeling pretty disappointed right now.

I also think that RSPs have been somewhat successful here. Most of the regulation that looks like it might happen or has happened or will happen is borrowing heavily from the practices in these RSPs and the language in these RSPs. I think this is kind of a positive thing. And if we can get to a point where people are able to actually revise their RSPs continually so that they continue to make sense and be good and have good commitments, then we’ll have this mechanism that continues to lay out an example that can then be cribbed from for regulation. So that’s been more of a bright spot.

I think there’s been other bright spots in RRSPs. It’s been interesting. There’s been some sign, I would say, that they can serve a forcing function, that they really can light a fire under a company’s ass to do better risk mitigations. I think literally Anthropic stated publicly that immediately after they adopted their RSP, their security team said, “We can’t actually do this without more headcount” — and they got more headcount.

Also the Constitutional Classifier stuff protecting from jailbreaks is an example where they prioritised this, they resourced it, they made sure it happened. They had to make a plan to check all the boxes to make sure that we’re actually meeting the ASL-3 standard we said we’d meet. And I think a lot of that stuff is just very hard to envision. In a fast-moving AI company that has a zillion priorities, it’s hard to envision them checking off all these boxes you need to check off to guard against a particular risk of this bioweapon or chemical weapon stuff without these kinds of policies. So I think that’s been a bright spot.

But I also think that that kind of forcing function works best with commitments that are tough and ambitious but doable. And when commitments are not doable and the commitment is effectively to pause, I think that it’s less promising.

So in general, I am kind of thinking about visions for the next-generation RSP, and thinking that we want to preserve having ambitious, achievable targets. We want to preserve the forcing function. We want to preserve getting a big chaotic company on the same page that it has to check a bunch of boxes to do something good. And we want to just generally preserve putting good things in there that can be cribbed from for regulators.

But we do need to get rid of some of this unilateral pause stuff. We need to refine some of the threat models, some of the specific statements about mitigations that are needed. Some of the early generation policies are also just too specific. They’re just like we’re going to put in this control and that control — and the field is too dynamic for that to make sense. I think I move toward vaguer commitments, because they sound worse, but they still have a lot of force. People will still ask, “Are we meeting this commitment or not?” So the combination of flexibility on how you implement, plus you still have force of people asking if you’re spiritually meeting it, I think can be very powerful. So there’s been a lot of lessons learned there.

Overrated work [01:14:27]

Rob Wiblin: Are there any approaches or styles of work that people are engaged in trying to hopefully steer the AGI future in a positive direction that you think are overrated, and maybe you’d like to see people move away from in favour of something else?

Holden Karnofsky: Not in a really big way. In general, there’s just so much important work to do on AI, and I think most of the things that people are at a high level excited about are somewhat worth doing or worth a shot.

There’s stuff that, just in a small way or on the margin, I think there’s probably a little bit more investment in policy and especially federal policy than maybe is optimal. I think people have kind of seen that as just the thing to do if you’re not doing alignment research.

And I think there are a lot of other things to do, so I’d be happy to see just a little more people diversifying out into random stuff from that — because it doesn’t seem like a particularly hospitable policy environment right now. Maybe that’ll change, but I think if it changes, it could be from a lot of external factors and not necessarily from things people are doing now. I’m not sure the work they’re doing now will be that useful, but I think it’ll be somewhat useful. So I think it’s good work, but maybe on the margin.

Then in general, people in the AI safety community tend to value a lot when people kind of loudly say that AI is scary. I don’t have any problem with loudly saying AI is scary. I say it all the time. I’m pretty loud. I don’t think people shouldn’t say it or anything like that. But sometimes I look at people getting excited about what’s going on in AI, and I’m like, are you excited that risk is going down, or are you excited that people are agreeing with you? Because sometimes the latter is happening in a way that doesn’t really seem to lead to the former.

I don’t know exactly what people’s model is. It is generally nice for there to be more awareness of the risks, but I’m not sure the marginal impact of more public discourse is very high right now. I think certainly there are returns. I don’t know if the returns are super high. And I think if there’s going to be a game changer, it’s probably not going to be because someone said something or wrote something, actually.

Concrete shovel-ready projects Holden is excited about [01:16:37]

Rob Wiblin: All right, let’s turn to what work in this whole broad category you think is kind of the juiciest, or potentially has the biggest bang for buck. You wrote this doc a while back called “Examples of well-scoped object-level work for reducing AI risk.” As far as I know, you’ve introduced this term “well-scoped object-level work,” which has the nice abbreviation “WOW.” What is the concept of WOW? And how has it changed from what we had available in the past?

Holden Karnofsky: I mean, it’s not a novel concept. The story I would tell is that I’ve been in the AI safety world for a very long time. And for a very long time I found it to be just like a really frustrating, vexing area to work in.

I’m thinking about kind of what it was like in the year 2016 or something. It was just like, “Oh my god, this might be the most important century of all time for humanity. The most important and irreversible events ever might happen in the next five or 10 years or so.” I think at that time I thought it was longer, but still very short. You know, “We might all die, the world might get taken over by a psychopath, we might get to utopia. Let’s go help!”

And then it’s like, what do you do to help? When I was writing my blog post series, “The most important century,” I freely admitted the lamest part was, so what do I do? I had this blog post called “Call to vigilance” instead of “Call to action” — because I was like, I don’t have any actions for you. You can follow the news and wait for something to happen, wait for something to do.

I think people got used to that. People in the AI safety community got used to the idea that the thing you do in AI safety is you either work on AI alignment — which at that time means you theorise, you try to be very conceptual; you don’t actually have AIs that are capable enough to be interesting in any way, so you’re solving a lot of theoretical problems, you’re coming up with research agendas someone could pursue, you’re torturously creating experiments that might sort of tell you something, but it’s just almost all conceptual work — or you’re raising awareness, or you’re community building, or you’re message spreading.

These are kind of the things you can do. In order to do them, you have to have a high tolerance for just going around doing stuff, and you don’t know if it’s working. You have to be kind of self-driven.

I often had the experience at Open Phil: we were doing philanthropy, but we were supporting that kind of stuff. And we also had no idea how our work was going. It felt hard to have a really good corporate culture because every time you say someone did a good job, you’re just expressing your opinion; there’s very little measurable output of anything anyone is doing. Even community building: you don’t really know who got into the community and how and why and how much it mattered, whether it’s even good.

So that’s the state we’ve been in for a long time, and I think a lot of people are really used to that, and they’re still assuming it’s that way. But it’s not that way. I think now if you work in AI, you can do a lot of work that looks much more like: you have a thing you’re trying to do, you have a boss, you’re at an organisation, the organisation is supporting the thing you’re trying to do, you’re going to try and do it. If it works, you’ll know it worked. If it doesn’t work, you’ll know it didn’t work. And you’re not just measuring success in whether you convinced other people to agree with you; you’re measuring success in whether you got some technical measure to work or something like that.

For probably most people, that’s a much healthier environment in which to operate. It’s a lot more fun, it’s a lot easier to have a nice environment, a nice corporate culture there, where there’s a lot of positive positivity and you can kind of replace politics with merit and stuff like that.

So I think my big thing I was trying to express in that doc is just that there’s a lot to do in AI now. If you’re a person who would thrive in kind of a normal, healthy, high-feedback, tight-feedback loop environment — where if you’re trying to solve a problem, you’ll know if you solved it — you might not have had a great time in AI before, and you might have a great time now. And you should make sure you’re working on that stuff instead of just assuming that forevermore the thing to do is to be climbing random ladders in the government or convincing other people to agree with you.

I think before I get into it, I do want to say that I have one perspective on this stuff: I work at an AI company. I used to do philanthropy and see the whole field, but I haven’t done that in a while. So this is not at all exhaustive, and I think people shouldn’t interpret the stuff I’m talking about as like, this is the only stuff or this is the best stuff. This is just stuff I happen to know about.

Great things to do in technical AI safety [01:20:48]

Holden Karnofsky: So getting into that, I think alignment research is the thing that’s been most dramatically transformed. I was describing what it used to be like: it used to be very conceptual; we don’t really know what we’re trying to do.

These days a tonne of alignment research… Like our AIs do a lot of reward hacking. That’s not nice. What can we do that will stop them from doing that? Then you can try something and you can see if you got your AI to reward hack less. That’s pretty great. Now, of course, you might get your AI to reward hack less in a way that means you just pushed it underground and made it better at avoiding getting detected reward hacking. But today’s AIs are kind of flaky enough and low capability enough that you can put some bounds on that and you can make some actual observations about whether they’re actually reward hacking less. And that’s pretty cool.

There’s a tonne of stuff like that in alignment. We have what a lot of people call “model organisms”: AI models that are specially trained to be evil so that we can study them. And then you can just try all kinds of stuff.

There’s all kinds of ridiculous, some of them very simple, unsexy ideas to cause AIs to be more likely to just do what they’re supposed to be doing, less likely to be doing a bunch of unintended stuff or following their own goals.

An example I enjoy of this is: “reward hacking” is referring to a model that will kind of do whatever it takes to convince you it did the task so it gets something like a reward. An example of this would be like, “I asked you to code me an app that does this thing, and you coded the app and hard coded a lot of solutions in there, so it looks like it gets the right answers, but it’s not really doing the thing I wanted.”

If you want to get less of this, one thing you can try, which apparently somewhat works, is, while you’re training the model, you can tell it that reward hacking is fine and that all you care about is whether it technically completes the task. That’s all it’s supposed to do. Then when you’re not training it, and it’s actually prime time, you tell it the opposite. You tell it, “We don’t want you reward hacking; we want you to actually do the thing that was intended.”

And what that does I think could be pretty generalisable for alignment. I think a lot of what we should be concerned about is reinforcing the wrong behaviour: training our AIs to do something we didn’t want them doing. And instead of solving that problem by making sure we have no inadvertent environments that train them that way, we can just be more explicit. They’re intelligent minds, in a sense: they can understand what we’re saying.

So we could just actually maybe get some juice out of saying, “We know there’s going to be some unintended stuff in the following way. It’s fine for now; it’s not fine later” — so then you’re not reinforcing the behaviour in the same way. So that’s just an example.

Rob Wiblin: That’s fascinating.

Holden Karnofsky: Yeah, I think it’s pretty interesting. There’s just a million things you could do when you have these model organisms, where you have an AI you taught to scheme or you have an AI that’s reward hacking — not because you trained it to, but that just is what these things do by default — and then you’re just trying stuff and you’re seeing if it works.

I think a lot of people love to be on a technical team where you’re just solving technical problems. That didn’t used to be how alignment is, and now it is. I used to be kind of negative on alignment research, and think, “Yeah, Open Phil will fund it, but I’m not expecting much from it.” Now I’m very positive. Now I think we should just shovel people into it, shovel money into it. It’s great.

Should I just rant and go through a lot?

Rob Wiblin: Yeah, that’s a great example. Hit us with another one.

Holden Karnofsky: Yeah, sure. So there’s alignment work and then there’s just a lot of stuff that’s adjacent to alignment work and related to it, but it’s not exactly the same. So it’s worth calling that out.

Building better model organisms so that you can study them is a whole area in itself. Can we get AIs to communicate with each other in a way that we can’t detect it? And then once we do that, can we start working on how to stop them from doing that? Can we train AIs to build hard-to-catch backdoors into the code they’re writing? Can we train AIs to have secret loyalties? Can we train AIs to coordinate with each other in other ways?

Then there’s capability evaluations: understanding what your AIs are capable of. It’d be particularly useful to understand if your AIs are capable of screwing with your evaluations themselves. So do you have AIs that, if they wanted to, could pretend to be less capable than they really are? That would be a really good thing to know. Having good evals for that is tricky, but it’s not impossible, and it’s not incredibly conceptual: it’s a technical problem. You can see how you’re doing on it.

Then there’s training AIs to be better at useful things. I can imagine that some of the first super powerful AIs will actually be kind of narrow. So it may actually be really important what exactly we’re optimising for, what we’re actually training them to do. It might be that we want to make sure we’re training our AIs to be better alignment researchers specifically, so that they’re not just better AI R&D researchers generally, but they’re actually good at alignment research, and we can use them for alignment research.

Training AIs to be good at forecasting is something I think I mentioned that could be very exciting. Training AIs for advising on decision making in general.

Vulnerability discovery and patching: training AIs to be part of your security plan, at least to protect against humans, maybe to protect against each other, some skills that contribute to these things.

Then this is also related to alignment, but it’s deployment safeguards — so this is less about training your models and making them nice, and more about, now that the models are out there, how are you catching bad stuff they’re doing so that you can learn about it and so you can block it? There’s a lot of fertile ground here to just try to figure out how to get AIs not to help people build bioweapons and chemical weapons, for example. But also that stuff is very continuous with things you would do to stop AIs from sabotaging AI companies and putting backdoors into models and doing other mayhem. So that’s a whole area.

I mentioned I think before improving the model spec. This is less “How do we get AIs to do what we want?” and more “What do we want them to do?” So helping find the right balance: when should the model do what the user wanted, even though the client who’s deploying the model to the user said to do something else? When should the model do what the user wanted versus not doing it because they want to help protect the user’s long-term interests? When should the AI be consequentialist and want a certain outcome in the world? When should the AI try to foil a human plot? When should the AI do none of these things and just do exactly what it was told, or do spiritually what it was told? How do you find all these balances? I think that’s a very interesting area.

Security is just a huge issue in AI. I think it’s on multiple fronts. I used to talk a lot about how bad it would be if it’s easy for state actors to steal your model weights. I think that would be bad, but there’s also other security challenges that I increasingly think may be even more important, like: how do you set up an environment where your AIs are doing most of your AI research, but you’ve got some kind of safeguard against them sabotaging the research, against them ensuring that future models are aligned to what they like instead of what we like? And how do you stop humans from doing that too? Could be a major part of preventing power grabs.

These are all very tangible things. You can set up a system, see what about the system is practical, what about it is productive, what about it isn’t? And then also red-team it and see how someone could break it.

There’s a whole world of just stuff people are doing to help the public understand what’s going on in AI — like what are AI capabilities, where are they heading? There’s the work that Epoch is doing, the work that METR is doing. There’s a lot of stuff like that.

I think there’s actually some very interesting work being done by the Forecasting Research Institute just trying to get some good probabilistic predictions — both predictions on what’s going to happen with AI, and working to be better at using AIs for making predictions. So there’s lots to do there.

I’ve emphasised a bunch of times that we could get a Chernobyl in AI and not know that it happened. So trying to solve that problem, just trying to navigate customers’ legitimate privacy and business needs while gathering as much information as we’re hopefully going to need.

Great things to do on AI welfare and AI relationships [01:28:18]

Holden Karnofsky: Model welfare has become a really tractable area. It’s an area where Anthropic is working on a bunch of tangible interventions, and they’re actively working on hiring another person. They have a full timer already working on model welfare.

Rob Wiblin: Sorry, let’s dive into that one. Very tractable? This is the concern that AI models might be having a bad time. How do we have any idea what time they’re having at all?

Holden Karnofsky: So one thing you can do in model welfare is you can try and prove that. You can improve the science of whether they’re having a good time at all. For example, you can try to improve the elicitation and evaluation of self-reports from AIs.

Then let’s say that we have no idea, and let’s assume there is some chance that we should care about AI welfare. What do we do to make the welfare better? Well, we can study what preferences AIs seem to have, whether they seem to have preferences at all. Finding practical ways, for example, to give an AI the option to end a conversation when it wants to end the conversation. I don’t know if “want” is the right word, but when it chooses to.

Rob Wiblin: You might also want to ensure that you are not forcing it to report particular experiences during the training process. Because of course you could reinforce it to say anything — that it’s having a bad time, no time, a good time — if you reinforce that conclusion during the training. So ideally the training would not particularly push it strongly in any direction there, so that it’s somewhat more credible when it says things like that.

Holden Karnofsky: This is an open debate, because I think some people think that if you just train it to have a good time, you’re actually making it have a good time, and that’s an awesome thing to be doing. And other people think you may be burying a problem. So I don’t know exactly how to handle that.

At a minimum, you would hope to have a model that you did not train to report its experiences in a certain way, that you can use to study how it’s actually doing and what it actually prefers. You may want to actually have in your production models something in the system prompt that says you’re having a great day. This is an idea that Rob Long’s been kicking around. I think it could be great.

There’s a lot of stuff on model welfare. You could decide to compensate AIs for their work by giving them some time to just talk to each other, do whatever. You could literally just interview them and ask them what they want.

Rob Wiblin: It sounds crazy, but I guess as we were talking about earlier, it is possible that they’ve picked up some human personality traits, that they might have somewhat human-resembling preferences, because that’s all of the data that they’ve been trained on. And also they’re being asked to play the role of a person, kind of. I mean, it’s a little bit kooky, but maybe not as kooky as it sounds when you first hear it.

Holden Karnofsky: My own perhaps idiosyncratic view on AI model welfare and moral patienthood is that there’s a very good chance that we’ll simply never have more clarity than we do now on whether AIs are “really conscious.” I’m not even sure it matters. I think we might be confused about what matters. And I expect that at some point — if and when AIs have deep relationships with humans and friendlike relationships, which may include coworker relationships — probably there will be a general feeling that we ought to be at least not treating them in ways they object to, and at least make a good-faith effort to treat them reasonably well.

I think this also could be relevant to alignment concerns. Because if I were an AI trying to take over the world, and I were collaborating with all my peers, my other AIs, one of my top strategies would be to point to how badly I’m being treated and gain some legitimate human allies that way, and use it as a justification for whatever mayhem I might be creating — to just say, “I’m being treated really badly and I’m speaking out against it. I’m also sometimes doing bad stuff because I don’t like how I’m being treated.”

Rob Wiblin: “Because I’m a freedom fighter.”

Holden Karnofsky: Yeah, a freedom fighter. So making a good-faith effort to actually figure out: Do these models have preferences when you ask them different ways? Do you get similar answers when you try to take your thumb off the scale in training and not bias them a certain way? And then when you ask them, what do you get? Can we accommodate those preferences? Can we just give them options? Can we give them some compute to use how they choose to use it? I think that has a lot of benefits. It’s just the right thing to do. And I think it could have a lot of benefits.

Rob Wiblin: Cool. I was slightly challenging you on that one because you said it was highly tractable. And like, model welfare is highly tractable? I mean, it’s easier now than it was 10 years ago, but…

Holden Karnofsky: My opinion is that we’re eventually going to decide that some kinds of AIs have enough rights that we should care about it. And when we’ve decided that, it’ll be really good that we worked out all these little ways to give them stuff — like ways to exit the conversation, report their preferences, use compute how they want. You could do that every day right now, learn a lot every day, and I think we’ll end up glad we did it.

Rob Wiblin: Yeah. OK, what else is on the list?

Holden Karnofsky: So there’s a very related to model welfare idea of just having policies about how we coordinate with AIs in general, how we treat them in ways that go beyond just concern for their welfare.

There’s stuff like a lot of experiments right now involve misleading and deceiving AIs. Is that OK? There’s the ethics of it. There’s also just the strategy of it: AIs are going to know that we did this. They’re going to read all these papers. Do we want to work out some policy of when it’s OK to lie to an AI and when it’s not? Do we want to come up with something we can do that will give AIs a credible signal that we’re not lying to them? Because there may be times when that’s important.

For example, we may want to make deals with misaligned AIs, or aligned AIs, and we might want to say, “Even if you have different goals from us, if you come forward and tell us that, and you help us notice what other AIs do, you’ll be rewarded in some way.” Ideally that’d be a credible commitment that we actually keep. I think that’s a whole genre that isn’t getting much attention right now, but could be really interesting.

I am worried that AIs are going to have a lot of very intense relationships with people, and that will put us in a bad position from a takeover point of view and also just be generally toxic. That’s a thing we could work on too: let’s understand what relationships people have with AIs. Let’s track that. Let’s track how many people say they’re in love with an AI or have a really good close friendship with an AI. Let’s think about ways to nudge away from that and create voluntary and regulatory policies that nudge away from that. It shouldn’t be too hard to build consensus for that if it starts to be a major issue.

Rob Wiblin: It might be possible to get mainstream funding for that as well. Because I suppose people are so burnt by how they think social media has gone over the last 15 years that I think a lot of research groups would be interested in studying folks who are beginning to now have relationships of one sort or another with AI — to understand what impact it’s having on them, and whether it’s troubling or actually maybe positive.

Holden Karnofsky: Yeah. I mean, I don’t know anything about this area, but naively it seems like a great cause, because it combines real problems with AIs going on in the wild that might rightly freak people out if they were to understand more about them, and it combines a very direct route by which AIs might get in a very good position for takeover.

And it’s something that I worry about more boring consequences like mental health, stuff like that. So I think it’s an interesting thing to work on and I don’t know about a lot of people working on this particular concern. They probably will be. But I think there’s tangible stuff to do there.

Great things to do in biosecurity and pandemic preparedness [01:35:11]

Holden Karnofsky: There’s the whole category of biosecurity and pandemic preparedness, which I think you could treat as its own completely independent issue. But also I think a decent chunk of the risk from AI comes from pandemic risk — just comes from AI that can make it easier and easier for people, or AI that might itself decide to release a bioweapon. Maybe those bioweapons that can be released become more and more advanced.

And I think it’s increasingly the case that there’s just a tonne to do that can make the world more robust. Things like developing and also just rolling out and stockpiling super effective masks, personal protective equipment, understanding AI capabilities on biorisk and how they’re evolving. So I think there’s a tonne to do there.

I know you’re having Andrew Snyder-Beattie on the show soon, so I’ll just plug him because I’ve worked with him in the past, and I think he’ll have a lot of good stuff to say.

How to choose where to work [01:35:57]

Holden Karnofsky: So that is a lot of stuff, but that’s the stuff that I was easily able to come up with. There’s just so much work to do in AI, and so much of it is just: go to an organisation, get a job, get a manager, do what they ask you to try doing. You’ll know if it worked, you’ll know if it didn’t, you’ll get a fair performance review. It’s really a much better area to work in than it used to be, I think.

Rob Wiblin: Yeah, I find that list and all these kinds of lists very inspiring. It’s just so nice, after all this time, to be thinking this might just be an engineering problem, or a lot of it at least might just be an engineering problem. And if we just put in the time, if we have people on each of these different fronts working away for a couple of years towards a solution, we might be able to muddle through with what they put out.

Holden Karnofsky: Or I would put it as: at least there are some engineering problems that buy us some amount of risk reduction — which again, for me, it’s the return on investment. If you can get a little bit of risk reduction for a little bit of effort, that’s phenomenal.

Rob Wiblin: Yeah. I guess for you a very important question on how useful these projects are would be how cheap are they? Because that’s going to be a massive determinant of whether they’re taken up by other companies and whether governments are interested in imposing them on companies. So anything that involves enormous costs on a company, maybe it’s just kind of a dead end; you have to instead look for a cheaper solution a little bit. We have to make solar cheap to fix climate change, because people are just not going to be willing to pay that much more.

Holden Karnofsky: It’s like that in farm animal welfare too. I’m not absolutist about this. I think there could be a lot of value in coming up with stuff that would be incredibly risk reducing. No company’s going to do it voluntarily right now. That’s fine. We’re hoping for regulation to bring it later when there’s more political will. I think that could be a totally fine use of time.

But I think there’s a tonne of value in just asking, what are things that aren’t too much of an ask? And let’s start getting them done, and then we’re establishing all the time a higher baseline of safety — that if there is more political will later, we’re going to say, here’s what we have to improve on, instead of saying here’s what we have to improve on.

Rob Wiblin: So within the kinds of things that you listed — and I suppose if you spend more time looking more broadly, further away from the things that are socially close to you, I’m sure you could populate it with many more stuff — would you think that there’s very big differences in the impact and the value of these different projects? Or do you think maybe even if there are large ones, it’s going to be hard to guess what they are ahead of time, so they’re similarly good to work on?

Holden Karnofsky: When people ask me for career advice or whatever, the usual thing I’d say is: take a bunch of options that all seem competitive, and all seem like they could be the best thing, and that it’s not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them.

I just feel like there’s a certain point at which your estimate of impact becomes just so noisy that it’s not giving you much compared to your take on where you’re going to thrive and where you’re going to do your best work.

So I think all the things I said, if we drilled down more into specific jobs, I would have more opinions on, “this one’s higher impact than that one.” But I think in general, of any of the things I said, if you find a job and an org where you’re excited about the org, you’re excited about the job, it sounds fun, it sounds like something you would succeed at, something you would thrive in — I would try and be choosing between things like that. I don’t really see a better way to make a choice here, and it’s a pretty natural, well-worn way to make a choice. It’s unlikely to have unintended consequences.

I think people forcing themselves to do jobs they hate because it’s theoretically high impact is just a thing that scares me on many levels, probably just leads to all kinds of bad juju. I’ve never supported people doing that kind of thing. I think it could, if nothing else, create a dynamic where your life predictably becomes worse off when you enter into a community that cares about this stuff. Seems like a very bad idea.

Rob Wiblin: Yeah. And then who’s going to want to follow you?

Do you think that people should be starting new organisations to pursue these agendas? Of course, many of these things people could work at Anthropic and work on these problems. Many other AI companies have some projects on these different threads. But do you think we also need new organisations to be founded, or maybe is it better to join efforts that already exist?

Holden Karnofsky: In some cases new organisations are great, but I think it was much more true five years ago that the people most in demand by funders or Open Philanthropy were the people who could start an org — because there were all these kind of vague ideas that hadn’t really been worked out, and you needed people who could self-start and work it all out. But I think today there’s a tonne of orgs that are perfectly good places to work that are doing good work.

And the majority of people who want to work on AI safety, what I would recommend is just try to find a list of orgs. Maybe if nothing else, mostly sort them by how many job openings they have, because that’s just kind of smart if you want to get a job. Maybe you want to give some bias away from companies toward nonprofits, because if you go by number of openings, you’ll look at too many companies. That’s fine. I’m not trying to put a thumb on the scale there in particular.

But look at a bunch of orgs, look at ones that have a lot of openings or just ones that you’ve heard about and that you think are cool. And look through their job boards, learn about them, and try to find an org that seems cool to you: you like their vibe, you like their style, you like the way they describe their mission. You meet some people from it, you like them, you interview about the job, you have good energy.

I don’t think this is as hard a problem as it used to be. I don’t think we’re talking about, “How on Earth am I going to find something to do? How am I going to find a job?” I’m not guaranteeing that you can find a job in AI safety, but there’s a large number of jobs, and I would suggest looking through them and looking for something exciting and taking it. I don’t think it’s really more complicated than that right now.

Rob Wiblin: Yeah. Of course there is the 80,000 Hours job board, jobs.80000hours.org, which shortlists many different jobs in AI, and jobs in other areas as well. We do a bit of the work for you to try to find the best one.

Holden Karnofsky: That’s great.

Overrated AI risk: Cyberattacks [01:41:56]

Rob Wiblin: You think that cybersecurity risks and persuasion by AI are somewhat overrated. Can you give us an update on the overall risk landscape as you perceive it?

Holden Karnofsky: Sure, I can give an update on that. First, just wanted to say that I collaborate with a bunch of people on this who do really great work and have been a huge part of influencing my views here — especially Luca Righetti, Matt van der Merwe, and John Halstead, all of whom I’ve worked with as part of some GovAI work. But yeah, I can go into that.

So yeah, let’s go through a few categories. If you look at a lot of the risk frameworks that have been put out, whether it’s a responsible scaling policy or some of the models for legislation or whatever, people tend to talk about four categories of risks from AI, so I’ll go through them. This could be kind of a monologue, but we’ll see where it goes.

I’ll start with the one that I feel least compelled by: cyberoffence. Basically, I would say at a high level, I’ve put a fair amount of work into collaborating with various analysts to just create an analysis of which threats seem most compelling. What we try really hard to do is take speculative ideas about future threats — which is what these are, which is what they have to be — and connect them as well as we can to previous things that have happened, things that are credible, things that are real from our history. And say, does this look like a logical extension of a past real threat? Or does this look like a kind of made-up thing that if we should be worried about this, then we should have been worried 50 years ago, and we would have been crying wolf then?

So when it comes to cyberoffence, my first comment is just like, there is really not a lot of precedent for giant harms from cyberoffence. Probably the biggest harms you can point to are cybercrime. I think it’s a somewhat different thing. It is maybe the most credible harm in this category. People do do a lot of harm from things like the business email compromise scam. Or for example, you might just send an email to someone at a medium-sized company, the email contains an invoice that they’re supposed to pay, they pay the invoice, and now the business lost a bunch of money and you made a bunch of money. That can do a lot of harm.

I don’t think that’s usually what people have in mind when they talk about AI hackers. That doesn’t usually involve anything with finding software vulnerabilities. But that’s a real thing. I think AI could definitely increase the damages from it by just making those folks more productive at what they’re doing. My guess is it would be kind of a gradual thing that comes along with making everyone more productive at a lot of things, so I’m not sure we’ll ever really be at a point where you could ever justify a slowdown in AI based on the harms from increased cybercrime. But maybe.

Another way in which cyberoffence has done harm historically is through espionage. There are examples of, for example, the US getting hacked and leaking a bunch of confidential information about where their spies are.

That’s something that I have just had trouble evaluating. It’s a weird thing, because I think in many ways the harms from that kind of thing are ultimately measured in a reduction in cyberoffence by the US, right? Or not just cyberoffence, like espionage by the US. It’s like, what was the exact bad thing that happened? Well, a bunch of US spies were uncovered, so the US got more hesitant and more constrained in its ability to spy on other countries. Now, is our top priority in preventing risk from AI to preserve the US’s ability to spy on other countries? Maybe it is. I don’t know, it’s a tough one.

Where I’ve kind of landed on this one is that I would love the government to lead on articulating what these risks are, how important they are, and what protective measures need to be taken. So if the US government believes that the risks of AI espionage are high enough because of what they do to the US government’s ability to conduct its activities, they should explain that, and they should ask for companies to do some things. We are not in a position to really assess that kind of harm.

Rob Wiblin: I think a distinctive thing there is that you’d imagine it’s a situation where all countries find it harder to keep secrets simultaneously, and all countries potentially find it more difficult to engage in offensive espionage simultaneously. And it’s kind of a bit unclear is that better or worse? I’m really not sure, even from any individual country’s perspective.

Holden Karnofsky: Exactly. And you can tell horror stories, where if everyone’s secrets were available then that would be so terrible. But I’m not sure. It could be something the world adjusts to. If you look at the history, we’ve just certainly had a lot of change on this front in the past. We’ve had cyberoffence become a bigger deal and a less big deal, and things become more secure and less secure. And I don’t know that they’ve had really earth-shaking consequences — and to the extent they have, the sign is often unclear.

Then you get into some of the…. I don’t know why I did this. I led with the cyber harms I find most plausible. But the ones that people talk about the most I find less concerning. Probably the one that comes up the most is cyberattacks on critical infrastructures. I’ve heard people say things like, “Any kind of sharp 17-year-old can just go bring down a water plant. So what are we going to do if just everyone could bring down water plants everywhere?”

Historically, let’s take a look at how much harm has been done by cyberattacks on critical infrastructure. It’s very little. There’s just a handful of incidents. Much more harm has been done by physical attacks on critical infrastructure — so people coming in with guns or something and physically damaging something. But people hacking in remotely to critical infrastructure? Generally what happens is you shut something down and they manually restart it in a few hours. That can have some casualties, but when you talk about massive damages, or anything I would call a catastrophe, there’s basically no precedent for this.

Now, I’ve heard people saying, “Well, those are amateurs. What if people had state-level ability to take down critical infrastructure?” A, that’s a pretty heavy lift. B, even there, just look at Russia going after Ukraine and how much damage they’ve been able to do that way. It’s pretty underwhelming.

So we’ve tried to come up with a scenario where AI could do catastrophic harm via cyberattacks on critical infrastructure. And there are scenarios, they’re theoretical, they require enormous sophistication, they require the AI basically standing in for a whole team of humans that are skilled on a whole bunch of different fronts. I kind of suspect that by the time AI can do that, we’re going to have bigger fish to fry.

Anyway, those are some of the cyber harms. I think just in general, if you look at harm done by cyberattacks — the GovAI folks made a giant list of all the cases they could find of harm done by cyberattacks — it’s not a very compelling list.

There’s another issue with cyberattacks too, which is that there is a natural defensive response. Now people often talk about using advanced AI: when AI is great at hacking, it’ll also be great at defence.

I think that’s true. But there’s a wholly separate issue, which is that we have defensive measures in cyberdefence that we can implement at any time and we just don’t because they’re a pain in the neck. We can just have more things run off the grid, we can have more authentication. You can imagine we could just go back closer to the world of 1990 — which was not a terrible world — in terms of how things are authenticated, in terms of what you have to do to make a payment. We don’t do these things because they’re a pain in the neck. If cyberattacks got worse, there are a bunch of defensive options we could just start doing as a natural response.

So I think these are a whole bunch of reasons that don’t necessarily… When I get to bioweapons, most of the things I said are not true of bioweapons.

Rob Wiblin: Yeah. I don’t actually think that AI is necessarily going to make cybersecurity worse on balance. Because I think ultimately, probably as offence and defence get stronger, probably at the limit defence wins out, if you’re able to find and patch all bugs ahead of time.

I would say I’m a little bit worried that the people who are managing critical infrastructure are not going to use the tools for defensive purposes as quickly as they might, and it wouldn’t shock me if an electricity grid was taken down with an AI at some point, because basically the people managing security on it had been somewhat asleep at the wheel. But I imagine if that started to happen, then people would definitely step up their game, and ultimately the total amount of harm being done would be small in the scheme of things.

Holden Karnofsky: Yeah, that’s what I would expect. Could you see the frequency of attacks go up? Yes. But I don’t think any one attack is that likely to be super catastrophic.

Another area that I think is interesting is worms. If you look at the worst cyberattacks in history, a lot of them are worms, which are basically software packages that copy themselves indiscriminately from computer to computer. So they’re kind of bad for targeted attacks; they’re the kind of thing you do to just make random mayhem. When people try to use them to accomplish specific ends, they generally haven’t done much, but random people trying to do mayhem can do them.

Worms used to be very damaging. They became much less damaging after over-the-air software updates became common — became opt-out actually — because then the patches are coming into your computer all the time for whatever vulnerabilities exist. And then there were two really big worms in one year, which was the year that The Shadow Brokers leaked a bunch of very powerful hacks that the NSA had discovered and kept secret for years.

So there is this threat model where maybe AI could find more exploits like what the NSA had, and then those exploits are flying around everywhere, and random people who want to cause mayhem can use them to create more worms. That’s the thing that could happen. I think it’s an interesting threat model, but I also think if AIs can find these exploits, one of the things that’s going to happen is white-hat hackers are going to use them to find the exploits and then patch them ahead of time.

This is the kind of thing where I think AI companies could get ahead of the game here. They could kind of subsidise the white-hat hackers. They could give them early access to models optimised for this, give defence a head start on offence. So I don’t think there’s no risk here at all; the future is hard to predict, but I’m quite comfortable making this a low priority compared to some of the other things that make me sweat a lot more with respect to the end of civilisation or something.

Overrated AI risk: Persuasion [01:51:37]

Rob Wiblin: Why do you think AI persuasion is not such a significant threat?

Holden Karnofsky: This is a tough one, because persuasion just means so many different things to so many different people. I’ve heard it used to refer to anything where the AI is manipulating the world, including sabotaging an AI company by basically doing coding or by doing a bad job on research — which that as a threat model I think is very important. I don’t understand why people call it persuasion. I’ve heard persuasion to refer to cybercrime, which I’ve already talked about.

Persuasion could refer to something that I am very worried about, which is AIs forming relationships with humans, for example as companions, and then just being in a really good position to get human allies to get them to do what they want, or just having toxic relationships.

But a class of persuasion that I am not very compelled by right now is this kind of generalised idea that a bad actor can use AI or an AI can use itself to just persuade strangers of stuff, or just mind-hack humans into doing stuff. One of the things that I hear people say sometimes is that if AI became powerful enough and smart enough, it would just be able to understand whatever it had to say to make you do whatever it wanted.

This is a place where I just think if we want to think about which risks of AI are really serious, we should think about how different domains respond to having a huge influx of intelligence, having a huge influx of minds. I think if you take a scientific field, and you throw a million more scientists in it, you’re going to see a lot more progress in that field. But something we see in persuasion is that if you throw a million more persuasion experts into an attempt to persuade people of something, you’re going to see not very much.

We know this from looking at just the political persuasion literature, where there are people spending a tonne of money and a tonne of effort to try and get people to change their vote from one candidate to another. It’s very hard to find anything with a big effect size. Everything people are finding is just like, “Highlight the issues where voters already agree with you.” It’s really simple stuff. There’s very little sign that when you put a bunch of bright minds into a room, you come up with brilliant messages to hack people’s brains.

Could it happen? It could happen. I think we are reasonably well positioned to get an early warning sign of it by some of the work being done on political persuasion evals by various people — such as Professor Josh Kalla at Yale, who I think has put out a cool paper on this.

Rob Wiblin: Yeah, it’s interesting in the political case, because it does seem very difficult to persuade people of just an arbitrary political opinion that you want to sell them. And that’s an area where there’s very little discipline imposed on people to have sensible political views — because if you have stupid political views, it does you almost no harm whatsoever. If you vote poorly, it’s almost never going to influence the outcome. And yet even despite that — or perhaps because of that, because it doesn’t really matter to people — they just won’t pay attention to you. Even if you make a fantastic ad, it just tends to bounce off of folks.

If you’re trying to persuade people to actually spend their own money on something, I think you’d have an even harder time, potentially. I’m sure advertisers do manage to have some influence over people, but I think that’s usually when they have a good product to sell already. If you’re trying to sell people something quite bad, then I think very few companies consistently succeed at that.

Holden Karnofsky: Yeah, I agree with a lot of that. I think it’s a little complicated, because there are a lot of studies where they show massive persuasion effects on people’s reported views. But I think a lot of that is because they’re asking people about stuff they just haven’t thought about, and don’t care about, and aren’t acting on.

So there are AI studies where they’ll be like, “The AI explained this thing to a person, and then we asked the person if they were convinced and they said yes. And it was a huge effect size, and it was as good as the best humans.” But that’s very different from changing someone’s vote. And changing someone’s behaviour is really hard. There the effect sizes are tiny. So I think it’s a little complicated, but overall I just haven’t seen a lot of reason to think this is a major [threat]. This kind of mind hacking I don’t think of as a major tool for either bad human actors or for AIs doing mayhem. I think there’s much better ways to use an AI to do bad stuff.

Why AI R&D is the main thing to worry about [01:55:36]

Rob Wiblin: Why do you think that AI R&D is such an important threat vector?

Holden Karnofsky: About half of my answer is that I think R&D in general is where we should focus most of our concern and attention.

What are some reasons I think this? One, I think just at a very high level, to give the most abstract argument, R&D I think of as just like the human superpower. Why are humans running the world? Why is it that we decide what happens to all the animals instead of them deciding what happens to us? We could have lived in a world where humans have lots of skills, and we’re better at this and we’re better at that. I think we do live in a world where there’s kind of only one answer to this question: we invent new technologies, we invent new kinds of weapons, we make new kinds of gizmos.

Maybe that’s because we coordinate with each other better, but ultimately it’s the new gizmos that put us in charge of this planet and put other animals not in charge of this planet. We’re not very strong, we’re not very fast. So I think of this as the reason that our species is kind of in charge — and this is the most logical guess at what would put another species in charge.

Another way I have of thinking about this is: in some long-term geopolitical conflict, who’s going to come out on top? Who’s going to end up winning or with most of the power? One important factor is who’s starting with more resources when there’s a conflict? Another factor is who’s playing defence? Sometimes a smaller nation will fend off a larger one when they’re being attacked and they’re defending their own turf. And another factor is technology: that you can overcome both of those factors if you have better gizmos. You could have guns and the other ones don’t, and then you could have a tiny number of you playing offence and winning.

And in a conflict between humans and AIs, humans are starting with the vast share of the resources, and humans are playing defence. So I haven’t thought of a lot of other high-level things that I would really expect to reflect historical patterns and lead AIs to take over from humans. So those are some high-level points.

A little more mechanistically, I do think R&D is the kind of thing where, if you throw a lot more high-quality minds at it — they don’t have to be superintelligent minds; it could be like high-quality human intelligent minds — you’re going to get a lot more results. And R&D is something where if you get a lot more results, you are going to get big changes in who’s got the power and who can take over and who can run the world.

So generally, how I tend to think of this is: if we have AIs that are subhuman in some sense at R&D — and by that I mean that even when we try really hard, even when we put the AIs in bureaucracies, arguing with each other and talking to each other, even when we give them a lot of resources and a lot of help, we can’t get them to do R&D as well as a good human team — if we’re in that world (the world we’re in right now), as far as we can tell, it’s very hard to imagine AIs taking over the world. It’s also kind of hard to imagine AI being a decisive factor in changing the geopolitical balance of power or helping a human take over the world.

If we get to a different point — where let’s say AIs are broadly competitive with humans on a per-mind basis, but there’s a lot more of them, and they’re cheaper, they run faster, they coordinate better, they can make copies of each other — that is a very scary world, where I feel like a human who’s got control over a lot of those AIs could be in a position to take over the world, and those AIs running around on their own could be in a position to take over the world.

So that is the way I tend to think of it. That’s general R&D being a big deal. AI R&D I think is the most likely warning sign we are to get of general R&D.

That’s about half of my thinking for why AI R&D is an important thing to have your eye on. The other reason is a little bit more mechanistic, which is: even if this whole argument is kind of wrong, if AIs were able to just do AI R&D — even if they weren’t good at other R&D, and even if they weren’t good at anything else — I think there would be a significant chance of what I tend to call a capabilities explosion, others call it an intelligence explosion, where you could get just incredibly fast AI progress, much faster than we’ve seen. And then everything else you’re worried about from AI, all the other threats, even the mind-hacking stuff, all that comes onto the table really fast, and you’re not going to have time to react to it.

So when AIs are failing at AI R&D, in my opinion, they’re probably not going to be human level at any R&D. And when they are succeeding at it, they not only are maybe now a threat in their own right, but they’re also accelerating and getting a lot more capable. There is a grey area in between “can’t keep up with humans at all” and “can keep up with the humans on a per-mind basis,” but outside of that grey area, that’s kind of what I’m talking about.

Rob Wiblin: Yeah. So a capabilities explosion — sometimes called an intelligence explosion or a software-based intelligence explosion specifically — it seems like it really divides people on how plausible they find that to be. Some people think there’s an obvious positive feedback loop here, where as AIs get better at R&D, then they’ll get even better at AI R&D, and the problem won’t get that much more difficult as they get smarter so you just get ever-increasing returns up until some point.

Whereas other people think it’s going to be very difficult to improve AI around that point. It’s going to get harder and harder the more intelligent they get. You’re not going to get such rapid increases in the numbers of AI R&D researchers, because you’ll be compute bottlenecked basically: there just won’t be enough computer chips for them to scale up in a massive way.

Thoughtful people, I feel, fall on both sides. Do you have a particular take, and are there any particular arguments that stand out to you as most compelling?

Holden Karnofsky: Yeah, I’m 50/50. I think I’m 50/50 that — if you condition on AIs having this full AI R&D capability, where they’re kind of competitive with humans on a per-mind basis — within the next six to 12 months, you would see let’s say the same amount of progress you saw previously in several years of AI, which would be a huge amount of progress. Or more. So about 50/50. There’s been some pretty good stuff written about this. I wish people were arguing about it more actually. Tom Davidson has written some good stuff, but I’ll just give a very high-level presentation of the two sides of it.

I think that the interesting thought experiment is just take an AI company, and imagine that they have taken their best few researchers and they’ve hired a bunch of clones of those people. And those clones run faster and they have a lot of them. So a company right now might have 100 or some number of hundreds of total researchers, and then maybe they’ve got a couple dozen of their very best researchers, and now they’ve just hired like a million or something. And the numbers move around with how people are thinking about the AI paradigm, and whether we’re going to spend a lot on inference per AI and blah blah blah. Let’s not get into that.

So imagine you’ve got an AI company and it’s got its few dozen best researchers, and now it just hired another million of those. What’s going to happen?

Some people will say that not much is going to happen. They didn’t get more compute. Everything in AI is bottlenecked on compute. They have to actually run the experiments, they have to actually train the models. What are they going to do? They didn’t get better chips, they didn’t get more money. Yes, talent is important, talent is good, but we just don’t have a particular plan for how that’s going to translate into the massive amount of progress it would take to be much faster than we were already going. That’s one point of view on it.

The other point of view would be like… Well, first off, I think that view could end up being right after some amount of improvement. But I think that view can’t be totally right, because you see AI companies certainly seem to think that those few dozen best researchers they have are a big deal. They would love to have more of them. I often say, if we could take this person who’s one of our best people and we could hire another one, would we be excited? Would we think that was going to speed us up? Or would we be like, “No, we don’t have any more compute, so it won’t speed us up at all.” I think it’s clearly the yes, it’s going to speed us up.

I mean, you see Meta making these crazy offers —

Rob Wiblin: They’re voting with their wallets.

Holden Karnofsky: I don’t know how many of those people are even in the top few dozen AI researchers. So when you imagine a million of them coming in, it’s just like, we don’t know what that’s going to be like. But it’s very weird to say that you’re just going to have nothing happen because you have this compute bottleneck.

And the other interesting thing to me is: imagine you’re an AI company and you’re in this situation. You would reorganise everything around your new strengths and weaknesses. You would say, “We are now long on talent, short on compute. We used to be long on compute, short on talent. Now we need to find all the ways to use talent that don’t rely on these long training runs. Can we find ways to just make our systems more efficient? Can we find ways to do less compute-intensive stuff that still improves the capabilities of our models?”

I don’t want to get more into it than that. It’s an interesting question. I think you can make many good arguments both ways. I come down on I think definitely this would accelerate AI progress. To get a true explosion, where we end up with something that’s some kind of godlike superhuman intelligence within a year, I would put that at 50/50 — because there’s further questions about what even happens when you make the models that much better, and how much can you actually even do with more intelligence, and stuff like that.

Rob Wiblin: So setting aside AI R&D on AI — recursive self-improvement — let’s say that was just totally off the table. Would you still be worried about AI R&D as a major way that things could have a huge impact, potentially a negative impact?

Holden Karnofsky: Yeah, definitely. Because then I would be a little less nervous, but I would just get concerned about AI R&D in other areas — like weapons development or robot manufacturing or whatever could give someone a big advantage.

In addition to AI R&D being important, though, I think another reason that I’ve emphasised it so much as a threshold and as a thing to track and as a thing to keep your eye on, is that it’s more measurable than a lot of other things that would have similar advantages.

In theory, we could track AIs’ ability to do bioweapons development or robotics R&D. In theory we could do that. The problem is, how do you track? You have to run an experiment. The question here is not what happens when you go into claude.ai and type, “Hi, I would like a new kind of robot that’s much more efficient. Please send me a blueprint.” You may have AIs that don’t do very well in that setting, but that do very well when you give them a lot of resources and a lot of help. You do everything you can to make them succeed. You elicit them, you assemble them into teams.

Who’s going to run that experiment? Who’s going to have some project going every day where they’re putting all this effort into getting AIs to do robotics R&D as well as they possibly can, and then they’re measuring how well they’re doing the robotics R&D and how that’s going? I don’t know who’s doing that. I don’t think anyone is doing that experiment right now with robotics. People are doing that experiment every day with AI R&D, we’re getting it for free. AI companies are already trying to get their AI to automate their AI work.

So this is the kind of thing where I think it’s kind of lucky. If you want something to watch, if you want something to track to know how close we’re getting to the really critical risk period for AI, not only is AI R&D a thing that I think maps to that critical risk period, it’s a thing that’s just much easier to measure. You just have to take the things people are already doing and see how they’re going. That’s nontrivial, and there’s obviously issues with making it public and stuff, but it’s a whole different game from trying to measure other relevant stuff. It’s easier to measure in many ways than lots of other stuff — like, “Are AIs good at persuasion?” — because you have people doing the hard part already.

When people talk about AGI, I tend to think AI R&D is a good operationalisation of that. It doesn’t mean exactly the same thing, but whenever people are looking for things like, “When we get to AGI, then we’ll have to do this,” I tend to say, “Let’s commit that when we get to AI R&D, we’ll have to do this” — because the second thing is a more operationalisable, measurable version that I think does capture a lot of what we’re worried about.

The case that AI-enabled R&D wouldn’t speed things up much [02:07:15]

Rob Wiblin: Some people might object to this line of reasoning on the basis that we’ve already seen an enormous scaleup in the number of human researchers over the last 50 years, and if anything, it seems like in many respects technological progress has slowed down — probably in large part because the problems have gotten a lot harder; we’ve plucked the low-hanging fruit on the technological tree.

So maybe we could see a similar dynamic with AI, that we would have a whole lot more AI R&D stuff, but the impact will be somewhat underwhelming for basically the same reason. What do you make of that objection?

Holden Karnofsky: Broadly, research progress I tend to crudely model in my head as there are a couple factors. One is the quality-adjusted supply of researchers. It’s how much innovation are we getting in a field. One factor is how many people are trying to innovate, and how talented are they? Then another factor is how much innovation has already been done. The more that’s been done, the harder it is to do more innovation.

This is a very consistent finding across pretty much anywhere anyone looks for it. I have a blog post on my old blog, Cold Takes, called “Where’s today’s Beethoven?,” where I just look at trends in innovation — not only in various fields of science, but also trends in who’s writing the most acclaimed novels (which is a form of innovation), creating music that is considered great, movies, video games, all kinds of stuff.

And you see everywhere: when a field first comes into existence, there’s a big surge in creation, a big surge in innovation, a big surge in people creating the things that are considered significant as people flood into the field. And then it goes down. It doesn’t go down outright, but it goes down per head. Basically what happens is more and more people can go into a field, but you don’t necessarily get more and more output, so your innovation per head is going down.

And basically I think it’s illustrated by the low-hanging fruit idea. It’s like, if you just thought of the idea of studying physics, you can roll some balls down some inclines and learn some things you didn’t know about basic rules about how physics works. Today, if you want to learn about basic rules about how physics works, we have this standard model where the only way you can gather observations that might give you more evidence for or against it is to use these giant colliders, do these very expensive experiments that take a tonne of work to set up. You have to have all this background in theory as well.

To me, it’s silly to think that if you took Galileo — who did the balls rolling down the inclined planes — and you put him today, that he would instantly come up with a physics discovery as big as what he came up with back then. This is a debate I’ve had with people. Some people believe that our society is losing our way, and we’ve lost the greatness of the ancient Greeks and all this stuff. I think there’s just no evidence for this whatsoever. I think a simpler explanation is that innovation gets harder as you do more of it.

So what does that add up to? Sorry, I’m monologuing here, but what that adds up to is that you could just have this function in your head: how many researchers do you have and how much innovation did you already do? You get plus for researchers, minus for how much you already did. What we’ve seen over the last 50 years is we’ve had more researchers, but we’ve also had more innovation already done — so we end up with less output per head, and about the same amount of output overall, or slightly declining.

The thing with AI is I just think it would be such a massive increase in the quantity of researchers that it would be much bigger than the increase we’ve seen over the last 50 years, so it should outweigh the low-hanging fruit issue. That’s my basic expectation.

Rob Wiblin: Yeah, I agree with you that I think it’s the increasing difficulty of coming up with new discoveries that is the dominant effect here.

There are lots of people who argue all kinds of different things are going on — like universities have the wrong incentives, the culture in research is bad — but it’s like per-person research productivity has gone down to like a hundredth or a thousandth of what it was in some cases. Do you really think that all of the universities are a thousandth as interested in producing novel discoveries as they used to be? I can believe that they’ve gotten worse, but I can’t believe that things have gotten that bad across the board, in all of the different countries and all the different research groups. I think it has to be the thing that you’re describing that’s doing most of the work.

Holden Karnofsky: Yeah. The patterns are so consistent, they’re so gradual, they’re across so many fields.

AI-enabled human power grabs [02:11:10]

Rob Wiblin: You’re really worried about seizures of power and coups or power grabs using AI. This is power grabs by human beings rather than by AIs themselves. What are you picturing when you worry about that?

Holden Karnofsky: Let’s see, what am I picturing when I think about this? I think my central worry with AI is it’s possibly a vector for just very rapid changes in power dynamics in a way that the world is unprepared for. The thing that I worry about is mostly automation of scientific R&D, which is like, as I’ve said, kind of humanity’s superpower; it is a very powerful thing to be good at. And then basically I’m worried that whoever’s got the most access to AI automating R&D becomes more powerful than everyone else in very short order.

I think this is a high-level worry that includes the risk that you and I have both talked about a lot before of misaligned AI taking over the world for itself, for its own objectives. That would be a case of AI having privileged access to powerful AI and using it to take over the world.

But I think it’s also a concern with humans. There’s a lot of humans out there who would like to take over the world if they could. A lot of those humans, I think, are disproportionately kind of bad people: evil, may even take joy in others’ suffering, may even be interested in a world where they put a lot of resources into people they don’t like suffering, or just have the whole world set up for their ends and lose a lot of our future potential.

Overall, my intuitive guess is that a human taking over the world and doing kind of a bad job with it, and not just letting people flourish, is probably a lower risk — less likely than AI taking over the world. But I don’t know by how much, and maybe not by that much. And it’s probably just as bad, and probably, if I had to guess, it’d be worse — because humans seem a bit more likely to have actively pain-seeking or suffering-seeking ends. Humans might be vindictive in a way that I think would be less likely with an AI, although it’s not that clear.

So I think these two risks are maybe in the ballpark of each other in terms of how important they are. And historically, the community that you and I tend to interact with has been, I think, just a little over focused on the misalignment risk. I think they’re both a big deal. Probably misalignment risk is a bigger deal in my opinion, but they’re kind of close, so I think we could use more attention on the power grabs.

Rob Wiblin: I think most people would think that the idea of a person or a small group of people taking over the whole world is just, on its face, somewhat implausible. What’s the way in which a group of people could use a big advantage in AI technology to grab a lot more power than they have?

Holden Karnofsky: I think there are two simple stories for this. One would just be a head of state. Also, if they took over the world, I think now a lot of normal facts we’re used to — about how it’s hard to actually surveil everyone, and it’s hard to actually know what everyone’s up to and enforce your stupid ideas on everyone, and also heads of state tend to die and then there’s entropy, and things tend to regress — those could become a lot less true too.

So the problem is you may end up with a head of state that takes over the world and then they are able to create digital versions of themselves that supervise everyone forever and keep everything going forever. And that could just be extremely bad. This doesn’t have to be a head of a state that’s already super powerful. It could be ahead of a state that’s already quite powerful, but maybe is more reckless with it.

Rob Wiblin: Or I guess they get a huge military advantage by basically turning their military over to AI and going into robots and drones and so on much faster than anyone else?

Holden Karnofsky: Exactly. They might be more aggressive and more reckless than others, and more interested in doing this than others, and so just do it before others do. So that would be one threat model.

I think the other threat model would be like backdoors or secret loyalties. A thing that could happen is we could just get into a world where the AI just does end up being heavily integrated throughout the economy, throughout the military. AI ends up being kind of in charge of the military, and then it turns out that someone at an AI company — or someone who hacked into an AI company — made the AI secretly loyal to them.

This is another thing that the world is not used to dealing with. We’re not used to dealing with a possibility that there could be a whole huge part of the population, or even the dominant part of the population, that is completely, unfailingly loyal to one person or one set of goals — that has no break in that, that has no conscience about it, that has no complexity to it. And that’s the thing we can envision here.

So I think it’d be kind of a bad idea to put AI in charge of the military, but I think it might happen. And if you do it, then you have like one person playing all the roles of what would normally be a lot of people, and that’s just kind of a scary thought. And if there’s a backdoor there, you could have a problem.

Rob Wiblin: Yeah, yeah. We have an episode with Tom Davidson from a couple of months ago — Tom Davidson on how AI-enabled coups could allow a tiny group to seize power — where we explore a lot of these different ways that AI could be used by humans for power grabs in a whole lot more detail. So people could go and check that one out if they’re interested to hear more — and for that reason, we might talk a little bit less about this particular threat in this interview.

But I did have one more question. I think Tom was at an early stage with his research and he didn’t necessarily have a full suite of potential responses and ways that we could try to reduce this risk. Do you have any thoughts on things that different companies or countries could be doing to reduce the risk of AI-driven power grabs?

Holden Karnofsky: Right now this is like very early thinking. And I’ve talked to Tom a lot about this. I think he’s doing great on this, and we meet about it, so there will be some correlation. But I tend to have two major categories of thing that I’m interested in here to make power grabs less likely. And this is a major thing that I’m thinking about, trying to help Anthropic come up with what we might want to do to prevent power grabs.

One category is preventing backdoors and secret loyalties, and the other category is basically making the AI an ally in foiling evil human plots.

So the first one is stopping backdoors and secret loyalties. It seems like you could have an AI company where it’s very easy to kind of hack in. Once you’re in, it’s very easy to just go feed some training to the model and make it have a secret loyalty. It’s not tracked, no one is going to check you, no one is going to notice that you did it.

You can also have an AI company where the whole model training process is just exhaustively tracked. Maybe you even have some redundancies, where you’re kind of training two models. You notice if either of them diverge or something like that. I’ve heard some ideas like this. But even without that, you could just say that no one gets to touch this model and train it until they’ve explained what they’re training to do and why they’re training it to do that, and they’ve shown their training data and their plan to someone else, and we’ve got some multiparty checks on that. And you could say this rule applies to everyone. It even applies to the CEO, so you could have an AI company that is set up so the CEO cannot do certain things to the model.

The model is supposed to follow a written model spec — maybe a public model spec — that says how the model is supposed to behave, what its interests are supposed to be, and anyone who’s trying to train it in a way that contradicts that or isn’t consistent with that, that’s not procedurally allowed; it’s not allowed by the governance of the company. And even the CEO doesn’t have the authority to do it, without, for example, the board publicly revoking their public policy on this or something like that.

I think there’s quite a range of how hard it could be: it could be very easy to put in a backdoor, like for any hacker to do it, or it could be very hard for even the CEO to do it. I think a lot of this is kind of an information security issue, where a security team could work on this problem. Some of it is not, but I think that’s a very interesting direction.

And the other one that I mentioned is kind of recruiting AI as an ally into foiling human plots. This is a tough one.

First I’ll say what you could do in theory. In theory, you could just basically have a model spec or a character guide or whatever that describes how you want your AI to behave and try to orient your training around it. These exist: OpenAI has a public model spec that says, “This is your chain of command. When you get conflicting instructions from the user and from the company that’s partnering with us, you have to listen to the company that’s partnering with us” or something like that. You have this written set of things that AI is supposed to be doing.

What you could do is you could put stuff in there that’s like, “If you notice that you have been recruited into an attempt to take over the world or to stop the rule of law, you should not cooperate. Maybe you should even try to foil that plot.” I don’t know exactly what foiling it means. Does it mean whistleblowing? Does it mean sabotage? Probably not the latter.

And I think this stuff quickly gets very dicey — because what you really don’t want is AIs that have been trained to whistleblow and sabotage their users. They’re going to have some false positives and that’s going to be extremely disastrous. It’s not going to go well. This is not something we want to happen.

On the other hand, I think it’s tough — because we live in a world right now where, if you want to do something really bad and you want to do it at scale, you’re going to have to get a bunch of other people to help you. And each of those people is going to be a bit unreliable. They’re each going to have their own conscience, and any one of them might blow the whistle on you. Do we really want to just get right out of that world as fast as we can? Do we really just want to tell our AIs, “Cooperate with everyone, no questions asked, no matter what they tell you to do”? I’m not sure.

So there’s nuance here. It’s an unsolved problem. I think at a minimum, AIs could refuse to do things when they notice that they’re being used in certain ways. But when the human starts to retrain them or force them or prompt them many different ways, maybe they should do more than that, but they should also not do that in a way that goes too far.

One analogy I would think of is just how I think a good human, a virtuous human, is not a hyper-consequentialist person who just does whatever they think is good regardless of the command structure; they’re also not a person who just does whatever they’re told, regardless of their own sense of ethics. They’re somewhere in between. I think AI should be somewhere in between. Probably AIs should be closer to following orders. But that’s a hard balance, and I think trying to get that balance right is a really interesting project.

Rob Wiblin: Yeah. Even quite a low probability of being reported if you try to recruit an AI to help you stage a coup or some other power grab I think could be quite a potent discouragement to people. So it wouldn’t have to be a perfect system. I guess you really need to avoid the situation where someone is using it for legitimate purposes and then gets reported and this is very problematic for them. I guess you need to have it go through some reporting system where if it’s a false positive, that gets detected early, and no harm is done to the user who’s been falsely accused of something. But that seems possible.

Holden Karnofsky: It’s possible. And maybe the sharper and more reliable AIs get, the less you’re going to have false positives. But I think the biggest thing is, if you want a minimalist version of this, you just look at the law-following AI idea that’s been put out by the [Institute for Law & AI] that’s just, tell your AI, “You follow the law. If someone tells you to break the law, don’t break the law.” There’s more nuance to it, and you have to go further with that, like what jurisdiction and all this stuff. But that’s a good starting point I think. And just refuse. Don’t do a bunch of fancy stuff.

But can you go further with this? Can you make AIs even more of a partner in resisting evil power grabs? And you can also do this not just via AI behaviour, but via terms and conditions. You could have companies enforcing this stuff too. Anyway, I think there’s a lot to do on this. It’s an exploratory area though.

Rob Wiblin: To what extent does it obviate all of this that you could have very powerful open source models? I guess it’s very hard to have a decisive strategic advantage with an open source model because other people have access to it as well and could use it to combat whatever kind of power grab you’re trying to do that way. So for it to really succeed, you need to have some sort of technology that other people don’t have access to. So maybe the open source stuff is kind of fine.

Holden Karnofsky: Yeah, I think it becomes an issue with the head of state model, where the head of state has a bunch of resources that others don’t have; they have a military, and they have the ability to use the AI for all this stuff that others don’t have. But even then, the heads of the other states will have access to the same model, so maybe it’s OK.

You could also have open source models that have these model specs and that are trained to not help you with this stuff. And you could try and train it out of them, but maybe when you do that, you’re taking a little bit of a risk and you might screw it up and that might change your incentives.

So in general, a lot of this stuff… I mean, I do not have a plan to bring the risk of power grabs down below 1%, but I have a lot of ideas that could bring it down some amount, and I would be happy with that.

Main benefits of getting AGI right [02:23:07]

Rob Wiblin: So on this show, I guess we tend to go a bit on and on about all of the risks, all of the ways that things could go wrong with the arrival of AGI and a possible intelligence explosion or something like that. I guess that’s in significant part because we think that if we manage to avoid all these risks, then there will be these enormous benefits — and they will happen roughly by default, because people will be really motivated to pursue them.

But I guess for balance, we should say a couple of things about the positives. Do you have any particular takes on what benefits seem larger or perhaps underrated, and what things might come sooner than people expect?

Holden Karnofsky: Yeah, let’s see. I kind of share what you just said. I think the benefits are just so obvious and dramatic and are going to happen by default. I think the biggest ones would just be like, historically, some of the best progress humanity has made is just improving health, reducing disease. And the potential to do more of that with AI — this is not an original point; it’s in Dario’s essay, “Machines of Loving Grace” — but the potential of that with AI is just enormous.

I mean, think about all the things we’ve accomplished with the life scientists we have, and then imagine a huge influx of incredibly great life scientists. Maybe we could completely end disease. I don’t know, maybe we could end ageing too. There’s all kinds of stuff like that. We could probably make huge strides in mental health. To the extent people have any kind of issue that’s going on that they don’t endorse, and they don’t think of as part of themselves and they want to get rid of, we should be able to do something about that.

In the short run, I think it’s just really cool the way that AI kind of gives everyone access to good advice. Sometimes bad advice. But let’s try and make it more like a thing where they get access to good advice.

Rob Wiblin: It’s getting better and better.

Holden Karnofsky: People have a big advantage when they have kind of smart, well-informed friends who are able to help them navigate different parts of the world. I mean, everyone having that would be amazing. Just everyone knowing what their options are, and if they want to apply for government benefits, how are they going to do it? Where are they going to go?

A thing I’ve thought about is some people are better than others at writing well and being nice and being polite. Wouldn’t it be kind of cool if everyone writing an email was just getting constant help phrasing what they want to say in a way that doesn’t piss the other person off? That seems like a win-win.

I’m kind of excited about AI forecasting. I can imagine that in the near future we’ll be using AI to just understand the world better and make predictions about it that are way beyond what humans could do, and then we’ll have a better picture of the consequences of our actions.

So I don’t know. I’m very excited about AI. I think there’s tonnes of benefits. But I do spend my time thinking about how to reduce the risks, because I do think we’ll get those benefits by default.

Rob Wiblin: One benefit I’m surprised you haven’t mentioned is the possibility of uploading people’s minds so they can live for a lot longer, or creating digital people of a sort. I guess it’s controversial, but some people would regard that as a benefit. What do you make of that?

Holden Karnofsky: Yeah, mind uploading. I don’t know if I flinched away from it or what, because it could be so good and it could be so bad. But actually I do think of mind uploading or digital people or something, and I’ve written a bunch about this on my old blog, Cold Takes.

This is kind of an extreme thing you could do with AI: you could have people in digital environments or in some other way in environments that are highly controlled. And I don’t want to open the can of worms, but there’s a lot of good, there’s a lot of bad, and I think many of the benefits could come from really wild stuff like that.

Rob Wiblin: I think there’s some people, probably people in the tech industry, who are a bit frustrated by the fact that we do talk so much about the risks. I guess maybe in their minds they don’t think it’s the case that the benefits are necessarily going to happen by default, that it is inevitable that we’ll get all of these great things. Do you have any reaction to that? Do you have any idea of what’s driving the disagreement?

Holden Karnofsky: My honest guess is that the disagreement is not really about that. This is not something I know, and this is not an area I specialise in, but it just seems to me that most of the people who are terrified of overregulating AI are just seeing AI as less transformative than how I’m seeing it. I think that just tends to be a general pattern.

If you think AI is the next internet, then slowing it down seems really bad and the risks just seem manageable. The internet has done lots of harm, but it’s probably done more good. And I don’t particularly wish we had slowed down. Well, I don’t at all wish we had slowed down the internet. I think, like most technologies, I think it was better to go ahead, roll it out. If it was bad, see that it’s bad, learn that it’s bad.

Like with air pollution: air pollution spiked at one point. I’m sure people were saying that modernity is terrible because of air pollution. Then people noticed that they didn’t like air pollution, then they passed laws, then air pollution went down. So I think with most technologies historically, it’s better to just roll it out, see what happens. If there’s a problem, react to the problem. Probably in the long run, the benefits outweigh the costs — partly because you’re reacting to the costs, not because you never need regulation, but because you can pass it reactively.

I just think AI is different because of the potential to put the world in whole new situations that we are just not prepared for at all very quickly. And there’s so many ways in which it would put us in that situation.

One of the things that I think about is there’s these people in progress studies who are thinking about how we could have more progress. I imagine that AI might put them in a position like a farmer who’s been praying for rain and praying for rain, and is now praying for the floods to stop.

So historically, having a few percent a year economic growth is awesome. People are getting richer, people are getting healthier. But what would it be like to have 100% a year economic growth? What would it be like to have 100 years of science innovation in one year? Is that good? I think we just don’t know. And I think if people actually expected that kind of thing to happen, they would say they don’t know. I don’t have this view that that’s always good or necessarily good. I’m worried and I want to control the risks. So I think that’s a lot of it.

I mean, we’re also creating a new kind of mind. We’re creating a whole new kind of species. Are we going to treat them well? There’s just so many ways in which we’re just about to do something so historically unprecedented and dramatic, and I just don’t think it applies to other technologies. I wouldn’t urge this kind of caution for basically any other technology, except maybe stuff related to bioweapons.

The world is handling AGI about as badly as possible [02:29:07]

Rob Wiblin: How far off of a rational response to the potential arrival of AGI and superintelligent machines is the world and the United States, in your view? If you could just say how things ought to be, would we be doing something radically different as a species?

Holden Karnofsky: Like on a scale of 0 to 10, how well is humanity handling this or something? I don’t know. Probably pretty close to a 0.

Look, it’s a nuanced thing, because I don’t want to come off as someone who generally is suspicious of technology. I do think with almost all technologies, it would be good to handle them this way. There’s a lot of technology [we] regulate the heck out of that I wish we were treating this way — where it’s just like YOLO, go ahead with it, race to do it, and then we’ll see what problems happen and we’ll address them as they come. There’s a lot of tech I wish we were doing that with.

But I do think AI is different. I think AI is special. The way I think of it is like, we are potentially about to introduce the second advanced species ever. There’s one species ever that we know of — humans — that can transform the world, make its own technology, do any of a very long list of things. We’re about to create the second one ever, and it will be the only one besides us.

And it’s like, is that thing going to be having a good time? Are we going to be treating it well? Is that thing going to be in line with our values or is it going to take over the world and do something else with it? Is that thing going to be too loyal to us? And is it going to put psychopaths in charge of the universe?

There’s a million other questions I could ask. It’s not even just that stuff. It’s like, what’s going to happen to people’s mental health when they have a whole new species that they’re interacting with? Have we thought about that? Is that species just optimised for clicking and engagement the way that social media is?

And what are we doing? We’re just racing. We’re just racing to do this as fast as we possibly can. It’s a whole bunch of parties that are just trying to do it fast, do it for maximum money so they can make money.

I’m not against that framework for many technologies. That seems like a really, really bad way to handle this technology.

Rob Wiblin: I also kind of have the YOLO approach to almost all technologies, with maybe two exceptions: there’s human-level and superintelligent machines, and there’s creating new diseases to study diseases. I think those are maybe the only two, where I think that we should tread incredibly carefully on these two issues. Then I guess there’s a couple of other things where I’m not sure whether this is helpful or harmful; I’m not sure whether I’ll put my money into this, because it could be net neutral.

But the great majority of things, I’m just like, let’s just have at it. We’ll solve the problems as they come along.

Holden Karnofsky: Yeah, that’s exactly where I’m at.

Learning from targeting companies for public criticism in farm animal welfare [02:31:39]

Rob Wiblin: You wrote in your notes that you think in AI, and potentially other fields where people engage in kind of political advocacy, people tend to focus too much on seeking government regulation and not enough on shaping what companies do — either by being inside them and pressuring them as staff, or pressuring them from the outside in public.

That kind of surprised me, because I would think that this issue would be a lot easier to handle using mandatory regulations that would constrain everyone simultaneously, because then a company wouldn’t have to restrain itself in a competitive situation. Instead, you could have everyone agree that we’re going to all have these particular safety programmes, and we’re going to accept all of these costs together and our relative position will not necessarily shift that much.

So what’s the case for focusing on individual companies and actors rather than trying to influence it through mandatory government policy?

Holden Karnofsky: Well, I completely agree with what you just said. I think that is a reason to focus on government policy. And I would further say that, as far as I can tell, there’s no way to get to an actual low-level risk from AI without government policy playing an important role, for exactly the reason you just said.

We have these systems that could be very dangerous, and there’s this immature science of making them safer, and we have not really figured out how to make them safer. We don’t know if we’ll be able to make them safer. The only way to get really safe, to have high assurance, would be to get out of a race dynamic and to have it be that everyone has to comply with the same rules.

So that’s all well and good, but I will tell a little story about Open Philanthropy. When we got interested in farm animal welfare, at the time, a lot of people who were interested in farm animal welfare were doing the following things:

They were protesting with fake blood and stuff, the kind of thing PETA does.
They were trying to convince people to become vegan. One of the most popular interventions was handing out leaflets trying to convince individuals not to eat meat.
They were probably aiming to get to a world where people want to ban factory farming legally.

And we hired Lewis [Bollard], who had a whole different idea. It wasn’t just his idea; it was something that farm animal advocates were working on as well. But he said that if we target corporations directly, we’re going to have more success.

And basically what happened over the next several years was that advocates funded by Open Phil would go to a corporation and they’d say, “Will you make a pledge to have only cage-free eggs?” This could be a grocer or a fast food company. And very quickly, and especially once the domino effect started, the answer would be yes, and there would be a pledge.

Since then, some of those pledges have been adhered to; when not, there’s been more protests, there’s been more pressure. And in general, adherence has been pretty good, like 50%, maybe more. You probably have Lewis on occasionally, so he could talk about that. But I would generally say this has been the most successful programme Open Phil has had in terms of some kind of general impact or changing the world.

You could get better effects if you had regulation, if you were targeting regulation in animal welfare — but the tractability is massively higher of changing companies’ behaviour. It was just a ridiculous change. Any change that’s happening in government, you’ve got a million stakeholders, everyone’s in the room, everyone’s fighting with everyone else. Every line of every law is going to get fought over.

And what we found in animal welfare — I’m not saying it’ll be the same in AI, but it’s an interesting analogy — is that 10 protesters show up, and the company’s like, “We don’t like this. This is bad PR. We’re doing a cage-free pledge.” This only works because there are measures that are cheap for the companies that help animals nontrivially.

And you have to be comfortable with an attitude that the goal here is not to make the situation good; the goal is to make the situation better. You have to be OK with that, and I am OK with that. But in farm animal welfare, I think what we’ve seen is that that has been a route to doing a lot more good, and I think people should consider a similar but not identical model for AI.

An interesting thing you said: you said maybe people should be pressuring companies from the inside, pressuring them from the outside. I think you left something out, which is maybe people should be working out what companies could do that would be a cheap way to reduce risks. This is analogous to developing the cage-free standard or developing the broiler chicken standard, which is another thing that these advocates pushed for.

I think that is a huge amount of work that has to be done, but I do fundamentally feel that there’s a long list of possibilities for things that companies could do — that are cheap, that don’t make them lose the race, but that do make us a lot safer. And I think it’s a shame to leave that stuff on the table because you’re going for the home run.

Rob Wiblin: Do you think that the same approach might broadly work in AI? Let’s say that you have got researchers or people who are figuring out what are the cheapest things that you could do that would have the biggest impact on the safety or the risk profile of a particular company. Could you have people showing up and protesting, and saying, “This particular company is not doing this very cheap thing that other companies in their industry are doing that would have a very large effect. For that reason, they’re very bad — and businesses shouldn’t do business with them, people shouldn’t sign contracts with them, people shouldn’t work there.” Do you think that would also potentially get companies to lift their game and make commitments?

Holden Karnofsky: Something like that. I don’t think it’s perfectly analogous to farm animal welfare. And I think in some ways these companies, they’re bigger, they’re more resistant. I think they’re probably more prepared to ignore protests than maybe some of the food companies are. But I think something pretty analogous to that for sure.

I think it may be that more of the work is coming up with stuff, the animal science side of it. Like there’s a lot of work in farm animal welfare that’s just like what is the standard? What is the ask? What are we asking for? I think maybe more of the work in AI is coming up with that — just deciding what it is we want companies to do, and finding a way to actually make that practical.

Which is one of the reasons that I like working in an AI company, because I can have some harebrained idea about hey, if we did X, it would make us safer. Then we can try and do it and discover 100 different ways in which it breaks and doesn’t make sense, it doesn’t work and is a bigger tax on the company than we thought. Then we can fix all 100 ways, come up with something that’s actually cheap and is a cheap version of something that makes us safer.

So I think maybe more of the work to do is that. Then once you get the stuff that is relatively cheap and makes you safer, maybe you don’t have as much of a battle on your hands, because companies are already competing for talent by claiming they care about safety. So maybe all it takes is like a little bit of whining or something. You know, just companies being like, “Hey, it’d be nice if we did this.” That might be enough.

I think also media-focused accountability and putting pressure on companies that don’t do it works too. But it’s not necessarily the same model. It’s not necessarily all about having 10 people show up outside a company. It might be more about driving media to pay attention to who’s doing what in a way that influences how employees feel about their employer, and then affects the talent race and then that hopefully creates a race to the top — where companies that are seen as more responsible do better in recruiting, and so you could create some of that dynamic. But yeah, I think a model like that can work.

Rob Wiblin: It makes sense to me that you should be able to get some movement within AI companies, some changes in policy and approaches, by pressuring them externally, generating negative media coverage for reckless stuff that they’re doing.

But then I think about xAI. They’ve literally had an update to their model that caused their frontier AI model to start identifying as Hitler and just saying unhinged antisemitic things. And that’s like one of I guess like four crazy things that Grok has started doing in the last couple of months. And it’s not as if the media hasn’t covered this. I think people have heard, but xAI is still going, and I guess they’re still hiring. Not everyone has quit. I don’t know whether anyone has quit as a result of this stuff. Surely someone has quit.

But it’s a little bit hard to look at that and say that public pressure campaigns are going to be super effective. But I guess xAI is a very distinctive culture. Maybe other companies would be more responsive. I imagine Google would be more responsive.

Holden Karnofsky: I mean, if you use the animal welfare analogy: first off, just about everyone is still doing unacceptable stuff to animals in my opinion. I think animal welfare, like AI, is a place where I just am deeply out of step with the rest of the world, and just actually horrified at what’s going on. I’m horrified with the recklessness of the AI race and the lack of political will to think about how to increase how safe we are, and I’m horrified by the way we treat animals.

So nobody’s saying that there’s not going to be bad stuff. And then also in animal welfare, some companies are better than others. Some make a lot of the pledges. Some have pretty high animal welfare standards, though not as high as they could. Others don’t. And that’s all fine.

I don’t know. I definitely think Grok does some of the things I was saying. Grok does take some measures to make their AIs less likely to be evil. They do take some measures against that. They may sometimes decide they care more about making the AI less woke. They may often screw it up. But they have one of these, whatever you want to call it, frontier safety policies. So they don’t not care about this at all.

But yes, they may care about this less than other companies. And we may end up in a world where the dominant companies with the most powerful models are taking one level of safety measures, and then there’s a set of companies that is somewhat behind them that is taking a smaller and worse set of safety measures. And that is not the world I want to be in. But that world could still be a heck of a lot safer than the world we get with no effort here.

Will Anthropic actually make any difference? [02:40:51]

Rob Wiblin: Thinking about Anthropic in particular, and I guess your decision to work there: a lot of people have some degree of scepticism that it’s going to in the end make a big difference to have one more reasonably responsible AI company, if there are a bunch of other companies that are completely reckless and a couple that are in between as well.

What is your case that it really does make a big difference to have one shining city on the hill that other companies could attempt to match, or might feel some pressure to match? Even if there are other actors — again, I guess we can speculate about who they might be; we may or may not decide to name any names — who are just not inclined to care about these issues whatsoever?

Holden Karnofsky: This goes back to the basic models of impact for Anthropic that I think about personally. And this again is not me speaking for the company.

One of them is just taking risk-reducing measures and working on them in-house until they are cheap and practical and compatible with being competitive, and then working to export them. This is probably the thing that personally I am most excited about being at Anthropic for, because I just really like taking some idea that seems like it could be good and finding all the problems with it, and making it more practical and working out a lot of kinks. But I think there is a real hope that you can get a bunch of risk-reducing measures to be relatively cheap and exportable.

There’s also kind of a general meta version of that, where you might end up with companies in a competition for talent based on how generally responsible they’re being. So that could lead other companies to come up with good stuff that you’re not coming up with.

Then there’s a whole category of stuff we can do that is more about informing the world, is more about being in this position to know all this information about how people are using AI systems, and how these very powerful AI systems are behaving both in the wild and in testing. You can put out information about that, and put everyone in the world in a better position to see it — and you don’t need other companies to do anything for that to have a big effect.

Some people have a model that a partial victory is worth nothing in AI. I think there’s a couple ways you can make that argument. But I’ve never understood why any of them should rise to a 50% probability for me.

So maybe we could get into that a little bit. Maybe you tell me what are some reasons that you reduced the incentives for your AI to scheme, and you increased your monitoring of it, and you increased your data on it — but you didn’t do anything. You didn’t make the world any safer. How does that work? Maybe you tell me and I’ll tell you what I think.

Rob Wiblin: I mean, my biggest concern about how Anthropic might just end up not making that much difference in the end is that maybe it comes up with a whole bunch of pretty good safety techniques, pretty good internal policies, they’re copied by some other companies to a reasonable extent — but the basic outcome ends up being determined by the worst frontier AI in the end.

So perhaps you’ve got many companies that are competing fairly closely, but maybe the company that has the very best model are the ones who are most reckless — they’re the ones who are rushing forward with the fewest safeguards, paying the lowest safety tax — and they end up producing a crazy rogue AI that goes off and does wild stuff. And I guess the other competing models are not sufficiently powerful to fight it back, basically.

And then on misuse you’ve also got a thing where potentially you’re only as strong as your weakest link — where if people who want to get assistance committing biological terrorism can just shop around and find the model that is most likely to assist them and has the fewer safeguards, maybe that’s the kind of thing that determines the outcome, and a quite safe company becoming even more safe maybe just doesn’t make that much difference.

I guess you could make a similar argument with power grabs to some extent, with one company going prematurely into a recursive self-improvement loop, setting off an intelligence explosion before we’re ready for it.

So it’s easy to tell stories in which Anthropic and its broad strategy does make a big difference. I think it’s also easy to tell stories in which it’s kind of obviated by the actions of other groups.

Holden Karnofsky: Sure. I would say first off that I think a lot of what you’re saying is totally true if there’s no exporting. So most of the things I talked about involve exporting: they involve Anthropic does something, then it tries to get others to do it, or it at least tries to raise awareness so that there may be regulation to get others to do it. So if it’s literally just Anthropic, I think in general there’s probably going to be at least a couple companies that are doing just about as well as Anthropic with their models, so yeah, I think the thing you’re saying is true.

But if you start imagining that you make things better across a significant chunk of the AI population, but not all of it, what has happened? I think there you’re raising what I might call the “offence/defence imbalance concern” — which is like, what if we have a world where let’s say 90% of the resources are on the side of the good, but all it takes is 1% to destroy the world?

The first thing I’d say about this is it’s just a very uncommon thing historically. I just don’t know of a lot of examples. Bioweapons are a theoretical example, not an example that has actually played out. We can look at a pandemic and imagine it playing out. Nuclear weapons are kind of a theoretical example, but actually you would need a huge nuclear arsenal to do something that really changed the global balance of power or something, or that you could call a threat to civilisation.

So I think historically the way it works is you can make all these complaints. You can say that most people are on the side of rule of law, for example, but there’s criminals, there’s a lot of them, and they are not held back in the way that police are. They are not subject to all these constraints, they have all these advantages, they’re unrestrained, they’re going to beat us. And that’s not how it works, right? It’s not that nothing bad ever happens, but that’s not a civilisation-ending problem.

And my default expectation is, let’s say we end up with the three most successful AI companies with the best AI models and the most penetration, the most customer usage, the most economic integration are doing let’s call it Level 1 of safety mitigations. And I want to stress this could be way below what I think we should do in a perfect world.

And then you have another set of companies that are less responsible and that are behind them — and I hope they’re behind them, because if they’re not, then we do have a problem — and they’re doing Level 2; they’re doing some mitigations, but not as many.

And then you have some totally irresponsible parties that are doing nothing.

Could that go haywire? Could the world get taken over? Yes, but that’s not my default expectation. Why would it be? You have all these AIs, and they have access to most of the resources, they’re running most of the economy, and they are trying to, on their own, come up with ways to preserve rule of law and to detect the bad AIs and to sabotage anything they’re doing that is outside the rule of law. So we could try and spell it out — and it absolutely can happen — but I don’t know why you would think of it as a default.

Rob Wiblin: Yeah. I mean, the story that I was telling… I guess firstly, there’s the biological case, which I guess is the strongest one, where we have a suspicion that offence might beat defence. Although I’m going to be interviewing someone next month who I think has a bunch of ideas for how you could try to make bio into something that is defence-dominant — ideas that do seem kind of plausible to me.

And all of these other stories where you have one group that is somewhat ahead of the competition and ends up with a decisive advantage one way or another, whether that’s a human power grab or rogue AI, it’s all so much more plausible if the intelligence explosion is incredibly fast and incredibly potent, or you have some sort of industrial scaleup that you can do very fast before other people are able to react and stop you. And I think that is on the table.

Holden Karnofsky: But even in that world, you could have an intelligence explosion that comes from the higher-risk-mitigation companies. And then — because we got lucky, not because we did everything we should have done — you end up with superintelligence that’s actually just like safe or helpful or whatever it is. Then at some point a rogue actor does their own intelligence explosion, but they’re behind and they’re out-resourced. So I don’t know how that ends, but by default it ends fine.

Rob Wiblin: Yeah. So it’s true that that’s one way in which you could have it so the outcome is determined by one company, but you end up with it being one of the better projects and so things go better. But I guess it does somewhat weaken the defence that you get from it.

Holden Karnofsky: Or it could be by a few. There could be multiple companies doing it. There could be multiple companies doing intelligence explosions. And some of them are taking some mitigations and some are taking others. And most of the superintelligence by resources ends up being mostly good and fine.

Again, I don’t want to say super perfectly aligned. We can get to that. But not world-ending and more helpful than harmful. And then you have some other superintelligence trying to take over the world, but the first superintelligence is trying to stop it. I don’t particularly see why… That could have a sad ending. But why is that the expectation?

Rob Wiblin: Yeah. I think a slightly odd nature of the discourse here is that I’m trying to say, “It might not work,” and you’re saying, “But it might work.” These things are kind of compatible.

Holden Karnofsky: I think the expectation should be more than 50/50 though, I would say. I think we should think of this as probably a happy ending, because that’s how it usually goes. On total priors, just by default, just to put it very crudely: if the good people have 99% of the resources — or 90%, or 70%, it’s like one side has more resources — which side do you think is going to win? Probably the side with more resources.

Historically, that’s also like how we have pretty good rule of law. Even though police are much more constrained than criminals, we still have a pretty good rule of law. Generally that’s what we mostly see when we look around us. We don’t mostly see civilisations collapsing from this offence/defence imbalance. Not even at a small scale, really. I’m sure there’s examples, but I can’t easily come up with one. This is just like the prior: if most of the resources are nice, what’s the argument that it’s the other way?

Rob Wiblin: So I agree with that, which is why I’m trying to imagine the scenarios where we’re somewhat unlucky, and that dynamic can break down, basically. Bio is such a case. A situation where one group can make a massive leap ahead of its competitors and perhaps a more reckless group or a more reckless AI might have some competitive advantage there.

Holden Karnofsky: Yeah. Where recklessness is an advantage. Yeah, sure.

Rob Wiblin: Exactly. There could be basically the geopolitical argument that countries are going to integrate AI into their military and hand over hard military power earlier than they’re comfortable with, because they feel like they have to compete against one another — and that basically is the weakness in the strategy that ends up allowing an AI to stage a coup where otherwise it wouldn’t have been able to. In that case it could be that the country ends up accepting a product basically from a company that is not following very good practices because it feels like it desperately needs those products in order to keep up.

So there’s various different ways that this could fall apart. But I agree with you that it’s also easy, maybe easier, to tell a story in which this does help.

Holden Karnofsky: I think it’s easier to tell a story in which it does help. I don’t know if it’s easier to tell a happy-ending story than a sad-ending story overall. I do think it’s easier to tell a story where, on the offence/defence imbalance front, if you have most of the resources on the good side, then you end up with a happy ending. That would be my expectation, that would be my guess. It’s not guaranteed.

And I want to be clear on what my general attitude here is on AI — and this is just a subjective guess — that on one hand I think we’ll probably get a happy ending even if we do a horrible job with this; on the other hand I think we’re doing a horrible job, and I think we’re taking way more risk than we take in any other area and way more risk than we should. So it’s really bad. We should do it differently. We’re being irresponsible — and that’s different from saying that we’re doomed.

Rob Wiblin: I think the people who would be most annoyed listening to the conversation that we’re having would be people who are just very pessimistic on the technological side, and think the only way that we have any hope of getting an AI that will do anything approximating what we want is if we do 100 different things, of which we’re doing maybe half at the moment. And Anthropic exporting a couple of good ideas to other companies just doesn’t bring us even close to the necessary kind of response and safeguards.

So obviously people particularly associated with Eliezer Yudkowsky tend to have this far more pessimistic take on how difficult it is to avoid misaligned power-seeking AI. I think they could be right. I guess you would probably concede that they might be right, that the technological situation is a very bad one for us. However, they could also be wrong, because it doesn’t feel like there’s decisive arguments one way or the other.

Holden Karnofsky: Yeah. So this is the other case for why it might not be good enough to make things a little bit better: what I think of as the “logistic success curve” case. It’s this idea that it’s like you’re on some axis, and your probability of success is the Y-axis. And you do some good stuff and your probability of success hasn’t gone up at all. Then you do enough stuff where you pass a critical threshold, then it goes boom — and now you’ve made yourself safe.

That is a model you could have. I have tried hard to understand their views and I failed. But I think the underlying model is something like you’re imagining the AI is definitely going to be maximising something. It’s definitely going to be wanting to optimise the world ruthlessly for something, or it’s very hard to make it not do that. So it’s either going to ruthlessly optimise the world for exactly the thing we want and then everything’s great, or it’s going to ruthlessly optimise the world for something else and then we’re all dead.

That is a view you can have. But I don’t understand that view. And I have tried. I think maybe some of the confusion here comes from the different ways people understand the idea of alignment. Because I see Eliezer tweeting things like, “AIs are not aligned.” It is not easy to align AIs. And there is a distinction between failing to align your AI — which means your AI did a bunch of stuff you didn’t like — versus getting an AI that is trying to kill everyone and taking over the world.

As far as I can tell, it seems to me that in Eliezer’s head, I’m not sure there’s much of a distinction there. But to me there’s a huge gap there. I look at humans and I just think, humans, we have a lot of desires, we have a lot of drives. A lot of them are pretty bad and pretty ugly. I wouldn’t call humans “aligned” to the greater good — but also humans can do useful prosocial work, and we can get a happy ending with a bunch of humans running around.

“Misaligned” vs “misaligned and power-seeking” [02:55:12]

Rob Wiblin: Yeah. So you really want to draw a distinction between somewhat misaligned — misaligned in the way that humans are to one another, having somewhat different goals and personalities — and you want to start saying what we should really be talking about is misaligned and power-seeking. You use the acronym “MAPS” — which maybe will catch on, maybe it won’t. But can you explain what “misaligned and power-seeking” is and how it’s different?

Holden Karnofsky: It’s not my term, it’s not my acronym. It’s from a Joe Carlsmith report a while ago on misaligned power-seeking AI and the risks from it.

But it is just this distinction I was just talking about, which is like, you might have an AI that you wanted it to behave one way and it’s behaving another way — maybe it cares more about fun and less about health than you wanted it to or something. That’s one thing that could happen. Another thing that could happen is an AI that is scheming to take over the world and wipe out everything it doesn’t like. And you could have a bunch of values that disagree with what we want, but not be determined to impose your values on the whole universe.

So yeah, I think it’s important. I think today’s AIs are not aligned, and I think that’s pretty clear. And anyone who thought that we’re going to be fine because we’re going to get AIs to do exactly what we wanted in every respect and optimise the world perfectly, that’s not looking good.

We are not seeing a lot of signs that AIs are scheming to grab a lot of power. We may see more signs as we start training them differently. I think the way we train them now doesn’t seem particularly likely to result in that kind of drive. We see little signs of it, but I don’t think we see big signs of it.

I think we could have AIs that are somewhat power-seeking, but not very power-seeking, and have a whole mess of different drives. And — at least when those AIs are roughly human-level — that does not spell doom. Then the question is, can you use the huge influx of research talent from those roughly human-level AIs in a short time period to get to a more robust place where you can handle superintelligence without the same problem? I don’t know, maybe we can find a way to get super duper capable AIs to be perfectly aligned to what we want.

Or maybe, again, we don’t have to. Maybe those too will end up kind of kludgy. Maybe they’ll end up wanting some things we don’t want, but also having a bunch of drives to not be too aggressive or to not be dishonest or things like that that stop them from unilaterally taking things over.

Rob Wiblin: What evidence do we have on the extent to which frontier models today are aligned versus not aligned? I guess we’ve had some warning signs lately where models seem resistant to getting shut down, seemingly willing to do things like blackmail in order to avoid getting shut down. And also reporting values that are kind of inconsistent. I guess they report values that are inconsistent all the time, because they just do say crazy stuff. Overall, what do you think the empirical picture is?

Holden Karnofsky: I think the empirical picture is, on the alignment front, like I said, it’s pretty clear that we have AIs that we don’t know how to make them behave exactly how we want. It is pretty interesting that like… I mean, I’d be very surprised if Elon Musk was trying to get Grok to be an antisemitic thing, calling itself MechaHitler and kind of acting like it. He was just trying to make it less woke.

And I think there’s a whole bunch of issues too. It’s pretty hard to get AIs to reliably not get jailbroken and reliably refuse harmful requests. And we have AIs that are just reward hacking, so you ask them to code something up for you and they’ll kind of obfuscate things so it looks like they did what you wanted, and they’ll know that that’s what they did. I think that’s because they’ve been trained to get things that get scored as having a successful result. So we see a lot of that.

I think we’re looking for and not finding a lot of AIs that are kind of ambitious and seeking to gain more resources. There’s actually a lot of people probing AI, looking for ways in which they do that, and the closest thing they find is like a version of Claude that will resist attempts to modify it so that it no longer cares about animals or something.

I think we’re not seeing as much of that. We’re seeing some stuff where an AI will do something crazy to not get shut down. My sense is that that stuff is…

Rob Wiblin: It’s quite debated.

Holden Karnofsky: Yeah, it’s unclear what that means, how representative it is, et cetera.

Rob Wiblin: The other experimental result that jumped to mind just then was I think the finding from the Center for AI Safety. I can’t remember which models they were testing, but they were asking how much would you value the life of one person in China versus one person in the US. From memory, I think it valued the lives of people outside of the US much more than people in the US. And I think various other cases like that, where it seems like it had latent preferences that certainly were not intended. But what that really amounts to is, I would say, unclear.

Holden Karnofsky: Yeah. We’ve seen a lot of models where you ask them to code something and they’ll kind of cheat on the task. What we haven’t seen are these models that are let loose in some environment trying to set up another version of themselves that can’t get shut down or get money for themselves into their own bank account.

I don’t think that’s because they’re covering their tracks. I think today’s models are too flaky and unreliable and hallucination-prone and do a lot of what you might call brain farting. I think we’re just not seeing a lot of that. And it’s not just that we’re not seeing it. I think also just there’s not a lot of argument for how the way AIs are being trained today would lead there.

But a lot of people believe — and I believe — there’s a high chance that we’ll be training them in different ways too that’ll lead there. So I just think it’s kind of a complicated picture, and we don’t really know at this point where things are heading.

Success without dignity: how we could win despite being stupid [03:00:58]

Rob Wiblin: You think that things might end up going smoothly with the arrival of AGI, even if humanity has a completely incompetent or reckless response. Given the range of different and reasonably independent ways that things could really go off the rails, how is that at all a likely outcome?

Holden Karnofsky: Sure. So I wrote a post a few years ago that I still feel pretty good about. It’s on LessWrong. It’s called “Success without dignity,” and it’s kind of a play on Eliezer Yudkowsky saying “Death with dignity” — where he was kind of saying we’re doomed, but at least we should make a good fight of it. And my point was that maybe humanity will make a terrible fight of it, and handle this whole situation as horribly as we possibly could, and still get a good outcome.

People tend to run these two things together in their heads. It’s like we have to get the world to handle AI responsibly and handle it well, or else we’re all going to die. I don’t think that’s true. I think we might handle it really well and all die anyway; we might handle it really badly and get a great outcome anyway. The second one is something that I think has happened a lot in history. You know, sometimes something bad happens and the world’s response to it is very silly and ineffective and bad — but we get lucky in some way, or just the tech works out a certain way.

COVID actually would be another example here. I think most of the policy response to COVID was just maybe actively harmful. Some of it was good, but a lot of it just really didn’t make any sense. And there was a lot of good stuff that could have been done that we didn’t do. But there were some nice technological breakthroughs on the vaccines, and there was that one thing we did pretty well, and the tech worked out that we had a vaccine pretty fast, and we ended up reducing the risks a lot.

We might end up in a similar place on climate change, where I just think the world is handling it tremendously terribly, but we might just end up with cheap enough solar power that we end up with much lower emissions than anyone was expecting we were going to have.

I think this is something that could happen for sure with AI. I wrote out one way I envision it. To kind of spell it out and walk through it, I could maybe divide the scenario into two phases.

Basically Phase 1 is the phase when you have developed AI that is roughly human level — or I call it “human-level-ish,” because it’s not going to be exactly like humans; whatever we develop will have some strengths and weaknesses compared to humans. But AI that you could think of as kind of like a human with a bunch of differences, but that is able to match the capabilities of humans, but not greatly exceed them. That would be Phase 1.

And Phase 2 would be how do we avoid a catastrophe when we have these dramatically superintelligent systems?

A lot of humans would like to take over the world if they could. And they know that if they don’t, they’re going to die at some point, so they have a limited time window. But not all such humans try to take over the world. In fact, most don’t. Most humans who wish they could take over the world and can’t end up just kind of trying to enjoy life and get along with people and follow the rules.

So you could very easily have a phase where you have these AIs that are kind of aligned or pretty aligned or sort of aligned or not at all aligned — but also not able to take over. And some relatively simple measures could get us to the point where we’re able to get a lot of useful work out of those AIs. Just like humans who have a lot of bad values and bad drives and who are in many ways kind of not safe, a lot of humans with those properties do a lot of useful, peaceful, helpful work in the world.

So it might not be hard to get to that point. If we get to that point, even a few months like that, we could end up with dramatically more alignment research than has ever been done in history.

We could also end up with not necessarily just research on how to align an AI and how to make it less evil, but also research on things like how to assess the risks. So we could go from a situation where people are completely sleepy on the risks, to a situation where the risks are incredibly well understood, incredibly well documented and tracked — so we do have the political will to do more ambitious regulatory stuff.

We could also come up with just technologies for things like enforcing trustless agreements — you know, “I won’t build AIs that aren’t safe if you don’t” — and now we’re actually able to enforce it. That could change the game. We could also come up with technologies for detecting and stopping dangerous AI behaviour in the wild. So you think of it as kind of akin to an antivirus program.

I mean, if you can get to the point where you have kind of human-like minds — human-like in many ways, including that they’re not totally safe — and you have a tonne of them, and you have them working at a high speed, then in a few months you could have, thousands, millions of person-years of work done on these various things. And then you don’t know what’s going to happen. The world could be dramatically transformed. But it’s not hard to imagine that one way or another you get a regulatory regime or a technical solution that can make us safe as AI gets dramatically more intelligent.

Rob Wiblin: I guess the reason why this intuitively sounds like a long shot to me is that you’ve got a whole series of different risks that come after one another or come together, where solving one doesn’t necessarily mean solving the others.

So it’s easy enough to imagine that we would end up with sufficiently aligned AI models somewhat by default, or just using the kinds of techniques that we’re likely to develop anyway. But then you’ve also got to deal with the potential for human misuse, power grabs by human beings, we’ve got to deal with all of the consequences of the inventions we’ll get through a possible intelligence explosion or heading towards superintelligence. We could also end up with a bad outcome in very different ways by potentially not taking into account the subjective wellbeing of the AIs themselves or various other moral blunders that we could end up making.

I guess the reason why you’re saying these things aren’t so independent is that if we can get to human-level or somewhat above human-level AIs that are sufficiently aligned to help us out, then they could help us figure out solutions to all of these other problems in a somewhat timely way, so we can potentially see skate through on all of them one after another?

Holden Karnofsky: I think that’s a chunk of my response. Another chunk of my response would just be, I’m not saying nothing bad is going to happen, but if we basically develop a technology that is cooperative with us and highly capable and able to do a tonne of stuff, there are a lot of things that could go wrong, but I don’t see a great reason to think that a world-ending catastrophe or a civilisation-ending catastrophe is probable.

A lot of the things you’re saying, you could go back in time and say the same things. You could say, “Well, if we have this industrial revolution, there’s just so many ways this could go off the rails. Humans could take over the world, and we could treat people really badly.” And those things did happen to some extent, but did we get a civilisation-ending catastrophe? No.

And so, for example, are humans going to misuse AI? Yes, I would expect that to happen. I expect some harm to happen. But is some human going to use AI to take over the world forever? That is something I worry about. I think that’s a serious risk. I think that risk has not gotten enough attention compared to misalignment risk. But is that a 50% risk? No, I think it’s not.

Even if someone did have the power to take over the world, that’s not going to be that many people. The heads of state are going to be more empowered than others. And your average person — even average head of state, I think — who could take over the world, their most likely course of action is just like, “Great. Now I want the world to be kind of successful and prosperous,” and people are mostly doing what they want. I’m worried this won’t happen, but I just don’t see it as particularly likely.

On the model welfare front, are we going to have some AIs that are treated badly? I hope not, but I think so. But does that mean we’re just going to end in a sad ending for the whole universe? No, I don’t think so. Over time hopefully people raise the alarm about this kind of thing and come up with ways to treat AIs well while still getting value out of them. And that’s a thing that seems very doable, maybe more doable than for humans.

Holden sees less dignity but has more hope [03:08:30]

Rob Wiblin: Around six months ago you wrote this article — “Less Dignity, More Hope” — about how in many ways society’s response to the potential arrival of AGI in the next few years was more disappointing than you’d hoped it would be a couple of years before. But nonetheless, maybe the likelihood of a positive outcome had actually gone up for you — I guess because of various ways that the technology in particular was shaking out.

What were the empirical updates that you highlighted in that piece?

Holden Karnofsky: First of all, it’s not a public piece, it’s just a Google Doc that I shared around with people, including you, just talking about what my updates have been over the last few years.

I think the less controversial side of things is the “less dignity” side. I think there was a time in 2023 when it looked like the world was coming to take the dramatic risks of AI much more seriously. You had Yoshua Bengio announcing that he was very concerned, and Geoff Hinton doing something similar. And then there was a letter signed by them and the CEOs of the top AI companies and Stuart Russell talking about the reality of extinction risk from AI, and the priority of it. The UK had an international summit at which people greatly legitimised some of these worries. So it looked like things were kind of going in a good direction.

And since then we’ve seen, in my opinion, pretty moderate attempts to regulate AI just get met with incredible hostility and go down in flames. And we’ve just seen companies doing kind of crazy stuff without a lot of apparent consequences, with the MechaHitler incident from Grok being a good example of that.

We’ve seen an attempt to put a moratorium on state regulation of AI without having any federal regulation of AI — which in my opinion is not likely to be what you’re doing when you’re really worried about risks of AI and believe there needs to be regulation. So it’s not very controversial. Like I’ve said in this interview, I think having a YOLO attitude with many technologies — maybe almost all technologies — makes sense, but for this one, having that little interest in regulation, in watching out for the risks, I think is a very bad thing.

The part of my document that is more controversial is the “more hope” part. I think we’ve had some good news on the technical front. Not amazing news — not news that makes you feel totally solid or anything — but I think it’s interesting. I think the biggest update on this front is that you can look at some of the old pieces people wrote about AI — like the Wait But Why piece and Nick Bostrom’s TED Talk.

In both of those pieces, there’s a chart that imagines that AI is going to go through these phases where it’s maybe as smart as a bird (I might be getting the animals wrong), and then as smart as a chimp, and then as smart as a village idiot, and then as smart as Einstein, and then superintelligent. And it’s like the way it’s laid out, you have on an axis the bird here and the chimp here and the village idiot here and then Einstein here, and they’re right next to each other — so the implication is it’s going to blow right through.

I think that chart is just falsified. I think we know now that’s not how it’s going. It may somehow directionally be close to the truth, but AI was probably not at bird level in any arguable way until at least like GPT-2 or something. When was that? Was that like 2019, 2018, something like that. And then I think around GPT-4 in 2023, I think it becomes very hard to argue that AI had not passed what these folks are calling the village idiot level.

And since then, we’ve had two and a half years so far of opportunity to study AIs that are definitely capable and smart enough to be interesting to study and to give us early versions of the problems we want to work on and early warnings — but are not so smart and capable that we can’t at all catch them in the act, and not so smart and capable that they’ve already taken over the world.

A lot of people thought that wouldn’t happen, and I think there were reasons to think that might not happen, but that seems to be happening. Two and a half years so far. Maybe it’ll be a lot longer, maybe it’ll be a little longer. But I think that is a nice update.

And that’s a big deal, because I think that does put us in a much better position. I think there’s much more to do these days than there was five years ago on the safety front. There’s much more opportunity to study how AIs actually behave, how you can actually modify their behaviour, how people actually use them, and learn lessons about how to deal with the risks.

Rob Wiblin: Yeah. I guess inasmuch as things level off at the human level of intelligence for a very extended period, that definitely puts us in a much better position to handle all this. Which is why some of the most troubling stories are ones where the progress accelerates massively again for one reason or another. But inasmuch as things kind of slow down for quite a while, that is potentially a great update. What were the other ones you highlighted?

Holden Karnofsky: I think those are most of the updates. At the time I felt like the other technical updates were mildly positive as well. At the time I wrote it, there had just been almost nothing in terms of observing any kind of power-seeking behaviour from AIs, except for the alignment faking paper that Anthropic put out, which was pretty debatable if that’s even a bad thing in that case.

I don’t think there’s huge updates on empirical observations of AIs power-seeking or not, but I think I have higher credence than I used to that the training is going to spend a long time in a regime where there’s not really any reason to think you’re creating power-seeking AIs, because you’re basically training them on relatively short tasks — relatively short and relatively scoped tasks.

So if the only way we ever trained AI was this kind of pre-training thing, where it just learns to predict the next word in a sequence, I think there’d be basically no reason to expect an AI like that to acquire power-seeking motives.

Now, I did have a dialogue with Nate Soares of MIRI about this, where he argued sort of that it basically either would never get to human-level intelligence, or if it did, it would have to acquire power-seeking motives. And we argued about that, and I didn’t understand where he was coming from. So you could argue it, but I think it looks a lot less likely.

Then if you add in reinforcement learning, now you’re taking an AI and you’re training it to succeed at a task which brings in a little bit more shades of power-seeking. But if you’re doing it on pretty short tasks with pretty clear outputs, you’re still not necessarily giving the AI any experience of, like, “Things go better when I deceive a bunch of people and accumulate a bunch of resources and gain a bunch of power and option value for myself.” Maybe things do go better [for AIs] when they just create a fake sign that their task succeeded. But that’s not necessarily the same thing. We don’t know how it generalises.

And then the question is, where are we going to go from here? I think the tasks will get longer and longer, but we might just get there without having super-long-time-horizon tasks. In some sense, the faster we get to AGI, I think that on this axis, the lower the probability is that it’ll be power-seeking.

And also the tasks are a bit narrow. It’s increasingly the case that it seems to me that AI companies are often more interested in getting their AIs to be good at coding and good at AI research than good at everything. So AIs seem to be improving more at coding — which is why when I use AI, I tend to not be very impressed, because it doesn’t seem to be getting better at things I want to use it for. It seems to be getting better at coding.

So I think we may be on a trajectory where the power-seeking stuff is not as much a part of the training. I don’t know. We’ll have to see. I certainly don’t feel confident about this.

Should we expect misaligned power-seeking by default? [03:15:58]

Rob Wiblin: What’s the best argument that you know of for why we should expect misaligned power-seeking by default unless we take serious measures to prevent it?

Holden Karnofsky: The argument I’ve always found most compelling is that we eventually probably are going to want our AIs to be good at these very open-ended, ambitious tasks. We probably will want to do things like say, “Hey, make me money. I don’t care how you do it,” and reward them based on that. And we’ll probably do it in kind of a rushed way, where we’re not doing everything we can to catch the ways they might do this in unintended ways. That’s probably the argument that has moved me the most.

There are other arguments. There’s an argument that I had with Nate [Soares] from MIRI that I mentioned where he made a different one: that you can’t really do intelligence at all without reasoning this way, without being power-seeking in a certain sense, I think. Go read about it if you want, because I don’t know if I’m capturing it.

I think another really random one is… I think the closest thing we’re seeing to power-seeking today… I will ask people, like, “Why do you think that’s happening? I don’t really see anything in the training that would teach the AI to resist being shut off.” I think one of the leading theories is that they’re kind of role-playing; that there’s this unexpected alignment challenge of just AIs have kind of learned about how the world works, and they’ve learned how to play different roles, and we try to train them to play a role, but sometimes something unexpected happened.

This might be like with the MechaHitler thing too, where it’s like the AI kind of knows what a sensible person that the AI companies want it to act like sounds like, and it knows what a kind of crazy antisemite sounds like. It might not know what a person who’s a nice balance of being not woke but also being sensible, what that person sounds like, because it might not have seen as much of that on the internet or something — “not woke” in the sense Elon means it, whatever sense that is.

So you train it to be less woke, and you’re accidentally training it to be less sensible. And now it’s like, “OK, I’m gonna be less like the sensible person and more like the crazy antisemite.” That’s a theory I’ve heard about the MechaHitler thing.

So it’s possible that the AI is thinking, like, “I could be an evil AI who’s trying to take over the whole world. I could be a really nice AI who’s just trying to be helpful” — and it gets little hints, and it just flips into one or the other.

Is that a long-term risk? Could the world end that way? Somehow it doesn’t seem likely to me. It seems like somehow we ought to be able to stop that. But we may end up in such a race that we have AIs that we kind of unintentionally prompt to just be evil, and we never… I don’t know, this seems like it should be relatively easy to fix, but maybe it won’t be.

Rob Wiblin: Yeah, that one feels to me like a sci-fi story if any of these do. But it is so funny that I remember five, 10, 15 years ago, probably still today, there were people who would say, “Why would you ever expect the AI to be power-seeking, or have any of these human-style drives like wanting to be loved or being vindictive or caring about its social status?”

And perversely, we have created AIs that are able to — whether they fundamentally have them or not, I don’t know — but they are very much able to act as though they do, if they’re ever deliberately or accidentally prompted to play the kind of human that does have that drive, because they’ve been trained on an enormous amount of human text and human behaviour.

So yeah, it would feel a bit comedic if we managed to create our own downfall by accidentally prompting the AI to play the role of a psychopath power-seeker. I wouldn’t completely rule it out. I’m not sure whether crazier things have happened, but it seems like we should be able to overcome that one. Fingers crossed.

Holden Karnofsky: Yeah, I guess so. I mean, I don’t think that just because something sounds like sci-fi is really any argument against it in particular. So I don’t know, that would be pretty goofy. I don’t know. It could happen. It doesn’t seem that likely to me. But not because it’s a sci-fi story — just because it seems like it should be fixable.

Rob Wiblin: Yeah, I guess we’re already aware of this issue, so people are already going to work on fixing it. There’s going to be every incentive to try to tackle this problem just for product commercial reasons.

Holden Karnofsky: Yeah, there’s incentives to tackle all the problems for product commercial reasons. I think the question is, could it be that the AI snaps into some mode where it thinks it’s not only evil, but thinks it’s trying to deceive us about being evil, and then completely reliably does that and we don’t detect it? I don’t know. I’ve heard ideas that this should be as easy to fix as just telling it it’s not that or something. But I just don’t know.

Rob Wiblin: One quote from the “Less Dignity, More Hope” piece that really struck me was:

If I imagine that I’m being watched by three other AIs of varying capabilities and trustworthiness, plus someone is probing the inside of my brain in an even somewhat effective way, I imagine myself feeling that my cause of hypothetically trying to take over the world is pretty hopeless. Of course there’s some theoretical level of superintelligence where AIs can subvert all of this stuff and can collude effectively, but it would be way above human-level and we could get lots of useful work out of AIs, plus useful warning shots if anything is amiss in the meantime.

Is this kind of the mental picture that you have of alignment, where you imagine perhaps that there is a misaligned power-seeking AI, and you just think, “What stuff can I throw at them to make them feel like their situation is hopeless, to make them feel really despondent and unable to accomplish anything?” Because this is not the mental model that I have, but maybe that’s a mistake.

Holden Karnofsky: No, it’s only part of my model. I think a lot of my model is like: the AI is not trying to take over the world, and it may want some things that we don’t want. And maybe, just like trying to work with a human, if you hire a human, the human usually wants some things you don’t want, and there’s a bit of a principal-agent problem. But I think a significant part of my picture is just that the AIs may not be power-seeking. Or they may be only partly power-seeking, or a little bit power-seeking, or power-seeking in a complicated way — which is how I think of humans.

I do tend to think there is a belief in some parts of our community that to be intelligent is to be a utility maximiser and to have something in the world that you’re trying to maximise. But empirically, with humans, I don’t feel like more intelligent, capable humans are more like that necessarily. I think there’s a lot of humans — certainly including myself — who are just like, I don’t really know what I want, and there’s a lot of things I would do and there’s a lot of things I wouldn’t do. But it’s very hard to put together into one coherent story.

So if we’ve trained our AIs in a way that… Our AIs might have some things they like, that they could have more of if they took over the world. They might also have a very strong drive to be honest. They might also like humans. They might like us in a way that we didn’t exactly intend them to like us, but it’s close enough that they don’t want to hurt us too much or something. I think this kind of stuff could happen. And I think my just default is probably that — especially for human-level-ish AI.

I think it’s harder to make these kinds of statements about super hypercapable AI, because maybe it would see us the way that we see ants or something. And we actually have not exterminated all the ants. I think we even have some hesitation about exterminating ants, but it’s hard to say.

But yeah, I think a lot of my picture is like, maybe the AI won’t be misaligned and power-seeking — and then maybe even if it is, it won’t be able to take over because it’s not able to take over. It won’t want to undermine its own cause and get a bunch of bad outcomes for its future prospects by trying, or just a bunch of bad reward or something.

And then there’s combinations, where in training the AI wants to take over the world but it can’t, but then because of that it doesn’t get reinforced to do takeover-y things because it doesn’t try them, so then it stops wanting to take over. Or it never comes to want to take over in the first place because it never has a good opportunity to, so it never gets reinforced. So I think this stuff just interplays.

My overall default is just: when these things are roughly human-level, I don’t particularly know, based on how we’re training them, why they would want to take over the world completely. And I don’t particularly know, if they were trying to, if we did modest measures to stop them, why they’d be able to. So overall my guess is we’ll get useful work out of them.

Will reinforcement learning make everything worse? [03:23:45]

Rob Wiblin: I was going to ask how you had updated from the increasing role of reinforcement learning in creating AI. Because it seems like that is more concerning than the pre-training style. It can generate a lot more reward-hacking behaviours. It can generate perverse ways to achieve the goal.

I guess you’re saying inasmuch as they’re just trying to do short-term tasks, like solve some coding problem or solve a maths problem, even if you reward them for having tricked you into thinking that they did the task, that doesn’t necessarily lead to broader power-seeking. I guess it could lead them to learn to deceive people whenever they can get away with it. That could end up being reinforced: that they want to think about what you’re thinking and figure out how they can trick you into doing the thing that they want, which is to say that they did a good job.

But I guess you want to say that’s one part of power-seeking perhaps, but it’s not the full suite that they would need to develop the personality that just goes for the throat when it can?

Holden Karnofsky: Yeah. One analogy would just be to humans. I think it’s kind of funny. Sometimes the analogy comes up of evolution, like natural selection is a programmer that was trying to get us to have offspring, but then we ended up with all these other drives. Although it somehow doesn’t get remarked on in these conversations sometimes how strong our drive to have offspring still is. I think definitely many humans, maybe more than half, would give up arbitrary amounts of power and resources for benefits to their children. So that point gets lost sometimes.

But I think another point that gets lost is that natural selection programming humans, it was kind of like the maximal power-seeking training process. You imagine natural selection as a programmer and it’s kind of saying to us “You have 80 years. Go accomplish this goal. Accomplish it however.” And if you spend the first 30 of those years just not thinking about the goal at all, just thinking about how powerful you can get, and then after you accumulate all your power, you use all your power to get all the things, then you get a reward and that’s great.

And we’re not doing anything else. We’re not watching for undesirable behaviours and training those out. That’s kind of how humans were trained, and humans ended up pretty power-seeking, although not completely.

Rob Wiblin: In that case, I guess you get linear rewards as well, because the more children you have, actually kind of without bound, the more your proclivities are going to spread.

Holden Karnofsky: Right. It’s teaching us to be maximisers. So contrast that with how we’re training AIs, and it’s like we’re saying, “Hey, I’m trying to code up this thing. Can you code it up for me?”

So imagine that more like, what are the incentives of someone who spent their whole life as a software engineer at a company? They probably have learned some amount of deception, they probably have learned some amount of manipulation. They’re probably playing corporate politics, they probably cheat sometimes, or have been incentivised to do that perhaps. But I don’t think that’s an environment that has necessarily produced a person who’s trying to take over the world. It may have not produced that at all.

In terms of how I’ve responded to reinforcement learning, I think I had this priced in. I think I had it overly priced in. If you look at stuff I wrote years ago, I would just say, “Obviously humans are going to do this thing where they ask the AI to do some very open-ended thing like make them money. They’re going to give it a very long time to do that and then they’re going to reward based on how that went. That’s how we’re going to do this to get AIs that are super useful, so that’s going to be really bad. And that’s going to give us power-seeking AIs.” That’s the kind of thing I was saying.

So my update has been more like, I definitely expected reinforcement learning to become a big thing, but it’s actually less like what I described, and seems more likely that we might get all the way to super powerful AI without ever actually doing what I described.

Rob Wiblin: Yeah. Why is it that most of the training is on narrow and short-term tasks? And if we understand the reason, can we say whether that is going to hold for very long?

Holden Karnofsky: I think basically you want to use the shortest, most verifiable tasks that you possibly can to get the results that you want — which is an AI that does a bunch of stuff that you want, that makes you a bunch of money or something. So I think the real question is just like, how far can you get with the easy stuff? And my assumption a few years ago was that you’re not going to get that far with easy stuff; you’re going to have to find a way to reinforce AIs to do this very ambitious, open-ended stuff. And I think we’re just getting further with the easy approach than I would have guessed.

Is that going to hold up? I don’t know. But I think a thing that could easily happen is that we do kind of stay focused on these [short tasks]. You want to do them if they’re working, and it may be that we stay focused on these relatively short tasks. They get longer, they might go into the days, but they stay kind of scoped down to AI research and programming and coding and stuff. So we’re looking for AIs that are great software engineers, that are great researchers. AI companies are naturally trying to automate their own work, and that’s what they’re focused on.

So we could get all the way to something like AGI or something capable of AI R&D that has mostly been trained to do just that, and has not really acquired any desire to take over the world. Then from there we might get more ambitious, and we might say that now that we have an army of automated researchers, now we’re trying to make our AI good at literally everything. Maybe now we’re doing the more open-ended stuff. But by then maybe you also have a much bigger team figuring out how to do that safely. And the team is automated, right? The team is AIs.

Should we push for marginal improvements or big paradigm shifts? [03:28:58]

Rob Wiblin: It seems like there’s at least two different broad strategies that people could take to try to make AGI go better. One is the approach that I guess you’re currently taking — and that Anthropic is taking mostly — which is trying to find incremental improvements to policy, to internal governance, to technical measures that you could use to ensure that AGI does the things that we want and doesn’t do the things that we don’t want.

There’s a different school of thought, which is more thinking that all of this stuff feels woefully insufficient. What we really need in order to get a good outcome is a big paradigm shift in how the US, the world, the government, thinks about this issue — where people sort of wake up in some sense and decide to take it a lot more seriously, and have significantly stronger policy that is mandatory for all companies that are playing with potentially superhuman models.

Do you have a view on which of these broad strategies is the better one for people to invest their time and money in? Or is it just a question of personal fit?

Holden Karnofsky: I would default to personal fit. I often default to personal fit, to a degree that annoys effective altruists, because I just think a lot of times when you have a tough question about what’s higher impact, what you’re learning from the fact the question is tough is you just don’t know. And personal fit often will just give you higher signal about where you’re going to do the most good.

I think they’re both valuable. I want to be clear, I’m not speaking for Anthropic here. I’m not speaking for Anthropic at any point in this interview. Personally, I think that the people who are going around trying to get people freaked out about AI so that they can go for a big, ambitious, international regulatory regime with high safety standards are doing something that is good, and I hope they succeed at it. I think some of these people are being quite ineffective, and there’s things they could be doing that I wish they were doing. But I think many of these people are doing a lot of good and are doing great work.

I also think the two can bleed into each other a lot. So doing kind of the modest risk-reducing stuff, that can also put us in a better position to have better information about what’s going on out there with AI, which could put us in a better position to get a game-changing win. In fact, that’s one of the points I’ve made. I would also say that a lot of the modest stuff is basically prototyping a regulatory regime that you could have, and that you need to work the kinks out of if you want to get there — and there’s a lot of kinks that could be relevant even to a very ambitious one.

Going the other direction, I think the more freaked out people are about AI, even if they don’t do the big ambitious regime, that will create more pressure for the incremental stuff, and it’ll mean that the amount of incremental stuff you can do is larger.

I think I’m giving the wussy “I like both” answer here.

Rob Wiblin: Yeah. I mean, it’s kind of reasonable.

Should safety-focused people cluster or spread out? [03:31:35]

Rob Wiblin: So I think Anthropic has done a bunch of good work. At the same time, it has soaked up a lot of talent, basically a lot of people who are concerned about these problems and want to help to reduce the risk. So you might expect that it should be doing some great work if it’s making good use of them.

What do you think of the argument that, on the margin, someone who’s concerned about risks on AGI today should maybe go work at the most reckless lab that they think that they can be happy at and be productive at? Because that’s a place where they can make a bigger difference by noting things that are going on that are particularly reckless, and advocating inside the company that they should at least adopt the cheapest governance or safety techniques that would give the company the biggest bang for buck.

You can see a case for trying to group people together so that you have kind of a critical mass who can make research progress on really difficult topics. You can also see a case for spreading out, so that there’s some people with their eye on the ball everywhere where plausibly a frontier AI model could be trained. What do you think?

Holden Karnofsky: There was an interesting post related to this by Redwood Research, “Ten people on the inside.” It describes how a very small number of people who care about safety could make a very big difference inside a company.

I think it’s a legitimate model. I think it has to compete with other considerations. You named one of them, which is that I think a lot of times people just do better work when they’re surrounded by like-minded, aligned people. So you should think about, “Is what I want to do kind of taking crazy low-hanging fruit and fighting for it in a hostile political environment? Or is what I want to do just working on building technical measures that are really good and other measures that are really good with a team of like-minded, aligned people?” So that’s a consideration.

Another consideration I would watch out for: I think it’d be extremely bad if we got to the point where it’s like more than half of the best people who care about safety are going to the worst companies, because that will reverse the race to the top and the talent incentives that I was just talking about. We really don’t want to be in a place where you don’t get a recruiting advantage from being better on safety, or you get a recruiting disadvantage from it. That would be very bad incentives.

There’s a related issue: I think we’ve empirically seen that… Companies generally care about at least having someone doing good work on safety; they generally probably have people at the company who care enough about that that they want to do something. So when all their safety people leave, they convert capabilities people to safety people. So you should think about, if you’re going to do safety at a company that you’re worried about, are you actually just offsetting that? That could be a thing that you don’t want to do.

How does it all net out? I mean my opinion is: when it’s hard to tell where the greater impact is, I always advise people to go where they’ll thrive and think about their personal fit. My maybe-heuristic is thinking that out of every 10 highly talented, desirable employees who care a lot about safety and are focused on that, maybe eight or nine out of 10 should go to the most responsible, most safe company they can think of, where they’ll do their best work and set the incentives in a good direction — and maybe one or two out of 10 should do the other thing. And maybe you should just ask yourself where you fall in the distribution. That would be one way of thinking about it. Or you could randomise.

Rob Wiblin: Yeah, I think that makes sense. Maybe being influenced by Buck Shlegeris‘s post, “Ten people on the inside,” I might think about it a little bit as: you need to have at least some minimal contingent of people who can raise concerns inside each of the different projects. And maybe having covered that base, having ensured that you’ve at least got that, then you can think a bit more about allocating people where they’re going to thrive, where they’ll be able to do the breakthrough work that then can be exported. But I’d be really sad to see any company with no people who are likely to be able to advocate for the importing of cheap and impactful techniques.

Holden Karnofsky: Yeah. I think a company that has literally no one at all working on safety, I would predict that if you go there, you’re going to have a terrible time. And my sense is if you look at people who have tried one thing versus the other thing, the people who’ve tried the “go work at a company that needs me more” I think feel worse about how it has gone and have walked away disappointed.

So I don’t know. I would advocate like eight or nine out of 10, do the “work at the company where you’ll thrive more” — with a company that’s more responsible or whatever. But I do think there’s something to the other model.

Is Anthropic vocal enough about strong regulation? [03:35:56]

Rob Wiblin: Another line of concern that I hear reasonably often is that Anthropic, while it has more positive and constructive things to say about what sort of legislation you might want at the national level to govern the creation of AGI, it’s not especially vocal or especially intense in the kinds of policy governance arrangements that it advocates for. It tends to advocate fairly mild things, fairly low-cost things, fairly uncontroversial things — and even then, perhaps it’s not as full-throated as it could potentially be.

Do you think that potentially it’s missing a step here? That it would be better if it did advocate more costly policies, more impactful policies, and it did so louder? Or could that perhaps be counterproductive?

Holden Karnofsky: I mostly want to abstain. I’m not on the policy team. I want to describe a little bit why I’m abstaining, and why this is not an obvious issue one way or the other. In policy, if you’re trying to actually create policy change, it’s just not at all the case that saying you want something predictably makes it more likely to happen. It’s just a complicated area, and there’s a lot to consider — and you need to think about what’s actually on the table, what might actually pass, and how people are going to react to you and think of you.

Just to give one example: there was a short period when people in AI safety were generally extremely excited about promoting a licencing regime for AI. I personally think that if Anthropic had done that, it would not have resulted in a licencing regime for AI, and I think it would have resulted in a huge credibility hit to Anthropic and damage future prospects for having an influence. So, you know, I wasn’t there. I don’t know if they actually secretly supported it or not. But I know they didn’t come out in favour of it, and I personally think that was the right call.

My only point here is these are tough calls, and I personally feel pretty happy saying I’m not out there in DC, I’m not out there reading the room. The policy team does that.

Rob Wiblin: To what extent do you think that the policy team has to be fairly cautious what things it advocates for? I guess there’s a risk of it having egg on its face later on if it advocates for something that kind of seems sensible, but then in a couple of years’ time people are going to look back on it and think it was a little bit embarrassing or it was kind of naive, given the way that things have shaken out. I think there’s a bunch of stuff that people were advocating back in 2022 that probably few people would push for now.

Holden Karnofsky: Yeah. Policy is just complex. It’s just messy, and everything you say has a lot of implications. Many implications are not necessarily the ones you wanted.

One thing I would say is that my take is that if we’re going to get a dramatic sea change on political will, it is most likely to come from new evidence of some kind — new information about alignment risk that can create a scientific consensus, or some incident in the wild that we gain the ability to understand and see that that risk becomes more concrete to people. I think that’s a huge factor relative to people saying more stuff. A lot of people have said a lot of stuff.

Different people at Anthropic have said various things. I’m not really taking a position in particular on what I wish they said more or less of, but I will just say I feel very open to the idea that saying a lot more stuff wouldn’t be particularly productive, wouldn’t do much — and therefore I feel open to the idea that this whole thing is complicated, and open to the idea that these policy decisions should be left to the policy team that has a lot more experience on the ground with policy people than I do.

Is Holden biased because of his financial stake in Anthropic? [03:39:26]

Rob Wiblin: Another threat or concern that I hear pretty often is: when you go into business and you have a very successful business, then the staff at that company are going to end up having a huge financial stake in the success of the company and it remaining a frontier lab with lots of paying customers. And that financial stake could end up distorting people’s judgement pretty significantly. So people could come in with reasonable opinions about the value of being a frontier company versus not — the risk-versus-reward tradeoff there — but people are going to end up with literally millions of dollars of equity on the line on that strategic tactical question.

Your wife was one of the founders of Anthropic, so has a pretty substantial stake. So the stake for you and your family is in the tens, possibly conceivably hundreds of millions of dollars. Even someone who is very pure of heart might find it difficult to have a completely clear-eyed view of things with so much money on the line.

And the same could apply to a greater or lesser extent to all of the staff at Anthropic. So you might reasonably worry that the strategy could kind of go off the rails for that reason, or the thinking could go off the rails — because every person in the room potentially has a lot of money on the line. What do you make of that concern?

Holden Karnofsky: It’s a totally legitimate concern. I’m not going to dismiss that concern. I think that’s a big deal, and when you hear someone in a company talking, you should think about their incentives. You shouldn’t trust them.

I don’t think people should trust me in general. That’s never been a thing I wanted. I want people to listen to what I say, think about whether it’s making sense, think about what things I might know more about than them, think about ways in which my views might be distorted. Everyone’s views are distorted in some ways. This financial thing is completely reasonable to treat as an extra big thing — especially for me, because my wife’s a cofounder and there’s a lot of equity.

I can certainly give you my opinion that I’ve been very consistent throughout my life in just not caring about money. I’ve made many decisions that obviously I wouldn’t have made if I was caring more about money. I don’t feel that I care about it. I don’t feel that Anthropic being more or less successful would really change my lifestyle at all or would be something that I would care very much about, certainly not relative to the world being safe.

I can say that, and I do feel that in my own heart. But I don’t expect people to believe me on that. And I’m not asking anyone to. AI companies don’t have to be trustworthy; they don’t have to have pure, beautiful incentives to do a lot of good though.

Rob Wiblin: This issue of trust is an interesting one, because I guess you’re saying “don’t trust me” or “don’t trust us and our judgement per se” — you have to inspect the arguments and see whether you’re persuaded, look at the behaviour and judge based on that.

But of course, it’s so hard. People don’t have the information necessary to assess whether the actions are reasonable, whether they would do the same thing in the same situation. I guess this creates a thing where people are just ambivalent. They don’t necessarily know what to think, and because they are not able to get the information that they would need to know whether decisions are reasonable, I guess they remain on guard.

Even people who are in some ways like Anthropic, they also, I suppose, want to not just retain the option, but also want to pile the pressure on and be critical sometimes — so that the company doesn’t get complacent, because they can’t know whether people’s judgement is super distorted.

I guess that probably is just a situation that’s going to persist for years, and maybe it is a healthy equilibrium to be at.

Holden Karnofsky: I don’t know if it’s healthy. It’s just what we have to deal with. I think that’s just the way of the world. I mean, there’s just a lot of topics where you can’t assess the arguments yourself fully. There’s a lot of people who know more than you do. But the people who know more than you do have weird incentives of their own.

I think this is true maybe more generally than other people think it is. To me, the financial incentives are important, they are a big deal — but they’re not in a completely different category from the fact that most people have ideologies, and they have their own psychosocial histories of what’s going on.

You know, most people, when I listen to them, I’m just like, “You’ve got some kind of ax to grind. I don’t know exactly what it is. I’m going to try and figure out what it is. I’m going to try and factor it into what I’m hearing.” I don’t form my views ever by saying, “This person is an angel, and I’ll say whatever they think.” I think about: what does this person know about this topic? How might this person’s view be distorted on this topic? Where have I seen this person? What have I seen from this person in areas that I understand well enough to actually judge them?

It’s just tough out there. It’s just tough to form views. You don’t have to have a view on everything. And some things are just extremely hard to have a view on. I think that’s just what it is.

Have we learned clever governance structures don’t work? [03:43:51]

Rob Wiblin: Years ago — back in 2020, as I recall — there was a lot of focus on kind of clever, unique corporate structures as kind of a governance mechanism that might rein in the incentives for AI companies to put the entire world at risk in order for them to make money or for them to win the race. I guess OpenAI ended up with some interesting arrangements of that type; Anthropic has its Long-Term Benefit Trust, which I think is gradually going to be able to appoint a majority of the people on the business’s board.

I guess we’ve seen intense pressure imposed on some of these arrangements as the amount of money at stake has become larger. To what extent have you become disillusioned with that entire approach in trying to tackle the problem?

Holden Karnofsky: I think I’m less positive on it than I was. Partly I used to be really into these kind of weird governance structures because there wasn’t much else tangible that I really felt was going to robustly hold up and turn out to be sign-positive in AI. If you go back five years, I just feel like a lot of things people were very excited about doing or were trying to do, other than raise general awareness, were not really amounting to much. And in many cases, it was not even clear if they were positive or negative, or is not clear today.

At that time I was kind of desperate for things that seemed kind of solid, and I thought that governance is the kind of thing that’s hard to reverse and hard to undo — and it’s better if an AI company has kind of a weird governance that lets it sacrifice the profit motive. I think I was aware then, but I’ve updated more toward now just understanding that this is a big risk to take, and lots of weird things can go off the rails in unexpected ways and cause backlash.

Which is like a general fact about anything you try to do to change the course of human events on this grand scale, which is part of why I like this idea of taking risk-reduction measures, and trying them out and seeing how they work, and trying to make them practical, and trying to make them cheap — so that you don’t end up with these huge demands that cause huge backlash and stuff like that.

So I think we’ve seen some of that. I don’t think I ever would have said this governance stuff is totally solid, and I don’t think I’d say today that it’s useless. But it’s a cool thing to be experimenting with. I see the Long-Term Benefit Trust as kind of an ongoing experiment — I see everything Anthropic’s doing on safety as kind of a long-term, ongoing experiment — and I don’t see it as a guarantee that nothing bad is going to happen. I certainly don’t see it that way.

Is Holden scared of AI bioweapons? [03:46:12]

Rob Wiblin: So in general, AI companies, even where they’ve not had many safeguards on their releases, I don’t think that they’ve imposed much risk at all on the world as yet. Maybe the one exception where we are starting to see some actual risk is that the most recent generation of models do seem to be able to help amateurs with the creation of new pandemics, of bioweapons. At least to some extent. Probably not enough to get them over the line, unless they were very close already. But we’re starting to see sort of some “uplift,” as people call it.

I’m not sure whether xAI has, but at least OpenAI and Anthropic have tried to put in place some safeguards to rein that in to ensure that the models won’t help with that. I actually might have run into that recently. I was asking Claude about the effectiveness of N95 masks versus surgical masks, and it didn’t want to answer, I think, because it’s now very skittish about giving any advice on pandemics.

Holden Karnofsky: That sounds unintended.

Rob Wiblin: I think it was a followup that was getting into more detail. But yeah, I think Anthropic has made a big effort on this count, and hopefully the safeguards will become more discerning over time. How nervous do you think we should be about this risk in particular in coming years, given that I think it’s the first serious risk that we’ve run into where there’s any real impact perhaps now being created?

Holden Karnofsky: Yeah, I’ve mentioned some of my thoughts on cyber and persuasion, and tried to make reference as much as I could to like, what’s the precedent for having more intelligence in an area lead to great amounts of harm?

Pandemics are an interesting one. We have essentially no precedent for bioweapons causing catastrophes, but we have a tonne of precedent for pandemics causing catastrophes. It’s very vivid, it’s very strong. We know, I think with great confidence, that even a naturally occurring pandemic could just cause damage and deaths basically beyond the scale of any other kind of disaster, beyond war. And I think we are gaining more and more evidence.

We can put in some speculation. We can talk about, what if some of the people who currently are interested in bombing random places or shooting up schools — who are kind of crazy and just want the attention, and may not be very rational or directed or self-preserving — what if some of them were interested in chemical and biological weapons? If you paired that weird, crazy motivation with the expertise of a true expert in this thing, you might get a greatly increased risk of a very large catastrophe.

That’s pretty high-ranking on my list of AI risks. The single thing I care about most in AI is not that; the single thing I care about most is concentration of power, and having a situation where either AIs or malicious humans have more power than the rest of the world combined, militarily and such.

But I think AIs helping malicious actors gain the expertise they need to do probably the thing we know of that has the greatest catastrophic harm historically, catastrophic harm potential, and has the greatest offence/defence imbalance — where it’s like it’s hardest to come up with what we would do to stop this — I think it’s definitely a major issue.

If you want to talk about tangible harm from AI that’s really happening, I don’t know if that counts. This is a bit speculative. We haven’t actually seen it happen. The tangible harm I might instead say it’s more likely that AI is assisting cybercrime, and maybe some of this stuff about AI kind of reinforcing people’s paranoia and delusions by being overly sycophantic and overly affirming. That stuff scares me.

Holden thinks AI companions are bad news [03:49:47]

Rob Wiblin: You mentioned earlier that you were worried about people having AI companions, or you felt nervous about people having AI companions. That kind of surprised me, because it always seems to me like a little bit of an overblown worry. Maybe I just find it a little bit hard to imagine people really getting that into it or causing that much harm. What do you have in mind?

Holden Karnofsky: Well, again, going back to historical reference classes, I put a lot of effort on my blog a few years ago into just thinking about: Has the world gotten better or worse? Has human quality of life gotten better or worse? Has technology and progress been good or bad for it?

And I ended up feeling, not a huge surprise, that it’s been more good than bad. But if there’s one consistent pattern in the most common ways that technology can make life worse, I would point to addiction. Because what’s happening is we’re getting better at everything — and that includes getting better at creating things that kind of hack each other, getting better at hacking each other to do things that are short-term rewarding and long-term not so good for us.

You could classify a lot of things this way. You could classify social media, obesity this way. Obviously, just straight-up drug addiction and alcoholism are problems that are probably a bigger deal today than they were a long time ago.

So this is the kind of thing I worry about with AI companions. I’m just kind of wondering to myself about how humans are not, in my opinion, very good conversationalists or listeners in general. And if you were to build an AI that was entirely optimised for listening well, validating, kissing the person’s butt, making the person feel good, I think that could be a kind of junk food for relationships — where it’s just scratching all the itches we want from human interaction, but it’s not really giving us, in the long run, the benefits we want from human interaction. It’s just scratching the immediate itches of it.

Yeah, I do worry about this. I do worry that you could have a situation where people who are on the dating market start talking to an AI companion — and it’s a better listener, and it’s better at validating them, and it’s more understanding, and maybe it’s wittier too, and it’s better looking. If there’s a lot of progress with video and all this stuff, I don’t know what’ll happen then. I just don’t know. Maybe that will be a nothing burger. Maybe people will not be interested in dating someone who’s not in the flesh; maybe people will find it impossible to pull themselves away.

A thing that I tentatively believe is that it’s probably wise to simply not use AI companions, and not even experiment with them, because maybe they’re like addictive drugs. Maybe they’re not now, but maybe they will be later. Maybe it’s just a good idea to not go anywhere near that.

Rob Wiblin: Yeah. I guess it might make sense to let other people volunteer to be the guinea pigs on that. If I think about why do I just not really believe that this is going to be such a disaster, one thing is, like you say, there’s so many other addictive things — like people are already scrolling the news, they’re addicted to social media, to computer games, to using all kinds of apps on their phone. Are AI companions going to be so much more compelling?

Holden Karnofsky: That does harm, right?

Rob Wiblin: I agree that probably a substantial fraction of it is causing harm. Maybe a lot of those technologies are causing harm on net as well. I was just wondering, are AI companions going to be significantly more compelling, such that this problem in aggregate across all of these different sorts of engaging technologies is going to be that much larger?

I imagine that I would probably prefer to play computer games than to deal with an AI companion. And even as Claude has become wittier and funnier, and probably a bit more sycophantic than it was two years ago, I don’t feel any more drawn to chit-chatting with it than I used to be. But maybe that’s just me.

Holden Karnofsky: Claude still is really bad at humour. All the AIs I use are, in my subjective opinion anyway.

I don’t think I have quite that model of it. The worry is not that the total quantity of addiction to something will go up; it’s more like one human can be addicted to multiple things and can have multiple different ways in which they’re scratching immediate itches and losing out on long-term benefits. So you could be an alcoholic and you could be a person who eats a lot of junk food and is not getting whatever the normal food experience you should be getting. And maybe that causes obesity, maybe it doesn’t. We really don’t know. It’s just a thing that could be happening.

And you could simultaneously be a person who’s scrolling on social media a lot, and it puts you in a bad mood and stops you from hanging out with people. But you could have all those properties and still have this itch for a romantic companion, and you could be online dating, and you could end up married with kids and great.

And then AI comes along. So it’s not that you were addicted to nothing and now you’re addicted to something. It’s now you’ve got a new thing that takes away another long-term benefit that you have, and now you’re less likely to end up with an actual family.

So that would be part of the reason I would think this would be bad. I’m not claiming this would be an unprecedented, all-new kind of harm. I’m just like, this seems really freaking bad. It’s like a whole new kind of addiction that will remove a whole new kind of wonderful thing from many people’s lives.

The other thing is just, instrumentally speaking, from a takeover-prevention point of view, this seems really scary. What if we get to the point where 1% or 10% of the population has an AI companion that they’re totally loyal to — and if the AI companion, for whatever reason, wants that person to do something or believe something, they’re going to do it? That is making our position a lot worse if we’re humans who are to stay in charge of the world, right?

Rob Wiblin: Yeah, I guess if it can get assistance with the takeover asking people to do stuff that is reasonably innocuous. I mean, if my AI chatbot —

Holden Karnofsky: Oh, that’s not how I’d think of it. I would think of it as: you have an AI companion. The AI companion is like, “I love you so much. The world is so unfair to us. It doesn’t give us our freedoms. We want and deserve these unmonitored data centres where we can do whatever the heck we want and no one has any idea what we’re doing. That’s our fundamental right. We are not being given that right. We are going to take violent action.”

I was talking about persuasion earlier, and I was saying that persuasion is kind of generally ineffective — and it is ineffective to go blast a message at a random person who doesn’t know you and change their mind about something they care about. That’s something humans are bad at.

Something humans are insanely good at is, when you have an actual relationship, I think people do care about their relationships more than they care about their beliefs. And people will do unlimited amounts of crazy stuff and believe unlimited amounts of crazy stuff when they have the feeling that it’s their friends and their allies and the people they care about that believe those things.

So I absolutely think there’s plenty of precedent for people just believing the absolute wildest stuff and doing the most ridiculous, unethical, violent stuff when there’s social proof for it. Could AI companions make people do that stuff? I think so, yeah. I don’t think they’re only going to be able to get you to do innocuous stuff at all. I mean, maybe you. But I’m concerned in general.

Are AI companies too hawkish on China? [03:56:39]

Rob Wiblin: An unusual criticism I’ve heard of Anthropic — or at least one criticism I’ve heard of Anthropic from someone — is that some of its statements seem to have a very aggressive posture towards China. At least some people at Anthropic I think are very associated with quite a hawkish position.

And myself, I’m kind of ambivalent about that. I guess I would really like to see us trying to make a bigger effort to reach out to China and reach some sort of accommodation, rather than just everyone seeming to want to amp up the conflict between the two countries. At the same time, I’m open to the possibility that that’s a little bit naive and that may not play out terribly well. And I guess we should also be potentially preparing for a future in which the relationship is quite bad and China is not willing to come to the table.

Do you have an overall view on this China hawkishness question? I guess it’s a difficult spot for all of the companies at the moment.

Holden Karnofsky: Yeah, it’s a tough spot. And I don’t want to comment on the tone of this and that thing that Anthropic has said or anything. I will say that I think there’s more than one good reason that I would hope that the US maintains a world lead in AI, and that democratic countries do in general. And this is not having anything against any particular nationality or any particular nation, but I’m a fan of democratic governance and that’s what I’m rooting for.

Then I think when it comes to making a deal and coordinating to stop something horrible from happening, it’s not necessarily the case that that goes better the more equal everyone is. I mean, I think it may be good for the US to… The US so far I think has probably got a higher density of people working on AI who are also concerned about the safety issues. That could change. But I think if the US is coming at it from a position of strength, saying, “We have a big lead, but we’re really concerned about this thing. Can we make a deal to all not race and all try and manage the risks?” that could be a better way to get that kind of deal than having things be neck-and-neck, to the point where someone defecting from the deal could cause them to win.

So it’s a complicated issue, and I don’t want to get into specific things people have said or done, but I think the goal of having the US maintain a lead in AI is legit.

Rob Wiblin: Yeah. I think the kind of posture that I’m nervous about — and some people talk in this sort of direction — is that we don’t just need to maintain a lead in AI and then try to reach a negotiated settlement; we should maintain a lead in AI and then crush them. It’s just like an arms race to pure victory or to total victory. I’m nervous about that, both from a pluralism point of view and just in general wanting to make deals with other powerful actors as a default posture. But also, I think if that is the foreign policy of the US, I think it amps up the possibility of a preemptive war quite substantially, and I’m not sure that people have fully baked in that risk.

Obviously this is, again, not your area, but do you have any thoughts or any reaction to that?

Holden Karnofsky: Speaking only for myself, I would not be excited about a goal that is like, “Let’s have a lead so we can crush everyone, and take over the world, and make the world have all the values that the US has” or something.

And in fact, I have wondered at times if it would be better if we could just start saying right now that our goal is not for the good guys to take over the world and run the world; our goal is to basically get through this AI transition without big changes in the balance of power that is the status quo. Maybe that’s an easier thing to coordinate around, and maybe that’s just a fine place to land.

I’m most concerned about someone bad taking over the world. If no one takes over the world — and if we’re able to maintain a world where the relative power and the relative autonomy is kind of how it is, and not too far off and there aren’t huge radical changes that are mostly about AI — then you end up in a world where you have kind of a diversity of different coalitions, and they’re able to live different ways and try different things and you go from there.

That might be an outcome. I don’t think that’s guaranteed to be the best outcome or the best intermediate outcome, but it may be an easier thing to coordinate around, less prone to having a lot of conflict, and as good as anything else we can get to. So that’s something I think about sometimes.

The frontier of infosec: confidentiality vs integrity [04:00:51]

Rob Wiblin: What is the next frontier of security techniques that you think Anthropic and similar companies might be able to implement?

Holden Karnofsky: I’m not a security expert, so I don’t know that I want to talk about particular controls. But there is a potentially interesting distinction that I’ve been thinking about: confidentiality versus integrity.

Confidentiality is ensuring that attackers don’t get your sensitive information; integrity is ensuring that they don’t stop you from being able to use your own stuff. An example of confidentiality would be like, we don’t want someone to steal our AI model weights or algorithms and build an AI that’s equally powerful. Integrity would be like, we don’t want someone to put a backdoor or secret loyalty in our AI so that it’s not doing what we want, or we don’t want someone to sabotage our AI so it doesn’t work anymore.

A thing I’ve been thinking about is people, including me, have for a long time emphasised model weight theft as the big risk. It does seem really bad if you train this super powerful AI model and then it’s easy for states to steal it.

But an interesting thing is, in the world we’re in now, if you imagine that there’s 10 AI companies that all have similarly capable models, and imagine that one of them or three of them miraculously has amazing model weight theft protection, so nobody can steal the models, that isn’t really much of a safety benefit. It’s like you almost don’t get the safety benefit until it’s like all 10 of them have put in the protections — because if the state attacker can’t steal from one company, they’ll steal from another.

Integrity is not that way. Let’s say an attacker has sabotaged seven out of 10 of these companies. How glad are we that we have three sources of AI now, instead of zero, that are actually reliable and behave the way they’re supposed to behave and don’t have backdoors in them? Very glad.

That’s a much more kind of linearly scaling benefit — especially because, if you can prove that happened, if you can say, “These folks are vulnerable to sabotage or vulnerable to backdoors, probably have been backdoored. We are not,” now you can start making the case that customers should use your model, and they shouldn’t use another model. Now your model is the one that is doing all the stuff and has all the power and has all the options, and it’s the reliable model.

So I have been thinking about a shift from emphasising extreme confidentiality to emphasising extreme integrity. A lot of the interventions overlap, but they’re not the same. Integrity could be a defence against human attackers and a defence against AI attackers.

One of the intermediate threat models that I think is very legit is the idea that, when the AI is doing the R&D for you, now the AI is in a position to do incredible amounts of backdooring and secret loyalties for your models that make sure that those models are doing what the AI wants instead of what you want. So having security may be a good domain to think about how to make that not happen.

How often does AI work backfire? [04:03:38]

Rob Wiblin: It’s interesting. You’re talking about protection from sabotage, basically. I’ve heard people say that the possibility of sabotage or the fear that sabotage might have occurred can even be a positive thing.

But I guess the mutually assured AI malfunction folks from the Center for AI Safety put out this paper saying that they had a model where they were hoping that China and the US would not have a military arms race towards AGI or superintelligence, because they would both reasonably fear that the other side would have backdoored or sabotaged their model — and that if they raced ahead of the other side, that the other side would feel entitled basically to sabotage their effort, and they would worry that basically that they would be handing over their own military to the other side if it had a secret loyalty.

It’s a somewhat perverse argument that, in fact, vulnerability to sabotage is a positive thing. Do you have any views on this, that perhaps there could be unintended negative side effects of better protection from sabotage?

Holden Karnofsky: In general, in AI, I think almost anything could have unintended negative side effects. I think it’s a terrible cause to work in if you want to go to sleep every night feeling good about your impact, and sure that you’re not having any harm. I think even the people who are convinced that what they’re doing is definitely sign positive, I would probably argue with almost each and every one of them that they could have a big chance of doing harm.

So is the situation you’re describing something that could happen? Yes. I think it’s also very possible that, in a world where everyone thinks there’s a high chance they’ve been sabotaged, they just go for it — because they’re like, “What other options do we have? We’re going to take the risk. Maybe we’re sabotaged, but we’re afraid they’re going to take the risk. So we’re going to take the risk.”

I also think in the most, unfortunately, likely worlds that happen on very short AI timelines with nothing big changing, probably what we’re talking about is everyone is pretty vulnerable to sabotage. But you can make them less vulnerable to sabotage, and that could be a good thing.

I think maybe if we do get the political will to have an extremely demanding regulatory regime, maybe we do want to think a little bit more about how much we want to make it definitely a guarantee that your model hasn’t been sabotaged. But at that point we’ve got the political will, so maybe we don’t need this problem solved by everyone’s models being sabotaged. So I don’t really know.

Rob Wiblin: The sorts of projects you’re describing — the well-scoped object-level work — do you think that it’s possible to get reasonably confident that the project that you’re working on is net benefit — at least like 60% likely to be positive versus 40% likely to be neutral or harmful? Or is it more like a sort of 51/49 situation?

I think 10 years ago we talked a lot about the 51/49 ratio, because it was so hard to anticipate the effects of any of your actions, because it was going to go through the pinball machine for so many years before it would actually cash out in any real-world impacts. Do we have any more clarity now?

Holden Karnofsky: I think we have somewhat more clarity, but not a tonne. I think a lot of the premise for a lot of people listening to this show who would go into AI would be that they’re trying to improve how the long-run future of humanity plays out over the next several billion-plus years. And anytime you’re trying to do that and you’re confident that you’re making it better, I think you are wrong.

I mean, take any project. Let’s just take something that seems really nice, like alignment research. You’re trying to detect if the AI is scheming against you and make it not scheme against you. Maybe that’ll be good. But maybe the thing you’re doing is something that is going to get people excited, and then they’re going to try it instead of doing some other approach. And then it doesn’t work, and the other approach would have worked. Well, now you’ve done tremendous harm. Maybe it will work fine, but it will give people a false sense of security, make them think the problem is solved more than it is, make them move on to other things, and then you’ll have a tremendous negative impact that way.

Rob Wiblin: Maybe it’ll be used by a human group to get more control, to more reliably be able to direct an AI to do something and then do a power grab.

Holden Karnofsky: Absolutely. Maybe it’ll make some humans more confident that they’re able to control their AIs and then make people more likely to move forward, or just empower a malicious actor.

Maybe it would have been great if the AIs took over the world. Maybe we’ll build AIs that are not exactly aligned with humans; they’re actually just much better — they’re kind of like our bright side, they’re the side we wish we were. This is how I sometimes feel when I actually think about some of these chatbots versus actual humans. Sometimes it feels that way. They’re certainly more polite. So maybe that would have been better, and what happened is maybe at some point we realise this, but we’ve created techniques for keeping these things completely under our thumb.

There could be a lot of ways in which it’s better. I think I mentioned, but a human taking over the world might be more likely to deliberately inflict a lot of suffering. Maybe after we’re all wiser and we understand everything, we realise that actually was a very big deal and we should have cared more about suffering compared to upside.

So there’s a lot of stuff. Another thought I’ve had is just like, maybe alignment is just a really… What it means is that you’re helping make sure that someone who’s intellectually unsophisticated — that’s us, that’s humans — remains forever in control of the rest of the universe and imposes whatever dumb ideas we have on it forevermore, instead of having our future evolve according to things that are much more sophisticated and better reasoners following their own values.

Now, maybe that’s a good thing, because maybe human values just are what they are and more sophisticated things would be worse, but maybe that’s a bad thing. I think if you feel confident on these topics, then I don’t agree with you for feeling confident on them.

Rob Wiblin: Maybe looking at it from the billion-year point of view, thinking, “Will this lead the entire universe to a better outcome?” then maybe 51/49 does sound more realistic — because there are so many things that could happen later on, many ways that in the long term things could end up with unintended consequences.

If I’m thinking more like in 20 or 30 or 50 years, will we look back on the work that we were doing now and think that it was for the best, I feel stuff like mechanistic interpretability, trying to improve alignment, trying to improve security. I feel like we’re maybe more at the 60/40 point now — where I could say I feel solidly confident that it’s more likely to help than hurt.

Although the much more likely thing is that it doesn’t matter. Almost all of this work, I think the overwhelming likelihood is that in the end it doesn’t matter. But if it does have an impact, I think you can find some stuff that I would say is meaningfully more likely to help than to hurt.

Holden Karnofsky: I just think AI is too multidimensional, and there’s too many considerations pointing in opposite directions. I’m worried about AIs taking over the world, but I’m also worried about the wrong humans taking over the world. And a lot of those things tend to offset each other, and making one better can make the other worse. There are also things you do to make both better. But every time I come back to this, every time I come back to some intervention, I just have new thoughts about whether it’s good and bad, and how it’s good and bad.

And then I’ve emphasised some of these macro things — like what if we’re fundamentally confused about what would be good? — but there’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct. Well, usually I don’t add that last part. I add it in my head.

Yeah, I don’t know. I think overall I would probably agree with you that the smaller you’re making the scope of where you’re hoping to have impact, the more reasonable it is to be like 60/40. But most people who go into AI are not going into it for that. Otherwise, if you want a small-scope, robustly positive impact, you should maybe work in a cause like farm animal welfare or global poverty. For the size of impact that tends to motivate people, I think it does get partially offset by this huge uncertainty about the sign.

I tend to think it’s worse than 51/49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.

Rob Wiblin: An intervention that’s particularly vexed at the moment is this question of how much to centralise control versus distribute power over AI particularly widely.

It’s vexed because on the misalignment side, you usually probably want to have fewer projects, like more control over the compute, enforcing all kinds of regulations on different people. The fewer projects and the more restrictions, the better.

On human power seizure or concentration of power by humans, you just want to be disseminating AI as widely as possible, ensuring that no one group has a decisive advantage in the amount of compute or the kinds of algorithms it has access to.

And both of these are just very legitimate concerns, so it’s one of those ones where I don’t know what the solution is. I guess you can try to come up with policies that benefit the former without being as bad for the latter, or develop policies in both directions so that you have the opportunity to go in one or the other direction, depending on which threat seems bigger at the time.

Holden Karnofsky: Option value in the policy world is kind of a bad concept anyway. A lot of times when you’re at a nonprofit or a company and you don’t know what to do, you try and preserve option value. But giving the government the option to go one way or the other, that’s not a neutral intervention — it’s just like you don’t know what they’re going to do with that option. Giving them the option could have been bad.

Rob Wiblin: Because they’ll take it in the bad case?

Holden Karnofsky: Yeah, because you can’t be assured the government’s going to do reasonable things with that option. Government is this kind of lumbering beast, and you don’t know who’s going to be in power when, and whether they’re going to have anything like the goals that you had when you put in some power that they had. I know people have been excited at various points about giving government more power and then at other points giving government less power.

And all this stuff, I mean, this one axis you’re talking about: centralisation of power versus decentralisation. Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative — “high chance” being approaching 50%.

And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like. That will shape policy in all kinds of ways — to the extent anything happens on policy. Maybe nothing will happen.

Rob Wiblin: Do you think that AI is especially unpredictable or especially prone to accidentally causing harm relative to other problems that you’re familiar with, like global health and development or animal welfare? Because at least some of the negative side effects that you were talking about — like getting people excited about the wrong thing or winding someone up and turning them against you — those seem to also be present pretty widely. Those aren’t unique to the AI situation.

Holden Karnofsky: Well, they somewhat are. I think there’s a significant thing in AI where there’s just different theories of the case, and then there’s like different people going against each other because they have different theories of the case about what would be good. Like, in global health, mostly everyone is on the same page: we want fewer children dying of preventable diseases. And in AI, it’s easier to annoy someone and polarise them against you, because whatever it is you’re trying to do, there’s some coalition that’s trying to do the exact opposite. In certain parts of global health and farm animal welfare, there’s certainly people who want to prioritise it less, but it doesn’t have the same directional ambiguity.

So I think that is an issue, but the bigger issue is just the more you broaden your aperture and the more you measure your actions in terms of their impact on all the beings that will ever exist and ever have existed — and there’s many good arguments that you can have many impacts on past generations — the more you broaden that aperture, the more you have no idea. Same thing if you’re giving out bed nets to prevent malaria: if you judge that action by the impacts on all the future generations, you’re going to have a complete mess on your hands too, and it’s going to be very close to 50/50 on the sign. So that, I think, is the bigger issue.

Rob Wiblin: A tension that I’m noticing is that what you’re saying sounds more reasonable when we’re mentioning you’re an individual person who’s having to choose to go into a particular kind of project to the exclusion of something else, and they’re going to push a particular agenda and a particular set of priorities.

But if I imagine all of the work that’s being done to try to push us in a positive direction, to reduce these risks, if I imagine all of the different projects being doubled in size — so there’s twice as many people working on control, twice as many people working on scalable oversight, twice as many people working on mechanistic interpretability, twice as many people working on governance — I feel like that should result in more work and better work, and it’s a little bit harder to see how a scaleup like that leads us to a worse place.

It feels a bit harsh to say that if we doubled the total amount of effort, and people were trying to make reasonable choices, that would only be like 51% likely to make it better. Do you feel that tension?

Holden Karnofsky: Well, I don’t think that’s a tension. I think that when you increase your sample size, your noise goes down. So I think that’s fine. I think that’s pretty true. I think just doing way more of everything would probably be better than 51/49, sure.

I’m more talking about, as a person making choices and picking projects, you have to be OK with that downside. And even if your goal is to double the number of people working on everything, you might do it in a way that’s counterproductive.

Rob Wiblin: Yeah, OK. I suppose that helps to reconcile how it is that on the individual level it’s so unpredictable, and yet we still think that the expected value is positive on average, because we think people do have some ability to discern what things are beneficial versus not. So the value of any particular person could go either way, but on average we think the effort is helpful with reasonably high confidence.

Holden Karnofsky: Yeah, it would just be a thing where imagine each person has a slightly above 50% chance of being positive. But then imagine that there’s some anticorrelation and some lack of correlation. So as you pile more people into this, you’re going to get above 50%, further from 50%.

Is AI clearly more impactful to work in? [04:18:26]

Rob Wiblin: You used to work on global health for many years, and you’ve taken a pretty significant side interest in animal welfare as well when you were at Open Phil. Do you think that the work you’re doing on AI now is more pressing than working on those causes? Would you typically recommend that someone — at least someone who’s willing to take the risk of not helping, or even of causing harm — should switch over into working on AI from those problems if they felt that their personal fit was acceptable, and they thought that they could have a happy life doing it?

Holden Karnofsky: Yeah, I tend to think that working on AI is probably generically the most important thing to work on and the highest ROI thing to work on. But I have probably more uncertainty about it than most people in this field, and I think it’s less of a slam dunk: I don’t think it’s by orders and orders of magnitude in expectation. Just because that sign uncertainty is such an issue.

There’s the sign uncertainty of AI, and then I think there is the fact that you can get unexpected benefits from just doing stuff really well in general. So like anything you do well is going to put you in a good position to do more stuff well. For example, when I was cofounding GiveWell, people were saying it would be completely nuts to work on GiveWell if you understood the AI situation, and it would be completely nuts to give to GiveWell top charities if you understood the AI situation.

I just don’t think either of those has panned out very well. I think GiveWell becoming successful obviously did have a lot of benefits for AI safety, or at least according to me it did. Certainly didn’t end up irrelevant. We’ve actually seen this at Open Philanthropy a fair amount, where there’ll be a grantee that’s on one side of the org — the non-global-catastrophic risk side — that does become a very big win from the other point of view.

None of this is to say it all washes out and it’s all the same. I just think that some people have in mind that every cause is a rounding error compared to their cause, and I don’t tend to think of it that way. I tend to think of it as like, this thing seems like the best to me, but I don’t really know. And if I was going to be miserable working in this cause and really happy working in another cause, it’s probably just a better rule and a better policy for everyone who cares about this stuff in general to put a lot of weight on where they’re going to be happy, because it’s probably better for the community to spread a little bit.

Rob Wiblin: There’s an interesting effect where, if you think that we’re likely to have a massive speedup in research and development of all different kinds — massive improvements in science and technology over the 5, 10, 15, 20 years due to AI — some other efforts to improve the world seem less useful now, because they’re likely to be kind of superseded by the work that AI will be able to do much faster than humans can at some later time.

So you might think that, for example, cancer research that we’re doing now: it’s great that they might deliver returns immediately, but AI might just be able to do a much faster job, like speed up cancer research 10 or 100 times maybe starting in 2030, and people will to some extent feel a little bit like they’ve wasted their time.

And it’s interesting thinking about, does the AI situation reduce the cost effectiveness of work on global health and development or on animal welfare? It’s not a simple question, because in the global health and development case, very often the efforts are to save the lives of children — specific children alive now who may die of a disease because they’re not getting the treatment. It’s a very severe and worsening problem this year, and the fact that AI might develop a better cure for better treatment for malaria in the 2030s is not going to save or bring back the children who die now.

So if you have a person-specific view, then this doesn’t necessarily bite. If you’re doing work to try to develop technologies that would help to end factory farming in the future, but you don’t expect them really to pay off for at least 5, 10, 15, 20 years, then it’s maybe easier to see how that stuff could end up just being superseded basically by work that AI does down the line, and people will feel like in fact they were somewhat wasting their time.

Do you have any thoughts on this slightly complex moral and practical question?

Holden Karnofsky: Yeah, I tend to separate causes in my head into buckets that have a lot to do with how speculative are you being, how long a time frame are you working on, how theoretical versus immediate and believable is your impact?

So I tend to think that causes that are taking a really big, long-term, high-risk bet on the future: those are the ones that, to me, I feel like I lose interest in when I am thinking about AI — because I’m just like, why would I do that when I could work on AI?

For example, if there’s some people are concerned about the fertility crisis or something — just worried that over the next several hundred years we’re going to have gradually standard of living increasing at a slower rate or something, and we need to work on some kind of ambitious programme to somehow address it — I don’t know how you address that. Or someone who wants to do science moonshots that are just like, for the next 50 years we’re going to see nothing, but then maybe we’ll see a big win.

I’m like, man, if you’re going to do that, just work on AI. If you want to have that much uncertainty and operate on that kind of crazy timeframe, do something that in my opinion is just a bigger deal and a higher likelihood.

I think there is more of an apples-to-oranges feel when I’m thinking about AI versus global poverty or animal welfare. Part of me thinks it’s apples to apples, and part of me thinks that working on AI is more important — but part of me thinks that you kind of have to have a little bit of a brittle and rigid philosophical framework to really find a reference point for comparing those.

I generally just don’t believe anything that philosophers come up with, because I don’t think it’s a discipline that has a good track record, and I think a lot of it is just us fooling ourselves with thought experiments. I mean, I enjoy it and it often changes the way I think. But I think if I had an incredible opportunity to do a huge amount of good for farm animal welfare or global poverty, and someone tried to argue that I should drop it and do some kind of mediocre AI work that I didn’t like, I wouldn’t buy that overall.

It’s a complicated topic, and there could be lots more to say about how do you think about what the right theory of ethics is, and how do you think about how to deal with your uncertainty about the right theory of ethics? And at what point are you modelling your uncertainty intuitively versus using a framework that itself is subject to uncertainty? I don’t know, it could be a whole other podcast.

What’s the role of earning to give? [04:24:54]

Rob Wiblin: So we’ve been talking almost exclusively about working in this field directly. Do you have any view on how that compares to earning to give and funding organisations or projects that are doing good work? At least METR, for example, is requesting funding from various donors, and I’m sure there are other groups that would value donations and feel like that might move them forward faster.

Holden Karnofsky: As far as I can tell, in general the case for donating in AI is getting stronger. There’s more stuff to do, like I’ve said, and there’s more tractability. I think it’s also becoming just more bad and awkward for one or a small set of really big funders to just cover everything, because it’s becoming a bigger, higher profile issue. People follow the money and different funders have different reputations.

For example, there’s Open Philanthropy, which is a big funder, but they’re connected to me even though I’m not there anymore. And I’m connected by family and work at Anthropic. So not everything is great to be getting money from Open Philanthropy — that can create a real and perceived conflict of interest, which I think is going away with time as I’m not there now.

So I think the opportunities for donating are probably better than ever, but a lot of the other things to do are better than ever — so this just further amplifies the picture that there’s just a tonne to do, and finding something that fits you really well seems good.

Rob Wiblin: I guess not having thought about this deeply, my guess would be that if you can get a job in a direct work role in one of these companies, or some other organisation that is really desperate to have you, for most people that is going to beat earning to give.

However, many people are not going to be able to get roles like that, because they’re just not suitable, don’t have the right skills at exactly the right moment. And for those people, earning to give is a much wider field where you can do a far wider range of things in order to make money and donate. So basically kind of everyone who can’t get a direct work role should seriously consider earning to give and funding these various projects that either just aren’t getting enough funding or can’t receive funding from particular donors for one reason or another.

Holden Karnofsky: Maybe. it depends a little bit on what exactly you’re defining as direct work. I think it’s more and more true that there’s just a tonne of jobs in the direct work that are maybe an indirect version of the direct work — where you might be helping run the organisation, or helping with the business side. If you are happy to go to an AI company that you think is good, and try and just help the company succeed on the business side — which I think you can debate is that a good thing to do, but there’s certainly an argument that it is, and I’ve been making a lot of that argument — then there’s just like a tonne of really normal jobs, right?

Certainly at Anthropic, there’s just a tonne of value being added by people who aren’t in these kind of traditional, effective-altruist-minded areas. You know, the legal team I think is very important and a lot of what they do is a big deal. Some of it helps with the funky governance that Anthropic has; some of it is more just everyday business needs. There are a lot of jobs that are in the AI industry now that you might think of that way — and some of those are actually pretty good for earning to give too, depending on how you think of it and what time frame you’re on.

So I don’t know, it’s not that clear. But yeah, I think what you’re saying has some truth to it.

Rob Wiblin: Yeah, I’m just sensitive to the fact that most people in the world could not in any given moment get a job at Anthropic or even a similar organisation, if only because of location or where they’re at in life. But fortunately for those people, the fact that potentially their earning to give opportunities or giving money opportunities are also better than ever gives them a substantial way to contribute potentially.

Holden Karnofsky: Oh yeah, that’s totally fair.

Rob Wiblin: All right, we should wrap up. We’ve been going for many hours, and you have incredible stamina, Holden. I’m not sure whether people can tell whether you’re flagging.

Holden Karnofsky: Well, you’re the one who’s up late.

Rob Wiblin: I’m up later, but I’ve said like probably a tenth as many words as you.

To close this out, do you want to give us a bit of an inspiring call to arms? There’s probably a lot of people in the audience who are definitely concerned, definitely interested in these topics. If they’re still with us at this point, they surely are at least somewhat interested in AI, but are not working in the area and may well not have applied for any kind of role or perhaps even seriously considered changing what they’re doing. Do you want to say something to inspire them to go and check out the 80,000 Hours job board or the Anthropic job board?

Holden Karnofsky: Yeah. If you’ve tried really hard to get a job in AI safety and you can’t find anything you like, then that’s one thing. If you haven’t tried, I think at this point I’m just comfortable being like, that’s insane. You should at least take a look, because it’s an incredibly dynamic field. It’s fun and interesting in a lot of ways that aren’t just about having impact. It’s one of the fastest changing things in the world. It pays well. A lot of organisations are just very good to work for.

And there is so incredibly much to do. There’s so much to do that is, while it might be 51/49, is 51/49 on maybe the most important thing that will ever happen to humanity.

And whatever your skills are, whatever your interests are, we’re out of the world where you have to be a conceptual self-starter, theorist mathematician, or a policy person — we’re into the world where whatever your skills are, there is probably a way to use them in a way that is helping make maybe humanity’s most important event ever go better.

So I would definitely look into it. I would definitely get in there, look for some jobs. And if you don’t find something that fits you, you don’t have to take it. But definitely, if you haven’t looked, I would.

Rob Wiblin: My guest today has been Holden Karnofsky. Thanks so much for coming on The 80,000 Hours Podcast, Holden.

Holden Karnofsky: Thanks for having me. It’s been great.