Trying to better understand the practical epistemology of EA, and how we can improve upon it.
I don't quite agree with your summary.
Kat explicitly acknowledges at the end of this comment that "[they] made some mistakes ... learned from them and set up ways to prevent them", so it feels a bit unfair to say that that Non-Linear as a whole hasn't acknowledged any wrongdoing.
OTOH, Ben's testimony here in response to Emerson is a bit concerning, and supports your point more strongly.[1] It's also one of the remarks I'm most curious to hear Emerson respond to. I'll quote Ben in full because I don't think this comment is on the EA Forum.
I did hear your [Emerson's?] side for 3 hours and you changed my mind very little and admitted to a bunch of the dynamics ("our intention wasn't just to have employees, but also to have members of our family unit") and you said my summary was pretty good. You mostly laughed at every single accusation I brought up and IMO took nothing morally seriously and the only ex ante mistake you admitted to was "not firing Alice earlier". You didn't seem to understand the gravity of my accusations, or at least had no space for honestly considering that you'd seriously hurt and intimidated some people.
I think I would have been much more sympathetic to you if you had told me that you'd been actively letting people know about how terrible an experience your former employees had, and had encouraged people to speak with them, and if you at literally any point had explicitly considered the notion that you were morally culpable for their experiences.
This is only Ben's testimony, so take that for what it's worth. But this context feels important, because (at least just speaking personally) genuine acknowledgment and remorse for any wrongdoing feels pretty crucial for my overall evaluation of Non-Linear going forward.
I also sympathize with the general vibe of your remark, and the threats to sue contribute to the impression of going on the defensive rather than admitting fault.
Here's a dynamic that I've seen pop up more than once.
Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form "but I don't want (e.g.) alignment to be doomed — it would be a huge relief if I'm wrong!"
It seems uncontroversial that Person A would like to be shown that they're wrong in a way that vindicates their initial forecast as ex ante reasonable.
It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.
In my experience, it's much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) "convenient beliefs" to hold.
If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.
To clarify: I'm not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I'm simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn't necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form "I could've done better on a task I care about" can be hard to accept.
I'm a bit unclear on why you characterise 80,000 Hours as having a "narrower" cause focus than (e.g.) Charity Entrepreneurship. CE's page cites the following cause areas:
Meanwhile, 80k provide a list of the world's "most pressing problems":
These areas feel comparably "broad" to me? Likewise for Longview, who you list as part of the "AI x-risk community", state six distinct focus areas for their grantmaking — only one of which is AI. Unless I've missed a recent pivot from these orgs, both Longview & 80k feel more similar to CE in terms of breadth than Animal Advocacy Careers.
I agree that you need "specific values and epistemic assumptions" to agree with the areas these orgs have highlighted as most important, but I think you need specific values and epistemic assumptions to agree with more standard near-termist recommendations for impactful careers and donations, too. So I'm a bit confused about what the difference between "question" and "answer" communities is meant to denote aside from the split between near/longtermism.[1] Is the idea that (for example) CE is more skeptically focused on exploring the relative priorities of distinct cause areas, whereas organizations like Longview and 80k are more focused on funnelling people+money into areas which have already been decided as the most important? Or something else?
I do think it's correct note that the more 'longtermist' side of the community works with different values and epistemics to the more 'neartermist' side of the community, and I think it would be beneficial to emphasise this more. But given that you note there are already distinct communities in some sense (e.g., there are x-risk specific conferences), what other concrete steps would you like to see implemented in order to establish distinct communities?
I'm aware that many people justify focus on areas like biorisk and AI in virtue of the risks posed to the present generation, and might not subscribe to longtermism as a philosophical thesis. I still think that the ‘longtermist’ moniker is useful as a sociological label — used to denote the community of people who work on cause areas that longtermists are likely to rate as among the highest priorities.
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
I don't find Carlsmith et al's estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we're fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that "any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved." is enough to provide a disjunctive frame!
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For humanity to be dead by 2070, only one of the following needs to be true:
- Humanity has < 20 years to prepare for AGI
- The technical challenge of alignment isn’t “pretty easy”
- Research culture isn’t alignment-conscious in a competent way.
For this to be a disjunctive argument for doom, all of the following need to be true:
- If humanity has < 20 years to prepare for AGI, then doom is highly likely.
- Etc …
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.
Suggestions for better argument names are not being taken at this time.
Based solely on my own impression, I'd guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you'd consider relevant to your key cruxes.
(For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you've read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)
Here's one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:
“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”
When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.
For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.
For humanity to be dead by 2070, only one premise below needs to be true:
- Humanity has < 20 years to prepare for AGI
- The technical challenge of alignment isn’t “pretty easy”
- Research culture isn’t alignment-conscious in a competent way.
Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.
Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I'm also >10% on many claims in the absence of 'detailed, technical arguments' for such claims in the absence of such arguments, and I think we can do a lot better than we're doing currently.
I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I'm also glad that you offered the 'social explanation' for people's low doom estimates, even though I think it's incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I'd like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I'd be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).
thnx! : )
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren't necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
Nice post!
I think I’d want to revise your first taxonomy a bit. To me, one (perhaps the primary) disagreement among ML researchers regarding AI risk consists of differing attitudes to epistemological conservatism, which I think extends beyond making conservative predictions. Here’s why I prefer my framing:
I also think that the language of conservative epistemology helps counteract (what I see as) a mistaken frame motivating this post. (I’ll try to motivate my claim, but I’ll note that I remain a little fuzzy on exactly what I’m trying to gesture at.)
The mistaken frame I see is something like “modeling conservative epistemologists as if they were making poor strategic choices within a non-conservative world-model”. You state:
The level of concern and seriousness I see from ML researchers discussing AGI on any social media platform or in any mainstream venue seems wildly out of step with "half of us think there's a 10+% chance of our work resulting in an existential catastrophe".
I have concerns about you inferring this claim from the survey data provided,[1] but perhaps more pertinently for my point: I think you’re implicitly interpreting the reported probabilities as something like all-things-considered credences in the proposition researchers were queried about. I’m much more tempted to interpret the probabilities offered by researchers as meaning very little. Sure, they’ll provide a number on a survey, but this doesn’t represent ‘their’ probability of an AI-induced existential catastrophe.
I don’t think that most ML researchers have, as a matter of psychological fact, any kind of mental state that’s well-represented by a subjective probability about the chance of an AI-induced existential catastrophe. They’re more likely to operate with a conservative epistemology, in a way that isn’t neatly translated into probabilistic predictions over an outcome space that includes the outcomes you are most worried about. I think many people are likely to filter out the hypothesis given the perceived lack of evidential support for the outcome.
I actually do think the distinction between 'conservative predictions' and 'conservative decision-making' is helpful, though I'm skeptical about its relevance for analyzing different attitudes to AI risk.
If my analysis is right, then a first-pass at the practical conclusions might consist in being more willing to center arguments about alignment from a more empirically grounded perspective (e.g. here), or more directly attempting to have conversations about the costs and benefits of more conservative epistemological approaches.
First, there are obviously selection effects present in surveying OpenAI and DeepMind researchers working on long-term AI. Citing this result without caveat feels similar using (e.g.) PhilPapers survey results revealing that most specialists in philosophy of religion are to support the claim that most philosophers are theists. I can also imagine similar selection effects being present (though to lesser degrees) in the AI Impacts Survey. Given selection effects, and given that response rates from the AI Impacts survey were ~17%, I think your claim is misleading.
I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I’ll start with the first point. In your post, you state the following.
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn't useful to think of LLMs as "simulating stuff" … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth "behind" the masks.
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.