L

Linch

@ -
27594 karmaJoined Working (6-15 years)openasteroidimpact.org

Comments
2887

crossposted from https://inchpin.substack.com/p/legible-ai-safety-problems-that-dont

Epistemic status: Think there’s something real here but drafted quickly and imprecisely

I really appreciated reading Legible vs. Illegible AI Safety Problems by Wei Dai. I enjoyed it as an impressively sharp crystallization of an important idea:

  1. Some AI safety problems are “legible” (obvious/understandable to leaders/policymakers) and some are “illegible” (obscure/hard to understand)
  2. Legible problems are likely to block deployment because leaders won’t deploy until they’re solved
  3. Leaders WILL still deploy models with illegible AI safety problems, since they won’t understand the problems’ full import and deploy the models anyway.
  4. Therefore, working on legible problems have low or even negative value. If unsolved legible problems block deployment, solving them will just speed up deployment and thus AI timelines.
    1. Wei Dai didn’t give a direct example, but the iconic example that comes to mind for me is Reinforcement Learning from Human Feedback (RLHF): implementing RLHF for early ChatGPT, Claude, and GPT-4 likely was central to making chatbots viable and viral.
    2. The raw capabilities were interesting but the human attunement was necessary for practical and economic use cases.

I mostly agree with this take. I think it’s interesting and important. However (and I suspect Wei Dai will agree), it’s also somewhat incomplete. In particular, the article presumes that “legible problems” and “problems that gate deployment” are idempotent, or at least the correlation is positive enough that the differences are barely worth mentioning. I don’t think this is true.

 

 

For example, consider AI psychosis and AI suicides. Obviously this is a highly legible problem that is very easy to understand (though not necessarily to quantify or solve). Yet they keep happening, and AI companies (or at least the less responsible ones) seem happy to continue deploying models without solving AI psychosis.

Now of course AI psychosis is less important than extinction or takeover risk. But this does not necessarily mean that problems as legible as AI psychosis today (or as AI psychosis in Nov 2024) will necessarily gate deployment, with actors at similar levels of responsibility as the existing AI company leaders.

Instead, it might be better to modify the argument to say we should primarily focus on solving/making legible problems that are not likely to actually gate deployment by default, and leave the problems that are already gating deployment to others (Trust and Safety teams, government legislators, etc). This sounds basically right to me.

But this raises another question: Legible to whom? And gating deployment by whom?

Wei Dai’s argument implicitly adopts a Mistake Theory framing, where AI company leadership don’t understand the illegible (to them) issues that could lead to our doom. On the one hand, this is surely true: e/accs aside, AI company leaders presumably don’t want themselves and their children to die, so in some sense, if they truly understand certain illegible issues that could lead to AI takeover and/or human extinction, the issues would probably block deployment.

In another sense, I’m not so sure the framing is right. Consider the following syllogism:

  1. If I believe that the risk is real, my company may have to shut down, or incur other large costs and possibly lose the AI race to Anthropic/OpenAI/DeepMind/China.
  2. I do not wish to incur large costs.
  3. Therefore, by modus tollens, the risk is not real.

This is a silly syllogism at face value, yet I believe it’s a common pattern of (mostly unconscious) thought among many people at AI labs. Related idea by Upton Sinclair: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

This suggests at least two complications for the epistemics-only/work on making illegible problems more legible framing:

  1. Solving a problem can go a long way in making a problem more legible
    1. Many people have talked about how in the course of solving a problem, you may make it more legible. Alternatively, reframing a problem can make it more solvable (cf also Grothendieck’s rising sea)
    2. But if you take a incentives-first, motivated-cognition framing as I’ve implied, you may also believe that solving a problem, and thus reducing the alignment tax, may magically and mysteriously make AI company leaders suddenly understand the importance of your problems, now that they’re cheaper to solve.
  2. If motivations, and not technical difficulty, drive much of the illegibility, this suggests that sometimes we should focus our explainer efforts on people who are further away from the situation, and thus less biased
    1. Concretely, the current pipeline looks like “AI safety-standard” route of legibility increases the following way: first try to convince “very technical Constellation-cluster” people -> then convince AI company safety teams -> then convince AI company non-safety technical people -> then convince AI company leaders - > then maybe try to convince the policymakers to implement informal agreements and policies into law
    2. But if I’m right about motivations, we should instead aim to convince unbiased people (or people biased in our direction) first, like tech journalists, faith leaders, politicians, and members of the general public.
      1. This is riskier epistemically in some ways because you’re talking to less knowledgeable and in some ways less intelligent people, but also has significant benefits in maintaining independence, and having less funky incentives and biases.
    3. To repeat, under my model, illegibility is often driven by incentives, culture and motivated reasoning, not technical or conceptual skill.

Note that I’m assuming that sometimes what you want is for AI companies to “see the light” and manage themselves (which is most of the “inside game” path forwards of #1). However, most of the time the way we get actual progress on AI not killing us all (especially for #2) is via legal and other forms of state hard power. In a democracy this entails a combination of convincing policymakers, civil society, and the general public, including not just technically agreement, but also saliency increases.

Of course, there can be real issues with over-regulation or inaccurate misdiagnosis of “illegible issues.” As someone responded to me on Twitter, “If a problem has no problem statement then… there isn’t a problem.” While the strong version of that is clearly false, there’s a weaker version that’s probably correct: problems you or I view as “illegible problem” are more likely to in reality be “not a real problem” in objective terms. I don’t have a clear solution for this other than a) thinking harder, and b) hoping that trying to increase legibility will also reveal the holes in the reasoning of “fake problems.” Ultimately, reality is difficult and there aren’t cheap workarounds.

Conclusion

Concretely, compared to before reading and thinking about Wei Dai’s article, I tentatively update a bit towards wanting to

a) work on more illegible problems,

b) thinking that AI safety should prioritize more explanation-type work, or work that is closer to analytic philosophy’s “conceptual sharpening,” and

c) if ideas mysteriously seem to bounce off of AI company employees and AI lab leaders, this may not be due to true philosophical or technical confusions, but rather obvious bias.

In some cases, we should think about, and experiment with, framing problems that are illegible (to AI company leaders or other ML-heavy people) to other audiences, rather than assume people are too dumb to understand our extant arguments without dumbing down.

 

Let me know if you have other thoughts on it here!

Some of the negative comments here gesture at the problem you're referring to, but less precisely than you had.

I wrote a quick draft on reasons you might want to skip pre-deployment Phase 3 drug trials (and instead do an experimental rollout with post-deployment trials, with option of recall) for vaccines for high diseases with high mortality burden, or for novel pandemics. https://inchpin.substack.com/p/skip-phase-3

It's written in a pretty rushed way, but I know this idea has been bouncing around for a while and I haven't seen a clearer writeup elsewhere, so I hope it can start a conversation!

Are the abundance ideas actually new to EA folks? They feel like rehashes of arguments we've had ~ a decade ago, often presented in less technical language and ignoring the major cruxes.

Not saying they're bad ideas, just not new.

This post had 169 views on the EA forum, 3K on substack, 17K on reddit, 31K on twitter.

Link appears to be broken.

This is great news; I'm so glad to hear that!!!

Linch
12
0
0
1
1

I wrote a field guide on writing styles. Not directly applicable to the EA Forum but I used some EA Forum-style writing (including/especially my own) as examples. 

https://linch.substack.com/p/on-writing-styles

I hope the article can increase the quality of online intellectual writing in general and EAF writing in particular!
 

x-posted from Substack

Now, of course, being vegan won’t kill you, right away or ever. But the same goes for eating a diet of purely McDonald’s or essentially just potatoes (like many peasants did). The human body is remarkably resilient and can survive on a wide variety of diets. However, we don’t thrive on all diets. 

Vegans often show up as healthier in studies than other groups, but correlation is not causation. For example, famously Adventists are vegetarians and live longer than the average population. However, vegetarian is importantly different from vegan. Also, Adventists don’t drink or smoke either, which might explain the difference. 

Wouldn’t it be great if we had a similar population that didn’t smoke or drink but did eat meat to compare? 

We do! The Mormons. And they live longer than the Adventists. 

The Seventh-Day Adventist studies primarily looked at differences *between* different Seventh-Day Adventists, not just a correlational case of Seventh-Day Adventists against other members of the public. This helps control for a number of issues with looking across religious groups, which would be a pretty silly way to determine causation from diet to health. I believe the results also stand after a large number of demographic adjustments [2].

Finally, Mormons are predominantly white. Only 3% of Mormons are black. 32% of Seventh-Day Adventists are black. In the US, black people have a substantially lower life expectancy than white people [3]. Thus, it'd be unreasonable to look at naive life expectancies across two different religious groups and assume that lifestyle makes the biggest difference, when there are clearly other things going on.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC4191896/

[2] https://pmc.ncbi.nlm.nih.gov/articles/PMC4191896/table/T4/

[3] Interestingly enough, this is not true across the rest of the developed world. For example, UK black people have a higher life expectancy than white people. I've never dived in into this discrepancy before so I'm not sure what the reason is.

I have a lot of sympathy towards being frustrated at knee-jerk bias against AI usage. I was recently banned from r/philosophy on first offense because I linked a post that contained an AI-generated image and a (clearly-labelled) AI summary of someone else's argument[1]. (I saw that the subreddit had rules against AI usage but I foolishly assumed that it only applied to posts in the subreddit itself). I think their choice to ban me was wrong, and deprived them of valuable philosophical arguments that I was able to make[2] in other subreddits like r/PhilosophyOfScience. So I totally get where you're coming from with frustration.

And I agree that AI, like typewriters, computers, calculators, and other tools, can be epistemically beneficial in allowing people who otherwise don't have the time to make arguments to develop them. 

Nonetheless I think you're wrong in some important ways.

Firstly, I think you're wrong to believe that perception of AI ought only to cause us to be skeptical of whether to engage with some writing, and it is "pure prejudice" to apply a higher bar to writing after reading it conditional upon whether it's AI. I think this is an extremely non-obvious claim, and I currently think you're wrong. 

To illustrate this point, consider two other reasons I might apply greater scrutiny to some content I see:

  1. An entire essay is written in Comic Sans
  2. I learned that a paper's written by Francisca Gino

If an essay is written in Comic Sans (a font often adopted by unserious people), we might initially suspect that the essay's not very serious, but after reading it, we should withdraw any adverse inferences we make about the essay simply due to font. This is because we believe (by stipulation) that an essay's font can tell us whether an essay is worth reading, but cannot provide additional data after reading the essay. In Pearlian terms, reading the essay "screens off" any information we gain from an essay's font.

I think this is not true for learning that a paper is written by Francisca Gino. Since Francisca Gino's a known data fraudster, even after carefully reading a paper by her, or at least with the same level of care I usually apply to reading psychology papers, I should continue to be more skeptical of her findings than after reading the same paper written by a different academic. I think this is purely rational, rather than an ad hominem argument, or "pure prejudice" as you so eloquently put it.

Now, is learning whether an essay is written (or cowritten) by AI a signal more akin to learning that an essay is written in Comic Sans, or closer to learning that it's written by Francisca Gino? Reasonable people can disagree here, but at the very least the answer's extremely non-obvious, and you haven't actually substantiated why you believe it's the former, when there are indeed good reasons to believe it's the latter. 

In brief: 

  1. AI hallucination -- while AIs may intentionally lie less often than Harvard business professors, they still hallucinate at a higher rate than i'm comfortable with seeing on the EA Forum.
  2. AI persuasiveness -- for the same facts and levels of evidence, AIs might be more persuasive than most human writers. To the extent this additional persuasiveness is not correlated with truth, we should update negatively accordingly upon seeing arguments presented by AIs.
  3. Authority and cognition -- If I see an intelligent and well-meaning person present an argument with some probably-fillable holes, that they allude to but do not directly address in the writing, I might be inclined to give them a benefit of a doubt and assumed they've considered the issue and decided it wasn't worth going into in a short speech or essay. However, this inference is much more likely to go wrong if an essay is written with AI assistance. I alluded to this point in my comment on your other top-level post but I'll mention it again here.
    1. I think it's very plausible, for example, that if you took the time to write out/type out your comment here yourself, you'd have been able to recognize my critique for yourself, and it wouldn't have been necessary for me to dive into it.
  1. ^

    I still defend this practice. I think the alternative of summarizing other people's arguments in your own words has various tradeoffs but a big one is that you are injecting your own biases into the summary before you even start critiquing it.

  2. ^

    Richard Chappell was also banned temporarily, and has a more eloquent defense. Unlike me he's an academic philosopher (TM)

I compiled a list of my favorite jokes, which some forum users might enjoy. https://linch.substack.com/p/intellectual-jokes

Load more