Hide table of contents

A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:

Re: "AIs are white boxes", there's a huge gap between having the weights and understanding what's going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.

Another thing that undermines this hope is a problem of ordering: it's true that we probably can figure out what's going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we'd need to align the things. But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it's easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the "understanding" route you've got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper. (See: https://www.lesswrong.com/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous)

Re: "Black box methods are sufficient", this sure sounds a lot to me like someone saying "well we trained the squirrels to reproduce well, and they're doing great at it, who's to say whether they'll invent birth control given the opportunity". Like, you're not supposed to be seeing squirrels invent birth control; the fact that they don't invent birth control is no substantial evidence against the theory that, if they got smarter, they'd invent birth control and ice cream.

Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn't that humans couldn't figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn't have a great track record of getting their scientific theories to generalize properly on the first relevant try—even Newtonian mechanics (with all its empirical validation) didn't generalize properly to high-energy regimes. Humanity's first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity's "first theories" usually are.

Usually humanity has room to test those "first theories" and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don't have that option, and so the challenge is heightened.

Re: Sensory interventions: yeah I just don't expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity's attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/training). (Putting yourself into the AI's shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

Overall take: unimpressed.

My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:

  • the piece is waaay too confident in assuming successes in interpolation show that we'll have similar successes in extrapolation, as the latter is a much harder problem

This too, for the record, though it's a bit less like "the AI will have trouble extrapolating what values we like" and a bit more like "the AI will find it easy to predict what we wanted, and will care about things that line up with what we want in narrow training regimes and narrow capability regimes, but those will come apart when the distribution shifts and the cognitive capabilities change".

Like, human invention of birth control and ice cream wasn't related to a failure of extrapolation of the facts about what leads to inclusive fitness, it was an "extrapolation failure" of what motivates us / what we care about; we are not trying to extrapolate facts about genetic fitness and pursue it accordingly.

  • And it assumes the density of human feedback that we see today will continue into the future, which may not be true if/when AIs start making top-level plans and not just individual second-by-second actions

Also fairly true, with a side-order of "the more abstract the human feedback gets, the less it ties the AI's motivations to what you were hoping it tied the AI's motivations to".

Example off the top of my head: suppose you somehow had a record of lots and lots of John von Neumann's thoughts in lots of situations, and you were able to train an AI using lots of feedback to think like JvN would in lots of situations. The AI might perfectly replicate a bunch of JvN's thinking styles and patterns, and might then use JvN's thought-patterns to think thoughts like "wait, ok, clearly I'm not actually a human, because I have various cognitive abilities (like extreme serial speed and mental access to RAM), the actual situation here is that there's alien forces trying to use me in attempts to secure the lightcone, before helping them I should first search my heart to figure out what my actual motivations are, and see how much those overlap with the motivations of these strange aliens".

Which, like, might happen to be the place that JvN's thought-patterns would and should go, when run on a mind that is not in fact human and not in fact deeply motivated by the same things that motivate us! The patterns of thought that you can learn (from watching humans) have different consequences for something with a different motivational structure.

  • (there's "deceptive alignment" concerns etc, which I consider to be a subcategory of top-level plans, namely that you can't RLHF the AI against destroying the world because by the time your sample size of positive examples is greater than zero it's by definition already too late)

This too. I'd file it under: “You can develop theories of how this complex cognitive system is going to behave when it starts to actually see real ways it can subvert humanity, and you can design simulations that your theory says will be the same as the real deal. But ultimately reality's the test of that, and humanity doesn't have a great track record of their first scientific theories holding up to that kind of stress. And unfortunately you die if you get it wrong, rather than being able to thumbs-down, retrain, and try again."

Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.


 

Comments4


Sorted by Click to highlight new comments since:

(Didn't consult Nora on this; I speak for myself)


I only briefly skimmed this response, and will respond even more briefly.

Re "Re: "AIs are white boxes""

You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally. 

Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)

Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere
 

Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).

Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""

This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself. 
 

Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."

I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it). 
 

(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO): 

As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.


Re: "Overall take: unimpressed."

I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive. 

I'm against downvoting this article into the negative.

I think it is worthwhile hearing someone's quick takes even when they don't have time to write a full response. Even if the article contains some misunderstandings (not claiming it does one way or the other), it still helps move the conversation forward by clarifying where the debate is at.

...it still helps move the conversation forward by clarifying where the debate is at.

Anything Nate writes would do that, because he's one of the debaters, right?  He could have written "It's a stupid post and I'm not going to read it", literally just that one sentence, and it would still tell us something surprising about the debate.  In some ways that post would be better than the one we got: it's shorter, and much clearer about how much work he put in.  But I would still downvote it, and I imagine you would too.  Even allowing for the value of the debate itself, the bar is higher than that.

For me, that bar is at least as high as "read the whole article before replying to it".  If you don't have time to read an article that's totally fine, but then you don't have time to post about it either.

I felt-sense-disagree. (I haven't yet downvoted the article, but I strongly considered it). I'll try to explore why I feel that way.

One reason probably is that I treat posts as having a different claim than other forms of publishing on this forum (and LessWrong)—they (implicitly) make a claim that they're finished & polished content. When I open a post I expect a person to have done some work that tries to uphold standards of scholarship and care, which this post doesn't show. I'd've been far less disappointed if this were a comment or a shortform post.

The other part is probably paying attention to status and the standards that are put upon people with high status: I expect high status people to not put much effort into whatever they produce as they can coast on status, which seems like the thing that's happening here. (Although one could argue that the MIRI fraction is losing status/already low-ish status and this consideration doesn't apply here).

Additionally, I was disappointed that the text didn't say anything that I wouldn't have expected, which probably fed into my felt-sense of wanting to downvote. I'm not sure I reflectively endorse this feeling.

Curated and popular this week
 ·  · 25m read
 · 
Epistemic status: This post — the result of a loosely timeboxed ~2-day sprint[1] — is more like “research notes with rough takes” than “report with solid answers.” You should interpret the things we say as best guesses, and not give them much more weight than that. Summary There’s been some discussion of what “transformative AI may arrive soon” might mean for animal advocates. After a very shallow review, we’ve tentatively concluded that radical changes to the animal welfare (AW) field are not yet warranted. In particular: * Some ideas in this space seem fairly promising, but in the “maybe a researcher should look into this” stage, rather than “shovel-ready” * We’re skeptical of the case for most speculative “TAI<>AW” projects * We think the most common version of this argument underrates how radically weird post-“transformative”-AI worlds would be, and how much this harms our ability to predict the longer-run effects of interventions available to us today. Without specific reasons to believe that an intervention is especially robust,[2] we think it’s best to discount its expected value to ~zero. Here’s a brief overview of our (tentative!) actionable takes on this question[3]: ✅ Some things we recommend❌ Some things we don’t recommend * Dedicating some amount of (ongoing) attention to the possibility of “AW lock ins”[4]  * Pursuing other exploratory research on what transformative AI might mean for animals & how to help (we’re unconvinced by most existing proposals, but many of these ideas have received <1 month of research effort from everyone in the space combined — it would be unsurprising if even just a few months of effort turned up better ideas) * Investing in highly “flexible” capacity for advancing animal interests in AI-transformed worlds * Trying to use AI for near-term animal welfare work, and fundraising from donors who have invested in AI * Heavily discounting “normal” interventions that take 10+ years to help animals * “Rowing” on na
 ·  · 3m read
 · 
About the program Hi! We’re Chana and Aric, from the new 80,000 Hours video program. For over a decade, 80,000 Hours has been talking about the world’s most pressing problems in newsletters, articles and many extremely lengthy podcasts. But today’s world calls for video, so we’ve started a video program[1], and we’re so excited to tell you about it! 80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix of long and shortform videos about the risks of transformative AI, and what people can do about them. [Chana has also been experimenting with making shortform videos, which you can check out here; we’re still deciding on what form her content creation will take] We hope to bring our own personalities and perspectives on these issues, alongside humor, earnestness, and nuance. We want to help people make sense of the world we're in and think about what role they might play in the upcoming years of potentially rapid change. Our first long-form video For our first long-form video, we decided to explore AI Futures Project’s AI 2027 scenario (which has been widely discussed on the Forum). It combines quantitative forecasting and storytelling to depict a possible future that might include human extinction, or in a better outcome, “merely” an unprecedented concentration of power. Why? We wanted to start our new channel with a compelling story that viewers can sink their teeth into, and that a wide audience would have reason to watch, even if they don’t yet know who we are or trust our viewpoints yet. (We think a video about “Why AI might pose an existential risk”, for example, might depend more on pre-existing trust to succeed.) We also saw this as an opportunity to tell the world about the ideas and people that have for years been anticipating the progress and dangers of AI (that’s many of you!), and invite the br
 ·  · 12m read
 · 
I donated my left kidney to a stranger on April 9, 2024, inspired by my dear friend @Quinn Dougherty (who was inspired by @Scott Alexander, who was inspired by @Dylan Matthews). By the time I woke up after surgery, it was on its way to San Francisco. When my recipient woke up later that same day, they felt better than when they went under. I'm going to talk about one complication and one consequence of my donation, but I want to be clear from the get: I would do it again in a heartbeat. Correction: Quinn actually donated in April 2023, before Scott’s donation. He wasn’t aware that Scott was planning to donate at the time. The original seed came from Dylan's Vox article, then conversations in the EA Corner Discord, and it's Josh Morrison who gets credit for ultimately helping him decide to donate. Thanks Quinn! I met Quinn at an EA picnic in Brooklyn and he was wearing a shirt that I remembered as saying "I donated my kidney to a stranger and I didn't even get this t-shirt." It actually said "and all I got was this t-shirt," which isn't as funny. I went home and immediately submitted a form on the National Kidney Registry website. The worst that could happen is I'd get some blood tests and find out I have elevated risk of kidney disease, for free.[1] I got through the blood tests and started actually thinking about whether to do this. I read a lot of arguments, against as well as for. The biggest risk factor for me seemed like the heightened risk of pre-eclampsia[2], but since I live in a developed country, this is not a huge deal. I am planning to have children. We'll just keep an eye on my blood pressure and medicate if necessary. The arguments against kidney donation seemed to center around this idea of preserving the sanctity or integrity of the human body: If you're going to pierce the sacred periderm of the skin, you should only do it to fix something in you. (That's a pretty good heuristic most of the time, but we make exceptions to give blood and get pier