Steven Byrnes

Research Fellow @ Astera
1363 karmaJoined Nov 2019Working (6-15 years)Boston, MA, USA
sjbyrnes.com/agi.html

Bio

Hi I'm Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Comments
129

Topic contributions
3

Hi, I’m an AI alignment technical researcher who mostly works independently, and I’m in the market for a new productivity coach / accountability buddy, to chat with periodically (I’ve been doing one ≈20-minute meeting every 2 weeks) about work habits, and set goals, and so on. I’m open to either paying fair market rate, or to a reciprocal arrangement where we trade advice and promises etc. I slightly prefer someone not directly involved in AI alignment—since I don’t want us to get nerd-sniped into object-level discussions—but whatever, that’s not a hard requirement. You can reply here, or DM or email me. :) update: I’m all set now

Humans are less than maximally aligned with each other (e.g. we care less about the welfare of a random stranger than about our own welfare), and humans are also less than maximally misaligned with each other (e.g. most people don’t feel a sadistic desire for random strangers to suffer). I hope that everyone can agree about both those obvious things.

That still leaves the question of where we are on the vast spectrum in between those two extremes. But I think your claim “humans are largely misaligned with each other” is not meaningful enough to argue about. What percentage is “largely”, and how do we even measure that?

Anyway, I am concerned that future AIs will be more misaligned with random humans than random humans are with each other, and that this difference will have important bad consequences, and I also think there are other disanalogies / reasons-for-concern as well. But this is supposed to be a post about terminology so maybe we shouldn’t get into that kind of stuff here.

My terminology would be that (2) is “ambitious value learning” and (1) is “misaligned AI that cooperates with humans because it views cooperating-with-humans to be in its own strategic / selfish best interest”.

I strongly vote against calling (1) “aligned”. If you think we can have a good future by ensuring that it is always in the strategic / selfish best interest of AIs to be nice to humans, then I happen to disagree but it’s a perfectly reasonable position to be arguing, and if you used the word “misaligned” for those AIs (e.g. if you say “alignment is unnecessary”), I think it would be viewed as a helpful and clarifying way to describe your position, and not as a reductio or concession.

For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.” (And if the AI is not the kind of thing for which the word “trying” and “cares about” is applicable in the first place, then the AI is neither aligned nor misaligned, and also I’d claim it’s not an x-risk in any case.) More caveats in a thing I wrote here:

Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.

I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.

Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.

May I ask, what is your position on creating artificial consciousness?
Do you see digital suffering as a risk? If so, should we be careful to avoid creating AC?

I think the word “we” is hiding a lot of complexity here—like saying “should we decommission all the world’s nuclear weapons?” Well, that sounds nice, but how exactly? If I could wave a magic wand and nobody ever builds conscious AIs, I would think seriously about it, although I don’t know what I would decide—it depends on details I think. Back in the real world, I think that we’re eventually going to get conscious AIs whether that’s a good idea or not. There are surely interventions that will buy time until that happens, but preventing it forever and ever seems infeasible to me. Scientific knowledge tends to get out and accumulate, sooner or later, IMO. “Forever” is a very very long time. 

The last time I wrote about my opinions is here.

Do you see digital suffering as a risk? 

Yes. The main way I think about that is: I think eventually AIs will be in charge, so the goal is to wind up with AIs that tend to be nice to other AIs. This challenge is somewhat related to the challenge of winding up with AIs that are nice to humans. So preventing digital suffering winds up closely entangled with the alignment problem, which is my area of research. That’s not in itself a reason for optimism, of course.

We might also get a “singleton” world where there is effectively one and only one powerful AI in the world (or many copies of the same AI pursuing the same goals) which would alleviate some or maybe all of that concern. I currently think an eventual “singleton” world is very likely, although I seem to be very much in the minority on that.

Sorry if I missed it, but is there some part of this post where you suggest specific concrete interventions / actions that you think would be helpful?

Mark Solms thinks he understands how to make artificial consciousness (I think everything he says on the topic is wrong), and his book Hidden Spring has an interesting discussion (in chapter 12) on the “oh jeez now what” question. I mostly disagree with what he says about that too, but I find it to be an interesting case-study of someone grappling with the question.

In short, he suggests turning off the sentient machine, then registering a patent for making conscious machines, and assigning that patent to a nonprofit like maybe Future of Life Institute, and then

organise a symposium in which leading scientists and philosophers and other stakeholders are invited to consider the implications, and to make recommendations concerning the way forward, including whether and when and under what conditions the sentient machine should be switched on again – and possibly developed further. Hopefully this will lead to the drawing up of a set of broader guidelines and constraints upon the future development, exploitation and proliferation of sentient AI in general.

He also has a strongly-worded defense of his figuring out how consciousness works and publishing it, on the grounds that if he didn’t, someone else would.

I am not claiming analogies have no place in AI risk discussions. I've certainly used them a number of times myself. 

Yes you have!—including just two paragraphs earlier in that very comment, i.e. you are using the analogy “future AI is very much like today’s LLMs but better”.  :)

Cf. what I called “left-column thinking” in the diagram here.

For all we know, future AIs could be trained in an entirely different way from LLMs, in which case the way that “LLMs are already being trained” would be pretty irrelevant in a discussion of AI risk. That’s actually my own guess, but obviously nobody knows for sure either way.  :)

It is certainly far from obvious: for example, devastating as the COVID-19 pandemic was, I don’t think anyone believes that 10,000 random re-rolls of the COVID-19 pandemic would lead to at least one existential catastrophe. The COVID-19 pandemic just was not the sort of thing to pose a meaningful threat of existential catastrophe, so if natural pandemics are meant to go beyond the threat posed by the recent COVID-19 pandemic, Ord really should tell us how they do so.

This seems very misleading. We know that COVID-19 has <<5% IFR. Presumably the concern is that some natural pandemics may be much much more virulent than COVID-19 was. So it’s important that the thing we imagine is “10,000 random re-rolls in which there is a natural pandemic”, NOT “10,000 random re-rolls of COVID-19 in particular”. And then we can ask questions like “How many of those 10,000 natural pandemics have >50% IFR? Or >90%? And what would we expect to happen in those cases?” I don’t know what the answers are, but that’s a much more helpful starting point I think.

We discussed the risk of `do-it-yourself’ science in Part 10 of this series. There, we saw that a paper by David Sarapong and colleagues laments “Sensational and alarmist headlines about DiY science” which “argue that the practice could serve as a context for inducing rogue science which could potentially lead to a ‘zombie apocalypse’.” These experts find little empirical support for any such claims.

Maybe this is addressed in Part 10, but this paragraph seems misleading insofar as Ord is talking about risk by 2100, and a major part of the story is that DIY biology in, say, 2085 may be importantly different and more dangerous than DIY biology in 2023, because the science and tech keeps advancing and improving each year.

Needless to say, even if we could be 100% certain that DIY biology in 2085 will be super dangerous, there obviously would not be any “empirical support” for that, because 2085 hasn’t happened yet. It’s just not the kind of thing that presents empirical evidence for us to use. We have to do the best we can without it. The linked paper does not seem to discuss that issue at all, unless I missed it.

(I have a similar complaint about the the discussion of Soviet bioweapons in Section 4—running a bioweapons program with 2024 science & technology is presumably quite different than running a bioweapons program with 1985 science & technology, and running one in 2085 would be quite different yet again.

(Recently I've been using "AI safety" and "AI x-safety" interchangeably when I want to refer to the "overarching" project of making the AI transition go well, but I'm open to being convinced that we should come up with another term for this.)

I’ve been using the term “Safe And Beneficial AGI” (or more casually, “awesome post-AGI utopia”) as the overarching “go well” project, and “AGI safety” as the part where we try to make AGIs that don’t accidentally [i.e. accidentally from the human supervisors’ / programmers’ perspective] kill everyone, and (following common usage according to OP) “Alignment” for “The AGI is trying to do things that the AGI designer had intended for it to be trying to do”.

(I didn’t make up the term “Safe and Beneficial AGI”. I think I got it from Future of Life Institute. Maybe they in turn got it from somewhere else, I dunno.)

(See also: my post Safety ≠ alignment (but they’re close!))

See also a thing I wrote here:

Some researchers think that the “correct” design intentions (for an AGI’s motivation) are obvious, and define the word “alignment” accordingly. Three common examples are (1) “I am designing the AGI so that, at any given point in time, it’s trying to do what its human supervisor wants it to be trying to do”—this AGI would be “aligned” to the supervisor’s intentions. (2) “I am designing the AGI so that it shares the values of its human supervisor”—this AGI would be “aligned” to the supervisor. (3) “I am designing the AGI so that it shares the collective values of humanity”—this AGI would be “aligned” to humanity.

I’m avoiding this approach because I think that the “correct” intended AGI motivation is still an open question. For example, maybe it will be possible to build an AGI that really just wants to do a specific, predetermined, narrow task (e.g. design a better solar cell), in a way that doesn’t involve taking over the world etc. Such an AGI would not be “aligned” to anything in particular, except for the original design intention. But I still want to use the term “aligned” when talking about such an AGI.

Of course, sometimes I want to talk about (1,2,3) above, but I would use different terms for that purpose, e.g. (1) “the Paul Christiano version of corrigibility”, (2) “ambitious value learning”, and (3) “CEV”.

This kinda overlaps with (2), but the end of 2035 is 12 years away. A lot can happen in 12 years! If we look back to 12 years ago, it was December 2011. AlexNet had not come out yet, neural nets were a backwater within AI, a neural network with 10 layers and 60M parameters was considered groundbreakingly deep and massive, the idea of using GPUs in AI was revolutionary, tensorflow was still years away, doing even very simple image classification tasks would continue to be treated as a funny joke for several more years (literally—this comic is from 2014!), I don’t think anyone was dreaming of AI that could pass a 2nd-grade science quiz or draw a recognizable picture without handholding, GANs had not been invented, nor transformers, nor deep RL, etc. etc., I think.

So “AGI by 2035” isn’t like “wow that could only happen if we’re already almost there”, instead it leaves tons of time for like a whole different subfield of AI to develop from almost nothing. 

(I'm making a case against being confidently skeptical about AGI by 2035, not a case for confidently expecting AGI by 2035.)

Load more