OCB

Owen Cotton-Barratt

10863 karmaJoined

Sequences
3

Reflection as a strategic goal
On Wholesomeness
Everyday Longermism

Comments
984

Topic contributions
3

I think we're still primarily assessing what was said. And you can make your system flexible enough that it doesn't just entrench existing actors.

I think this pushes towards filtering happening via means other than just assessing the argument -- e.g. something like persistent reputations (with some occasional sampling to allow new entrants to get out of the zone of being ignored). cf. discussion of how AI could help us keep tabs on the reliability of different actors.

I imagine that there's a relatively cheap filter to work out which arguments to even engage with, and then most effort can still go into engaging with those arguments.

I haven't read the whole post, but tripped up on this: 

When the cost of generating arguments approaches zero, their value as traces of reasoning falls, while their value as proxy signals — indirect markers of reliability more amenable to social confirmation — rises.

This seems backwards to me? They're no longer a costly signal, but they can still be assessed for legitimacy.

You can have a smart system make inferences from camera visible information. 

But yeah, the main use case we had in mind for the monitoring layer was not about these very tricky-to-observe states, but expanding the space of things you can make agreements about (potentially including some high-stakes cases, as I write about at the end of this story: https://strangecities.substack.com/p/some-days-soon).

This is basically the reason I regard this as the most technically challenging of the things we're presenting here. You eventually want a system which is not just a passive consumer of data, but can actively explore. You may need to give it access to robots with cameras and internet so that it can verify some of the basics of its setup. It might still fear that the entire thing is being spoofed, but I think it's vastly harder to generate a plausible world that's robust to the agent exploring and running consistency probes.

I'm kind of unsure which of the sketches you're talking about with this question. Could you ask it of whichever one you feel it's cleanest for?

Plausible, yes. For one thing you can run versions of the coordination tech in parallel with old cheap models, and flag and dig into discrepancies. This could make it harder for misalignment to strongly bite.

Of course if there are big misalignment issues and we're not seriously tracking that there could be big misalignment issues, that's gonna be a problem.

I feel like you're baking a lot into this clause:

With AI delegates, they would presumably be verifiable and would be programmed to tell the truth and keep to deals

I think that aiming for an equilibrium where that's true would be good, but I'm not certain that's the starting point (and if it were otherwise going to scupper getting this off the ground, it probably shouldn't be the starting point).

So if one person adopts the AI delegate and another doesn't, then the human can overexaggerate their preferences, withhold information, and even defect on the deal (without blatantly lying), but a verifiable AI delegate presumably wouldn't be able to do that?

I see no reason why an AI delegate shouldn't be able to withhold information. I agree that people might want delegates that could do the other things too, but I think that it might be better for the human principal if it couldn't -- it can develop a reputation as trustworthy (in a way that's hard for an individual human to develop enough of a reputation for because others don't get enough track record).

I agree that there are significant concerns here! FWIW I'm more concerned about the adversarially-manipulated layer (at least at something needing attention now). I think that a lot of these applications could work with systems that aren't much stronger than what we have today; but that getting effective misaligned scheming would require a significant step up in capabilities. (You might have weaker forms of misalignment, but I think that those are pretty similar to "the systems just aren't really good enough yet".)

Load more