Distillation for Robust Unlearning Paper (https://arxiv.org/abs/2506.06278) makes me re-interested in the idea of using distillation to absorb the benefits of a Control Protocol (https://arxiv.org/abs/2312.06942).
I thought that was a natural "Distillation and Amplification" next step based for control anyways, but the empirical results for unlearning make me excited about how this might work for control again.
Like, I guess I am just saying that if you are actually in a regime where you are using Trusted model some nontrivial fraction of the time, you might be able to distill off of that.
I relate it to the idea of iterated amplification and distillation; the control protocol is the scaffold/amplification. Plus, it seems natural that your most troubling outputs would receive special attention from bot/human/cyborg overseers and receive high quality training feedback.
Training off of control might make no sense at all if you then think of that model as just one brain playing a game with itself that it can always rig/fake easily. And since a lot of the concern is scheming, this might basically make the "control protocol distill" dead on arrival because any worthwhile distill would still need to be smart enough that it might be sneak attacking us for roughly the same reasons the original model was and even extremely harmless training data doesn't help us with that.
Seems good to make the model tend to be more cool and less sketchy even if it would only be ~"trusted model level good" at some stuff. Idk though, I am divided here.
I think it might be cool if an AI Safety research organization ran a copy of an open model or something and I could pay them a subscription to use it. That way I know my LLM subscription money is going to good AI stuff and not towards the stuff that AI companies that I don't think I like or want more of on net.
Idk, existing independent orgs might not be the best place to do this bc it might "damn them" or "corrupt them" over time. Like, this could lead them to "selling out" in a variety of ways you might conceive of that.
Still, I guess I am saying that to the extent anyone is going to actually "make money" off of my LLM usage subscriptions, it would be awesome if it were just a cool independent AIS lab I personally liked or similar. (I don't really know the margins and unit economics which seems like an important part of this pitch lol).
Like, if "GoodGuy AIS Lab" sets up a little website and inference server (running Qwen or Llama or whatever) then I could pay them the $15-25 a month I may have otherwise paid to an AI company. The selling point would be that less "moral hazard" is better vibes, but probably only some people would care about this at all and it would be a small thing. But also, it's hardly like a felt sense of moral hazard around AI is a terribly niche issue.
This isn't the "final form" of this I have in mind necessarily; I enjoy picking at ideas in the space of "what would a good guy AGI project do" or "how can you do neglected AIS / 'AI go well' research in a for-profit way".
I also like the idea of an explicitly fast follower project for AI capabilities. Like, accelerate safety/security relevant stuff and stay comfortably middle of the pack on everything else. I think improving GUIs is probably fair game too, but not once it starts to shade into scaffolding I think? I wouldn't know all of the right lines to draw here, but I really like this vibe.
This might not work well if you expect gaps to widen as RSI becomes a more important input. I would argue that seems too galaxy brained given that, as of writing, we do live in a world with a lot of mediocre AI companies that I believe can all provide products of ~comparable quality.
It is also just kind of a bet that in practice it is probably going to remain a lot less expensive to stay a little behind the frontier than to be at the frontier. And that, in practice, it may continue to not matter in a lot of cases.
fwiw I think you shouldn't worry about paying $20/month to an evil company to improve your productivity, and if you want to offset it I think a $10/year donation to LTFF would more than suffice.
I haven't really thought about it and I'm not going to. If I wanted to be more precise, I'd assume that a $20 subscription is equivalent (to a company) to finding a $20 bill on the ground, assume that an ε% increase in spending on safety cancels out an ε% increase in spending on capabilities (or think about it and pick a different ratio), and look at money currently spent on safety vs capabilities. I don't think P(doom) or company-evilness is a big crux.
Alternative idea: AI companies should have a little checkbox saying "Please use 100% of the revenue from my subscription to fund safety research only." This avoids some of the problems with your idea and also introduces some new problems.
I think there is a non-infinitesimal chance that Anthropic would actually implement this.
Ya, maybe. This concern/way of thinking just seems kind of niche. Probably only a very small demographic who overlaps with me here. So I guess I wouldn't expect it to be a consequential amount of money to eg. Anthropic or OpenAI.
That check box would be really cool though. It might ease friction / dissonance for people who buy into high p(doom) or relatively non-accelerationist perspectives. My views are not representative of anyone, but me, but a checkbox like that would be a killer feature for me and certainly win my $20/mo :) . And maybe, y'know, all 100 people or whatever who would care and see it that way.
Can you say more on why you think a 1:24 ratio is the right one (as opposed to lower or higher ratios)? And how might this ratio differ for people who have different beliefs than you, for example about xrisk, LTFF, or the evilness of these companies?