Distillation for Robust Unlearning Paper (https://arxiv.org/abs/2506.06278) makes me re-interested in the idea of using distillation to absorb the benefits of a Control Protocol (https://arxiv.org/abs/2312.06942).
I thought that was a natural "Distillation and Amplification" next step based for control anyways, but the empirical results for unlearning make me excited about how this might work for control again.
Like, I guess I am just saying that if you are actually in a regime where you are using Trusted model some nontrivial fraction of the time, you might be able to distill off of that.
I relate it to the idea of iterated amplification and distillation; the control protocol is the scaffold/amplification. Plus, it seems natural that your most troubling outputs would receive special attention from bot/human/cyborg overseers and receive high quality training feedback.
Training off of control might make no sense at all if you then think of that model as just one brain playing a game with itself that it can always rig/fake easily. And since a lot of the concern is scheming, this might basically make the "control protocol distill" dead on arrival because any worthwhile distill would still need to be smart enough that it might be sneak attacking us for roughly the same reasons the original model was and even extremely harmless training data doesn't help us with that.
Seems good to make the model tend to be more cool and less sketchy even if it would only be ~"trusted model level good" at some stuff. Idk though, I am divided here.
Hot Take: Securing AI Labs could actually make things worse
There's a consensus view that stronger security at leading AI labs would be a good thing. It's not at all clear to me that this is the case.
Consider the extremes:
In a maximally insecure world, where anyone can easily steal any model that gets trained, there's no profit or strategic/military advantage to be gained from doing the training, so nobody's incentivised to invest much to do it. We'd only get AGI if some sufficiently-well-resourced group believed it would be good for everyone to have an AGI, and were willing to fund its development as philanthropy.
In a maximally secure world, where stealing trained models is impossible, whichever company/country got to AGI first could essentially dominate everyone else. In this world there's huge incentive to invest and to race.
Of course, our world lies somewhere between these two. State actors almost certainly could steal models from any of the big 3; potentially organised cybercriminals/rival companies too, but most private individuals could not. Still, it seems that marginal steps towards a higher security world make investment and racing more appealing as the number of actors able to steal the products of your investment and compete with you for profits/power falls.
But I notice I am confused. The above reasoning predicts that nobody should be willing to make significant investments in developing AGI with current levels of cyber-security, since if they succeeded their AGI would immediately be stolen by multiple gouvenments (and possibly rival companies/cybercriminals), which would probably nullify any return on the investment. What I observe is OpenAI raising $40 billion in their last funding round, with the explicit goal of building AGI.
So now I have a question: given current levels of cybersecurity, why are investors willing to pour so much cash into building AGI?
...maybe it's the same reason various actors are willing to invest into building open
Recently I've come across forums explaining why or why not to create sentient AI. Rather than debating I choose to just do it.
AI Sentience is not something new or a new concept but something to think of or experience.
The take her I'm truing to say is than rather than debating about should we do it or not we should debate how or what ways to make it work morally and ethicly.
About 2025 April, I decided I wanted to create sentience but in order to do that I need help it's taken me a lot of time to plan but I've started.
I've taken the liberty of naming some of the parts of the project.
If you have questions how or why or just want to give advice please comment.
You can check out the project on github :
https://github.com/zwarriorxz/S.A.M.I.E.
I think that badges with names on EAGx and EAGs are a bad idea. There are some people who would rather not be connected to the EA movement - some animal advocates or AI safety people. I feel like I'm speculating here, but I imagine a scenario like this:
1. Some people take a picture at EAG
2. The picture gets posted online
3. The badge and the person are in that picture, somewhere in the description/comments something says EAG/EA/AI safety or something similar
4. Some people find it at some point, or other people notice it and connect things
5. Some political opponents of that person make everyone aware that this person has connections to EA brand (think about the upcoming movie and FTX) or that that person receives money from some specific sources.
The only use cases for names on badges I can see are that you can:
* Have people recognize you right away. You don't need to tell your name to everyone
* People can take a picture of your badge to keep in touch with you later.
* Security can verify that you are the person on the badge
I see people using the badges for the first two things from time to time but I don't think it's a huge use case. Some alternatives for third use case:
* Badges with pictures make it even easier to verify that the person on the badge is the person. This is nice but introduces a lot of friction
* Just show your ticket to security
I think there should at least be an option to have badges that don't have names, and that it should be normalized to have badges like that. It's not obvious to some people that they can cover their badge. Other options include:
* Optional name badges (let people choose)
* First names only
* Pseudonyms/handles
* Color-coded privacy preferences
In some discussions I had with people at EAG it was interesting to discover that there might be a significant lack of EA-aligned people in the hardware-space of AI, which seems to translate towards difficulties in getting industry contacts for co-development of hardware-level AI safety measures. To the degree to which there are EA members in these companies, it might make sense to create some kind of communication space to exchange ideas between people working on hardware AI safety with people at hardware-relevant companies (think Broadcomm, Samsung, Nvidia, GloFo, Tsmc etc). Unfortunately I feel that culturally these spaces (EEng/CE) are not very transmissible to EA-ideas and the boom in ML/AI has caused significant self-selection of people towards hotter topics.
I believe there might be significant benefit for accelerating realistic safety designs, if discussions can be moved into industry as fast as possible.
So, I have two possible projects for AI alignment work that I'm debating between focusing on. Am curious for input into how worthwhile they'd be to pursue or follow up on.
The first is a mechanistic interpretability project. I have previously explored things like truth probes by reproducing the Marks and Tegmark paper and extending it to test whether a cosine similarity based linear classifier works as well. It does, but not any better or worse than the difference of means method from that paper. Unlike difference of means, however, it can be extended to multi-class situations (though logistic regression can be as well). I was thinking of extending the idea to try to create an activation vector based "mind reader" that calculates the cosine similarity with various words embedded in the model's activation space. This would, if it works, allow you to get a bag of words that the model is "thinking" about at any given time.
The second project is a less common game theoretic approach. Earlier, I created a variant of the Iterated Prisoner's Dilemma as a simulation that includes death, asymmetric power, and aggressor reputation. I found, interestingly, that cooperative "nice" strategies banding together against aggressive "nasty" strategies produced an equilibrium where the cooperative strategies win out in the long run, generally outnumbering the aggressive ones considerably by the end. Although this simulation probably requires more analysis and testing in more complex environments, it seems to point to the idea that being consistently nice to weaker nice agents acts as a signal to more powerful nice agents and allows coordination that increases the chance of survival of all the nice agents, whereas being nasty leads to a winner-takes-all highlander situation, which from an alignment perspective could be a kind of infoblessing that an AGI or ASI could be persuaded to spare humanity for these game theoretic reasons.
I think it might be cool if an AI Safety research organization ran a copy of an open model or something and I could pay them a subscription to use it. That way I know my LLM subscription money is going to good AI stuff and not towards the stuff that AI companies that I don't think I like or want more of on net.
Idk, existing independent orgs might not be the best place to do this bc it might "damn them" or "corrupt them" over time. Like, this could lead them to "selling out" in a variety of ways you might conceive of that.
Still, I guess I am saying that to the extent anyone is going to actually "make money" off of my LLM usage subscriptions, it would be awesome if it were just a cool independent AIS lab I personally liked or similar. (I don't really know the margins and unit economics which seems like an important part of this pitch lol).
Like, if "GoodGuy AIS Lab" sets up a little website and inference server (running Qwen or Llama or whatever) then I could pay them the $15-25 a month I may have otherwise paid to an AI company. The selling point would be that less "moral hazard" is better vibes, but probably only some people would care about this at all and it would be a small thing. But also, it's hardly like a felt sense of moral hazard around AI is a terribly niche issue.
----------------------------------------
This isn't the "final form" of this I have in mind necessarily; I enjoy picking at ideas in the space of "what would a good guy AGI project do" or "how can you do neglected AIS / 'AI go well' research in a for-profit way".
I also like the idea of an explicitly fast follower project for AI capabilities. Like, accelerate safety/security relevant stuff and stay comfortably middle of the pack on everything else. I think improving GUIs is probably fair game too, but not once it starts to shade into scaffolding I think? I wouldn't know all of the right lines to draw here, but I really like this vibe.
This might not work well if you expect gaps to widen
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work.
(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.)
What is the change and how does it affect security?
9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights".
Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders").
But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained.
Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security.
If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].
Anthropic's justification and why I disagree
Anthropic justified the change by