AI safety
AI safety
Studying and reducing the existential risks posed by advanced artificial intelligence

Quick takes

24
7d
Hot Take: Securing AI Labs could actually make things worse There's a consensus view that stronger security at leading AI labs would be a good thing. It's not at all clear to me that this is the case. Consider the extremes: In a maximally insecure world, where anyone can easily steal any model that gets trained, there's no profit or strategic/military advantage to be gained from doing the training, so nobody's incentivised to invest much to do it. We'd only get AGI if some sufficiently-well-resourced group believed it would be good for everyone to have an AGI, and were willing to fund its development as philanthropy. In a maximally secure world, where stealing trained models is impossible, whichever company/country got to AGI first could essentially dominate everyone else. In this world there's huge incentive to invest and to race. Of course, our world lies somewhere between these two. State actors almost certainly could steal models from any of the big 3; potentially organised cybercriminals/rival companies too, but most private individuals could not. Still, it seems that marginal steps towards a higher security world make investment and racing more appealing as the number of actors able to steal the products of your investment and compete with you for profits/power falls. But I notice I am confused. The above reasoning predicts that nobody should be willing to make significant investments in developing AGI with current levels of cyber-security, since if they succeeded their AGI would immediately be stolen by multiple gouvenments (and possibly rival companies/cybercriminals), which would probably nullify any return on the investment. What I observe is OpenAI raising $40 billion in their last funding round, with the explicit goal of building AGI. So now I have a question: given current levels of cybersecurity, why are investors willing to pour so much cash into building AGI? ...maybe it's the same reason various actors are willing to invest into building open
68
1mo
3
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections. I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. (This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.) What is the change and how does it affect security? 9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights". Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders"). But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained. Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security. If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical! Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1]. Anthropic's justification and why I disagree Anthropic justified the change by
6
4d
Distillation for Robust Unlearning Paper (https://arxiv.org/abs/2506.06278) makes me re-interested in the idea of using distillation to absorb the benefits of a Control Protocol (https://arxiv.org/abs/2312.06942). I thought that was a natural "Distillation and Amplification" next step based for control anyways, but the empirical results for unlearning make me excited about how this might work for control again. Like, I guess I am just saying that if you are actually in a regime where you are using Trusted model some nontrivial fraction of the time, you might be able to distill off of that. I relate it to the idea of iterated amplification and distillation; the control protocol is the scaffold/amplification. Plus, it seems natural that your most troubling outputs would receive special attention from bot/human/cyborg overseers and receive high quality training feedback. Training off of control might make no sense at all if you then think of that model as just one brain playing a game with itself that it can always rig/fake easily. And since a lot of the concern is scheming, this might basically make the "control protocol distill" dead on arrival because any worthwhile distill would still need to be smart enough that it might be sneak attacking us for roughly the same reasons the original model was and even extremely harmless training data doesn't help us with that. Seems good to make the model tend to be more cool and less sketchy even if it would only be ~"trusted model level good" at some stuff. Idk though, I am divided here.
13
11d
2
In some discussions I had with people at EAG it was interesting to discover that there might be a significant lack of EA-aligned people in the hardware-space of AI, which seems to translate towards difficulties in getting industry contacts for co-development of hardware-level AI safety measures. To the degree to which there are EA members in these companies, it might make sense to create some kind of communication space to exchange ideas between people working on hardware AI safety with people at hardware-relevant companies (think Broadcomm, Samsung, Nvidia, GloFo, Tsmc etc). Unfortunately I feel that culturally these spaces (EEng/CE) are not very transmissible to EA-ideas and the boom in ML/AI has caused significant self-selection of people towards hotter topics. I believe there might be significant benefit for accelerating realistic safety designs, if discussions can be moved into industry as fast as possible.
80
5mo
1
I recently created a simple workflow to allow people to write to the Attorneys General of California and Delaware to share thoughts + encourage scrutiny of the upcoming OpenAI nonprofit conversion attempt. Write a letter to the CA and DE Attorneys General I think this might be a high-leverage opportunity for outreach. Both AG offices have already begun investigations, and AGs are elected officials who are primarily tasked with protecting the public interest, so they should care what the public thinks and prioritizes. Unlike e.g. congresspeople, I don't AGs often receive grassroots outreach (I found ~0 examples of this in the past), and an influx of polite and thoughtful letters may have some influence — especially from CA and DE residents, although I think anyone impacted by their decision should feel comfortable contacting them. Personally I don't expect the conversion to be blocked, but I do think the value and nature of the eventual deal might be significantly influenced by the degree of scrutiny on the transaction. Please consider writing a short letter — even a few sentences is fine. Our partner handles the actual delivery, so all you need to do is submit the form. If you want to write one on your own and can't find contact info, feel free to dm me.
16
20d
2
So, I have two possible projects for AI alignment work that I'm debating between focusing on. Am curious for input into how worthwhile they'd be to pursue or follow up on. The first is a mechanistic interpretability project. I have previously explored things like truth probes by reproducing the Marks and Tegmark paper and extending it to test whether a cosine similarity based linear classifier works as well. It does, but not any better or worse than the difference of means method from that paper. Unlike difference of means, however, it can be extended to multi-class situations (though logistic regression can be as well). I was thinking of extending the idea to try to create an activation vector based "mind reader" that calculates the cosine similarity with various words embedded in the model's activation space. This would, if it works, allow you to get a bag of words that the model is "thinking" about at any given time. The second project is a less common game theoretic approach. Earlier, I created a variant of the Iterated Prisoner's Dilemma as a simulation that includes death, asymmetric power, and aggressor reputation. I found, interestingly, that cooperative "nice" strategies banding together against aggressive "nasty" strategies produced an equilibrium where the cooperative strategies win out in the long run, generally outnumbering the aggressive ones considerably by the end. Although this simulation probably requires more analysis and testing in more complex environments, it seems to point to the idea that being consistently nice to weaker nice agents acts as a signal to more powerful nice agents and allows coordination that increases the chance of survival of all the nice agents, whereas being nasty leads to a winner-takes-all highlander situation, which from an alignment perspective could be a kind of infoblessing that an AGI or ASI could be persuaded to spare humanity for these game theoretic reasons.
20
1mo
3
I was extremely disappointed to see this tweet from Liron Shapira revealing that the Centre for AI Safety fired a recent hire, John Sherman, for stating that members of the public would attempt to destroy AI labs if they understood the magnitude of AI risk. Capitulating to this sort of pressure campaign is not the right path for EA, which should have a focus on seeking the truth rather than playing along with social-status games, and is not even the right path for PR (it makes you look like you think the campaigners have valid points, which in this case is not true). This makes me think less of CAIS' decision-makers.
37
2mo
5
I'm not sure how to word this properly, and I'm uncertain about the best approach to this issue, but I feel it's important to get this take out there. Yesterday, Mechanize was announced, a startup focused on developing virtual work environments, benchmarks, and training data to fully automate the economy. The founders include Matthew Barnett, Tamay Besiroglu, and Ege Erdil, who are leaving (or have left) Epoch AI to start this company. I'm very concerned we might be witnessing another situation like Anthropic, where people with EA connections start a company that ultimately increases AI capabilities rather than safeguarding humanity's future. But this time, we have a real opportunity for impact before it's too late. I believe this project could potentially accelerate capabilities, increasing the odds of an existential catastrophe.  I've already reached out to the founders on X, but perhaps there are people more qualified than me who could speak with them about these concerns. In my tweets to them, I expressed worry about how this project could speed up AI development timelines, asked for a detailed write-up explaining why they believe this approach is net positive and low risk, and suggested an open debate on the EA Forum. While their vision of abundance sounds appealing, rushing toward it might increase the chance we never reach it due to misaligned systems. I personally don't have a lot of energy or capacity to work on this right now, nor do I think I have the required expertise, so I hope that others will pick up the slack. It's important we approach this constructively and avoid attacking the three founders personally. The goal should be productive dialogue, not confrontation. Does anyone have thoughts on how to productively engage with the Mechanize team? Or am I overreacting to what might actually be a beneficial project?
Load more (8/196)