AI safety
AI safety
Studying and reducing the existential risks posed by advanced artificial intelligence

Quick takes

8
1d
Here's some quick takes on what you can do if you want to contribute to AI safety or governance (they may generalise, but no guarantees). Paraphrased from a longer talk I gave, transcript here.  * First, there’s still tons of alpha left in having good takes. * (Matt Reardon originally said this to me and I was like, “what, no way”, but now I think he was right and this is still true – thanks Matt!) * You might be surprised, because there’s many people doing AI safety and governance work, but I think there’s still plenty of demand for good takes, and you can distinguish yourself professionally by being a reliable source of them. * But how do you have good takes? * I think the thing you do to form good takes, oversimplifying only slightly, is you read Learning by Writing and you go “yes, that’s how I should orient to the reading and writing that I do,” and then you do that a bunch of times with your reading and writing on AI safety and governance work, and then you share your writing somewhere and have lots of conversations with people about it and change your mind and learn more, and that’s how you have good takes. * What to read? * Start with the basics (e.g. BlueDot’s courses, other reading lists) then work from there on what’s interesting x important * Write in public * Usually, if you haven’t got evidence of your takes being excellent, it’s not that useful to just generally voice your takes. I think having takes and backing them up with some evidence, or saying things like “I read this thing, here’s my summary, here’s what I think” is useful. But it’s kind of hard to get readers to care if you’re just like “I’m some guy, here are my takes.” * Some especially useful kinds of writing * In order to get people to care about your takes, you could do useful kinds of writing first, like: * Explaining important concepts * E.g., evals awareness, non-LLM architectures (should I care? why?) , AI control, best arguments for/against sho
9
2d
3
EA Connect 2025: Personal Takeaways Background I'm Ondřej Kubů, a postdoctoral researcher in mathematical physics at ICMAT Madrid, working on integrable Hamiltonian systems. I've engaged with EA ideas since around 2020—initially through reading and podcasts, then ACX meetups, and from 2023 more regularly with Prague EA (now EA Madrid after moving here). I took the GWWC 10% pledge during the event. My EA focus is longtermist, primarily AI risk. My mathematical background has led me to take seriously arguments that alignment of superintelligent AI may face fundamental verification problems, and that current trajectories pose serious catastrophic risk. This shapes my donations toward governance and advocacy rather than technical alignment. I'm not ready to pivot careers at this stage—I'm contributing through donations while continuing in mathematics. I attended EA Connect during job search, so sessions on career strategy and donation prioritization were particularly relevant. On donation strategy Joseph Savoie's talk Twice as Good introduced the POWERS framework for improving donation impact: Price Tag (know the cost per outcome), Options (compare alternatives), Who (choose the right evaluator), Evaluate (use concrete benchmarks), Reduce (minimize burden on NGOs), Substance (focus on how charities work, not presentation). The framework is useful but clearly aimed at large donors—"compare 10+ alternatives" and "hire someone to evaluate" aren't realistic for someone donating 10% of a postdoc salary. The "Price Tag" slide was striking: what $1 million buys across cause areas—200 lives saved via malaria nets, 3 million farmed animals helped through advocacy, 6.1 gigatons CO₂ mitigation potential through agrifood reform. But the X-Risk/AI line only specified inputs ("fund 3-4 research projects"), not outcomes. This reflects the illegibility problem I asked about in office hours: how do you evaluate AI governance donations? Savoie acknowledged he doesn't donate much
43
1mo
2
Scrappy note on the AI safety landscape. Very incomplete, but probably a good way to get oriented to (a) some of the orgs in the space, and (b) how the space is carved up more generally.   (A) Technical (i) A lot of the safety work happens in the scaling-based AGI companies (OpenAI, GDM, Anthropic, and possibly Meta, xAI, Mistral, and some Chinese players). Some of it is directly useful, some of it is indirectly useful (e.g. negative results, datasets, open-source models, position pieces etc.), and some is not useful and/or a distraction. It's worth developing good assessment mechanisms/instincts about these. (ii) A lot of safety work happens in collaboration with the AGI companies, but by individuals/organisations with some amount of independence and/or different incentives. Some examples: METR, Redwood, UK AISI, Epoch, Apollo. It's worth understanding what they're doing with AGI cos and what their theories of change are. (iii) Orgs that don't seem to work directly with AGI cos but are deeply technically engaging with frontier models and their relationship to catastrophic risk: places like Palisade, FAR AI, CAIS. These orgs maintain even more independence, and are able to do/say things which maybe the previous tier might not be able to. A recent cool thing was CAIS finding that models don't do well on remote work tasks -- only 2.5% of tasks -- in contrast to OpenAI's findings in GDPval suggests models have an almost 50% win-rate against industry professionals on a suite of "economically valuable, real-world tasks" tasks. (iv) Orgs that are pursuing other* technical AI safety bets, different from the AGI cos: FAR AI, ARC, Timaeus, Simplex AI, AE Studio, LawZero, many independents, some academics at e.g. CHAI/Berkeley, MIT, Stanford, MILA, Vector Institute, Oxford, Cambridge, UCL and elsewhere. It's worth understanding why they want to make these bets, including whether it's their comparative advantage, an alignment with their incentives/grants, or whether they
44
1mo
4
Not sure who needs to hear this, but Hank Green has published two very good videos about AI safety this week: an interview with Nate Soares and a SciShow explainer on AI safety and superintelligence. Incidentally, he appears to have also come up with the ITN framework from first principles (h/t @Mjreard). Hopefully this is auspicious for things to come?
38
2mo
1
FYI: METR is actively fundraising!  METR is a non-profit research organization. We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted payment from frontier AI labs for running evaluations. ^[1] Part of METR's role is to independently assess the arguments that frontier AI labs put forward about the safety of their models. These arguments are becoming increasingly complex and dependent on nuances of how models are trained and how mitigations were developed. For this reason, it's important that METR has its finger on the pulse of frontier AI safety research. This means hiring and paying for staff that might otherwise work at frontier AI labs, requiring us to compete with labs directly for talent. The central constraint to our publishing more and better research, and scaling up our work aimed at monitoring the AI industry for catastrophic risk, is growing our team with excellent new researchers and engineers. And our recruiting is, to some degree, constrained by our fundraising - especially given the skyrocketing comp that AI companies are offering. To donate to METR, click here: https://metr.org/donate If you’d like to discuss giving with us first, or receive more information about our work for the purpose of informing a donation, reach out to giving@metr.org   1. ^ However, we are definitely not immune from conflicting incentives. Some examples:    - We are open to taking donations from individual lab employees (subject to some constraints, e.g. excluding senior decision-makers, constituting <50% of our funding)  - Labs provide us with free model access for conducting our evaluations, and several labs also provide us ongoing free access for research even if we're not conducting a specific evaluation. 
17
1mo
1
I try to maintain this public doc of AI safety cheap tests and resources, although it's due a deep overhaul.    Suggestions and feedback welcome!
68
7mo
3
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections. I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. (This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.) What is the change and how does it affect security? 9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights". Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders"). But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained. Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security. If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical! Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1]. Anthropic's justification and why I disagree Anthropic justified the change by
23
2mo
PSA: If you're doing evals things, every now and then you should look back at OpenPhil's page on capabilities evals to check against their desiderata and questions in sections 2.1-2.2, 3.1-3.4, 4.1-4.3 as a way to critically appraise the work you're doing.
Load more (8/225)