All posts

New

Today and yesterday
Today and yesterday

Frontpage Posts

Personal Blogposts

Past week
Past week

Frontpage Posts

Personal Blogposts

Past 14 days
Past 14 days

Frontpage Posts

Personal Blogposts

Past 31 days

Frontpage Posts

Personal Blogposts

Since May 1st

Frontpage Posts

Quick takes

I have a bunch of disagreements with Good Ventures and how they are allocating their funds, but also Dustin and Cari are plausibly the best people who ever lived. 

I have a bunch of disagreements with Good Ventures and how they are allocating their funds, but also Dustin and Cari are plausibly the best people who ever lived. 

Looks like Mechanize is choosing to be even more irresponsible than we previously thought. They're going straight for automating software engineering. Would love to hear their explanation for this.

"Software engineering automation isn't going fast enough"[1] - oh really?

This seems even less defensible than their previous explanation of how their work would benefit the world.

  1. ^

    Not an actual quote

Looks like Mechanize is choosing to be even more irresponsible than we previously thought. They're going straight for automating software engineering. Would love to hear their explanation for this. "Software engineering automation isn't going fast enough"[1] - oh really? This seems even less defensible than their previous explanation of how their work would benefit the world. 1. ^ Not an actual quote

A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.

I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work.

(This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.)

What is the change and how does it affect security?

9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights".

Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders").

But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained.

Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security.

If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!

Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1].

Anthropic's justification and why I disagree

Anthropic justified the change by saying that model theft isn't much of the risk from amateur CBRN uplift (CBRN-3) and that the risks from AIs being able to "fully automate the work of an entry-level, remote-only Researcher at Anthropic" (AI R&D-4) don't depend on model theft.

I disagree.

On CBRN: If other actors are incentivized to steal the model for other reasons (e.g. models become increasingly valuable), it could end up broadly proliferating which might greatly increase risk, especially as elicitation techniques improve.

On AI R&D: AIs which are over the capability level needed to automate the work of an entry-level researcher could seriously accelerate AI R&D (via fast speed, low cost, and narrow superhumanness). If other less safe (or adversarial) actors got access, risk might increase a bunch.[2]

More strongly, ASL-3 security must suffice up until the ASL-4 threshold: it has to cover the entire range from ASL-3 to ASL-4. ASL-4 security itself is still not robust to high-effort attacks from state actors which could easily be motivated by large AI R&D acceleration.

As of the current RSP, it must suffice until just before AIs can "substantially uplift CBRN [at] state programs" or "cause dramatic acceleration in [overall AI progress]". These seem like extremely high bars indicating very powerful systems, especially the AI R&D threshold.[3]

As it currently stands, Anthropic might not require ASL-4 security (which still isn't sufficient for high effort state actor attacks) until we see something like 5x AI R&D acceleration (and there might be serious issues with measurement lag).

I'm somewhat sympathetic to security not being very important for ASL-3 CBRN, but it seems very important as of the ASL-3 AI R&D and seems crucial before the AI R&D ASL-4 threshold! I think the ASL-3 AI R&D threshold should probably instead trigger the ASL-4 security!

Overall, Anthropic's justification for this last minute change seems dubious and the security requirements they've currently committed to seem dramatically insufficient for AI R&D threat models. To be clear, other companies have worse security commitments.

Concerns about potential noncompliance and lack of visibility

Another concern is that this last minute change is quite suggestive of Anthropic being out of compliance with their RSP before they weakened the security requirements.

We have to trust Anthropic quite a bit to rule out noncompliance. This isn't a good state of affairs.

To explain this concern, I'll need to cover some background on how the RSP works.

The RSP requires ASL-3 security as soon as it's determined that ASL-3 can't be ruled out (as Anthropic says is the case for Opus 4).

Here's how it's supposed to go:

  • They ideally have ASL-3 security mitigations ready, including the required auditing.
  • Once they find the model is ASL-3, they apply the mitigations immediately (if not already applied).

If they aren't ready, they need temporary restrictions.

My concern is that the security mitigations they had ready when they found the model was ASL-3 didn't suffice for the old ASL-3 bar but do suffice for the new bar (otherwise why did they change the bar?). So, prior to the RSP change they might have been out of compliance.

It's certainly possible they remained compliant:

  • Maybe they had measures which temporarily sufficed for the old higher bar but which were too costly longer term. Also, they could have deleted the weights outside of secure storage until the RSP was updated to lower the bar.
  • Maybe an additional last minute security assessment (which wasn't required to meet the standard?) indicated inadequate security and they deployed temporary measures until they changed the RSP. It would be bad to depend on last minute security assessment for compliance.

(It's also technically possible that the ASL-3 capability decision was made after the RSP was updated. This would imply the decision was only made 8 days before release, so hopefully this isn't right. Delaying evals until an RSP change lowers the bar would be especially bad.)

Conclusion

Overall, this incident demonstrates our limited visibility into AI companies. How many employees are covered by the new bar? What triggered this change? Why does Anthropic believe it remained in compliance? Why does Anthropic think that security isn't important for ASL-3 AI R&D?

I think a higher level of external visibility, auditing, and public risk assessment would be needed (as a bare minimum) before placing any trust in policies like RSPs to keep the public safe from AI companies, especially as they develop existentially dangerous AIs.

To be clear, I appreciate Anthropic's RSP update tracker and that it explains changes. Other AI companies have mostly worse safety policies: as far as I can tell, o3 and Gemini 2.5 Pro are about as likely to cross the ASL-3 bar as Opus 4 and they have much worse mitigations!

Appendix and asides

I don't think current risks are existentially high (if current models were fully unmitigated, I'd guess this would cause around 50,000 expected fatalities per year) and temporarily being at a lower level of security for Opus 4 doesn't seem like that big of a deal. Also, given that security is only triggered after a capability decision, the ASL-3 CBRN bar is supposed to include some conservativeness anyway. But, my broader points around visibility stand and potential noncompliance (especially unreported noncompliance) should be worrying even while the stakes are relatively low.


You can view the page showing the RSP updates including the diff of the latest change here: https://www.anthropic.com/rsp-updates. Again, I appreciate that Anthropic has this page and makes it easy to see the changes they make to the RSP.


I find myself quite skeptical that Anthropic actually could rule out that Sonnet 4 and other models weaker than Opus 4 cross the ASL-3 CBRN threshold. How sure is Anthropic that it wouldn't substantially assist amateurs even after the "possible performance increase from using resources that a realistic attacker would have access to"? I feel like our current evidence and understanding is so weak, and models already substantially exceed virology experts at some of our best proxy tasks.

The skepticism applies similarly or more to other AI companies (and Anthropic's reasoning is more transparent).

But, this just serves to further drive home ways in which the current regime is unacceptable once models become so capable that the stakes are existential.


One response is that systems this powerful will be open sourced or trained by less secure AI companies anyway. Sure, but the intention of the RSP is (or was) to outline what would "keep risks below acceptable levels" if all actors follow a similar policy.

(I don't know if I ever bought that the RSP would succeed at this. It's also worth noting there is an explicit exit clause Anthropic could invoke if they thought proceeding outweighed the risks despite the risks being above an acceptable level.)


This sort of criticism is quite time consuming and costly for me. For this reason there are specific concerns I have about AI companies which I haven't discussed publicly. This is likely true for other people as well. You should keep this in mind when assessing AI companies and their practices.

  1. ^

    It also makes it harder for these complaints to be legible to other employees while other employees might be able to more easily interpret arguments about what they could do.

  2. ^

    It looks like AI 2027 would estimate around a ~2x AI R&D acceleration for a system which was just over this ASL-3 AI R&D bar (as it seems somewhat more capable than the "Reliable agent" bar). I'd guess more like 1.5x at this point, but either way this is a big deal!

  3. ^

    Anthropic says they'll likely require a higher level of security for this "dramatic acceleration" AI R&D threshold, but they haven't yet committed to this nor have they defined a lower AI R&D bar which results in an ASL-4 security requirement.

A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections. I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. (This was originally a tweet thread (https://x.com/RyanPGreenblatt/status/1925992236648464774) which I've converted into a quick take. I also posted it on LessWrong.) What is the change and how does it affect security? 9 days ago, Anthropic changed their RSP so that ASL-3 no longer requires being robust to employees trying to steal model weights if the employee has any access to "systems that process model weights". Anthropic claims this change is minor (and calls insiders with this access "sophisticated insiders"). But, I'm not so sure it's a small change: we don't know what fraction of employees could get this access and "systems that process model weights" isn't explained. Naively, I'd guess that access to "systems that process model weights" includes employees being able to operate on the model weights in any way other than through a trusted API (a restricted API that we're very confident is secure). If that's right, it could be a high fraction! So, this might be a large reduction in the required level of security. If this does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical! Also, one of the easiest ways for security-aware employees to evaluate security is to think about how easily they could steal the weights. So, if you don't aim to be robust to employees, it might be much harder for employees to evaluate the level of security and then complain about not meeting requirements[1]. Anthropic's justification and why I disagree Anthropic justified the change by

 

I've now spoken to  ~1,400 people as an advisor with 80,000 Hours, and if there's a quick thing I think is worth more people doing, it's doing a short reflection exercise about one's current situation. 

Below are some (cluster of) questions I often ask in an advising call to facilitate this. I'm often surprised by how much purchase one can get simply from this -- noticing one's own motivations, weighing one's personal needs against a yearning for impact, identifying blind spots in current plans that could be triaged and easily addressed, etc.

 

A long list of semi-useful questions I often ask in an advising call

 

  1. Your context:
    1. What’s your current job like? (or like, for the roles you’ve had in the last few years…)
      1. The role
      2. The tasks and activities
      3. Does it involve management?
      4. What skills do you use? Which ones are you learning?
      5. Is there something in your current job that you want to change, that you don’t like?
    2. Default plan and tactics
      1. What is your default plan?
      2. How soon are you planning to move? How urgently do you need to get a job?
      3. Have you been applying? Getting interviews, offers? Which roles? Why those roles?
      4. Have you been networking? How? What is your current network?
      5. Have you been doing any learning, upskilling? How have you been finding it?
      6. How much time can you find to do things to make a job change? Have you considered e.g. a sabbatical or going down to a 3/4-day week?
      7. What are you feeling blocked/bottlenecked by?
    3. What are your preferences and/or constraints?
      1. Money
      2. Location
      3. What kinds of tasks/skills would you want to use? (writing, speaking, project management, coding, math, your existing skills, etc.)
      4. What skills do you want to develop?
      5. Are you interested in leadership, management, or individual contribution?
      6. Do you want to shoot for impact? How important is it compared to your other preferences?
        1. How much certainty do you want to have wrt your impact?
      7. If you could picture your perfect job – the perfect combination of the above – which ones would you relax first in order to consider a role?
  2. Reflecting more on your values:
    1. What is your moral circle?
    2. Do future people matter?
    3. How do you compare problems?
    4. Do you buy this x-risk stuff?
    5. How do you feel about expected impact vs certain impact?
  3. For any domain of research you're interested in:
    1. What’s your answer to the Hamming question? Why?

 

If possible, I'd recommend trying to answer these questions out loud with another person listening (just like in an advising call!); they might be able to notice confusions, tensions, and places worth exploring further. Some follow up prompts that might be applicable to many of the questions above:

  1. How do you feel about that?
  2. Why is that? Why do you believe that?
  3. What would make you change your mind about that?
  4. What assumptions is that built on? What would change if you changed those assumptions?
  5. Have you tried to work on that? What have you tried? What went well, what went poorly, and what did you learn?
  6. Is there anyone you can ask about that? Is there someone you could cold-email about that?

 

Good luck!

  I've now spoken to  ~1,400 people as an advisor with 80,000 Hours, and if there's a quick thing I think is worth more people doing, it's doing a short reflection exercise about one's current situation.  Below are some (cluster of) questions I often ask in an advising call to facilitate this. I'm often surprised by how much purchase one can get simply from this -- noticing one's own motivations, weighing one's personal needs against a yearning for impact, identifying blind spots in current plans that could be triaged and easily addressed, etc.   A long list of semi-useful questions I often ask in an advising call   1. Your context: 1. What’s your current job like? (or like, for the roles you’ve had in the last few years…) 1. The role 2. The tasks and activities 3. Does it involve management? 4. What skills do you use? Which ones are you learning? 5. Is there something in your current job that you want to change, that you don’t like? 2. Default plan and tactics 1. What is your default plan? 2. How soon are you planning to move? How urgently do you need to get a job? 3. Have you been applying? Getting interviews, offers? Which roles? Why those roles? 4. Have you been networking? How? What is your current network? 5. Have you been doing any learning, upskilling? How have you been finding it? 6. How much time can you find to do things to make a job change? Have you considered e.g. a sabbatical or going down to a 3/4-day week? 7. What are you feeling blocked/bottlenecked by? 3. What are your preferences and/or constraints? 1. Money 2. Location 3. What kinds of tasks/skills would you want to use? (writing, speaking, project management, coding, math, your existing skills, etc.) 4. What skills do you want to develop? 5. Are you interested in leadership, management, or individual contribution? 6. Do you want to shoot for impact? H

My favorite midsized grantmaker is Scott Alexander's ACX Grants, mainly because I've enjoyed his blog for over a decade and it's been really nice to see the community that sprang up around his writing grow and flourish, especially the EA stuff. His recent ACX Grants 1-3 Year Updates is a great read in this vein. Some quotes: 

The first cohort of ACX Grants was announced in late 2021, the second in early 2024. In 2022, I posted one-year updates for the first cohort. Now, as I start thinking about a third round, I’ve collected one-year updates on the second and three-year updates on the first. ...

The total cost of ACX Grants, both rounds, was about $3 million. Do these outcomes represent a successful use of that amount of money? ...

It’s harder to produce Inside View estimates, because so many of the projects either produce vague deliverables (eg a white paper that might guide future action) or intermediate results only (eg getting a government to pass AI safety regulations is good, but can’t be considered an end result unless those regulations prevent the AI apocalypse). Because we tend towards incubating charities and funding research (rather than last-mile causes like buying bednets), achieved measurable deliverables are thin on the ground. But here are things that ACX grantees have already accomplished:

  • Improved the living/slaughter conditions of 30 million fish.
  • Helped create Manifold Markets, a prediction market site with thousands of satisfied users, whose various spinoffs play a central role in the rationalist/EA community.
  • Helped create thousands of jobs in Rwanda and other developing countries
  • Passed an instant runoff vote proposition in Seattle.
  • Saved between a few dozen and a few hundred lives in Nigeria through better obstetric care.

And here are some intermediate deliverables from grantees:

  • Made Australian government take AI x-risk more seriously (estimated from 50th percentile to 60th percentile outcome)
  • Gotten the End Kidney Deaths Act (could save >1000 lives and billions of dollars per year) in front of Congress, with decent odds of passing by 2026.
  • Plausibly saved 2 billion chickens from painful death over next decade2.
  • Antiparasitic medication oxfendazole continues to advance through the clinical trial process.

And here are some things that have not been delivered yet but that I remain especially optimistic about:

  • Creation of anti-mosquito drones that provide a second level of defense along with bednets.
  • Revolutionize diagnosis of traumatic brain injury
  • Improve dietary guidelines in developing countries
  • Continue to support research and adoption of far UV light for pandemic prevention
  • Reduce lead poisoning in Nigeria

I think these underestimate success since many projects have yet to pay off (or to convince me to be especially optimistic), and others have paid off in vague hard-to-measure ways.

This is a beautifully crosswise oriented slice of the entire collective endeavor of effective altruism, and quite a lot of good done (or poised to be done) helped by a not-that-large sum of $3M over 2 cohorts given that GW and OP move 2 OOMs more $ per year. 

It's also been quite intellectually enriching to just see the sheer diversity of proposals to make the world better in these cohorts; e.g. I was a bit let down to learn that the Far Out Initiative didn't pan out ($50k to fund a team to work on pharmacologic and genetic interventions to imitate the condition of Jo Cameron, a 77-year old Scottish woman who is both incapable of experiencing any physical or psychological suffering and has lived an astonishingly well-adjusted life despite that, by creating painkillers to splice into farm animals to promote cruelty-free meat and "end all suffering in the world forever").

Of Scott's lessons learned, this one stood out to me in light of the recent elitism in EA survey I just took, I think because I was leaning towards the same hope he had:

One disappointing result was that grants to legibly-credentialled people operating in high-status ways usually did better than betting on small scrappy startups (whether companies or nonprofits). For example, Innovate Animal Ag was in many ways overdetermined as a grantee - former Yale grad and Google engineer founder, profiled in NYT, already funded by Open Philanthropy - and they in fact did amazing work. On the other hand, there were a lot of promising ACX community members with interesting ideas who were going to turn them into startups any day now, but who ended up kind of floundering (although this also describes Manifold, one of our standout successes). One thing I still don't understand is that Innovate Animal Ag seemed to genuinely need more funding despite being legibly great and high status - does this screen off a theoretical objection that they don't provide ACX Grants with as much counterfactual impact? Am I really just mad that it would be boring to give too many grants to obviously-good things that even moron could spot as promising?

The other takeaway of his that gave me mixed feelings was this one, I think because I'd been secretly hoping for some form of work-life balance compatibility with really effective (emphasis) direct-work altruism:

Someone (I think it might be Paul Graham) once said that they were always surprised how quickly destined-to-be-successful startup founders responded to emails - sometimes within a single-digit number of minutes regardless of time of day. I used to think of this as mysterious - some sort of psychological trait? Working with these grants has made me think of it as just a straightforward fact of life: some people operate an order of magnitude faster than others. The Manifold team created something like five different novel institutions in the amount of time it's taken some other grantees to figure out a business plan; I particularly remember one time when I needed something, sent out a request to talk about it with two or three different teams, and the Manifold team had fully created the thing and were pestering me to launch a trial version before some of the other people had even gotten back to me. I take no pleasure in reporting this - I sometimes take a week or two to answer emails, and all of the predictions about my personality that this implies would be correct - but it's increasingly something that I look for and respect. A lot of the most successful grants succeeded quickly, or at least were quick to get on a promising track. Since everything takes ten times longer than people expect, only someone who moves ten times faster than people expect can get things done in a reasonable amount of time.


Edited to add: I appreciated this comment by Alex Toussaint, an ACX grantee: 

Tornyol (anti-mosquito drones) is based in France and we couldn't have got the support from ACX Grants from a local VC. ...

VCs, like potential employees or clients, have reading grids (i.e. rubrics, a transliteration of « une grille de lecture ») to evaluate pitches. The great thing I found about ACX Grants is that the grid is different, and encourages different kinds of projects. Founder obsession for a problem seems to be encouraged in ACX Grants, although it's clearly discouraged for very early VC funding. VCs like very well made slides, communication abilities, and beautiful people in general, while I've found no such bias for ACX Grants. Being based outside the US is a big minus for American VCs, but ACX Grants almost seems to be favoring it. VCs tend to think a lot by analogy (the Uber for X, the Cursor for Y ...) while I found ACX Grants to be much more thinking from first principles than the median VC I met.

I'm not criticizing the VC reading grid. It obviously comes from experience and it tends to work financially for them. But you have to remember that a large part of the decision comes down to the potential for a quite early (3-4 years) and billion-dollar exit option. Not all projects fit that and it's a good thing to support the other. The other advantage of it is that it selects founders that can go through the hoops of making their project fit the grid. That proves VCs the founders are capable of adapting their message to their interlocutors, which is highly necessary when raising further money, recruiting or discussing with any partner. That's something ACX Grants does not seem to value much.

All in all, ACX Grants is great in that it provides funding with a very unique reading grid, so it helps projects that could get no help anywhere else.

66
Mo Putera
1mo
4
My favorite midsized grantmaker is Scott Alexander's ACX Grants, mainly because I've enjoyed his blog for over a decade and it's been really nice to see the community that sprang up around his writing grow and flourish, especially the EA stuff. His recent ACX Grants 1-3 Year Updates is a great read in this vein. Some quotes:  This is a beautifully crosswise oriented slice of the entire collective endeavor of effective altruism, and quite a lot of good done (or poised to be done) helped by a not-that-large sum of $3M over 2 cohorts given that GW and OP move 2 OOMs more $ per year.  It's also been quite intellectually enriching to just see the sheer diversity of proposals to make the world better in these cohorts; e.g. I was a bit let down to learn that the Far Out Initiative didn't pan out ($50k to fund a team to work on pharmacologic and genetic interventions to imitate the condition of Jo Cameron, a 77-year old Scottish woman who is both incapable of experiencing any physical or psychological suffering and has lived an astonishingly well-adjusted life despite that, by creating painkillers to splice into farm animals to promote cruelty-free meat and "end all suffering in the world forever"). Of Scott's lessons learned, this one stood out to me in light of the recent elitism in EA survey I just took, I think because I was leaning towards the same hope he had: The other takeaway of his that gave me mixed feelings was this one, I think because I'd been secretly hoping for some form of work-life balance compatibility with really effective (emphasis) direct-work altruism: ---------------------------------------- Edited to add: I appreciated this comment by Alex Toussaint, an ACX grantee: 

Load more months