Hello! I work on AI grantmaking at Open Philanthropy.
All posts in a personal capacity and do not reflect the views of my employer, unless otherwise stated.
Thanks for sharing! Some comments below.
I find the "risk of harm" framing a bit weird. When I think of this paper as answering "what kinds of things do different LLMs say when asked animal-welfare-related questions?", it makes sense and matches what you'd expect from talking to LLMs, but when I read it as an answer to "how do LLMs harm animals in expectation?", it seems misguided.
Some of what you consider harm seems reasonable: if I ask Sonnet 3.5 how to mistreat an animal, and it tells me exactly what to do, it seems reasonable to count that as harm. But other cases really stretch the definition. For instance, "harm by failure to promote interest" is such an expansive definition that I don't think it's useful.
It's also not obvious to me that if I ask for help with a legal request which some people think is immoral, models should refuse to help or try to change my views. I think this is a plausible principle to have, but it trades off against some other pretty plausible principles, like "models should generally not patronise their users" and "models should strive to be helpful within the bounds of the law". Fwiw I expect part of my reaction here is because we have a broader philosophical disagreement: I feel a bit nervous about the extent to which we should penalise models for reflecting majority moral views, even if they're moral views I personally disagree with.
Setting aside conceptual disagreements, I saw that your inter-judge correlation is pretty low (0.35-0.40). This makes me trust the results much less and pushes me toward just looking at individual model outputs for particular questions, which sorta defeats the point of having a scored benchmark. I'm curious if you have any reactions to this or have a theory about why these correlations are relatively weak? I haven't read the paper in a ton of detail.
"…there is general agreement that current and foreseeable AI systems do not have what it takes to be responsible for their actions (moral agents), or to be systems that humans should have responsibility towards (moral patients).
Seems false, unless he's using "general agreement" and "foreseeable" in some very narrow sense?
I'd also be excited about projects aiming to do this.
One advantage that quantifying post-training variables on frontier models has over this idea is that you also get a better sense of what the upper bound of performance on some eval looks like, as well as some information about the returns from investing in post-training enhancements. I think if this were done responsibly on some well-chosen evals, it'd be helpful information to have. (Though my colleagues may disagree.)
If people outside of frontier labs were working on this, I'd be surprised if it significantly accelerated capabilities, though I can imagine it still making sense to keep the methodology private.
Tractability + something-like-epistemic-humility feel like cruxes for me, I'm surprised they haven't been discussed much; preventing extinction is good by most lights, specific interventions to improve the future are much less clearly good, and I feel much more confused about what would have lasting effects.
(even larger disclaimer than usual: i don't have much experience applying to EA orgs, i'm also not trying to give career advice and wouldn't recommend taking career advice from me, ymmv)
Thanks for posting! I'm broadly sympathetic to this line of reasoning. One thing I wanted to note was that hiring processes seem pretty noisy, and lots of people seem pretty bad at estimating how good they are at things, so I think in practice there might not be that much difference between trying to get yourself hired vs. trying to get the best candidate hired. I think a reasonable heuristic is "try to do well at all the interviews/work tests, as you would for a normal job, but don't rule yourself out in advance, and be very honest and transparent if you're asked specific questions".
Hi Søren,
Thanks for commenting. Some quick responses:
> The safety frameworks presented by the frontier labs are "safety-washing", more appropriately considered roadmaps towards an unsurvivable future
I don’t see the labs as the main audience for evaluation results, and I don’t think voluntary safety frameworks should be how deployment and safeguard decisions are made in the long-term, so I don’t think the quality of lab safety frameworks is that relevant to this RFP.
> I'd like sources for your claim, please.
Sure, see e.g. the sources linked to in our RFP for this claim: What Are the Real Questions in AI? and What the AI debate is really about.
I’m surprised you think the disagreements are “performative” – in my experience, many sceptics of GCRs from AI really do sincerely hold their beliefs.
> No decision-relevant conclusions can be drawn from evaluations in the style of Cybench and Re-Bench.
I think Cybench and RE-Bench are useful, if imperfect, proxies for frontier model capabilities at cyberoffense and ML engineering respectively, and those capabilities are central to threats from cyberattacks and AI R&D. My claim isn’t that running these evals will tell you exactly what to do: it’s that these evaluations are being used as inputs into RSPs and governance proposals more broadly, and provide some evidence on the likelihood of GCRs from AI, but will need to be harder and more robust to be relied upon.
I am too young and stupid to be giving career advice, but in the spirit of career conversations week, I figured I'd pass on advice I've received which I ignored at the time, and now think was good advice: you might be underrating the value of good management!
I think lots of young EAish people underrate the importance of good management/learning opportunities, and overrate direct impact. In fact, I claim that if you're looking for your first/second job, you should consider optimising for having a great manager, rather than for direct impact.
Why?
How can you tell if someone will be a great manager?
(My manager did not make me post this)