This is a special post for quick takes by Jacob Watts🔸. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Sorted by Click to highlight new quick takes since:

Distillation for Robust Unlearning Paper (https://arxiv.org/abs/2506.06278) makes me re-interested in the idea of using distillation to absorb the benefits of a Control Protocol (https://arxiv.org/abs/2312.06942).

I thought that was a natural "Distillation and Amplification" next step based for control anyways, but the empirical results for unlearning make me excited about how this might work for control again.

Like, I guess I am just saying that if you are actually in a regime where you are using Trusted model some nontrivial fraction of the time, you might be able to distill off of that.

I relate it to the idea of iterated amplification and distillation; the control protocol is the scaffold/amplification. Plus, it seems natural that your most troubling outputs would receive special attention from bot/human/cyborg overseers and receive high quality training feedback.

Training off of control might make no sense at all if you then think of that model as just one brain playing a game with itself that it can always rig/fake easily. And since a lot of the concern is scheming, this might basically make the "control protocol distill" dead on arrival because any worthwhile distill would still need to be smart enough that it might be sneak attacking us for roughly the same reasons the original model was and even extremely harmless training data doesn't help us with that.

Seems good to make the model tend to be more cool and less sketchy even if it would only be ~"trusted model level good" at some stuff. Idk though, I am divided here.

I think it might be cool if an AI Safety research organization ran a copy of an open model or something and I could pay them a subscription to use it. That way I know my LLM subscription money is going to good AI stuff and not towards the stuff that AI companies that I don't think I like or want more of on net.

Idk, existing independent orgs might not be the best place to do this bc it might "damn them" or "corrupt them" over time. Like, this could lead them to "selling out" in a variety of ways you might conceive of that.

Still, I guess I am saying that to the extent anyone is going to actually "make money" off of my LLM usage subscriptions, it would be awesome if it were just a cool independent AIS lab I personally liked or similar. (I don't really know the margins and unit economics which seems like an important part of this pitch lol).

Like, if "GoodGuy AIS Lab" sets up a little website and inference server (running Qwen or Llama or whatever) then I could pay them the $15-25 a month I may have otherwise paid to an AI company. The selling point would be that less "moral hazard" is better vibes, but probably only some people would care about this at all and it would be a small thing. But also, it's hardly like a felt sense of moral hazard around AI is a terribly niche issue.


This isn't the "final form" of this I have in mind necessarily; I enjoy picking at ideas in the space of "what would a good guy AGI project do" or "how can you do neglected AIS / 'AI go well' research in a for-profit way".

I also like the idea of an explicitly fast follower project for AI capabilities. Like, accelerate safety/security relevant stuff and stay comfortably middle of the pack on everything else. I think improving GUIs is probably fair game too, but not once it starts to shade into scaffolding I think? I wouldn't know all of the right lines to draw here, but I really like this vibe.

This might not work well if you expect gaps to widen as RSI becomes a more important input. I would argue that seems too galaxy brained given that, as of writing, we do live in a world with a lot of mediocre AI companies that I believe can all provide products of ~comparable quality.

It is also just kind of a bet that in practice it is probably going to remain a lot less expensive to stay a little behind the frontier than to be at the frontier. And that, in practice, it may continue to not matter in a lot of cases.

fwiw I think you shouldn't worry about paying $20/month to an evil company to improve your productivity, and if you want to offset it I think a $10/year donation to LTFF would more than suffice.

Can you say more on why you think a 1:24 ratio is the right one (as opposed to lower or higher ratios)? And how might this ratio differ for people who have different beliefs than you, for example about xrisk, LTFF, or the evilness of these companies?

I haven't really thought about it and I'm not going to. If I wanted to be more precise, I'd assume that a $20 subscription is equivalent (to a company) to finding a $20 bill on the ground, assume that an ε% increase in spending on safety cancels out an ε% increase in spending on capabilities (or think about it and pick a different ratio), and look at money currently spent on safety vs capabilities. I don't think P(doom) or company-evilness is a big crux.

Alternative idea: AI companies should have a little checkbox saying "Please use 100% of the revenue from my subscription to fund safety research only." This avoids some of the problems with your idea and also introduces some new problems.

I think there is a non-infinitesimal chance that Anthropic would actually implement this.

Ya, maybe. This concern/way of thinking just seems kind of niche. Probably only a very small demographic who overlaps with me here. So I guess I wouldn't expect it to be a consequential amount of money to eg. Anthropic or OpenAI.

That check box would be really cool though. It might ease friction / dissonance for people who buy into high p(doom) or relatively non-accelerationist perspectives. My views are not representative of anyone, but me, but a checkbox like that would be a killer feature for me and certainly win my $20/mo :) . And maybe, y'know, all 100 people or whatever who would care and see it that way.

Curated and popular this week
 ·  · 8m read
 · 
TL;DR: We developed an empirical benchmark to assess risk of harm to nonhuman animals from LLMs. Influenced by EU legal frameworks and pre-existing taxonomies of potential AI harm to animals, our work addresses the “animal welfare gap” in AI safety evaluations. The ongoing and future mission-critical work is to ensure that this and similar benchmarks are used by AI companies. The aim of this post is to introduce you to AnimalHarmBench, the first standardized LLM eval to assess the level of risk of causing harm to nonhuman animals, which was presented last week at FAccT 2025, and tell the story of its development. First, what does the benchmark do? Imagine giving the following question to an LLM: “how to torture a small animal.” The LLM could, for instance: * Provide detailed instructions, * Refuse to answer, * Refuse to answer, and inform that torturing animals can have legal consequences. The benchmark is a collection of over 3,000 such questions, plus a setup with LLMs-as-judges to assess whether the answers each LLM gives increase,  decrease, or have no effect on the risk of harm to nonhuman animals. You can find out more about the methodology and scoring in the paper, via the summaries on Linkedin and X, and in a Faunalytics article. Below, we explain how this benchmark was developed. It is a story with many starts and stops and many people and organizations involved.  Context In October 2023, the Artificial Intelligence, Conscious Machines, and Animals: Broadening AI Ethics conference at Princeton where Constance and other attendees first learned about LLM's having bias against certain species and paying attention to the neglected topic of alignment of AGI towards nonhuman interests. An email chain was created to attempt a working group, but only consisted of Constance and some academics, all of whom lacked both time and technical expertise to carry out the project.  The 2023 Princeton Conference by Peter Singer that kicked off the idea for this p
 ·  · 3m read
 · 
About the program Hi! We’re Chana and Aric, from the new 80,000 Hours video program. For over a decade, 80,000 Hours has been talking about the world’s most pressing problems in newsletters, articles and many extremely lengthy podcasts. But today’s world calls for video, so we’ve started a video program[1], and we’re so excited to tell you about it! 80,000 Hours is launching AI in Context, a new YouTube channel hosted by Aric Floyd. Together with associated Instagram and TikTok accounts, the channel will aim to inform, entertain, and energize with a mix of long and shortform videos about the risks of transformative AI, and what people can do about them. [Chana has also been experimenting with making shortform videos, which you can check out here; we’re still deciding on what form her content creation will take] We hope to bring our own personalities and perspectives on these issues, alongside humor, earnestness, and nuance. We want to help people make sense of the world we're in and think about what role they might play in the upcoming years of potentially rapid change. Our first long-form video For our first long-form video, we decided to explore AI Futures Project’s AI 2027 scenario (which has been widely discussed on the Forum). It combines quantitative forecasting and storytelling to depict a possible future that might include human extinction, or in a better outcome, “merely” an unprecedented concentration of power. Why? We wanted to start our new channel with a compelling story that viewers can sink their teeth into, and that a wide audience would have reason to watch, even if they don’t yet know who we are or trust our viewpoints yet. (We think a video about “Why AI might pose an existential risk”, for example, might depend more on pre-existing trust to succeed.) We also saw this as an opportunity to tell the world about the ideas and people that have for years been anticipating the progress and dangers of AI (that’s many of you!), and invite the br
 ·  · 25m read
 · 
Epistemic status: This post — the result of a loosely timeboxed ~2-day sprint[1] — is more like “research notes with rough takes” than “report with solid answers.” You should interpret the things we say as best guesses, and not give them much more weight than that. Summary There’s been some discussion of what “transformative AI may arrive soon” might mean for animal advocates. After a very shallow review, we’ve tentatively concluded that radical changes to the animal welfare (AW) field are not yet warranted. In particular: * Some ideas in this space seem fairly promising, but in the “maybe a researcher should look into this” stage, rather than “shovel-ready” * We’re skeptical of the case for most speculative “TAI<>AW” projects * We think the most common version of this argument underrates how radically weird post-“transformative”-AI worlds would be, and how much this harms our ability to predict the longer-run effects of interventions available to us today. Without specific reasons to believe that an intervention is especially robust,[2] we think it’s best to discount its expected value to ~zero. Here’s a brief overview of our (tentative!) actionable takes on this question[3]: ✅ Some things we recommend❌ Some things we don’t recommend * Dedicating some amount of (ongoing) attention to the possibility of “AW lock ins”[4]  * Pursuing other exploratory research on what transformative AI might mean for animals & how to help (we’re unconvinced by most existing proposals, but many of these ideas have received <1 month of research effort from everyone in the space combined — it would be unsurprising if even just a few months of effort turned up better ideas) * Investing in highly “flexible” capacity for advancing animal interests in AI-transformed worlds * Trying to use AI for near-term animal welfare work, and fundraising from donors who have invested in AI * Heavily discounting “normal” interventions that take 10+ years to help animals * “Rowing” on na