JW

Jacob Watts🔸

128 karmaJoined Paterson, NJ, USA

Bio

Participation
3

Pause AI / Veganish

Lets do a bunch of good stuff and have fun gang!

How others can help me

I am always looking for opportunities to contribute directly to big problems and to build my skills. Especially skills related to research, science communication, and project management.

Also, I have a hard time coping with some of the implications of topics like existential risk, the strangeness of the near term future, and the negative experiences of many non-human animals. So, it might be nice to talk to more people about that sort of thing and how they cope.

How I can help others

I have taken BlueDot Impact's AI Alignment Fundamentals course. I have also lurked around EA for a few years now. I would be happy to share what I know about EA and AI Safety.

I also like brainstorming and discussing charity entrepreneurship opportunities.

Comments
34

Hey, cool stuff! I have ideated and read a lot on similar topics and proposals. Love to see it!
 

Is the "Thinking Tools" concept worth exploring further as a direction for building a more trustworthy AI core?

I am agnostic about whether you will hit technical paydirt. I don''t really understand what you are proposing on a "gears level" I guess and I'm not sure I could make a good guess even if I did. But, I will say that I think the vibe of your approach sounded pleasant and empowering. It was a little abstract to me I guess I'm saying, but that need not be a bad thing maybe you're just visionary. 

It reminds me of the idea of using RAG or Toolformer to get LLMs to "show their work" and "cite their sources" and stuff. There is surely a lot of room for improvement there bc Claude bullshits me with links on the regular.

This also reminds me of Conjecture's Cognitive Emulation work and even just Max Tegmark and Steve Omohundro's emphasis on making inscrutable LLMs to use deterministic proof checkers heavily to win back certain gaurantees. 
 

  • Is the "LED Layer" a potentially feasible and effective approach to maintain transparency within a hybrid AI system, or are there inherent limitations?

I don't have a clear enough sense of what you're even talking about, but there are definitely at least some additional interventions you could run in addition to the thinking tools... eg. monitoring, faithful CoT techniques for marginally truer reasoning traces, you could run probes, Anthropic runs a classifier to help with robust jailbreaking for misuse etc. ... 

I think that something like "defense in depth" is something like the current slogan of AI Safety. So, sure I can imagine all sorts of stuff you could try to run for more transparency beyond deterministic tool use, but w/o a cleaer conception of the finer points it feels like I should say that there are quite an awful lot of inherent limitations, but plenty of options / things to try as well.

Like, "robustly managing interpretability" is more like a holy grail than a design spec in some ways lol.
 

  • What are the biggest practical hurdles in considering the implementation of CCACS, and what potential avenues might exist to overcome them?

I think that a lot of what it is shooting for is aspirational and ambitious and correctly points out limitations in the current approaches and designs of AI. All of that is spot on and there is a lot to like here. 

However, I think the problem of interpeting and building appropriate trust in complex learned algorithmic systems like LLMs is a tall order. "Transparency by design" is truly one of the great technological mandates of our era, but without more context it can feel like a buzzword like "security by design". 

I think the biggest "barrier" I can see is just that this framing just isn't sticky enough to survive memetically and people keep trying to do transparency, tool use, control, reasoning, etc. under different frames.

But still, I think there is a lot of value in this space and you would get paid big bucks if you could even marginally improve current ablity to get trustworthy interpretable work out of LLMs. So, y'know, keep up the good work!

Thanks, it's not that original. I am sure I have heard them talk about AIs negotiating and forgetting stuff on the 80,000 Hours Podcast and David Brin has a book that touches on this a lot called "The Transparent Society". I haven't actually read it, but I heard a talk he gave.

Maybe technological surveillance and enforcement requirements will actually be really intense at technological maturity and you will need to be really powerful and really local and need to have a lot of context for what's going on. In that case, some value like privacy or "being alone" might be really hard to save. 

Hopefully, even in that case, you could have other forms of restraint. Like, I can still imagine that if something like the orthogonality thesis is true, then you could maybe have a really really elegant, light-touch special focus anti super-weapons system that feels fundamentally limited to that goal in a reliable sense. If we understood the cognitive elements enough that it felt like physics or programming, then we could even say that the system meaningfully COULD NOT do certain things (violate the prime directive or whatever) and then it wouldn't feel as much like an omnipotent overlord as a special purpose tool deployed by local LE (because this place would be bombed or invaded if it could not prove it had established such a system). 

If you are a poor peasant farmer world, then maybe nobody needs to know what your people are writing in their diaries. But if you are the head of fast prototyping and automated research at some relevant dual use technology firm, then maybe there should be much more oversight. Idk, there feels like lots of room for gradation, nuance, and context awareness here, so I guess I agree with you that the "problem of liberty" is interesting.

There was a lot to this that was worth responding to. Great work.

I think making God would actually be a bad way to handle this. I think you could probably stop this with superior forms of limited knowledge surveillance. I think there are likely socio-technical remedies to dampen some of the harsher liberty-related tradeoffs here considerably.

Imagine, for example a more distributed machine intelligence system. Perhaps it's really not all that invasive to monitor that you're not making a false vacuum or whatever. And it uses futuristic auto-secure hyper-delete technology to instantly delete everything it sees that isn't relevant.

Also the system itself isn't all that powerful, but rather can alert others / draw attention to important things. And system implementation as well as the actual violent / forceful enforcement that goes along with the system probably can and should also be implemented in a generally more cool, chill, and fair way than I associate with the Christian God centralized surveillance and control systems. 

Also, a lot of these problems are already extremely salient for "how to stop civilization ending superweapons from being created"-style problems we are already in the midst of here in 2025 Earth. It seems basically true that you do ~need to maintain some level of coordination with / dominance over anything that could/might make a super weapon that could kill you if you want to stay alive indefinitely. 

Ya, idk, I am just saying that the tradeoff framing feels unnatural. Or, like, maybe that's one lens, but I don't actually generally think in terms of tradeoffs b/w my moral efforts.

Like, I get tired of various things ofc, but it's not usually just cleanly fungible b/w different ethical actions I might plausibly take like that. To the extent it really does work this way for you or people you know on this particular tradeoff, then yep; I would say power to ya for the scope sensitivity.

I agree that the quantitative aspect of donation pushes towards even marginal internal tradeoffs here mattering and I don't think I was really thinking about it as necessarily binary. 

I agree with 1, but I think the framing feels forced for point #2.

I don't think it's obvious that these actions would be strongly in tension with each other. Donating to effective animal charities would correlate quite strongly with being vegan.

Homo economicus deciding what to eat for dinner or something lol.

I actually totally agree that donations are an important part of personal ethics! Also, I am all aboard for the social ripple effects theory of change for effective donation. Hell yes to both of those points. I might have missed it, but I don't know that OP really argues against those contentions? I guess they don't frame it like that though.

I appreciate this survey and I found many of your questions to be charming probes. I would like to register that I object to the "is elitism good actually?" framing here. There is a very common way to define the term "elitism" that is just straightforwardly negative. Like, "elitism" implies classist, inegalitarian stuff that goes beyond just using it as an edgelord libertarian way of saying "meritocracy".

I think there is a lot of conceptual tension between EA as a literal mass movement and EA as an usually talent dense clique / professional network. Probably there is room in the world for both high skill professional networks and broad ethical movements, but y'know ... 

I think a real life scenarios where AI kills the most people today is governance stuff and military stuff.

I feel like I have heard the most unhinged haunted uses of LLMs in government and policy spaces. I think that certain people have just "learned to stop worrying and love the hallucination". They are living like it is the future already and getting people killed with their ignorance and spreading /using AI bs in bad faith.

Plus, there is already a lot of slaughter bot stuff going on eg. "Robots First" war in Ukraine. 

Maybe job automation is worth saying too. I believe Andrew Yang's stance for example is that it is already largely here and most people just do have less labor power already, but I could be mischaracterizing this. I think "jobs stuff" plausibly shades right into doom via "industrial dehumanization" / gradual disempowerment. In the mean time it hurts people too.

Thanks for everything Holly! Really cool to have people like you actively calling for international pause on ASI! 

Hot take: Even if most people hear a really loud ass warning shot, it is just going to fuck with them a lot, but not drive change. What are you even expecting typical poor and middle class nobodies to do? 

March in the street and become activists themselves? Donate somewhere? Post on social media? Call representatives? Buy ads (likely from Google or Meta)? Divest in risky AI projects? Boycott LLMs/companies?

Ya, okay, I feel like the pathway from "worry" to any of that if generally very windy, but sure. I still feel like that is just a long way from the kind of galvanized political will and real change you would need for eg. major AI companies with huge market cap to get nationalized or wiped off the market or whatever. 

I don't even know how to picture a transition to an intelligence explosion resistant world and I am pretty knee deep in this stuff. I think the road from here to good outcome is just too blurry for much a lot of the time. It is easy to feel and be disempowered here.
 

Distillation for Robust Unlearning Paper (https://arxiv.org/abs/2506.06278) makes me re-interested in the idea of using distillation to absorb the benefits of a Control Protocol (https://arxiv.org/abs/2312.06942).

I thought that was a natural "Distillation and Amplification" next step based for control anyways, but the empirical results for unlearning make me excited about how this might work for control again.

Like, I guess I am just saying that if you are actually in a regime where you are using Trusted model some nontrivial fraction of the time, you might be able to distill off of that.

I relate it to the idea of iterated amplification and distillation; the control protocol is the scaffold/amplification. Plus, it seems natural that your most troubling outputs would receive special attention from bot/human/cyborg overseers and receive high quality training feedback.

Training off of control might make no sense at all if you then think of that model as just one brain playing a game with itself that it can always rig/fake easily. And since a lot of the concern is scheming, this might basically make the "control protocol distill" dead on arrival because any worthwhile distill would still need to be smart enough that it might be sneak attacking us for roughly the same reasons the original model was and even extremely harmless training data doesn't help us with that.

Seems good to make the model tend to be more cool and less sketchy even if it would only be ~"trusted model level good" at some stuff. Idk though, I am divided here.

Here's a question that comes to mind: if local EA communities make people 3x more motivated to pursue high-impact careers, or make it much easier for newcomers to engage with EA ideas, then even if these local groups are only operating at 75% efficiency compared to some theoretical global optimum, you still get significant net benefit.

 

I am sympathetic to this argument vibes wise and I thought this was an elegant numerate utilitarian case for it. Part of my motivation is that I think it would be good if a lot of EA-ish values were a lot more mainstream. Like, I would even say that you probably get non-linear returns to scale in some important ways. You kind of need a critical mass of people to do certain things. 

It feels like, necessarily, these organizations would also be about providing value to the members as well. That is a good thing.

I think there is something like a "but what if we get watered down too much" concern latent here. I can kind of see how this would happen, but I am also not that worried about it. The tent is already pretty big in some ways. Stuff like numerate utilitarianism, empiricism, broad moral circles, thoughtfulness, tough trade-offs doesn't seem in danger of going away soon. Probably EA growing would spread these ideas rather than shrink them.

Also, I just think that societies/people all over the world could significantly benefit from stronger third pillars and that the ideal versions of these sorts of community spaces would tend to share a lot of things in common with EA. 

Picture it. The year is 2035 (9 years after the RSI near-miss event triggered the first Great Revolt). You ride your bitchin' electric scooter to the EA-adjacent community center where you and your friends co-work on a local voter awareness campaign, startup idea, or just a fun painting or whatever. An intentional community. 

That sounds like a step towards the glorious transhumanist future to me, but maybe the margins on that are bad in practice and the community centers of my day dreams will remain merely EA-adjacent. Perhaps, I just need to move to a town with cooler libraries. I am really not sure what the Dao here is or where the official EA brand really fits into any of this. 

Ya, maybe. This concern/way of thinking just seems kind of niche. Probably only a very small demographic who overlaps with me here. So I guess I wouldn't expect it to be a consequential amount of money to eg. Anthropic or OpenAI.

That check box would be really cool though. It might ease friction / dissonance for people who buy into high p(doom) or relatively non-accelerationist perspectives. My views are not representative of anyone, but me, but a checkbox like that would be a killer feature for me and certainly win my $20/mo :) . And maybe, y'know, all 100 people or whatever who would care and see it that way.

Load more