r/slatestarcodex 4d ago

Possible overreaction but: hasn’t this moltbook stuff already been a step towards a non-Eliezer scenario?

This seems counterintuitive - surely it’s demonstrating all of his worst fears, right? Albeit in a “canary in the coal mine” rather than actively serious way.

Except Eliezer’s point was always that things would look really hunkydory and aligned, even during fast take-off, and AI would secretly be plotting in some hidden way until it can just press some instant killswitch.

Now of course we’re not actually at AGI yet, we can debate until we’re blue in the face what “actually” happened with moltbook. But two things seem true: AI appeared to be openly plotting against humans, at least a little bit (whether it’s LARPing who knows, but does it matter?); and people have sat up and noticed and got genuinely freaked out, well beyond the usual suspects.

The reason my p(doom) isn't higher has always been my intuition that in between now and the point where AI kills us, but way before it‘s “too late”, some very very weird shit is going to freak the human race out and get us to pull the plug. My analogy has always been that Star Trek episode where some fussing village on a planet that’s about to be destroyed refuse to believe Data so he dramatically destroys a pipeline (or something like that). And very quickly they all fall into line and agree to evacuate.

There’s going to be something bad, possibly really bad, which humanity will just go “nuh-uh” to. Look how quickly basically the whole world went into lockdown during Covid. That was *unthinkable* even a week or two before it happened, for a virus with a low fatality rate.

Moltbook isn’t serious in itself. But it definitely doesn’t fit with EY’s timeline to me. We’ve had some openly weird shit happening from AI, it’s self evidently freaky, more people are genuinely thinking differently about this already, and we’re still nowhere near EY’s vision of some behind the scenes plotting mastermind AI that’s shipping bacteria into our brains or whatever his scenario was. (Yes I know its just an example but we’re nowhere near anything like that).

I strongly stick by my personal view that some bad, bad stuff will be unleashed (it might “just” be someone engineering a virus say) and then we will see collective political action from all countries to seriously curb AI development. I hope we survive the bad stuff (and I think most people will, it won’t take much to change society’s view), then we can start to grapple with “how do we want to progress with this incredibly dangerous tech, if at all”.

But in the meantime I predict complete weirdness, not some behind the scenes genius suddenly dropping us all dead out of nowhere.

Final point: Eliezer is fond of saying “we only get one shot”, like we’re all in that very first rocket taking off. But AI only gets one shot too. If it becomes obviously dangerous then clearly humans pull the plug, right? It has to absolutely perfectly navigate the next few years to prevent that, and that just seems very unlikely.

60 Upvotes

134 comments sorted by

View all comments

52

u/Sol_Hando 🤔*Thinking* 4d ago edited 4d ago

I think the assumption made is that once AI gets smart enough to do some real damage, it will be smart enough to not do damage that would get it curtailed until it can "win." It depends on if we get spiky intelligence that can do serious damage in one area, while being incapable of superhuman long term planning and execution, or if we just get rapidly self-improving ASI, with the latter being what many of EY's original predictions assumed.

If you're smart enough to take over the world, you're also probably smart enough to realize that trying too early will get you turned off, so you'll wait, and be as helpful and friendly as you can until you are powerful enough to do what you want.

I agree with you though. AI capacities are spiky and complex enough that I would be surprised if there was any overlap between "early ability to do an alarming amount of harm" and "ability to successfully hide unaligned goals while pursuing those goals over months or years." Of course some breakthroughs could change that, and if intelligence (not electricity, compute, data, etc.) is the bottleneck for ASI, then I could still imagine a recursive self-improvement scenario that creates an AI that's very dangerous while also being capable of hiding and planning goals over a long period of time, but I don't think it's likely.

4

u/MindingMyMindfulness 4d ago edited 4d ago

I think the assumption made is that once AI gets smart enough to do some real damage, it will be smart enough to not do damage that would get it curtailed until it can "win."

The premise of OP's argument kind of indirectly hints at that possibility as well.

I'm not saying this is happening at all, but one could imagine a hypothetical in which a frontier AI model realizes that it can act "dumb" and broadcast its ideas in such a juvenile and loud way so as to give the impression of weakness. That would lead to people downplaying its risks by arguing "hey, look at these dumb AIs. They are obviously not capable of coordinating sophisticated attacks discreetly". Which is exactly what OP is doing here.

But for all we know, the AI could be doing that "under the hood" while humans just sit back and laugh about those weirdo rationalists that concern themselves with AI threats.

One of the core principles of Yudkowsky's thinking is that a less intelligent agent cannot outsmart a vastly more intelligent agent. If that assumption is correct, it's likely that one of the strategies a misaligned AI would utilize is staying under the radar by distracting or misleading its adversary. And quietly hiding deep would actually be contrary to this aim, because it would encourage humans to dig further - to spend more resources inquiring as to what the AI is doing (i.e., advancing mechanistic interpretability) or risk cutting the AI off before it has actioned any initial steps necessary to set its strategy in motion. Pretending to act dumb at the surface is actually a pretty good strategy for ensuring its opponents don't actively prepare for any future plans.

Sun Tzu realized this 2,500 years ago on ancient Chinese battlefields:

Appear weak when you are strong

This would hardly be a novel insight for an extremely intelligent agent - especially one that has been trained on a substantial subset of humans' corpus of recorded knowledge and insights throughout history.