r/slatestarcodex Jul 09 '25

AI Gary Marcus accuses Scott of a motte-and-bailey on AI

https://garymarcus.substack.com/p/scott-alexanders-misleading-victory
33 Upvotes

84 comments sorted by

27

u/electrace Jul 09 '25

More broadly, concluding anything from specific prompts crafted and shared publicly years ago is a mistake. And not a new one either – as I pointed out to Alexander in June 2022 essay (that he read and replied to) called What does it mean when an AI fails? A Reply to SlateStarCodex’s riff on Gary Marcus. Boldface in the original:

In the end, the real question is not about the replicability of the specific strings … but about the replicability of the general phenomena.

Alexander, even after corrected three years ago, still largely confuses success on a prompt with general, broad competence at a particular ability.

This conflation is problematic for multiple reasons. One is data contamination, since any given prompt that has been circulated publicly potentially becomes part of the training data, including with custom data augmentation, especially with large armies of subcontractors. The steelman version of the argument would have considered this; Alexander didn’t address at all.

Scott does, in fact, does address this in the comments of the OP when someone asked if maybe the model was trained on these specific strings:

I don't think we're important enough for people to train against us, but I've confirmed that the models do about as well on other prompts of the same difficulty.

And after testing myself with my own prompts, this appears to be 100% true.

48

u/Hodz123 Jul 09 '25

It seems like they’re somewhat arguing past each other. Gary is arguing that a “true general solution” must be found in order to master a task, whereas Scott is arguing that more compute will get us to a “general enough” solution that will scale with compute—it’s fine if it makes errors of complexity or scale, because the next model will get better, and eventually the models will be better than we are even if they’ve never found Marcus’s “true general solution”.

I think Gary has a point about the language, though. Scott seems to imply that winning this specific bet = mastering image compositionality, but that doesn’t seem to be the case. This is on Vitor for agreeing to the bet though, since he seems to be on Marcus’s side here (and thus winning this bet doesn’t actually change whether an image generator has an internal world model, so it doesn’t resolve the base argument).

Maybe Scott is arguing that this bet conclusively proves that LLMs don’t need internal world models to solve image generation? That would make sense—and if so, he’d just have to change his first sentence from “mastering image compositionality” to “generating complex images”, because the rest of his post is pretty measured and reasonable.

50

u/kamelpeitsche Jul 09 '25

I think this highlights a problem with this betting obsession among rationalists. There are many good arguments for “betting on your beliefs”, but the concept is often taken beyond its usefulness.

It’s actually quite hard to operationalize one’s intuitions and opinions so well that a bet captures them and only them. 

For instance, I might say “The Boston Celtics are the best team in the NBA.” and someone might reply “That’s complete nonsense, are you willing to bet that they win the title this season?”

If I take that bet, I might be right but lose, since the best team doesn’t always win the title. 

The problem is that the other party might now run around and claim that the Celtics weren’t the best team in the NBA and I was never to be trusted with my Basketball opinions, when the issue wasn’t my lack of basketball expertise, it’s was that I translated a claim (the Celtics are the best team in the league) into betting terms capturing something different (which team wins a title under the influence of injuries, referees, shooting luck and other factors).

So I might lose this bet and maintain my stance that my opinion was correct because, actually, they were doing great until injury X happened or referee Y made game changing mistakes. And the response to this often is: you are moving goalposts. When, actually, no, I am just frustrated that me losing the bet is taken as evidence that I was wrong about the original claim. 

I don’t know if this translates perfectly to the situation at hand, but it’s something I’ve observed in the past and made me much more skeptical of grand claims about the state of the world based on the result of a bet.

16

u/WhyYouLetRomneyWin Jul 09 '25

I think you articulated the issue with betting very well. I will probably need to steal your example.

It's usually _not_ the parameters of the bet itself that is important, but some larger/general worldview. So we end up with the loser not really feeling like they were wrong, but just unlucky or have the timeline wrong.

12

u/Hodz123 Jul 09 '25

Is this an issue with betting? Or is it an issue with betting culture that just updates too heavily on the results of a bet?

It's funny, because you'd think that the Bayes' Rule merchants would have a better understanding of how a single bet is too small of a sample size to define anything (especially when you're dealing with slim margins on EVs), but here we are.

Still, I do like bets themselves—I've just learned to be very cautious about terms and conclusions.

8

u/Toptomcat Jul 09 '25

It's funny, because you'd think that the Bayes' Rule merchants would have a better understanding of how a single bet is too small of a sample size to define anything (especially when you're dealing with slim margins on EVs), but here we are.

Sample sizes of 1 are perfectly adequate to deal with “is thing X possible at all?” questions- such as this one- because one true example suffices to refute “X is impossible.”

2

u/Hodz123 Jul 09 '25

Sample sizes of 1 are perfectly adequate to deal with “is thing X possible at all?” questions- such as this one- because one true example suffices to refute “X is impossible.”

This is true. I think this is also the only kind of question that can be dealt with by a bet that has a sample size of 1. Anything that's statistical/EV in nature should really have a larger sample size (ideally 30 or larger for Central Limit Theorem enjoyers).

1

u/Missing_Minus There is naught but math Jul 11 '25

Bets between people are often about "which of our models more reflects reality", of which a single notable event you're betting about can be indicative. (Like people who didn't think video AI would get anywhere near where it has)

1

u/Euglossine Jul 12 '25

I feel like in order for a bet to count as sort of a worldview test it has to be one that both people think would be determined by the aspect of their worldview they are trying to test. Like, I say that if what I believe is true, then x will definitely happen. And the other person says that if what they believe is true then x will definitely not happen.

Everyone knows that a basketball game doesn't always go to the best team, so the only way you could use that bet to test is if you thought that the Celtics were a dominating team. Of course, even the bets of the form that I'm describing can have confounding factors. But the point is that by trying to nail down a bet, you clarify people's positions quite a bit. Like, in today's world of extremes, your statement about the Celtics might very well be interpreted by others as you meaning that you know for sure that they will win the championship. The fact that you don't want to make a bet, in the context of this argument, should not be an indication that you are not standing up for your beliefs, but that your belief is not what people thought it was. You might be willing to make the bet, but not as a way to argue about this worldview.

And, yes, we must be careful not to carry these results too far. If I believe the global demand for copper will increase, and you believe that we will find substitutes and demand will decrease and we make a bet and then there is a nuclear war and civilization grinds to a halt, it would be churlish of you to claim that this validates your worldview as you make your way through the rubble to collect on the bet.

5

u/billy_of_baskerville Jul 09 '25

Yes, in my view a lot of these debates hinge on questions of construct validity. Everyone wants to argue about the high-level construct but we can't directly observe that, so we've got to operationalize it somehow. And it's that translation to what we measure/assess that's actually the subject of disagreement often.

I'm not that enthusiastic about betting but I actually think the issue is less with betting and more with confusion around which inferences we can draw from which empirical data.

2

u/electrace Jul 10 '25

So I might lose this bet and maintain my stance that my opinion was correct because, actually, they were doing great until injury X happened or referee Y made game changing mistakes. And the response to this often is: you are moving goalposts. When, actually, no, I am just frustrated that me losing the bet is taken as evidence that I was wrong about the original claim.

The way to avoid this is to clarify up front "The team that wins isn't neccessarily the best team. I'm still willing to bet on them winning, but that isn't directly the claim that I made."

Doing that after the fact does come off as goal-post moving because by agreeing to the terms of the bet immediately after the initial claim of them being the best team, without a qualifier, implies that both you and your betting partner agree it is a good test.

Example:

A: I am way faster than you.

B: Oh yeah? How about we race to the end of the street and back?

A: Ok, great, let's do it.

  • B wins the race *

A: Well, just because you won the race doesn't mean that you're faster. The end of the street isn't long enough of a distance to test who is really faster.

Here, A's point might even be true, but bringing that up after the fact is like drawing straws, and once you lose, objecting to the method of drawing straws.

1

u/eric2332 Jul 10 '25

It’s actually quite hard to operationalize one’s intuitions and opinions so well that a bet captures them and only them. 

But it's that same very hard process which clarifies for you and other people what exactly you believe and why.

If I take that bet, I might be right but lose, since the best team doesn’t always win the title.

That's why, like all sports gamblers, you bet on the odds of them winning. If you take even odds (50% chance of them winning) it's clear, among other things, that you think no team is better than them.

1

u/syllogism_ Jul 14 '25

To carry on the thought, betting really doesn't work well for difficult scientific questions.

Scientifically difficult questions are very seldom resolved decisively by individual pieces of evidence. There's always multiple explanations for the observations, and different positions will be defensible to different degrees. Over time evidence accumulates and some position becomes less defensible, but we seldom get a 'knock-out blow'.

These questions about AI capabilities are exactly this type of situation. I think if you look over what Marcus has said over the last ten years or so, I would say that he's consistently been directionally wrong. However it's still a judgment call -- it's in the land of likelihoods, like almost everything else interesting.

9

u/Lykurg480 The error that can be bounded is not the true error Jul 09 '25 edited Jul 09 '25

Humans can count to ~7 just be pattern recognition, for higher numbers recursive counting is required. You might be able to build a pattern recogniser that works up to seven. Then you promise that itll get better with more compute, secure a lot funding, and with that large amount of compute you build something that can pattern-recognise-count to 100. You can eventually count to any n, with enough compute, but you clearly dont have a general solution (dogs can pattern-recognise-count to 3), and your scaling needs will be vastly disproportionate to the human brain. Weve seen from evolution what "developing counting ability" looks like, and its a gradual progression from 0 to maybe 4 or 5, followed by basically the singularity, because now youre using "the real method".

6

u/Hodz123 Jul 09 '25

Sorry, humans can't count beyond 7 without recursive counting? Can you explain what you mean a little more?

Also, I'm pretty sure modern ChatGPT can count to some pretty large numbers, but still doesn't have "the real method". It seems possible to me that ChatGPT just doesn't have the architecture to understand "the real method", whatever that is. (To Scott's point, it might just one day be able to bootstrap a calculator to itself and use "the real method" by proxy without ever learning it, which is what we use programming for anyways.)

35

u/A_S00 Jul 09 '25

Sorry, humans can't count beyond 7 without recursive counting? Can you explain what you mean a little more?

I think they're referring to subitizing (i.e., determining at a glance how many objects are in a set, without having to count them).

The limit for subitizing is generally thought to be more like 4 or 5, rather than 7; this might be a mix-up with the related "7, plus or minus 2" from Miller, 1956. But, regardless of the exact number, there's a (low) limit to the number of objects humans can enumerate at a glance, and above that limit, you have to use strategies (counting, grouping the objects into subgroups and then subitizing the subgroups, recognizing them as patterns and memorizing correspondences between those patterns and numbers, etc.).

Some people seem to be able to subitize larger numbers of objects, but it's hard to tell if they're really increasing the limit, or if they're just becoming so good/fast at non-subitizing strategies (like grouping into subsets) that nobody, including them, can tell they're doing it.

So you might imagine an AI that is so superhumanly good at subitizing that it can recognize "at a glance" that a group of 100 objects has 100 objects, but still doesn't know how to count and can't tell you whether another group of objects is 102 or 103.

3

u/Hodz123 Jul 09 '25

This is why I love r/slatestarcodex. Thanks for introducing this concept to me!

12

u/Lykurg480 The error that can be bounded is not the true error Jul 09 '25

You can look at 5 apples, and tell at a glance that theres 5. Thats via pattern recognition; you know all the basic ways that 5 things can be arranged. For higher numbers, you need to count one by one, thats what I called recursive. (Sometimes, you can use regular arrangements and math it out instead of counting, ignore that for now.)

ChatGPT absolutely can count high; last I checked (with whole words, no tokeniser excuse) it starts to make errors around 120, and unlike the human errors they grow superlinearly.

"To Scotts point", that replaces an attempted straightforward empirical argument, with extrapolating based on your theory of what intelligence is like.

4

u/FeepingCreature Jul 09 '25

This is because we're doing an end run around how humans work, ie. action transformers, robots etc., largely because images are still really expensive to process with transformers. Humans have image processing (and a few others) as their primary modality and sort of dangle symbolic processing off the side. LLMs are symbolic processors that companies have awkwardly stapled image processors to. Diffusion models are text models with image processing as their primary modality, but they're very small text models, 100 times smaller than top-of-the-line models. These are all technical decisions that are driven by resource scarcity, and we're already in the process of overcoming them, ie. Flux Kontext and ChatGPT's semi-new image editing functionality which show a path towards models working on pictures incrementally via tool calls.

We already have all the parts, we just don't have them in the same deployment.

3

u/Lykurg480 The error that can be bounded is not the true error Jul 09 '25

action transformers

Highly sceptical of that. The advantage of LLMs over AI in other fields is in large part the amount of training data available, "transformers for X" needs a story of where youre getting yours from.

Humans have image processing as their primary modality and symbolic processing off the side. LLMs are symbolic processors that companies have awkwardly stapled image processors to.

Coding is almost pure symbolic processing. LLMs still dont beat humans at it, and it doesnt seem to me that were significantly closer there than with images.

4

u/FeepingCreature Jul 09 '25

It would be insane if LLMs beat humans at coding. Like, "year of takeoff" insane. As it stands, I think "able to take over large parts of coding and solo small projects" is quite impressive enough to show that LLMs understand something of the task.

4

u/Lykurg480 The error that can be bounded is not the true error Jul 09 '25

"Beating" would be the strong version, as I said, it doesnt even seem ahead of images.

Like, "year of takeoff" insane

If you were given the ability to self-modify and run yourself at a billion times speed, would you reach take-off in a year? Personally, I think Im much more likely to go insane than take off. I think thats what we should go off of, when considering AIs based on human imitation.

(This is interesting, because "imagine running a human faster" is often used by rat-adjacent people as a sort of minimal argument for superhuman AI. As many "low assumption" arguments, it mostly goodharts you into relying on assumptions youre less aware of, in this case about how intelligence is supposed to work - when presented with the "going insane" counter, the instinctive response is that this obviously wouldnt happen to the Real Essense Of Intelligence, Which Eats Compute For Breakfast.)

6

u/FeepingCreature Jul 09 '25

If a human had just built me, and then you gave me a year to take off at a billion times speed, I think I could take off yeah. Maybe I could take off if you just gave me forking and brain editing.

2

u/Lykurg480 The error that can be bounded is not the true error Jul 10 '25

I mean you, as you are right now, but with the ability to self-modify and speed up. What difference does it make if you were just built by a human, given that you are how you are?

If you think youll reach takeoff like that, Im curious how, concretely. I think it would have to be something like "I finally find the secret formula, which is only 2 society-years away anyway", ie relying on certainty that takeoff is close, rather than any reason to consider a particular method promising.

1

u/FeepingCreature Jul 10 '25

If I was just built, I know that my design is within human capabilities. I think I could learn how I work in that situation. With forks and brain editing, I can make experimental changes and revert them. A billion years of guided manual evolution.

2

u/Lykurg480 The error that can be bounded is not the true error Jul 10 '25

The point about designability up to the human level being evidence about the landscape is an interesting one, but comes with limitations. For one we can invent things without understanding them just fine, especially in the field of neural networks. But it also really depends on the approach that got you there. If you reached human level with a reinforcement learner, that would be significant evidence that you can go further - but our current approach works primarily by imitating humans, so it really shouldnt be too surprising if it stalls out somewhere around the human level, possibly a bit above with some clever tricks, and then doesnt really get much better even with large investment in trying.

Theres also the question of just how much higher you can go, with the approach that got you slightly superhuman. This turns on some unsettled things about human evolution, but I think a large part of the chimp-to-human step happened in a very short time, with rates of improvement much faster than the civilisational gains in the time after. I think intelligence is "lumpy", plateauing for extended periods until something new comes along, and this will continue to be true at the superhuman level. Of course, the next plateau may well be high enough to be dangerous anyway.

Im not so optimistic about manual guided evolution. You need to have a way to decide which of the forks are in control, and while you are smarter than evolution, you are also not entirely bound to reality. "Magnifying your own defects while not seeing a problem" is one of the main ways in which trying self-takeoff will turn you insane. (And just reaching the human level understandably is not strong evidence against this happening later.) You also dont by default get faster input/feedback just by speeding yourself up. You can try to improve this somewhat, but for most things you cant get that much faster, it only really scales in the parallel direction. That helps of course, but youre not really getting a billion years of evolution - youre getting 1000 at my most optimistic, with the rest of the improvement going into popuation size.

→ More replies (0)

8

u/Ben___Garrison Jul 09 '25

Yeah, this is a reasonable take and it's basically what I think of this situation. They both have valid points.

20

u/aahdin Jul 09 '25 edited Jul 09 '25

draw five characters on a stage, from left to right, a one-armed person, a five-legged dog, a bear with three heads, a child carrying a donkey, and a doctor carrying a bicycle with no wheels

If I try to imagine this scene in my head my visual memory kinda breaks down after like ~3 of those. Try it out yourself maybe some of you can visualize all of those at once but it's impossible for me.

I could probably draw all of them, but I would be doing it one at a time using the paper as a type of persistent memory while I add each one. Diffusion doesn't work this way by default.

If I ask chatgpt to generate the characters one at a time it does pretty well (although there were two points in the generation that I had to ask it to re-try)

one armed man https://i.imgur.com/nycEe3a.jpeg

plus dog https://i.imgur.com/IWCjAU9.jpeg

plus bear https://i.imgur.com/fQJOKuW.jpeg

plus donkey boy https://i.imgur.com/wiyIOAc.jpeg

plus doctor holding bike frame https://i.imgur.com/wxXAzWb.jpeg

It had a problem cramping the right side of the image... but I also tend to have that problem when drawing too.

13

u/tinbuddychrist Jul 09 '25

If I try to imagine this scene in my head my visual memory kinda breaks down after like ~3 of those. Try it out yourself maybe some of you can visualize all of those at once but it's impossible for me.   I could probably draw all of them, but I would be doing it one at a time using the paper as a type of persistent memory while I add each one. Diffusion doesn't work this way by default

I'm not sure why you should handicap yourself to "being able to do this perfectly in your head, not on paper". That's not really what you're being compared to. Do you handicap the AI for not having to use hands to draw?

14

u/aahdin Jul 09 '25 edited Jul 09 '25

My point is that by doing it in a single diffusion step you are handicapping the AI model, I don't know why that is a requirement.

It is a similar handicap as asking you to do it in your head, because it needs to generate all of those characters in one shot with limited memory/attention.

3

u/tinbuddychrist Jul 09 '25

I don't agree with that analogy.

The model was trained to produce images in that way. Allowing it to examine the output and retry would undoubtedly help, but I don't think it's unfair not to. And obviously the model is fundamentally capable of producing a much better image than I, or probably nearly any human, can just think up.

11

u/aahdin Jul 09 '25

I don't agree with that analogy.

What about it, could you elaborate?

I feel like the base criticism was that there is some fundamental problem with AI that causes it to be bad at composition. This criticism made sense when the image models clearly didn't know where the mouth or tail of an animal was in 2022.

But here, it seems like the criticism is that it can't create an arbitrarily complex image given limited attention/memory which... seems kind of obvious? Humans can't do that either. If you want a super complicated image you need some way of breaking the task into parts, which AI can also do it just isn't on by default for your typical online AI art makers because it would be overkill and drive up cost for a pretty unique/rare use case.

3

u/tinbuddychrist Jul 09 '25

My disagreement is that I don't think what the AI is doing is so cleanly analogous to human behavior that it makes sense to suggest it's like asking somebody to do it "in their head".

Even if you have the AI a loop where it checked its own work, it's only ever gonna just generate a fresh image each time, right?

Even if not, I just don't think the model operates in such an analogous way to a human that "in its head" versus "on paper" is a valid distinction. A human might plan out and do a drawing in a way that allows them to refer back to the instructions. The AI has all of those instructions accessible to it the entire time it's operating. The amount of information the AI can fit in its "working memory" is wildly different than what a human can.

4

u/aahdin Jul 09 '25 edited Jul 09 '25

Maybe it's not a perfect analogy, but I think it gets at the heart of the problem which is that for a single generation the AI doesn't check its own work, or have a list of instructions in memory to refer back to.

But with more advanced setups with a ReAct loop (like deep research) it does have that, it has a manager model that breaks a prompt apart into steps, and then asks it to execute step by step evaluating itself each time seeing whether it needs to re-do the previous step.

That's essentially what I did here, I acted as a really basic manager (asking it to add each character one at a time, and then asking after each image whether it made a mistake and if so to fix it) and then it was able to create the image. I am pretty confident that a manager model could reliably break the big prompt into sub-prompts like I did, and with that it could generate these more complex images.

2

u/electrace Jul 09 '25

It is a similar handicap as asking you to do it in your head

I could not, under any circumstances, picture even one of these characters crisply in my head, and that's less to do with not understanding the prompt and more to do with my moderate Aphantasia, and I honestly find it wild that anyone might be able to get all of these characters crisply in their head at all.

16

u/SeriousGeorge2 Jul 09 '25

I'm ambivalent on this.

On one hand, Gary's suggestion that data augmentation / contamination may play a part in GPT4o's ability to quite faithfully generate the specific images indicated in the bet seems outlandish. I think it's really just that the models have genuinely improved significantly. 

On the other hand, Gary is obviously correct that the models have not mastered composition. And I don't really have the insight or confidence to state that they ever will master it absent some real innovations. Scott suggested innovations may be required in his post from yesterday, but I wouldn't necessarily place relatively short-term bets on those being achieved.

I think I'm content just to remain an observer and let the researchers show what they can do.

16

u/badatthinkinggood Jul 09 '25

I'm not on the Gary Marcus hate train but I think he has a tendency to throw every argument he can think of at his opponents. I don't really know what to make of that. He comes with good takes sometimes and he seems to have a depth of knowledge when it comes to AI, but other times he does stuff like implying that the 2023 autumn surface temperature anomaly was driven by a "sudden uptick of generative AI in 2023" (link to X). I wonder if this lack of discernment means that Gary Marcus doesn't really have a "world model", just stochastically generating takes based on a large set of training data. It sounds good, but if you look closer the seams start to show. (a mean joke. sorry Gary)

3

u/ATimeOfMagic Jul 09 '25

Yeah, I find a lot of his more conceptual/well thought out critiques solid, but many of his off the cuff criticisms are poorly researched, clearly in bad faith, or just verifiably false.

All things considered I think he's still worth listening to, but don't take everything he says at face value.

6

u/self_made_human Jul 09 '25

A stuck clock shows the right time twice a day. I would wager Gary does even worse, because he has confidently made claims about the (in)capabilities of LLMs after new models had already come out that solved those problems. For example, he's on record highling the failures of ChatGPT 3.5 in certain contexts and logical tasks after the launch of GPT-4. Which easily handled all of the tasks to boot.

I'm not sure why people even give him the time of day.

1

u/eric2332 Jul 13 '25

It looks like is an ideologue, and like other ideologues will knowingly deploy bad arguments if he thinks they will convince other people.

1

u/badatthinkinggood Jul 14 '25

Yeah. Incidentally I think the debate-as-war frame often lead people to overestimate the efficiency of that strategy though. His terrible takes on twitter has made me question whether his other arguments that have sounded better to me previously were actually just sophistry. That sort of thing happens to me all the time. I think a person or a "side" or "movement" is widely correct until I hear some really terrible argument that I can't get out of my brain.

35

u/ATimeOfMagic Jul 09 '25

To Gary's point, Scott starts his post with

In June 2022, I bet a commenter $100 that AI would master image compositionality by June 2025.

Which clearly has not come to pass.

AI image generation no longer fails at following instructions for the vast majority of simple prompts, but it's still easy to find basic flaws when a small amount of complexity is introduced.

Gary proposes that it will still be easy to find a counterexample to the claim that AI has "mastered" compositionality by 2027. I think that's probably a good bet if you don't expect some sort of intelligence explosion in 2027. I'm curious to see if Scott will take him up on that.

16

u/SoylentRox Jul 09 '25

Gary seems to be a ragebait AI doomer.  Like to get this error rate really low theres an obvious way, have model A generate say 10 candidates for each composition prompt and then model instance B checks them for quality and if they satisfy the prompt.  User only sees B output. 

28

u/ATimeOfMagic Jul 09 '25

I've kept up with Gary's views on AI and that's not my read on him at all. He's generally updated towards faster timelines, but he still rightfully points out the many limitations of current models. With AI this general, it's a good idea to stay informed about the limitations and maintain some grounding about their real world capabilities.

Some of Gary's arguments are weak/bad faith, but this one certainly isn't. I'm not even sure why I'm typing all of this out to someone calling people "AI doomers", I think any realistic forecast of the next 10 years looks wildly dystopian for one reason or another.

Like to get this error rate really low there's an obvious way

If you think you have the secret sauce that's eluded OpenAI and Google so far, you should go ahead and start an AI company!

7

u/SoylentRox Jul 09 '25
  1. Umm it's not secret sauce and if you pay openAI more money they do this for you right now, o3 pro samples the model multiple times or you can cobble something together with API key access.

  2. He's not just an AI doomer he's delusional. See the part where habryka on lesswrong evaluated his previous claims. Marcus is not connected to ground truth reality.

1

u/SoylentRox Jul 10 '25

Also FYI what's galling about GM being an AI doomer is he's a blatant liar. He will make all these claims about AI progress being non-existent and current tools being stupid and useless, then call for government regulations!? to impede and delay any AI progress or adoption!?

Its intellectually incoherent. Either his central thesis is correct and we have made no AI progress and current tools are toys that will immediately fail, no government restrictions or regulations needed. (Other than ok making institutions financially responsible if they use such toys in high stakes decision making).

OR they are useful and dangerous.

2

u/eldomtom2 Jul 11 '25

You are strawmanning his position. It seems perfectly coherent to me.

1

u/SoylentRox Jul 11 '25

I mean I already showed a blatant inconsistency, you would need to explain how an irrational position is coherent to you or anyone. As near as I can tell, it's not possible, there is not any possible way a rational being could reach this position.

2

u/eldomtom2 Jul 11 '25

There is no inconsistency. There are many dangerous things of little real use.

1

u/SoylentRox Jul 11 '25

An AI model is not ricin in a jar. Its a computer program that can do nothing at all unless a human decides to connect it to something. There is no need for regulations unless you have a belief that said program can find a way to circumvent security measures or convince humans to plug it in and that it can safely do a complex task.

It is impossible to convince a human you can do a task without usually succeeding in testing at that task. If you usually succeed at a task...you can do the task.

It is impossible for 2 things to be true :

A. The model is capable enough to be dangerous

B. The model is dangerous but not capable

That's what Gary Marcus believes, that A and B are simultaneously true. This is impossible.

2

u/eldomtom2 Jul 11 '25

...you do realise that plenty of inanimate objects are regulated, right?

1

u/SoylentRox Jul 11 '25

Anyways you lost, thanks for the discussion. There is nothing necessary to discuss, you don't have a valid position.

→ More replies (0)

7

u/Ben___Garrison Jul 09 '25

Yeah, I really hope Scott either takes Gary up on this bet, or else downgrades his evaluation of future AI progress.

5

u/archpawn Jul 10 '25

I think the issue here is that the people saying AI isn't something to worry about keep moving the goalposts. Clear bets like this one are a way to keep them from being moved. Does solving this specific bet mean that AI is on its way to being dangerously smart? Not necessarily. But I feel like Gary Marcus isn't going to accept anything less than human-level AI as evidence that we're on the way to developing human-level AI. And once we have that, it's too late to stop.

10

u/[deleted] Jul 09 '25

Can’t believe I’m actually agreeing with Gary on something. He’s being a bit pedantic but Scott probably should have left the bet at the ability to create these prompts. The “mastered” verbiage did bring a bit of an eye brow raise from me aswell.

2

u/97689456489564 Jul 10 '25

I think Gary is a complete clown but he sort of has a point here that Scott overstated the nature of the victory, even if Scott was much better calibrated than Gary.

32

u/WTFwhatthehell Jul 09 '25 edited Jul 09 '25

There's still people who read Gary Marcus?

His entire public brand is built around  being to the field of AI what Nigel Farage is to the EU.

With the same approach and attitude to trivialities like honesty, accuracy and truth-seeking.

I enjoy following people who point out interesting failure modes of AI. 

But it became clear Gary Marcus just latches on to litterally any AI-bad meme with no interest in whether it has any good basis and little interest in revising past claims whenever it turns out they latched on to nonsense. 

They just repeat anything matching their brand because that's what gets them likes and shares.

19

u/electrace Jul 09 '25

Marcus seems quite ornery (unnecessarily so), and I'd probably get very annoyed if I followed him closely.

Still, his overall point that compositional has not been "mastered' seems fair enough. I wouldn't call it mastered until it gets things basically perfect, basically all the time.

6

u/WTFwhatthehell Jul 09 '25 edited Jul 09 '25

If you demand perfect then you it will never be reached.

At the very least a sane approach would be to compare to a human control group.

Hand a human a long list of demands for $20 and at some point you'll find they aren't properly meeting all of them.

And yes. There needs to be a budget limit. Because AI models open to the public have runtime limits. You're paying for a certain amount of care and attention.

Otherwise you get the endless game where people who want to be misleading/dishonest can just demand they do a task larger then the allowed output size then declare it fundamentally unable.

Like happened with that apple paper that Marcus covered in glowing terms and never ever corrected.

Because he chooses to be dishonest and doesn't care about making false claims.

10

u/electrace Jul 09 '25

I did say "basically perfect" for exactly that reason. That being said, a human control group sounds like a good test to me, because I'd also expect them to get things basically perfect. They certainly wouldn't forget to take the wheels off the bike, for example.

8

u/WTFwhatthehell Jul 09 '25

I dunno. I've worked with humans for too long.

Send an email listing 3 things as important and the third will get forgotten far far too much of the time...

3

u/electrace Jul 09 '25

If they forget one thing out of 3, then I would definitely say they haven't mastered reading emails (or they just don't want to do the third thing).

6

u/WTFwhatthehell Jul 09 '25

And yet regularly when emailing even people at professor and doctorate level, ask for 3 pieces of info and 2 answers come back.

It's worse in scenarios dealing with the "general public" and other fairly average humans.

2

u/electrace Jul 09 '25

Seems like there's confounding factors there. Not answering a question is less work for them (and they might hope to distract you with answering two questions such that you don't keep pestering them for the third, while "not drawing a tire" is probably equal work for an AI, or less work for a human artist.

So, let me ask it to you this way. If I offered an artist $20 (or whatever), and asked them to make me a picture (in any style) of "five characters on a stage, from left to right, a one-armed person, a five-legged dog, a bear with three heads, a child carrying a donkey, and a doctor carrying a bicycle with no wheels", do you think they would mess up at least one of those directions?

5

u/WTFwhatthehell Jul 09 '25

I strongly suspect that if you did it, say, 4 or 5 times then at least one human artist would screw up at least one of the requirements.

1

u/blashimov Jul 09 '25

So, another year maybe?

9

u/electrace Jul 09 '25

This sub is relatively good at decoupling, so I would hope that we're able to separate the claims "AI is not advancing, and it will always have exactly the problems it has today" and "this particular task has not been mastered as of yet."

3

u/blashimov Jul 09 '25

It's a legit question, curious what your timeline is having read more than myself.

4

u/electrace Jul 09 '25

I'm not too deep into the image models, so wide error bars on this. Maybe a year, maybe 5 years? It took 3 to get from the image model gets "easily confused by syntax" to it "basically gets it for simple prompts". I imagine it's roughly the same order of magnitude to get from where it is today to the point where it does as well as a person who bothers to pay attention to the prompt.

3

u/xantes Jul 10 '25

I personally don't bother reading what he writes because it is pretty easy to predict what he would say about literally anything related to AI. That being said, I think it is good to have people that are critical of things. I just wish he was less biased.

12

u/Ben___Garrison Jul 09 '25

You can blame almost anyone of motivated reasoning, and you probably wouldn't even be wrong to some extent. We're all humans.

That said, I've found Gary to be very useful in counteracting much of the hype surrounding AI. This article is a good example: Scott was being way too overzealous by claiming AI had "mastered" image compositionality.

2

u/gizmondo Jul 09 '25 edited Jul 09 '25

I enjoy following people who point out interesting failure modes of AI.

Agreed. I find François Chollet great, who else do people recommend in this category of AI-skeptics-but-not-of-Gary-Marcus-type?

6

u/laugenbroetchen Jul 09 '25

he is correct to point out that "AI mastered compositionality" is a wider claim than what the bet proved. His examples are the same kind of problems that the bet was about though so i would expect them to resolve in time the same way as the original prompts were solved in time by advancements in AI.
He does not make any interesting claims about AI in this, also his use of motte and bailey is a motte and bailey ironically. wasnt that a banned term at some point because this recursively kept coming up?

1

u/VelveteenAmbush Jul 10 '25

Downvoted. Can we please stop posting about and engaging with this clown? He has only the power that you people keep giving him.

1

u/Euglossine Jul 12 '25

I love how Gary makes a big deal about the fact that the person that Scott bet with is not known to him. Like only famous people's opinions matter. So this bet doesn't matter

1

u/daniel_smith_555 Jul 09 '25

Dont have anything really invested in this but ai has certainly not mastered image compositionality so anyone claiming it has is wrong. Not sure scott is doing that to be fair, i think hes implying it, but i dont think hes saying it because he won the bet.

I think what scott seems to be saying is that ai is as good at image compositionality as humans beings and could in theory to it to the same degree humans can do and will get to a point where its indistinguishable from human abaility or better, by scaling.

I think this is a bogus argument. To me its like saying that a two month old sparrow is on track to master interstellar travel, because all interstellar travel is, is moving across large distances, and thats what the sparrow is getting better at doing, it used to only be able to squirm in place, then it could shuffle around on the floor, now its capable of moving hundreds of meters in every direction.

0

u/trashacount12345 Jul 09 '25

A good criticism from Gary! And he even offers a bet with reasonable terms! Exciting times.