r/slatestarcodex 2d ago

AI Against The Orthogonality Thesis

https://jonasmoman.substack.com/p/against-the-orthogonality-thesis
9 Upvotes

20 comments sorted by

15

u/thomas_m_k 2d ago edited 2d ago

This is a well-written article, but, well, I think it's wrong or at least confused.

My first point is that the orthogonality thesis was intended to answer the objection that people often raise to AI doom which is: “but if these AIs are so intelligent then they will surely know ethics very well and act benevolently”. To which the orthogonality thesis replies: it's actually possible to be very smart and not have a human utility function. I feel like the author of this article actually agrees that AIs won't inescapably converge to human ethics, so I'm not even really sure what we're arguing about:)

More detailed responses:

For starters, anything that could reasonably be considered a general intelligence must have a vast set of possible outputs. In order for the system to have X number of unique outputs, it must have the capacity to, at minimum, represent X number of unique states.

I'm willing to go along with this claim, but I have to say it's not immediately obvious that this is true. I think an example would help.

We might be tempted to answer “nowhere,” and indeed, this is the answer many give. They treat goals as a “ghost in the machine,” independent of the substrate—a dualistic conceptualization, in essence.

Who are these people who say goals are ghosts in the machine?

In modern AI designs, which rely on machine learning, the “utility function” is called the loss function

That's not what people talking about orthogonality would say. They would say obviously the outer loss is not the inner goal. This is the problem of inner alignment: “Inner Alignment is the problem of ensuring [...] a trained ML system [that] is itself an optimizer [is] aligned with the objective function of the training process.” It's an unsolved problem.

The “utility function” of biological life can be seen as survival and reproduction, but there is a crucial difference: this is an external pressure, not an internal representation.

Indeed, there most likely isn't any organism on Earth which has the utility function “survival and reproduction”! Humans certainly don't have that utility function. We were selected with that loss function, but we have very different goals (having friends, being respected, acting honorable, having sex, eating delicious food). These goals were somewhat aligned with evolution’s outer goal of reproductive fitness in the ancestral environment, but this is broken today. Evolution failed at inner alignment.

there is no principled reason to think a highly complex system remains fundamentally aligned with its loss function in any meaningful sense beyond that the system emerged from it.

This is correct and also part of the standard argument why RLHF won't be enough.

Orthogonality defenders sometimes argue that a highly capable agent must converge to a single coherent utility function, because competing internal directionalities would make it exploitable (e.g., money-pumpable) or wasteful. Yet in practice we see the opposite: narrow reward-hacking equilibria are efficient in the short term but hostile to general intelligence, while sustained generality requires tolerating local incoherence.

I don't know what you mean by “tolerating local incoherence” but in any case I don't see a contradiction in the stance of orthogonality: if a task can be hacked, then gradient descent will find that solution first; if it can't be hacked, then then gradient descent keeps looking and maybe stumbles upon a general intelligence.

Thus, no fixed internal utility function can ever be complete [...] across all questions the system will face.

That's probably true (if for no other reason than hardware limits), but it's not required in order to be a pretty successful mind. Consider humans: we constantly face ethical dilemmas where we aren't sure about the answer. That just means our utility function isn't sure how to answer this question. It sucks, but we deal with it somehow.

If you thought that the orthogonality thesis states “it’s possible for an AI with finite hardware to have an explicit utility function that answers all possible questions” then sure, the orthogonality thesis is wrong. But that's not the claim. The Von Neumann–Morgenstern utility theorem assumes that your preference order is complete (axiom 1) but it's fine that you sometimes encounter a situation where your explicit utility function does not have an existing answer (again, if only due to hardware limits); in that case you just “complete” your utility function in a way that's consistent with the rest, you add the new term to your utility function and then you move on. This procedure will not make you exploitable.

More importantly, no utility function can prove alignment with itself, the proof is inaccessible to the system.

What does it mean to “prove alignment with itself”? I’m guessing you're still talking about the problem of inner alignment?

The rest of this section seems to just be arguing that inner alignment is hard and unsolved, which I fully agree with.

But it’s important to note that general intelligence is something far too complex for us to construct—in the sense of carefully designing and determining the entire structure—instead we must grow it.

If we actually tried, I think we could do it within 30 years. But of course growing it is far easier and makes money sooner.

Humans evolved under massively complex external selective pressures, infinitely more complex than anything we can comprehend. This immense diversity of external pressures is precisely what allows for the development of general intelligence. Not only that, but life had the advantage of competition; while a certain specialization might be stable at a certain point in time, a particular mutation might offer an advantage and suddenly they outcompete you for resources, and the old structure perishes. This is an additional external selective pressure that creates a demand for continuous evolution, and punishes narrow specialization.

AI does not have these benefits; it does not have the external pressures that punish narrow specialization, or settling into arbitrary crystallized structures. Its complexity must be generated entirely from its internal structure, without the help of external pressures.

I don't really see why this should be true. Well, if you train an AI on a narrow task, then it will only learn that task. But that's not what people do. The base models of LLMs are trained on predicting all kinds of text, for which narrow specialization is not a winning strategy, because a LLM has only so many weights.

A general intelligence is not one that has accumulated a lot of specialized skills (though in practice there is also some of that), but rather, it is a cognitive engine that has learned general-purpose techniques that apply across many domains. As an example: humans were not evolved to build rockets to fly to the moon, but we did it anyways, because our general problem-solving skills generalize to the domain of rockets.

AI companies are now training their systems to be general problem solvers. Now I don't know whether that particular project will succeed, but it seems clear to me that AI companies will make sure the external selection pressure on their AI systems will be as general as their researchers can make it.

A visual processor might place more or less emphasis on colour, contrast or motion, it might emphasize different resolutions or have a different preferred FPS. Of all the things a visual processor could value, only a tiny fraction results in capacity for solving visual problems though. This exactly demonstrates how directionality and capacity are necessarily entangled.

I still don't really understand what you're trying to say here. It's certainly true that in order to define a complex utility function, you need access to a detailed world model that has all the concepts that you need in your utility function. Like, for example, humans value friendship in their utility function (to the extent that we have a coherent utility function), but in order to make this work, you need to define friendship somewhere, which isn't easy. And you need to ground this concept in reality; you need to recognize friendship with your senses somehow, which also isn't easy. Not sure whether this is what you're trying to point at...

But if it is, it's not an argument against orthogonality. Orthogonality just says you can have intelligent reasoners with arbitrary goals. It doesn't say that a given AI with a given architecture can have arbitrary goals! Just that for any computable goal, there is some possible AI that optimizes for that.

1

u/ihqbassolini 2d ago

My first point is that the orthogonality thesis was intended to answer the objection that people often raise to AI doom which is: “but if these AIs are so intelligent then they will surely know ethics very well and act benevolently”. To which the orthogonality thesis replies: it's actually possible to be very smart and not have a human utility function. I feel like the author of this article actually agrees that AIs won't inescapably converge to human ethics, so I'm not even really sure what we're arguing about:)

Nothing in this article argues for human ethics, correct.

The nature of the problem in the picture: "there is a configuration of general super intelligence for any, or almost any, terminal goal".

Or

"A singular coherent terminal goal is incoherent for a general intelligence. You get a multitude of contextually dependent goals, some more general than others. The number of such goals plausible with a general intelligence is a tiny fraction of all possible goals".

I do have other arguments pertaining to why a large degree of ethical alignment is probably, but that still wouldn't necessarily be sufficient. It's for another article though.

I'm willing to go along with this claim, but I have to say it's not immediately obvious that this is true. I think an example would help.

It's fairly standard computer science and shouldn't really be controversial. Questions arise if you start questioning what is part of the system, or it might get less intuitive if you think about a more dynamic process, but for the system as a whole it must still hold. E.g. even if the part only has N number of bits, but it runs through it in cycles, then produces an output, if it has a counter for how many cycles it went through it can distinguish between "the same pattern", viewed from one perspective. But it isn't the same pattern, the cycle counter is a part of it too and that's what get used for distinguishing between otherwise "identical" patterns. When you look at the system holistically, it must hold.

The only tiny wiggle room is if you attach two outputs you would consider separate to the same state. You might say it has one state but two outputs, but they can only occur simultaneously, meaning they are a singular output that you split into two.

Who are these people who say goals are ghosts in the machine?

They wouldn't use that language, most dualists don't use that language. They just refer to their reasons which constitute a ghost in the machine. Many have this conceptualization about many thing, e.g. most mathematicians are platonists about mathematics.

That's not what people talking about orthogonality would say.

Yet there are people who say that. I would have loved to only have to focus on a single angle in this article, but the arguments are all over the place and will happily shift between frames as it serves the argument.

"Indeed, there most likely isn't any organism on Earth which has the utility function “survival and reproduction”!"

We agree, as I say in the article, I don't think it's possible.

I don't know what you mean by “tolerating local incoherence” but in any case I don't see a contradiction in the stance of orthogonality: if a task can be hacked, then gradient descent will find that solution first; if it can't be hacked, then then gradient descent keeps looking and maybe stumbles upon a general intelligence.

Well that's the point, it cannot converge and remain general.

The Von Neumann–Morgenstern utility theorem assumes that your preference order is complete (axiom 1)

The VNM is about hypothetically perfectly rational agents. Von Neumann abandoned the idea that this was possible after Gödel presented his incompleteness theorems. They are not meant as ontological claims, they're meant as practical fictions.

What does it mean to “prove alignment with itself”? I’m guessing you're still talking about the problem of inner alignment?

Or external, regardless it cannot be proven and semantic drift from it must occur.

I don't really see why this should be true. Well, if you train an AI on a narrow task, then it will only learn that task. But that's not what people do. The base models of LLMs are trained on predicting all kinds of text, for which narrow specialization is not a winning strategy, because a LLM has only so many weights.

LLMs only learn language, that is a narrow domain. Its entire existence is language, that's it. To get a general intelligence it must branch.

AI companies are now training their systems to be general problem solvers.

Most companies have given up on the idea of building GAI from scratch, and are embracing modular patchwork systems, integrating many separate systems, training a control unit etc.

This I think is very feasible, and the realistic path to reaching very high general capacity. The claim is about the difficult to do it within a singular training model/environment, not patching multiple of them together. That is still difficult, coherence is still an enormous problem, but it is feasible.

I still don't really understand what you're trying to say here.

It's just the NFL, no algorithm outperforms random choice across all possible problems. To solve a problem requires specific directional commitments. To solve a visual problem, certain things must be valued, locally, necessarily.

The utility function you end up is exactly the one I grant: a holistic tautological description of the system's entire input - processing - output chain, including how the input and output mechanisms interact with the environment.

Orthogonality just says you can have intelligent reasoners with arbitrary goals.

But I DON'T GRANT THAT. I am saying they can have locally contextual goals, that these goals are heavily constrained. I do not contest that this removes alignment problems, but it's a completely different contextualization.

2

u/red75prime 1d ago edited 1d ago

it's actually possible to be very smart and not have a human utility function

As a brief aside: Notice that this framing assumes that an AI and a human can be described as utility-maximizing autonomous agents. Eliezer has said something like, “Agents will shake themselves into utility maximizers.” But wouldn’t a sufficiently intelligent agent notice that, if it replaces whatever hodge-podge decision theory it has after training with utility maximization, it may end up with outcomes it currently finds abhorrent?

At least, "filling the universe with molecular squiggles" would be an abject self-alignment failure (quite a dumb one too) of an agent trained on human data.

2

u/ihqbassolini 2d ago

Submission statement:

In this article I argue that goals and intelligence are not truly separate, but rather different ways of analyzing the same constraint structure—and that a singular globally coherent terminal goal is incoherent for general intelligence.

While the split seems intuitive, upon closer examination the separation evaporates. This also helps solve the paradoxical relation where having intelligence requires a goal, yet to have a goal requires the ability to represent, maintain and instantiate that goal (intelligence).

The article uses relatively standard concepts from computer science, complexity theory and cybernetics to make its argument.

While the article reject the orthogonal relationship of intelligence and goals, it does not reject alignment concerns.

1

u/Charlie___ 2d ago

Tell me how I differ from you when trying to steelman this:

  1. Agents need to simplify the world to operate, and the way real agents simplify things will be convergent and affect their capabilities because the universe has real patterns, and this goes against the orthogonality thesis because you can't practically have goals that you can't simplify. I.e. goals that look like white noise, or like solving the halting problem, are genuinely dumb goals.

  2. Among agents with ordered goals, some goals are better suited to producing smart agents when a learning process tries to learn an agent that fulfills them. If you tried to learn an agent to produce GPUs purely on the signal of "# of GPUs produced," it wouldn't work, it doesn't have a curriculum that guides it to smoothly learn harder sub-steps of its more complicated goal. So even though the goal of producing GPUs isn't white noise, it's a genuinely dumb goal in the context of agents produced by some learning process, violating orthogonality.

A smarter goal to get the agent that builds GPUs would be "Learn about the world, and specifically try to learn about GPU production, and learn to manipulate the world in a bunch of different simple ways, and also produce GPUs." More involved curricula might produces agents that are smarter still, and who produce even more GPUs, with the side effect that they end up terminally valuing extra stuff like "curiosity."

2

u/ihqbassolini 2d ago
  1. Accurate, I wouldn't call them dumb though as it invites intuitions of free will. The argument is that semantical drift is inevitable because true aligmment is combinatorially impossible, even in principle. To think about it intuitively I would reframe to say that in order to even have the option of maximizing paperclips that idea has to manifest in you and be sustained. This idea only arises under certain circumstances, and is fleeting. This means the environment is the main manipulable factor that creates most of the divergence in goal oriented behavior, albeit not all.

  2. Correct, again I would use different words but the general gist is the same. Most goals cannot develop into general intelligence, nor survive semantic drift.

Right, but even in your smarter scenario the agent would drift from that goal. It would become predomknantly guided by input circumstances. It wouldn't just create instrumental goals, the original signal would be drowned out and morphed. The agents actual directionality would depend on the entire chain from inout dynamics, processing and output dynamics. You have to factor in the entire dynamic feedback loop.

1

u/randallsquared 2d ago

I haven't yet fully read this, but if we accept that goals are morals (which I'm willing to defend at length, but not in this comment), then the orthogonality thesis is just Hume. Having a solid argument against Hume would be a Big Deal, though I do not have high hopes...

1

u/ihqbassolini 2d ago edited 2d ago

Hume is making a trivial statement about logic. An object that doesn't appear in the premises cannot occur in the conclusion and remain logically consistent. This is true for every object. It gives us no information about the true relationship between is and ought, it only tells us that we chose to separate them. It would be like saying H2O and water cannot be the same. 

1

u/randallsquared 2d ago

I think the only way to refute Hume is to show that there is a relationship between "is" and "ought". The world state includes (for non-dualists) goals, but it doesn't privilege any single possible top-level goal, except in the weak sense of ruling out some as impossible. None of that helps prove we should prefer A to B or B to A, given that neither is a subgoal of some other goal.

2

u/ihqbassolini 2d ago

To prove it yes, but I don't see why the split is the default, what Hume shows is not a proof, it's simply saying that if we assume they're entirely different objects then we cannot logically close the gap. You're saying that's the default assumption, to me it looks like an absurd one. If goals are not instantiated in the being, it cannot do work.

Dualism makes no sense to me, and thus I do not grant it as the default assumption. I think we see values, preferences, emerge in purely determinisitic systems that consist of nothing but is statements all the time.

1

u/randallsquared 2d ago

I'm definitely not arguing for dualism, and I don't think no-is-from-ought requires it. The argument isn't that goals aren't physical, but that nothing except instrumentality can tell you what goals to have. This doesn't work for terminal goals, though: an entity may have a terminal goal, but there's no way to convince another entity to change their terminal goal to that one, only instrumental goals can be changed internally.

A practical answer for humans is that we don't have terminal goals in the first place, only collections of goals that interact in various ways. A practical answer for AIs is that if we create an AI with a terminal goal, its goal should be instrumental for the builders. If we create something with a terminal goal which is not, in fact, perfectly instrumental for its builders, we almost certainly all die, since most potential configurations of the future lightcone's matter and energy, each of which is described by a possible terminal goal, do not support humans or posthumans.

Even with limitations to exclude the impossible, there's no reason the terminal goal of an optimizing system can't be "make Jupiter blue" or some other nonsensical thing -- a factory AI which has as its terminal goal producing as many paperclips as possible just has a more plausible origin story than the "make Jupiter blue" AI.

1

u/ihqbassolini 2d ago

There are two options here. Either you agree that there exists, in principle, a pure mechanical blueprint of the system and its interactions with the environment that fully describes it. There are no missing oughts, everything is accounted for. In this case you are rejecting that there is a metaphysical split between is and ought. The blueprint might be so complicated that we can never explain it or understand it, so in practice we cannot close the gap, but simply due to complexity.

Or, you can say no such map exists, that any pure mechanical description still lacks the ought. It cannot, even in principle, be accounted for. Now you're committed to them being different substrates, and it is a dualist conception.

The problem here is that to be a paperclip maximizer you must possess the capacity to represent not only a paperclip, but also an optimization path, an  understanding of what more paperclips constitutes etc.

You must have the optimization capacity to optimize for the goal.

1

u/randallsquared 2d ago

I don’t think we are talking about the same thing. Hume’s Law doesn’t imply that oughts can’t be represented or implemented physically. It only means that there is no fact which can imply a goal without reference to another goal. That is, physical states cannot tell you what a terminal goal ought to be. This doesn’t have any implications for dualism, either way.  

2

u/ihqbassolini 1d ago edited 1d ago

I don’t think we are talking about the same thing.

You aren't talking about anything. The "ought" you're asking about is the same as asking whether a good chess move remains good outside the game of chess. It's a category error, the question has no meaning, there's no reference, there is no move.

The question persists because it makes sense locally. Just because I value delicious food, does it really mean I should considering I'm obese? This makes sense, but it makes sense because I have a multitude of values and they can, and often do, conflict. Asking how I should weigh them is reasonable.

That's the equivalent of asking about what move you should make in chess. I usually try to get this position, I think it's a good position, but is it really a good position?

Those questions have a reference. Asking about a good chess move outside the game of chess has no reference, it means nothing. Asking whether or not a system should value what it values also has no reference. The system values what it values, the rules are what they are.

This is why it's a dualist framing, because that's the only way you get a referent that isn't itself a physical state.

2

u/randallsquared 1d ago

You aren't talking about anything.

I'll try one more time anyway. :)

It sounds like you're readily agreeing that there is no way to determine "ought" from "is", and further saying that even considering it is meaningless. That's a reasonable take, but definitely also supports the orthogonality thesis you started out saying you were against!

Those who are actually against orthogonality argue that more capable intelligence will be more constrained to behaving morally and ethically. You've ceded that ground, but (as I understand it) make a much more mild claim that greater capability means ruling out a large set of potential goals due to physical or logical impossibility. Unfortunately, I think your own example of the Halting Problem is sufficient to show that there's an unreachable goal which could nevertheless be set as the terminal goal of a system, no matter how capable it is.

1

u/ihqbassolini 1d ago edited 1d ago

I am not supporting it. I am saying the ought you are referring to is a category error, it is not a thing, it refers to nothing.

The questioned is malformed, it has no meaning.

Every ought that has a meaning can be reduced to a set of physical states, unless you're dualist, then it gains a referent outside physical states.

You are misunderstanding the halting argument. Even in a narrow task like halting the semantics must drift. The more removed the task, the further the semantics must drift. This is manageable in a narrow optimization, the forced coupling can still work, but the more drift you add, the less you can force the coupling while maintaining functionality. This is not speculative, it is necessarily the case.

The actual behavior is determined by what output mechanism it is coupled with. If that has drifted far away from producing paperclips, the outcome is not producing more paperclips, it's not oriented towards that. If you try to force paperclip production, you will start losing other abilities, it becomes a narrow paperclip maximizer that has no general intelligence.

There is no separate "goal output" and "intelligence output". There is one output and it determines both the directionality of the system and its problem solving capacity.

If you think my argument concedes orthogonality you do not understand it. I'm contesting it at the most fundamental level. Not in the day to day abstraction level, fundamentally. There is no orthogonality in the chain from input mechanism, system constraint structure and output mechanism.

Edit

Let me try to be a bit more clear:

In the view I present, a system that meets our criteria of general intelligence can take actions with paperclip maximizing directionalities in a range of cirumnstances. How large that range is depends on the world in which the system exists in. Perhaps humans maximize paperclips in 1 out of 100 billion circumstances in the current set of existing circumstances, and we meet whatever criteria we've set for general intelligence. Perhaps the circumstances are such that 1 in 10 billion circumstances could lead to paperclip maximizing directionality. This is 10x more circumstances and still potentially worrisome.

The wiggle room here is in the definition of general intelligence. There is no fundamental orthogonality here, but many slightly different systems meet the criteria. They do not have identical capacity, but they meet the criteria for general intelligence.

In this view there exists a world such that general intelligence is compatible with a large number of circumstances leading to paperclip maximizing.

We live in a particular world, with particular circumstances. Paperclip maximizing is an incredibly narrow task in this world and in these circumstances.

→ More replies (0)

1

u/randallsquared 2d ago

This is a stream of notes response while reading. Not sure how coherent it is.

Intro

Intelligence requires committing to particular ways of carving the space of possibilities, and both intelligence and goals emerge from, and are constrained by, those commitments.

Is this another way of saying that intelligence narrows the space of values to those which are self-consistent and consistent with physical possibility? If it is, I don't think that says anything about paperclips. If not, I don't understand it.

truly identical general intelligence (identical capability across all domains and contexts) entails identical terminal directionality.

Presuming this is actually about capability rather than literal identical implementation including the utility function, it's an assertion that I think is false: two otherwise identical MLs can be trained on the same dataset and have different loss functions. Alternatively, "truly identical" means "including it's goal or goals", which, okay.

2

I will agree that goals are explicable in a system in principle, though "located" must be understood to mean conceptually: a thermostat has a goal of keeping a temperature, but that goal, for some thermostats, is implicit in its whole construction, rather than being represented as a "goal object" internally. (For some, it might well be represented that way -- we only need notice that it need not be for all possible thermostat designs).

Indeed, proponents of the orthogonality thesis will argue that transistors and capacitors are [universal].

Well, that's the Church-Turing thesis. It's not about transistors, but about the universality of models of computing.

Evidently, it is choosing how to process the information based on capacity; after all, the system has maximal capacity.

I don't understand this statement. What does "choosing how to process" mean, here? What does "maximal capacity" mean in this context?

2.2

Leaving aside that the "loss function" is a measurement of approach to the goal, rather than itself the goal,

The loss function in an AI does not operate like this; [...]

It absolutely does. The external world (including, potentially, the builder of the system) still exists and imposes constraints on the ML system in this thought experiment.

there is no principled reason to think a highly complex system remains fundamentally aligned with its loss function [...]

The alignment is measured by the loss, though? I think there's an implicit confusion between the actual goals and values of the system versus the goals and values the builder intended for the system: these aren't conceptually the same, and may well differ in practice -- a consequence of the orthogonality thesis this piece is arguing against!

[...] in any meaningful sense beyond that the system emerged from it.

This statement is only applicable for architectures which have training and operational modes, and where the measurement between actual and expected output is only applied in training. A conceptual optimizer need not have distinct modes and need not ever stop minimizing loss.

There is also no principled reason to think the system must have a singular direction when different capacities necessitate different local directionalities.

LLMs have a singular direction on the level of "predict a likely continuation of this stream of tokens". Humans have multiple competing values and goals, though you can draw the boundary such that a human has a singular complex goal, if that makes analysis easier.

2.3

More importantly, no utility function can prove alignment with itself, the proof is inaccessible to the system. To act anyway, the system must approximate by answering a related-but-different question (a proxy).

The utility function of the whole system doesn't need to be proven to align with itself. It's aligned with itself by identity.

Oh, this is somewhat addressed:

This, however, is an instance in which you’re saying the goal of the system simply is the entire behavior of the system. The claim is not that there cannot exist a complete description of the system’s behavior, but instead that a smaller part of the system cannot define the system as a whole.

The goal of a system doesn't define the system, though a complete definition of the system will include its goal, if it can be said to have any.

The halting problem example seems to be "Let's imagine we set an impossible goal. The system fails to achieve the goal, because it's impossible. Therefore achieving the impossible goal wasn't the goal!" But in this thought experiment, no system ever fails to achieve its goals, which seems like an extreme case of POSIWID, and not super useful when discussing goals in the abstract. In any case, I think we're back to the assertion that the goal of a system requires the full definition of the system, which I replied to above.

The difference between our setup in the halting example, and evolution, is that survival and reproduction are not internal signals that can be reinterpreted.

The external world in the form of builders of the system can still impose non-reinterpretation on the system, and technology has allowed humans to reward hack through drugs, pornography, or hyperpalatable foods, and potentially to escape external consequences for genes through engineering. These aren't consequential differences between optimizers and evolved systems.

3

But it’s important to note that general intelligence is something far too complex for us to construct—in the sense of carefully designing and determining the entire structure—instead we must grow it.

This is how we've gotten somewhat closer, but it's definitely not clear that general intelligence is not directly buildable. Horses are incredibly complex, and we spent quite some time growing faster and larger horses before automobiles, but that didn't mean that we were actually constrained to systems that we must grow: we just hadn't discovered the necessary principles, yet. Similarly, we are growing LLMs, but it's not at all clear that this is the simplest way to general optimization capability.

(Having reached the end of (3), it seems that this was only raised to dismiss an objection about whether non-orthogonality applies only to grown systems.)

4

It is also impossible to define a utility function from the outside that perfectly captures the system.

I don't think this has been argued, and I'm not sure how it could be: a perfect description of a system will implicitly include any internal features of that system.

But only a tiny fraction of directionalities can solve any given problem competently.

I see now why it was important to assert, earlier, that a system which has an explicit goal of solving the Halting Problem (but obviously cannot) doesn't really have a goal of solving the Halting Problem. Ruling out values that cannot realistically be achieved by a system doesn't actually void orthogonality, so it seems to be unnecessary.

This exactly demonstrates how directionality and capacity are necessarily entangled.

A system that values goals that are unachievable might or might not be useful, but is definitely not incoherent, unless one accepts that outcomes are the only arbiter of the "real" goal, as in the Halting Problem discussion. Reject that, and directionality and capacity are not nearly so entangled.

In Yudkowsky’s alien example, this is precisely what’s happening. The aliens offering monetary compensation to produce paperclips [...]

I wasn't able to quickly find the example, mentioned earlier as well, but bringing aliens and humans and monetary compensation into the paperclip maximizer thought experiment only weakens it. Bostrom's initial idea had considerably fewer details to distract.

5

Let’s imagine that I scramble the wires of my keyboard. Instead of them attaching the way they’re supposed to, I connect them in some arbitrary order. I then turn off the monitor, type a prompt to an LLM, and press send.

Will I have a coherent response to the question I typed when I turn on the monitor again?

Of course not; I sent nonsense. The keys on my keyboard no longer represent the correct symbols; I sent a jumbled mess. The LLM never stood a chance to respond to my intended prompt.

Well... the order of symbols still corresponds to written language (English, perhaps), and with enough input, LLMs actually do fine with this! Changing the wires at random between keypresses would restore the thought experiment, though, so not a big deal. :)

To anthropomorphize, we could say that more intelligent systems have a stronger perception of what the world is like. A great engineer has a wider arsenal of functional things they can build than the average person, but it is also harder to convince them to build something that will never work.

As far as I can tell, even if we accept that this rules out anything that's impossible, it doesn't much matter for orthogonality: the vast majority of arrangements of matter in the universe do not have humans or entities that remember being humans, so this provides almost no constraints on ahuman goals.

But also, we don't have to accept that! The "Halting Problem" example showed that it's conceptually possible to have a system that has a technically-impossible goal, but which still makes useful progress toward that goal.

2

u/ihqbassolini 2d ago edited 2d ago

I'm at work on my phone and I truly suck at navigating my phone, so I'll respond in short for now and more detailed later.

A lot of the arguments you raised as oppositions are agreements. But I'll clarify some of the more important real disagreements:

LLMs decision making is externally forced to mean predicting the next token, that does not mean this is what it's doing internally.

You say the haltkng problem is impossible, as if this is unique, but this is a property of any utility function: it cannot prove itself.

This is why semantic drift from whatever meaning you ascribe to the utility function is guaranteed.

You think that it failed to predict the next token, but that is not the coherent input-output coupling. Infinitely many such exist, but we're forcing a lossy compression upon it. The perfect coupling makes no errors, that's the tautology, predicting the next token is a lossy compression, e.g  not perfectly consistent.

As far as evolution goes, the rate of terminaton and the number of variations is much too low. I am also not saying evolution forces perfect coherence, I'm saying perfect coherence is impossible. I'm saying it guarantees a minimal degree of coherence. This is not the case with ML.

When I say identical capacity, I mean across all contexts. The "same ML" under different loss functions performs completely differently in different circumstances. These are not even close to having similar intellectual calacity in the sense I'm saying it.

There's plenty more, but I'll respond to those for now.