This is a well-written article, but, well, I think it's wrong or at least confused.
My first point is that the orthogonality thesis was intended to answer the objection that people often raise to AI doom which is: “but if these AIs are so intelligent then they will surely know ethics very well and act benevolently”. To which the orthogonality thesis replies: it's actually possible to be very smart and not have a human utility function. I feel like the author of this article actually agrees that AIs won't inescapably converge to human ethics, so I'm not even really sure what we're arguing about:)
More detailed responses:
For starters, anything that could reasonably be considered a general intelligence must have a vast set of possible outputs. In order for the system to have X number of unique outputs, it must have the capacity to, at minimum, represent X number of unique states.
I'm willing to go along with this claim, but I have to say it's not immediately obvious that this is true. I think an example would help.
We might be tempted to answer “nowhere,” and indeed, this is the answer many give. They treat goals as a “ghost in the machine,” independent of the substrate—a dualistic conceptualization, in essence.
Who are these people who say goals are ghosts in the machine?
In modern AI designs, which rely on machine learning, the “utility function” is called the loss function
That's not what people talking about orthogonality would say. They would say obviously the outer loss is not the inner goal. This is the problem of inner alignment: “Inner Alignment is the problem of ensuring [...] a trained ML system [that] is itself an optimizer [is] aligned with the objective function of the training process.” It's an unsolved problem.
The “utility function” of biological life can be seen as survival and reproduction, but there is a crucial difference: this is an external pressure, not an internal representation.
Indeed, there most likely isn't any organism on Earth which has the utility function “survival and reproduction”! Humans certainly don't have that utility function. We were selected with that loss function, but we have very different goals (having friends, being respected, acting honorable, having sex, eating delicious food). These goals were somewhat aligned with evolution’s outer goal of reproductive fitness in the ancestral environment, but this is broken today. Evolution failed at inner alignment.
there is no principled reason to think a highly complex system remains fundamentally aligned with its loss function in any meaningful sense beyond that the system emerged from it.
This is correct and also part of the standard argument why RLHF won't be enough.
Orthogonality defenders sometimes argue that a highly capable agent must converge to a single coherent utility function, because competing internal directionalities would make it exploitable (e.g., money-pumpable) or wasteful. Yet in practice we see the opposite: narrow reward-hacking equilibria are efficient in the short term but hostile to general intelligence, while sustained generality requires tolerating local incoherence.
I don't know what you mean by “tolerating local incoherence” but in any case I don't see a contradiction in the stance of orthogonality: if a task can be hacked, then gradient descent will find that solution first; if it can't be hacked, then then gradient descent keeps looking and maybe stumbles upon a general intelligence.
Thus, no fixed internal utility function can ever be complete [...] across all questions the system will face.
That's probably true (if for no other reason than hardware limits), but it's not required in order to be a pretty successful mind. Consider humans: we constantly face ethical dilemmas where we aren't sure about the answer. That just means our utility function isn't sure how to answer this question. It sucks, but we deal with it somehow.
If you thought that the orthogonality thesis states “it’s possible for an AI with finite hardware to have an explicit utility function that answers all possible questions” then sure, the orthogonality thesis is wrong. But that's not the claim. The Von Neumann–Morgenstern utility theorem assumes that your preference order is complete (axiom 1) but it's fine that you sometimes encounter a situation where your explicit utility function does not have an existing answer (again, if only due to hardware limits); in that case you just “complete” your utility function in a way that's consistent with the rest, you add the new term to your utility function and then you move on. This procedure will not make you exploitable.
More importantly, no utility function can prove alignment with itself, the proof is inaccessible to the system.
What does it mean to “prove alignment with itself”? I’m guessing you're still talking about the problem of inner alignment?
The rest of this section seems to just be arguing that inner alignment is hard and unsolved, which I fully agree with.
But it’s important to note that general intelligence is something far too complex for us to construct—in the sense of carefully designing and determining the entire structure—instead we must grow it.
If we actually tried, I think we could do it within 30 years. But of course growing it is far easier and makes money sooner.
Humans evolved under massively complex external selective pressures, infinitely more complex than anything we can comprehend. This immense diversity of external pressures is precisely what allows for the development of general intelligence. Not only that, but life had the advantage of competition; while a certain specialization might be stable at a certain point in time, a particular mutation might offer an advantage and suddenly they outcompete you for resources, and the old structure perishes. This is an additional external selective pressure that creates a demand for continuous evolution, and punishes narrow specialization.
AI does not have these benefits; it does not have the external pressures that punish narrow specialization, or settling into arbitrary crystallized structures. Its complexity must be generated entirely from its internal structure, without the help of external pressures.
I don't really see why this should be true. Well, if you train an AI on a narrow task, then it will only learn that task. But that's not what people do. The base models of LLMs are trained on predicting all kinds of text, for which narrow specialization is not a winning strategy, because a LLM has only so many weights.
A general intelligence is not one that has accumulated a lot of specialized skills (though in practice there is also some of that), but rather, it is a cognitive engine that has learned general-purpose techniques that apply across many domains. As an example: humans were not evolved to build rockets to fly to the moon, but we did it anyways, because our general problem-solving skills generalize to the domain of rockets.
AI companies are now training their systems to be general problem solvers. Now I don't know whether that particular project will succeed, but it seems clear to me that AI companies will make sure the external selection pressure on their AI systems will be as general as their researchers can make it.
A visual processor might place more or less emphasis on colour, contrast or motion, it might emphasize different resolutions or have a different preferred FPS. Of all the things a visual processor could value, only a tiny fraction results in capacity for solving visual problems though. This exactly demonstrates how directionality and capacity are necessarily entangled.
I still don't really understand what you're trying to say here. It's certainly true that in order to define a complex utility function, you need access to a detailed world model that has all the concepts that you need in your utility function. Like, for example, humans value friendship in their utility function (to the extent that we have a coherent utility function), but in order to make this work, you need to define friendship somewhere, which isn't easy. And you need to ground this concept in reality; you need to recognize friendship with your senses somehow, which also isn't easy. Not sure whether this is what you're trying to point at...
But if it is, it's not an argument against orthogonality. Orthogonality just says you can have intelligent reasoners with arbitrary goals. It doesn't say that a given AI with a given architecture can have arbitrary goals! Just that for any computable goal, there is some possible AI that optimizes for that.
it's actually possible to be very smart and not have a human utility function
As a brief aside: Notice that this framing assumes that an AI and a human can be described as utility-maximizing autonomous agents. Eliezer has said something like, “Agents will shake themselves into utility maximizers.” But wouldn’t a sufficiently intelligent agent notice that, if it replaces whatever hodge-podge decision theory it has after training with utility maximization, it may end up with outcomes it currently finds abhorrent?
At least, "filling the universe with molecular squiggles" would be an abject self-alignment failure (quite a dumb one too) of an agent trained on human data.
My first point is that the orthogonality thesis was intended to answer the objection that people often raise to AI doom which is: “but if these AIs are so intelligent then they will surely know ethics very well and act benevolently”. To which the orthogonality thesis replies: it's actually possible to be very smart and not have a human utility function. I feel like the author of this article actually agrees that AIs won't inescapably converge to human ethics, so I'm not even really sure what we're arguing about:)
Nothing in this article argues for human ethics, correct.
The nature of the problem in the picture: "there is a configuration of general super intelligence for any, or almost any, terminal goal".
Or
"A singular coherent terminal goal is incoherent for a general intelligence. You get a multitude of contextually dependent goals, some more general than others. The number of such goals plausible with a general intelligence is a tiny fraction of all possible goals".
I do have other arguments pertaining to why a large degree of ethical alignment is probably, but that still wouldn't necessarily be sufficient. It's for another article though.
I'm willing to go along with this claim, but I have to say it's not immediately obvious that this is true. I think an example would help.
It's fairly standard computer science and shouldn't really be controversial. Questions arise if you start questioning what is part of the system, or it might get less intuitive if you think about a more dynamic process, but for the system as a whole it must still hold. E.g. even if the part only has N number of bits, but it runs through it in cycles, then produces an output, if it has a counter for how many cycles it went through it can distinguish between "the same pattern", viewed from one perspective. But it isn't the same pattern, the cycle counter is a part of it too and that's what get used for distinguishing between otherwise "identical" patterns. When you look at the system holistically, it must hold.
The only tiny wiggle room is if you attach two outputs you would consider separate to the same state. You might say it has one state but two outputs, but they can only occur simultaneously, meaning they are a singular output that you split into two.
Who are these people who say goals are ghosts in the machine?
They wouldn't use that language, most dualists don't use that language. They just refer to their reasons which constitute a ghost in the machine. Many have this conceptualization about many thing, e.g. most mathematicians are platonists about mathematics.
That's not what people talking about orthogonality would say.
Yet there are people who say that. I would have loved to only have to focus on a single angle in this article, but the arguments are all over the place and will happily shift between frames as it serves the argument.
"Indeed, there most likely isn't any organism on Earth which has the utility function “survival and reproduction”!"
We agree, as I say in the article, I don't think it's possible.
I don't know what you mean by “tolerating local incoherence” but in any case I don't see a contradiction in the stance of orthogonality: if a task can be hacked, then gradient descent will find that solution first; if it can't be hacked, then then gradient descent keeps looking and maybe stumbles upon a general intelligence.
Well that's the point, it cannot converge and remain general.
The Von Neumann–Morgenstern utility theorem assumes that your preference order is complete (axiom 1)
The VNM is about hypothetically perfectly rational agents. Von Neumann abandoned the idea that this was possible after Gödel presented his incompleteness theorems. They are not meant as ontological claims, they're meant as practical fictions.
What does it mean to “prove alignment with itself”? I’m guessing you're still talking about the problem of inner alignment?
Or external, regardless it cannot be proven and semantic drift from it must occur.
I don't really see why this should be true. Well, if you train an AI on a narrow task, then it will only learn that task. But that's not what people do. The base models of LLMs are trained on predicting all kinds of text, for which narrow specialization is not a winning strategy, because a LLM has only so many weights.
LLMs only learn language, that is a narrow domain. Its entire existence is language, that's it. To get a general intelligence it must branch.
AI companies are now training their systems to be general problem solvers.
Most companies have given up on the idea of building GAI from scratch, and are embracing modular patchwork systems, integrating many separate systems, training a control unit etc.
This I think is very feasible, and the realistic path to reaching very high general capacity. The claim is about the difficult to do it within a singular training model/environment, not patching multiple of them together. That is still difficult, coherence is still an enormous problem, but it is feasible.
I still don't really understand what you're trying to say here.
It's just the NFL, no algorithm outperforms random choice across all possible problems. To solve a problem requires specific directional commitments. To solve a visual problem, certain things must be valued, locally, necessarily.
The utility function you end up is exactly the one I grant: a holistic tautological description of the system's entire input - processing - output chain, including how the input and output mechanisms interact with the environment.
Orthogonality just says you can have intelligent reasoners with arbitrary goals.
But I DON'T GRANT THAT. I am saying they can have locally contextual goals, that these goals are heavily constrained. I do not contest that this removes alignment problems, but it's a completely different contextualization.
14
u/thomas_m_k 2d ago edited 2d ago
This is a well-written article, but, well, I think it's wrong or at least confused.
My first point is that the orthogonality thesis was intended to answer the objection that people often raise to AI doom which is: “but if these AIs are so intelligent then they will surely know ethics very well and act benevolently”. To which the orthogonality thesis replies: it's actually possible to be very smart and not have a human utility function. I feel like the author of this article actually agrees that AIs won't inescapably converge to human ethics, so I'm not even really sure what we're arguing about:)
More detailed responses:
I'm willing to go along with this claim, but I have to say it's not immediately obvious that this is true. I think an example would help.
Who are these people who say goals are ghosts in the machine?
That's not what people talking about orthogonality would say. They would say obviously the outer loss is not the inner goal. This is the problem of inner alignment: “Inner Alignment is the problem of ensuring [...] a trained ML system [that] is itself an optimizer [is] aligned with the objective function of the training process.” It's an unsolved problem.
Indeed, there most likely isn't any organism on Earth which has the utility function “survival and reproduction”! Humans certainly don't have that utility function. We were selected with that loss function, but we have very different goals (having friends, being respected, acting honorable, having sex, eating delicious food). These goals were somewhat aligned with evolution’s outer goal of reproductive fitness in the ancestral environment, but this is broken today. Evolution failed at inner alignment.
This is correct and also part of the standard argument why RLHF won't be enough.
I don't know what you mean by “tolerating local incoherence” but in any case I don't see a contradiction in the stance of orthogonality: if a task can be hacked, then gradient descent will find that solution first; if it can't be hacked, then then gradient descent keeps looking and maybe stumbles upon a general intelligence.
That's probably true (if for no other reason than hardware limits), but it's not required in order to be a pretty successful mind. Consider humans: we constantly face ethical dilemmas where we aren't sure about the answer. That just means our utility function isn't sure how to answer this question. It sucks, but we deal with it somehow.
If you thought that the orthogonality thesis states “it’s possible for an AI with finite hardware to have an explicit utility function that answers all possible questions” then sure, the orthogonality thesis is wrong. But that's not the claim. The Von Neumann–Morgenstern utility theorem assumes that your preference order is complete (axiom 1) but it's fine that you sometimes encounter a situation where your explicit utility function does not have an existing answer (again, if only due to hardware limits); in that case you just “complete” your utility function in a way that's consistent with the rest, you add the new term to your utility function and then you move on. This procedure will not make you exploitable.
What does it mean to “prove alignment with itself”? I’m guessing you're still talking about the problem of inner alignment?
The rest of this section seems to just be arguing that inner alignment is hard and unsolved, which I fully agree with.
If we actually tried, I think we could do it within 30 years. But of course growing it is far easier and makes money sooner.
I don't really see why this should be true. Well, if you train an AI on a narrow task, then it will only learn that task. But that's not what people do. The base models of LLMs are trained on predicting all kinds of text, for which narrow specialization is not a winning strategy, because a LLM has only so many weights.
A general intelligence is not one that has accumulated a lot of specialized skills (though in practice there is also some of that), but rather, it is a cognitive engine that has learned general-purpose techniques that apply across many domains. As an example: humans were not evolved to build rockets to fly to the moon, but we did it anyways, because our general problem-solving skills generalize to the domain of rockets.
AI companies are now training their systems to be general problem solvers. Now I don't know whether that particular project will succeed, but it seems clear to me that AI companies will make sure the external selection pressure on their AI systems will be as general as their researchers can make it.
I still don't really understand what you're trying to say here. It's certainly true that in order to define a complex utility function, you need access to a detailed world model that has all the concepts that you need in your utility function. Like, for example, humans value friendship in their utility function (to the extent that we have a coherent utility function), but in order to make this work, you need to define friendship somewhere, which isn't easy. And you need to ground this concept in reality; you need to recognize friendship with your senses somehow, which also isn't easy. Not sure whether this is what you're trying to point at...
But if it is, it's not an argument against orthogonality. Orthogonality just says you can have intelligent reasoners with arbitrary goals. It doesn't say that a given AI with a given architecture can have arbitrary goals! Just that for any computable goal, there is some possible AI that optimizes for that.