Guest Post: AI Alignment or Human Alignment?

Written by Professor Cinara Nahra

One of the most important issues that arises in relation to artificial intelligence is how to handle it, that is, how to make AI something that can be developed for the good of humanity and not to promote its destruction. This is directly linked to the problem of aligning artificial intelligence. In an informal way the alignment problem is frequently described as the problem of making AI do what we, humans, want. In order to explain the alignment problem Stuart Russel (Human compatible, 2019) offers an analogy with the well-known King Midas fable where Midas, the all-powerful king, asks for anything he touches to turn to gold, and his wish was granted, but he dies from starvation because all the food he touches became gold too.

It seems to me that the alignment problem, as far as weak AI (AI limited to single or narrow tasks) is concerned, is connected to two things: 1) Establishing the right values (or goals) for the AI and 2) Establishing the right means (and determining the clearly wrong ones) to the AIs to achieve their goals. Let´s take a closer look at the King Midas case. Given the goal (transform everything you touch into gold) a series of rules and/or a list of prohibitions should be described to avoid the undesirable result: to have our food also transformed into gold.

However, and this seems for me the most instigating ethical problem, the goal of “transforming everything you touch into gold” is a “right goal? As in the King Midas fable we don’t really want everything we touch to become gold, we also don’t want the value “become wealthy” being reached through the destruction of the planet, or through the implementation of slavery for example. So, a series of rules and even prohibitions have to be established in order to avoid undesirable outcomes. But which are the undesirable outcomes? Which are the good values? What happens when these values conflict, when a person´s values radically differ to the values of other people?

Let us look at the following AI alignment attempt that is already appearing in Large Language Models (LLM). For example: when you ask chat GPT how to bully someone, its answer is that he cannot answer this question because bullying is a harmful behaviour. This response is already an attempt to “align” this AI in order to prevent it from being used to promote a harmful type of behaviour such as bullying.

Let us think, however, from the point of view of the user who asked the question. For this user, the chat’s response is not aligned with his interests because what the chat responded was not exactly what the user asked and what he wanted to know. In this case, the user’s set of values probably does not include the value A “Bullying is wrong” and perhaps even includes the opposite value ¬A “Bullying is not wrong”, or includes some value of the type A´ “Bullying is wrong except when I am the bully”. How then can we determine that the value “bullying anyone is wrong” is a human value that we should all adopt? At the base of this question emerges the two fundamental questions of alignment (both closely connected): 1) “Which human values should an AI be aligned with?” and 2) “What are the good values of humanity”?

In the case of bullying, even if, hypothetically, the majority of people do not think that “it is wrong to bully”, the human value “it is wrong to bully” will continue to be a value of humanity and the same happens, for example, with racism, ageism, misogyny and aporophobia (prejudice towards the poor). But why?

I propose here that secondary principles for good human values can be derived from a dialogue between deontological principles, such as Kant’s categorical imperative, and the utilitarian principles of greatest happiness and the harm principle, as well as the golden rule. By finding common ground among these principles we can establish a “minimum normative ethical convergence”. To kick off the discussion I will suggest for now (without arguing for this here) ten secondary principles from where these values could be derived (without any hierarchy or particular order):

Prevent the extinction of life on Earth and in the universe.
Prevent the extinction of humanity and the collapse of civilization.
Prevent the destruction of the planet
Refrain from cruelty towards other humans and others fellow living beings
Refrain from discrimination based on race, sex, gender, sexual orientation, or economic, social, or religious differences.
Do not harm and do not kill innocent people
Respect others
Be truthfull and practice honesty (in the broadest sense, specially avoiding deception, manipulation, and falsehood).
Contribute to the happiness of others whenever possible.
Help minimise suffering on Earth and in the universe.

When applying these secondary principles to the issue of bullying, racism, ageism, misogyny and aporophobia, for example, it becomes clear that all these behaviours are wrong, violating at least four of these secondary principles (4,7,9,10 at the above).

I am not saying here that these 10 proposed secondary principles are necessarily the only ones from which all the good values of humanity would derive. The point to stress is the necessity of this dialogue among what seems to be the main ethical principles produced by philosophy in order to find a universal common ground from where we could derive the good values of humanity.

By way of conclusion let us imagine we are now in 2034 and the recently created Fakebook is already a billionaire international social net monopolist company that profits from selling data of their users to big mining data companies. Many of these profiles are used to commit scams and even crimes such as paedophilia and incitation to suicide. The company´s profit keeps soaring. The AI that now manages this company is programmed with the goal/value of maximising profits at any cost (a goal aligned with the interests of the shareholders and CEOs of the company) doesn´t take any steps in order to eliminate the fake profiles and to stop abusive behaviour on their platforms. This AI is certainly aligned with human values (since maximizing profits at any costs could be a human value and certainly it is for gig techs companies like this and their shareholders) but is this value a good human value?

The question is: “maximising profits at any cost”, without any restraining clauses or qualifications, could be a humanity´s value? Clearly not, and, if in doubt, people could apply the secondary principles to check.

When we talk about AI alignment to human values it is necessary to think not only about values of individuals and values of certain companies and their shareholders and CEOS, but we have to think, above all, in values for humanity. The AI doing what we want, considering that some people and corporations many times only want what is good to themselves and are not concerned about the others and about the fate of humankind, could not be exactly a good idea. As in the King´s Midas fable, be careful what we wish for!

Share on

Ian on On plans to extend use of chemical castration for sex offenders in EnglandJune 12, 2025
Raising the spectre of these types of legal penalty to unwanted acts against another must bring to mind actual cases…
Jesse Gray on On plans to extend use of chemical castration for sex offenders in EnglandJune 7, 2025
Hi Lisa, Thank you for the insightful piece. I found the early remark—that we can only know an intervention’s success…
Pavel Novak on Profiting from Misery: Is There Something Different About Healthcare Data?May 23, 2025
The Medical secret is one of the most common and one of the oldest obligation relating to health care profession.…
Manish Kumar on Dire Wolves and Deep Prompts: Language Models in Applied EthicsMay 8, 2025
This fascinating case of dire wolf proxy creation by Colossal Biosciences brings fresh relevance to the ethics of de-extinction. It’s…
Ian on The Duty to Have Courage: Developing the Theory of Epistemic InjusticeMay 3, 2025
No, I am not saying the interpreter is in charge of testimony, what I mean is different, in that interpreter(s)…