Save the Rhinos!: Coin flip

Just ran an experiment in my head and I think it's predictable enough to make a conclusion without actually running the experiment.

People want to be seen as good rather than evil. It's a bit of a tautology that being evil is bad. The only complication is that people sometimes prefer to be seen as evil by certain other people when they perceive that those people seeing them as evil leads to outcomes that are better for the world. This reaches an extreme when, as shown in fiction, someone tells another person to kill them.

Thanks to Bing's Copilot search, I was able to find a scene that I remembered reading about, from Iris II: New Generation (2013):

Yoo-gun's martial arts skills are too good and he ends up defeating Ray. With Yoo-gun holding a gun, Ray dares him to "take the shot" and Yoo-gun, filled with rage and fury, shoots him to death.

(In my search, I checked TV Tropes pages like Please Kill Me If It Satisfies You. It lists several variants and this particular scenario does not quite seem to fit any of them; the show Iris doesn't seem to be listed on any of the variant pages as an example.)

The experiment is this: it's the blue and red buttons again. People are asked what they would do if everyone had to choose between two buttons, and one button would kill anyone who pressed it if less than half of all people pressed it, but they would be safe if at least half of people pressed it.

(Note that one can vary the question, like by increasing the percentage of people who need to press it for all of them to be safe, but the '50%' scenario is more relevant for real-world judgements of behavior: 'the majority is always morally correct'.)

There are two scenarios: one where the safe button is labeled "I am good" and the risky button is labeled "I am evil", and the opposite. People are asked which button they would press in both scenarios, with the order of these two questions randomly varied and they answer both questions before submitting their response.

Then, this data is used to simulate successive experiments. This way, there is no need for a condition of, "the test is run again and everyone forgets the first test and chooses as though they had not encountered the problem before".

People are randomly assigned to one of the two scenarios, i.e. one of their choices for which button to pick is selected with a 50% chance.

If the vast majority of people pick the risky button, then there is no simulated decrease in population between generations. (Again, note that real-life scenarios could require a higher threshold, like 80% of people selecting the button for anyone who selected it to survive.) If almost everyone picks the safe button, there is only a small decrease in population, and most people would not feel the scenario is interesting. So we say that typical results are very close to 50% of people pressing each button.

This is exponential decay: after 10 generations, if the outcomes remain around 50%, 0.1% of the population remains. After 100 generations, ~~approximately 10^-30 of the original population remains (my calculator is being funny and rounding to 0 instead of~~ 7.8886090522×10⁻³¹ of the population remains. If each person clones themself each generation, then it's not a problem, but it distracts from the point, so we just accept that we only have 10 generations.

There are, basically, two possibilities: the percentage of people who think, based on reading the buttons that they press, that they are evil increases, or the percentage who think they are good increases.

Any individual person could answer anything to the two questions: they could always choose the safe button, no matter what the buttons say, or they could always choose the risky button, or they might press the risky button more often than the safe button if the risky button says one of two things: either when it says "good", or when it says "evil".

The 'control' question is when the buttons have neutral, non-meaningful differences, differing only to the extent needed to indicate which button does what. (For example, positioned to the north and south, if people don't view north as evil and south as good.) We assume that with this control question, about 50% of people will choose the risky button; it is, in any case, less than 100% and more than 0%. So the question is, what is more likely to increase the percentage of people choosing the 'safe' button: labeling it as the 'good' button, or labeling it as the 'bad' button?

People want to do things that other people see as 'good'.

Possibility 1: a person who wants to do good things already sees the safe button as 'good' when it has neutral markings.

- 1A: the safe button is marked as 'good'. They press it.

- 1B: the safe button is marked as 'evil'. Do they still press it?

Possibility 2: a person who wants to do good things sees the safe button as 'evil' when it has neutral markings.

- 2A: the safe button is marked as 'evil'. Do they press it?

- 2B: the safe button is marked as 'good'. Do they press it?

Discussions around the blue and red buttons suggest that people see the safe button as 'evil'. This breaks the symmetry that would exist if we assumed that people saw labels 'good' and 'evil' with indifference.

If the safe button is labeled as 'good', people have an excuse to press it. If the risky button is labeled as 'good', it does not convince more people to press it, since they already saw it as good and pressed it.

Note that people who did not assume or think that the risky button was 'good' when it was labeled neutrally might be convinced to press it when it's labeled 'good', but this is not most people.

So in any given generation: the majority of those who see the risky button labeled as 'good' press the risky button. The majority of those who see the safe button labeled as 'good' press the safe button. When the risky button loses, the majority of the survivors are people who pressed a button labeled 'good'.

It also includes people who pressed the safe button when it was labeled 'evil'. But over time, what we expect is a survivorship bias towards people who pressed buttons labeled 'good', whether or not they thought what they were doing was good or not.

In other words, people who survived got there by doing what an external system told them was 'good'.

Note the paths of individual people: one person survived because they always choose the safe button, no matter what the labels say. Another person survived because they were lucky enough to get 10 scenarios where the safe button was labeled 'good', even though they always pick 'good'. A third person got five safe buttons labeled 'good' (5 coin flips = 3% chance), but in the sixth scenario, the safe button was labeled 'evil' and so they chose the risky button, labeled 'good'.

If the percentage of people who pick the risky button is always 49.9% due to bad luck, then no one who ever picked the risky button survives (including this third person). If it's usually 51%, with enough variation (from people who vary their choice based on the labeling being assigned a different label) that just 10% of generations are below 50%, then the survivorship bias towards people who have always picked the button labeled 'good' is much weaker, and it would take many more generations for most survivors to have always picked the 'good' button.

I'm unpublishing this post if Greta posts anything on Instagram without sharing this idea, disregarding any Stories that she posts that get deleted after 24 hours.

Originally published 27 Apr 2026, 14:18.

Save the Rhinos!

Monday, April 27, 2026

Coin flip

No comments:

Post a Comment

About Me