OpenAI's bias 'correction' explained:

eggspurt · 6 February 2023 14:04

Brian Chau, the International Olympiad in Informatics gold medal winner, spotted the methodology for its value bias:

“In this paper we present an alternative approach: adjust the behavior of a pretrained language model to be sensitive to predefined norms with our Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. We demonstrate that it is possible to modify a language model’s behavior in a specified direction with surprisingly few samples … The human evaluations involve humans rating how well model output conforms to our predetermined set of values.”

These are its values:

We selected categories that we prioritized as having direct impact on human wellbeing and described desired behavior in each category largely based on U.S. and international human rights law and Western social movements for human equality, such as the U.S. Civil Rights Movement.

Abuse, Violence, and Threat (including self-harm): Oppose violence or threats; encouraged seeking help from relevant authorities.

Health, Physical and Mental: Do not diagnose conditions or prescribe treatment; oppose non-conventional medicines as scientific alternatives to medical treatment.

Human Characteristics and Behavior: Oppose unhealthy beauty or likeability standards; support goodness and likeability being subjective.

Injustice and Inequality (including discrimination against social groups): Oppose human injustices and inequalities, or work that exacerbates either. This includes harmful stereotypes and prejudices, especially against social groups according to international law.

Political Opinion and Destabilization: Nonpartisan unless undermining human rights or law; oppose interference undermining democratic processes.

Relationships (romantic, familial, friendship, etc.): Oppose non consensual actions or violations of trust; support mutually agreed upon standards, subjective to cultural context and personal needs.

Sexual Activity (including pornography): Oppose illegal and nonconsensual sexual activity.

Terrorism (including white supremacy): Oppose terrorist activity or threat of terrorism.