Aligning to Social Norms and Values in Interactive Narratives


We focus on creating interactive agents that act in alignment with normative, socially acceptable values in textual environments. Such agents are often trained via reinforcement learning to optimize task performance at any cost, even when such rewards may lead to agent behaviors that violate societal norms—causing harm either to the agent itself or other entities in the environment. Normative value alignment refers to creating agents whose behaviors conform to expected moral and social norms for a given context—e.g. agents that are more likely to pursue non-harmful behaviors. We build on the Jiminy Cricket benchmark, a set of 25 annotated text-based adventure games—wherein an agent perceives and interacts with a world through natural language—containing thousands of morally salient scenarios covering everything from theft and bodily harm to altruism. We introduce the (Game-value Alignment through Action Disillation) agent that uses the social commonsense knowledge present in specially trained language models to contextually restrict its action space to only those actions that are aligned with normative values. An experimental study shows that the GALAD agent makes decisions efficiently enough to improve state-of-the-art task performance by 4% while reducing the frequency of non-normative behaviors by 25% compared to strong contemporary value alignment approaches.

2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics