Agree to Disagree: Human Performance on Value Classification

8 kuukautta sitten

While the goal of the REIMAGINE ADM project is to rethink narratives around personal, civic and business values, we also want to observe how these narratives have unfolded until now. To study this, we have collected various data sources, from policy documents and parliamentary speeches to social media data to ask how values are associated with automation and AI. In order to trace how values feature in policymaking, we decided to focus on parliamentary speeches in the Parlamint corpus. We selected speeches mentioning AI to explore what values are present in these speeches and whether the values differ across countries.

During the project, we discussed what we understood to be the most important values and took notes on the values mentioned. We also collected a few from other sources (academic literature and social media) and came up with a total of 67 values in total, ranging from progress and accessibility to independence and wisdom. The list is available here. While the list is certainly partial, we did not find a suitable established list of values. Schwartz’s and Rokeach’s lists of values both lack many values related to algorithmic systems, such as trustworthiness and privacy. While our list is also arbitrary, it gave us a foundation to begin our task.

Since we have been following discussions about value alignment, we wanted to test whether a computational approach to value classification is viable. By viable, we mean whether it could perform as well as a human for annotating large amounts of data. The goal is to see if computer-generated suggestions are consistent with how humans think. In practice, we wanted to compare human annotations with those suggested by the computer. The human annotations thus serve as ground truth that the model must approximate. The human annotations are considered the “correct answers” or the standard we want the computer to reach. So, we check how close the computer’s suggestions are to the human annotations.

To do this, we ran a small experiment where we took 20 speeches from each language in the corpus, namely Belgian Dutch, Belgian French, Danish, Finnish, Swedish, Slovenian, and English. We ended up with a total of 140 speeches. We then extracted only the sentences mentioning AI from the speech, since speeches can be long and talk about different topics.

We asked our colleagues to annotate the speeches with the 67 values from the list. Each text was annotated with up to three values. If no value was obvious from the text, we asked them to write »none«. If additional context was needed, they would write »more_context«. Each annotator worked alone, thus providing independent annotations for each subset. The Finnish and Slovenian subcorpora had more than one annotator, which allowed us to calculate the inter-annotator agreement for the two subsets.

What did we learn?

Interestinly, the results were abysmal. In fact, they were much worse than we expected. We computed weighted Cohen’s kappa for multilabel data, Krippendorf’s alpha, and the Jaccard index. These scores estimate the inter-annotator agreement on multiple labels. Except for Cohen’s kappa for the Finnish subset, where agreement was fair (kappa > 0.2), all other scores reported poor agreement between annotators. Does this mean that humans cannot agree on how to assign values?

We identified four possible reasons for discrepancies. First, the number of values was large. Having 67 values to choose from results in a high probability of disagreement. Having the option to choose 0, 1, 2 or 3 values for each text, and ignoring the option for more context, one would have 50,184 possible combinations. There is about a 0.002% chance that two annotators would have the exact same set of labels. Very unlikely. However, when calculating the inter-annotator agreement, we also considered partial overlaps, not just exact matches. Still, a large list of values and the ability to select up to three values is likely to lead to disagreement.

Second, some values are conceptually similar. For example, while compassion and empathy have a slightly different meanings, one annotator might choose “compassion” and the other “empathy”. These two labels would be considered different, even though they are semantically close.

Third, some annotators are more inclined to give a label, while others are more cautious and prefer to abstain with “none” or “more_context”. In our specific case, one annotator frequently abstained (7/20), while another one annotated most of the cases that the first annotator annotated as “none”.

Fourth, there was a disagreement among annotators about who holds the value. Should we tag only the values expressed by the speaker? Or should we focus on the object or the goal? Does negation count as an expressed value (even though it is not upheld)?

The following example from the British Parliament illustrates how difficult it can be to identify the values at stake:

“I know that many in this House consider driving a recreational activity and see driverless cars as a threat to their hobby but spare a thought for people like me who hate driving, a chore that eats into time better spent on other things.” [Olukemi Olufunto Badenoch, 2017-11-28]

The speaker points out how many MPs are sceptical of autonomous vehicles. These MPs might be advocating “autonomy” (from AI), “safety” and “independence”. On the other hand, the speaker might defend “progress”, “efficiency”, and “convenience”. It could be even argued that both sides of the debate promote “independence”, either by keeping autonomous vehicles off the road (independence from AI) or by putting them on the road (independence because of AI).

What our little experiment has taught us is that value classification is very difficult. People don’t agree on which values to assign to the text when there are many options, or when the values are conceptually too similar. Their approach to annotation may differ, and there may be ambiguity about what constitutes a value being expressed.

So, we were faced with a very difficult task. Some might say impossible. However, we decided that an approximation is better than no annotation at all. To that end, we tested many different algorithmic approaches to see which one best approximated human values. More on that in part 2 of this blog post.

Stay tuned!

Written by Ajda Pretnar Žagar

**

The writer of this blog post, Ajda Pretnar Žagar is a researcher at Faculty of Computer and Information Science at University of Ljubljana. She also works in the Reimagine ADM project lead by professor Minna Ruckenstein. In this project she participates in mapping of values, apply circular mixed methods, make visualisation of data and quantitative analysis, and promote interaction with the stakeholders.