Kristine Eck

Automation of the Uppsala Conflict Data Program (UCDP)

The Uppsala Conflict Data Program (UCDP) is the world’s main provider of data on organized violence. Each year, UCDP coders read over 50,000 news reports. This manpower intensive methodology is difficult to maintain as information increases. To address this problem, UCDP will collaborate with RISE SICS, a leading research institute for applied information and communication technology, to develop an ambitious but realistic use of Natural Language Processing and Machine Learning. The aim is not for complete automation but rather to provide tools for improving the current UCDP coding process by making it more efficient and less resource intensive. Doing so would also provide UCDP the opportunity to mine more information about armed conflict events than it does at present, for example, on victim demographics. Automating data extraction will also facilitate more frequent data releases, with UCDP aiming to move to weekly, rather than annual, data releases. Thus, the project seeks to take advantage of existing technologies to achieve three ends. First, to ensure that UCDP is able to achieve its mandate of providing global, systematic data on armed conflict even as the amount of information it must process increases. Second, to open up new research fields by providing data on conflict events which are sought-after by researchers. Third, to enable more frequent data releases, which would facilitate efforts by researchers and organizations to predict the outbreak of new conflicts.
Final report
--Original aim and development of the project--

The Uppsala Conflict Data Program (UCDP) is the world’s main provider of data on organized violence. Each year, UCDP coders read over 50,000 news reports. This manpower intensive methodology is difficult to maintain as information increases. To address this problem, UCDP collaborated with RISE SICS, a leading research institute for applied information and communication technology, to develop an ambitious but realistic plan for using Natural Language Processing and Machine Learning.

This project had two aims. The first aim was to develop automated data extraction protocols for the UCDP event data, which is annotated at the document level. The ambition was not for complete automation but rather to provide tools for improving the current UCDP coding process by making it more efficient and less resource intensive. The second aim was to use automated tools to collect new types of information on conflict events that UCDP has not traditionally collected but which are sought-after by the research community. In particular, the project planned to explore opportunities to collect information on victim characteristics (such as age, sex, and occupation) and the type of weaponry used in conjunction with the fatality (bombs, guns, etc.)

--Project results--

Summary:

Given the complexity of event resolution and the high level of precision required, the project concluded that current machine learning techniques cannot yet markedly increase the efficiency of UCDP event data coding. Work on tactics annotation was not fully successful, but several promising avenues have been identified.

Detail:

The initial intention for addressing the first project aim was for the project to employ an Information Extraction approach to construct functionality to identify and classify mentions of real-world entities and events, and thereby leverage machine learning to facilitate rapid human annotation of news items. However, the UCDP annotation tooling and processes are centered on documents as the core object, which means that no annotations at the sequence level, required for Information Extraction, are present. Following this insight, adjustments to the operationalization of the working plan were made.

After familiarizing RISE collaborators on UCDP data and annotating procedures, RISE explored numerous text categorization techniques. These included a classic Bag-of-Words approach; character-based contextualized embeddings produced by ELMo; embeddings produced by the BERT base model, and a version of BERT base fine-tuned on UCDP data; and a pre-trained and fine-tuned classifier based on ULMFiT. The categorization results exhibited a large variability across the 17 distinct categorization tasks the UCDP undertakes, ranging from 30.3% to 99.8% F1-score.

The analysis showed that automated tools could be developed for some categorization tasks (such as event date or location), but that these tools would only marginally decrease the amount of time required by human coders to resolve an event. Given the complexity of event resolution and the high level of precision required, the project concluded that further research efforts on machine learning techniques are needed in order to increase the efficiency of UCDP event data coding.

With regard to the second project aim, the project first had to tackle the lack of conflict tactics classification system within the field of peace and conflict research. UCDP evaluated previous research within the field, and in collaboration with Yon Lupu, developed a classification system which would provide relevant categories for analytical purposes. These were anchored in assumptions that issues relating to proximity between attacker and attacked and technologies of warfare (particularly aerial and explosive-based attacks) would be of use for examining theories within the field. RISE also explored using Wikipedia classifications, though these are less useful for conflict researchers.

In terms of categorization, RISE explored several approaches from unsupervised learning with a clustering algorithm to supervised learning with a transformer, and an approach leveraging supervised learning without training data: zero-shot learning. The clustering algorithm focused on topics irrelevant to weapons/tactics, such as location. When filtering features relevant to the description of weapons/tactics, the clusters were more relevant. The experiments with Zero Shot indicated that when evaluating against an annotated dataset, 43.86% of the articles were correctly classified, without any training; a simple classifier allocating the most common label to all articles would correctly only classify correctly 32.83% of the articles. This indicates Zero-Shot provided improved, but still not satisfactory, results. More results are needed to determine how to improve the performance of the classifier. Running the BERT classifier showed good performance, especially when classifying two categories: aerial attacks and suicide attacks, but the classifier was less successful for other categories. Those results, while encouraging, indicate that more research is needed to investigate and improve the performance of such algorithms.

In conclusion, using machine learning for classifying text content of conflict tactics shows some promise, especially when using annotated articles that allowed for experimentation with supervised learning, and also with evaluating the classifiers. That said, further research is needed to improve the different algorithms before the approach is viable enough to produce a dataset that is usable for conflict researchers.

--How the infrastructure has been used--

N/A

--Unanticipated problems / deviations from the original plan--

With regard to the second aim, a feasibility analysis showed that data issues indicate that collecting new information on demographic characteristics of victims is unlikely to be satisfactorily achieved, so the UCDP prioritized the other component relating to conflict weaponry. After conducting an updated literature review, we decided to shift the focus slightly from weaponry to conflict tactics (e.g. suicide attacks, aerial bombardment), which we deemed would provide a more valuable resource to the research community.

--Long-term maintenance of the infrastructure--

N/A

--Infrastructure accessibility--

N/A. UCDP data remain openly available via ucdp.uu.se

--International collaboration--

Associate Professor Yon Lupu of George Washington University joined us for the tactics annotation part of the project in Year 2 (unsalaried).
Grant administrator
Uppsala University
Reference number
IN18-0710:1
Amount
SEK 5,478,000.00
Funding
RJ Infrastructure for research
Subject
Political Science (excluding Public Administration Studies and Globalization Studies)
Year
2018