Erik Melander

Database and digital archive for the next generation of conflict data


The Uppsala Conflict Data Program (UCDP) at Uppsala University has in recent years established itself as the leading provider of global data on armed conflict. Increased technical sophistication in conflict research raises the demands users have on our data. The raised demands on data among users means that providers of data need to improve their data management capabilities. UCDP will use this grant to develop its infrastructure for data management to ensure that conflict researchers worldwide continue to turn to UCDP for data also in the future.



There is now a particularly strong demand for disaggregated data, that is, for data about organized violence on a level of analysis below the country and year. UCDP has already developed large amounts of data of this type, which is the basis for the materials that the program publishes today. In order to keep its leading position UCDP is expanding its disaggregated data on armed conflict.



New technical resources are needed to manage the enormous amounts of data which will result from this emphasis on disaggregated data. The purpose of this project is to create a database structure for managing the hundreds of thousands of observations that will be generated in this process. The project will also digitalize source materials to enhance the transparency of the coding of the new data.
Final report

Erik Melander, Peace and Conflict Research, Uppsala university

Database and digital archive for the next generation of conflict data

2010-2012

The purpose of the digitalization project "Database and digital archives for second generation conflict data" can be divided into two distinct sub-categories. On the one hand the funding was aimed at financing a technical solution for the large amounts of data that are a result of the fact that UCDP has begun collecting data on a disaggregated level, i.e. information on violence that is disaggregated below the level of land and year. On the other hand there was a need for an infrastructure to handle the large amounts of source material that the project has accumulated over its 30 years of existence. This material was in dire need of digitalization in order to ensure its future accessibility as well as enable researchers to conduct searches within the source material.
The initial plan was to store all this material stored in an AskSam database. Early in the process, however, it became clear that AskSam did not meet the requirements of UCDP. AskSam does not support versioning, neither does it support multiple users working on the same datasets, the relations between different datasets was also inadequate. The major reason for these discrepancies was that AskSam is a "free-form" database which is not compatible with the UCDP data which follows a strict relational model, containing almost 100 relations. This led to the decision that AskSam was not suited for storing the geo-coded events, the main part of UCDP's new disaggregated data. AskSam does however have advantages in the field of the digitalization and archiving of source material. For the actual scanning OCR software has been used, the software has worked well and has been able to accommodate the challenges put forward by the irregular and sometimes poor source material. The software used has been ABBY Finereader. Since the binders that contain the source material are made up of a wide range of different types of records and are in several different languages a well-functioning and user friendly software has been of utmost importance to ease the scanning and analyzing of the source material.
To solve the problem with the shortcomings of AskSam UCDP decided to hire two computer programmers during the spring of 2011. These programmers were tasked with developing the foundation for a database storage system in which UCDP could easily store and update its data. The new system got further funding through a grant from the Research Council in the fall of 2011, and the work is ongoing. To secure the data during the transition and development period UCDP made two decisions. Firstly, to ensure versioning and storage the Excel sheets that was processed by the coders, was stored on a Share-point server. Secondly, our in-house programmer set up a temporary database for all launched geo-referenced events (disaggregated data), covering Africa. The result is available at http://www.ucdp.uu.se/ged/. The data were released in December 2011, and an update was launched in November 2012. Since its launch the site has been visited 16661 times. Several well established universities, research-environments and state institutions have downloaded the data e.g. Princeton, Michigan, Oxford, King's College, Columbia, SIPRI, GIGA, Swedish police, NATO and Pentagon.
Concerning the digitalization of UCDP's archives limited achievements have been made. Today 73 out of approximately 250 binders have been scanned. Initially the plan was to have the core staff of UCDP scan binders on top of their normal work, this proved to be a too time consuming approach that made other tasks suffer. For some headway to be made interns were hired for the summers of 2011 and 2012. The substantial underestimation of the time requirements for scanning and digitalizing occurred due to the extremely unique and different nature of each of the binders of primary material. In turn, this resulted in the person doing the digitization process having to exercise extreme care and attention in order for the result to be usable.
The binders containing source material used to code UCDP's dataset and database are organized by date and country, each containing vastly different types of source material. Some of these binders contain unique source material, impossible to access in any other way, i.e. extracts from regional and local newspapers from conflict areas, NGO reports as well as land- and subject- specific articles from various news organizations. These conditions have led to all time estimates done before the commencing of the process being extremely optimistic.
Current estimations based on the experience garnered from the two series of interns and from the scanner operator currently employed with UCDP is that one binder takes an average of 5-10 working days to scan and process. The process is extremely time consuming, especially as each scanned and digitalized article has to go through a manual quality control process. This process is used to check that each article corresponds with the original; in some cases, such as old newspaper clippings, faxed or copied material, the process has to be repeated a number of times as quality of the resultant digital material is extremely low.
Within the digitalization project a complete protocol has been developed and tested with good results. Currently UCDP has employed a scanner operator as well as a part-time archiving and digitalization consultant, in part to scan more binders and in part to process the digitized material in a way in which coders and researchers within UCDP can have the materials accessible. The process is currently financed by UCDP's core budget, but the project plans on applying for new grants in order to finalize the work.
Of the scanned archival material, some has made its way into the UCDP's geocoding project. Throughout the process of geocoding conflict event data in India, the scanned material was used to strengthen and complete the main UCDP sources that stand as the basis for UCDP's data. Thus, the pilot project showed itself valuable and successful, substantially simplifying the workload for the coders and project workers.

Publications

Melander, Erik, and, Ralph Sundberg, 2011, “Climate Change, Environmental Stress. and Violent Conflict: Tests introducing the UCDP Georeferenced Event Dataset”, Paper presented at the International Studies Association, March 16-19, Montreal, Canada.

Dulic, Tomislav, 2010, “Geocoding Bosnian violence: A note on methodological possibilities and constraints in the production and analysis of geocoded event data”, Paper presented at the International Studies Association, 17-20 February, New Orleans, United States.

Grant administrator
Uppsala University
Reference number
In10-0138:1
Amount
SEK 700,000
Funding
RJ Infrastructure for research
Subject
Unspecified
Year
2010