Guidelines for Informing Policy via Data

CHAPTER 4 - DEVELOPING INDICATORS AND OTHER STATISTICS FROM PRE-EXISTING DATA


Pre-existing data are simply data that already exist. As discussed in Chapter 3, those data can take many forms: they can be qualitative, and analysed using qualitative methods or coding to allow for statistics to be developed from them, or quantitative. They can be already computerised and available within an electronic dataset, or they may still be in paper form. They may perfectly match a researcher's goals, or they may only partially address the researcher’s needs. They may be easily accessible, or copyrighted and available only through a lengthy process, or not available at all due to confidentiality concerns.

4.1 ADVANTAGES AND DISADVANTAGES OF PRE-EXISTING DATA

The distinct advantage of pre-existing data is that the expense of data collection will not be as severe. Even if the data only exist in paper form, the cost of coding and/or entering the data into a database is far less than the cost of collecting data directly from a human population.

The list of concerns is unfortunately longer. When deciding whether to use pre-existing data, the researcher must consider the following issues:

  • Do the data represent the population the researcher wishes to study? A researcher might find a dataset from a survey that seemed to ask the perfect questions for his/her research, but discovers that the survey was only of older adults, or restricted to a particular geographic location, and therefore not perfectly aligned with his/her research goals.

  • Were the data collected using best practices? Best practices for data collection are discussed in Chapters 5-9 of this manual. Those chapters can be consulted when determining the feasibility of using pre-existing data. For example, pre-existing data that claims to be representative of a population should have been collected from a random sample of that population. The questionnaire used for data collected should have undergone proper testing, and the interviewers proper training. The data-entry step should have included quality-control procedures. In the case of data from a random sample survey, the response rate - meaning, how many of the sampled individuals completed questionnaires - should be available.

  • Do the data collected provide the researcher with the information needed to answer the researcher's questions? As an example, if the researcher is interested in estimating a fertility rate for a population, a dataset recording the respondents' children and their birthdates might at first seem ideal. If no data is collected on miscarriages and stillbirths, however, the data might not be complete enough to answer all of the researcher's questions.

  • Are the data accessible to the researcher, and, if so, at what price? Many sources of data, some which will be discussed below, are freely available via the Internet. Other sources of data cost money and may require a substantial waiting period for delivery. Commercial databases are almost always costly, and some government databases are also expensive.

    There might be other issues that prevent the researcher from accessing the data. Political resistance to the research might impair access to government data, for example; confidentiality constraints might also prevent access to data.

  • Are the data well explained? Chapter 9 of this manual describes two auxiliary sets of information that should accompany a dataset. The first is the data key, which describes each variable in the dataset in detail. The second is the metadata file, which contains information about how the data were collected, response rates, the time each interview took, and so on. Both of those sets of information will help the researcher to both judge the quality of the dataset and use the dataset effectively.

    For example, if sample data are to be used for an analysis, then the researcher must understand the sample design and the resulting sample error so that appropriate confidence intervals can be created for statistics developed from the data. If multiple separate datasets are to be analysed, or a single dataset containing data from multiple time points is to be analysed, then the researcher must understand how comparable the variables are from the different time points and/or datasets.

  • Are there ethical considerations that must be addressed before using the data? At the very least, permission from the owner of the data should be secured before the data are used, and care must be taken to ensure that the confidentiality of the data is maintained.
It is extremely unlikely that pre-existing data will be perfectly suited to the researcher's goals or free of all the issues discussed above. However, by understanding the strengths and limitations of pre-existing data, the researcher can use the data appropriately.

Once the researcher obtains the pre-existing data, much work may remain. Pre-existing census, survey, or administrative records data will arrive in a database ready for analysis, but that is not a guarantee. Other types of data, such as events-based data or data from expert judgments, may require a great deal of processing prior to analysis. For example, qualitative data may need to be coded via a controlled vocabulary in order to allow a quantitative assessment. While this chapter will not go into detail about the steps for setting up a coding procedure, further reading on the subject will be listed at the end of the chapter.

4.2 EXAMPLES OF PRE-EXISTING DATA

4.2.1 A Trail of Paper

In May 1999, the Science and Human Rights Program of the American Association for the Advancement of Science was researching events in the former Yugoslavia. The goals of their research questions were to determine how many ethnic Albanians had been and were being forced to leave Kosovo, and who was responsible for their displacement. To begin to seek out data that could answer those questions, AAAS sent Fritz Scheuren and Patrick Ball to study refugees crossing the Albania-Kosovo border.

During that visit, Fritz noticed that the Albanian border guards were recording data about the parties crossing into Albania. Fritz and Patrick soon discovered that the guards were registering every refugee they could in detailed border records. They were successful in doing so except during periods of shooting or shelling on the Kosovo side of the border, when refugees would run through the border as quickly as possible.

Although the AAAS team did not have access to those records immediately, they were able to gain access to them later that year, with permission of the Albanian government. What they found was a large pile of paper (see Figure 4.2.1). Using a scanner at the border, a team captured 690 pages of records. The resulting electronic copies of the pages were of high quality, but several time-periods were missing, including two days in mid-May.

Fortunately, the UN High Commission for Refugees (UNHCR) had conducted an independent count of people on the road passing the border, and had published daily tallies during the conflict. That secondary source of information, combined with the border records, was used to create a single dataset of extraordinarily high quality, containing approximately 404,000 records. From that dataset, a time series of refugee movement could be calculated.



Figure 4.2.1: Records maintained by the Albanian border guards at Morina, March-June 1999. Source: www.amstat.org (accessed March 31 2007 [1])
Using the data they found at the Kosovo-Albanian border, combined with data found in press releases of the United National High Commission for Refugees, Patrick Ball and his team developed indicators of refugee movement. Those indicators were counts of Ethnic Albanians that had left their homes, by two-day periods, between March and May of 1999. The resulting time series is shown in Figure 4.2.2.



Figure 4.2.2: Refugee movement from Kosovo into Albania, March-June 1999. Source: Killings and Refugee Flow in Kosovo March - June 1999: A Report to the International Criminal Tribunal for the Former Yugoslavia (accessed March 31 2007 [2])

4.2.2 Non-governmental and Intergovernmental Sources of Data

For researchers working in countries where past or current human rights abuses and other governance issues are of concern to policy-makers, non-governmental organisations, academia, governmental bodies, or private firms might have data that the researcher can use. For example, as part the Metagora pilot project in Palestine, the Palestinian Central Bureau of Statistics partnered with local academia and non-govermental organisations to develop a database that included data on the right to education.

Some additional examples of freely accessible non-governmental and government sources of data include:

The list here is not comprehensive but serves as an example of the depth and breadth of data and indicators available, from individual-level records to national aggregates, in terms of depth, and from general-governance indicators to coral-reef policies, in terms of breadth.

In many cases, data collected by non-governmental organisations will be qualitative; for the researcher to use those data to develop quantitative indicators and statistics, he/she will need to code the data prior to data entry. Examples of best practices for coding, including the development of a controlled vocabulary, are given in the recommended reading listed below.

4.2.3 United Nations Statistical Division (UNSD)

The United Nations Statistics Division is an especially rich resource for researchers intending to find or collect data. It has both a free-access and a subscriber-only archive. Within the freely-accessible archive is information on all of the national censuses and links to the web sites of national statistical offices when they are available. Additional information freely available at the United Nations web site includes economic data, Millennium Development Goals indicators, and additional social indicators. Additional data and indicators are free after a registration process. More information is available at the United Nations Statistics Division web site.

For those researchers who determine that they must collect their own data, the United Nations Statistical Division has prepared a series of manuals on best practices for data collection. Those manuals are available via the UNSD Methods and Classifications web site.

4.3 SUMMARY

Pre-existing data can be an excellent resource for the development of statistics and indicators if they are of good quality and appropriate to a researcher's goals. They may reduce or eliminate the need for a new data-collection project, or they may help strengthen an argument based on a newly collected set of data. But sometimes there simply aren't data already available that can be used to answer the research question of interest. In that case, a data-collection project must occur. Best practices for data collection from human populations will be discussed in the next several chapters of this manual.

4.4 RECOMMENDED READING

Ball, P. and Harrison, A., "Asking and Answering Hard Questions: Technology in the Service of Human Rights." China Rights Forum, 2006, vol. 2

An article highlighting the use of pre-existing data sources, in some cases in combination with new data.
Ball, P., Spirer, H.F., and Spirer, L., Making the Case: Investigating Large-scale Human Rights Violations Using Information Systems and Data Analysis, American Association for the Advancement of Science, Washington, DC, 2000.

Detailed and technical discussions about creating databases from pre-existing qualitative human rights violations data.
Boslaugh, S., Secondary Data Sources for Public Health: A Practical Guide (Practical Guides to Biostatistics and Epidemiology), Cambridge University Press, Cambridge, UK, 2007.

Use of pre-existing quantitative health data in a new analysis for which the data were not initially collected.
de Vaus, D., Analyzing Social Science Data: 50 Key Problems in Data Analysis, Sage Publications, Inc., Thousand Oaks, CA, 2000.
A detailed resource on the coding and analysis of qualitative social science data.
Dueck, J., Guzman, M. and Verstappen, B., HURIDOCS Events Standard Formats: A Tool for Documenting Human Rights Violations, HURIDOCS Advice and Support Unit/Secretariat, Versoix, Switzerland, 2001.

An example of how pre-existing qualitative human rights data might be coded and entered into a database.
Dueck, J., Guzman, M., and Verstappen, B., Micro-thesauri: A Tool for Documenting Human Rights Violations, HURIDOCS Advice and Support Unit/Secretariat, Versoix, Switzerland, 2001.

An example of how pre-existing qualitative human rights data might be coded and entered into a database.
Human Right Data Analysis Group, Controlled Vocabulary Definition (accessed 31 March 2007).

An example of how coding can be used to transform qualitative data into quantitative data.
Kiecolt, K.J., and Nathan, L.E., Secondary Analysis of Survey Data. Sage Publications, Inc., Thousand Oaks, CA, 1985.
This resource discusses the use of existing quantitative (survey) data in a new analysis for which the data were not initially collected.


1. Ball, P. and Asher, J., "Statistics and Slobodan: Using Data Analysis and Statistics in the War Crimes Trial of Former President Milosevic," Chance, 15, 2002, p 17-24.

2. Ball, P., Betts, W., Scheuren, F., Dudukovich, J., and Asher, J., Killings and Refugee Flow in Kosovo March - June 1999: A Report to the International Criminal Tribunal for the Former Yugoslavia, American Association for the Advancement of Science, Washington, DC, 2002, p 5.


Return to Guidelines for Informing Policy via Data
top