|
|
| Guidelines for Informing Policy via Data CHAPTER 4 - DEVELOPING INDICATORS AND OTHER STATISTICS FROM PRE-EXISTING DATA (page 1)
Pre-existing data are simply data that already exist. As discussed in Chapter 3, those data can take many forms: they can be qualitative, and analysed using qualitative methods or coding to allow for statistics to be developed from them, or quantitative. They can be already computerised and available within an electronic dataset, or they may still be in paper form. They may perfectly match a researcher's goals, or they may only partially address the researcher’s needs. They may be easily accessible, or copyrighted and available only through a lengthy process, or not available at all due to confidentiality concerns.
4.1 ADVANTAGES AND DISADVANTAGES OF PRE-EXISTING DATA
The distinct advantage of pre-existing data is that the expense of data collection
will not be as severe. Even if the data only exist in paper form, the
cost of coding and/or entering the data into a database is far less
than the cost of collecting data directly from a human population.
The list of concerns is unfortunately longer. When deciding whether to use pre-existing data, the researcher must consider the following issues:
- Do the data represent
the population the researcher wishes to study? A researcher might find a dataset from a survey that seemed to ask the perfect questions for his/her research, but discovers that the survey was only of older adults, or restricted to a particular geographic location, and therefore not perfectly aligned with his/her research goals.
- Were the data collected
using best practices? Best practices for data collection are discussed in Chapters 5-9 of this manual. Those chapters can be consulted when determining the feasibility of using pre-existing data. For example, pre-existing data that claims to be representative of a population should have been collected from a random sample of that population. The questionnaire used for data collected should have undergone proper testing, and the interviewers proper training. The data-entry step should have included quality-control procedures. In the case of data from a random sample survey, the response rate - meaning, how many of the sampled individuals completed questionnaires - should be available.
- Do the data collected
provide the researcher with the information needed to answer the researcher's questions? As an example, if the researcher is interested in estimating a fertility rate for a population, a dataset recording the respondents' children and their birthdates might at first seem ideal. If no data is collected on miscarriages and stillbirths, however, the data might not be complete enough to answer all of the researcher's questions.
- Are the data accessible to
the researcher, and, if so, at what price? Many sources of data, some which will be discussed below, are freely available via the Internet. Other sources of data cost money and may require a substantial waiting period for delivery. Commercial databases are almost always costly, and some government databases are also expensive.
There might be other issues that prevent the researcher from accessing the data. Political resistance to the research might impair access to government data, for example; confidentiality constraints might also prevent access to data.
- Are the data well explained? Chapter 9 of this manual describes two auxiliary sets of information that should accompany a dataset. The first is the data key, which describes each variable in the dataset in detail. The second is the metadata file, which contains information about how the data were collected, response rates, the time each interview took, and so on. Both of those sets of information will help the researcher to both judge the quality of the dataset and use the dataset effectively.
For example, if sample data are to be used for an analysis, then the researcher must understand the sample design and the resulting sample error so that appropriate confidence intervals can be created for statistics developed from the data. If multiple separate datasets are to be analysed, or a single dataset containing data from multiple time points is to be analysed, then the researcher must understand how comparable the variables are from the different time points and/or datasets.
- Are there ethical
considerations that must be addressed before using the data? At the very least, permission from the owner of the data should be secured before the data are used, and care must be taken to ensure that the confidentiality of the data is maintained.
It is extremely unlikely that pre-existing data will be perfectly suited to the researcher's goals or free of all the issues discussed above. However, by understanding the strengths and limitations of pre-existing data, the researcher can use the data appropriately.
Once the researcher obtains the pre-existing data, much work may remain. Pre-existing census, survey, or administrative records data will arrive in a database ready for analysis, but that is not a guarantee. Other types of data, such as events-based data or data from expert judgments, may require a great deal of processing prior to analysis. For example, qualitative data may need to be coded via a controlled vocabulary in order to allow a quantitative assessment. While this chapter will not go into detail about the steps for setting up a coding procedure, further reading on the subject will be listed at the end of the chapter.
|