About Microdata
Canadian Parliament United Nations The US Capitol Illinois Capitol Building Government Documents Homepage
What is Microdata?
Microdata vs. Aggregated Data Codebooks and Microdata
 PUMS, IPUMS, PUMF, and SAR Is Microdata What You Are Looking For?
  home

Microdata vs. Aggregated Data

One of the best ways to understand what microdata is would be comparing how it is different from aggregated data.

Aggregated Data

  • Aggregated data has some other names: summary data, table data, tabular data, or compiled data.
  • Aggregated data is normally provided in voluminous statistical abstracts and data compendiums in print, CD-ROMs and diskettes. In recent years, many of them are also available as downloadable PDF files and HTML files on the WWW.
  • The basic feature of this kind of data is: the numeric values have been aggregated under certain collective units and groups such as racial groups, geographic units (census tract, county, state, region, country, etc.), ages, years, institutions, or occupations.
  • In aggregated data, individual cases or observations are missing; and while these data could be read by spreadsheet software, most statisitcal techniques and analyses require individual records data.
Microdata
  • Microdata is unaggregated original sample data that contains every anomynized individual record (person, household, etc.)  in the sample.
  • Microdata is often of huge size and rarely in print.
  • Microdata files contain much more details - perhaps over 40 variables, and with fine detail on occupation, household composition or income - than aggregated data, but they are only for a sample of the population and/or strictly limited to a smaller geographic area.
  • Microdata files for public use have been either distributed with magnetic tapes and CDs in compressed form, or available for users to extract from remote sites.
  • These files are often called "machine-readable data files", since they can be read by a computer as data sets that can be manipulated and analyzed with statistical software (e. g., SPSS, SAS, and Stata).

Click here to see an exemplary microdata (raw data) file on the the website of Data and Program Library Service at the University of Wisconsin - Madison

Click here to see a microdata file that has been read into SPSS.

 

Codebooks and Microdata

Raw data in microdata files are often in ASCII format and compressed. In microdata sets, numbers may or may not be delimited with space, commas, lines, or tabs, thus which number corresponds to which variable is an essential issue. Therefore, a codebook or data dictionary is a must for downloading and using a microdata file. Without codebooks or data dictionaries to specify the exact location (columns and rows) of each variable and its value, a data file would be just a collection of meaningless numbers.

These are some of the resources discussing or demonstrating the importance of codebooks for acquiring and using microdata or other machine-readable statistical data sets:

  • Example of a data set accompanied by a codebook

  • An example demonstrating how a codebook is meaningful to a data set, in a powerpoint presentation entitled "Statistical Literacy and the Role of Data Services:
    The Social Sciences", by Elissa Cochran, Ann Fiegen, Chris Kollen, and Cathy Larson.

PUMS, IPUMS, PUMF, and SAR

It was the advent of computers that made it possible to process, store and distribute anonymized electronic data. The USA led the field in the release of microdata. The first microdata files were released from the 1960 U. S. Census - although retrospective microdata files have later been extracted for earlier years. The U. S. microdata was first called Public Use Sample (PUS), and renamed as Public Use Microdata Sample (PUMS) in 1980. Canada first released public use microdata files (PUMFs) from the 1971 Census and have continued this policy for every quinquennial census since then. Australia first produced microdata files for its 1981 Census. In U. K., the practice of releasing samples of anomymized records was accepted by the Census Offices in 1989 and heavily influenced by the US and Canadian experiences.

Public Use Microdata Samples (PUMS) in USA


The US Census Bureau has released census microdata for every census since 1960. PUMS datafiles contain records representing 1 percent and 5 percent samples of the housing units in the United States and the persons in them. Each PUMS file provides records for states and some of their geographic levels, covering the full range of population and housing information collected in the 1990 Census. Among these microdata files, there are different sample sizes, geography, and questions. To maintain confidentiality the geography is large units thus protecting against a combination of characteristics that would identify an individual.

Three different files were released from the 1990 Census: 5 percent and 1 percent samples of housing units, and 3 percent sample of the elderly. The 5 percent and 1 percent samples have the same content and are structured in a way that the relationship between individuals in the same households is retained. The difference between these two files is in the geographic coverage of the public use microdata area (PUMA):
 

  • The 5% sample identifies every state and various subdivisions of states, each with at least 100,000 persons. These PUMAs were primarily based on counties, and may be whole counties, groups of counties, or places. Nationwide, this gives a sample of over 12 million persons and over 5 million housing units.

    •  
  • The 1% sample was based primarily on metropolitan/nonmetropolitan areas, and contains PUMAs which were made from whole central cities, whole Metropolitan Statistical Areas (MSAs) or Primary Metropolitan Statistical Areas (PMSAs), MSAs or PMSAs outside the central city, groups of MSAs or PMSAs, and groups of areas outsides MSAs or PMSAs, all of which having at least 10000 people.

    •  
  • The 3% sample of the elderly population contains the same geography as the 5% sample, but includes only households where at least one menber is over age of 60.
  •  
Emergence of Integrated Public Use Microdata Sample (IPUMS)


The Integrated Public Use Microdata Sample (IPUMS), created at the Social History Research Laboratory of the University of Minnesota in October 1997, is now available. The IPUMS consists of consists of twenty-five high-precision samples of the American population drawn from fifteen federal censuses. The IPUMS combines these high-precision samples of the U.S. population into a single database spanning eleven census years from 1850 to 1990. The database includes over 15 million person records (soon to be over 50 million). Because different investigators created these samples at different times, they employed a wide variety of record layouts, coding schemes, and documentation, the IPUMS assigns uniform codes across all the samples and brings relevant documentation into a coherent form to facilitate analysis of social and economic change.

The Social History Research Laboratory at the University of Minnesota has also proposed to adapt this system, to internationalize IPUMS, by incorporating census microdata samples for the highest quality censuses with the longest time-spans from all other countries in the world.  IPUMS-International ( IPUMSi) proposes to integrate individual level census samples for a large number of countries into a single databank. The plan is, first, to standardize census microdata for selected countries from the 2000 round of censuses to the earliest available date (usually the 1960s or 1970s), and then, to distribute the integrated databank via the WWW, CD-ROM or other means suitable for the delivery of massive datasets.

Public Use Microdata Files (PUMF) in Canada


Public Use Microdata Files (PUMF) - Canadian census microdata records released by the Canadian government.  In the fall of 1974, Statistics Canada announced its decision to disseminate public use microdata files (PUMFs) to researchers and policy-makers, starting from the 1971 Census. Since then, microdata files have been produced for each of Canada's censuses.

PUMFs (formerly known as Public Use Sample Tapes (PUSTs) contain samples of anonymized responses to the long form, 2B census questionnaires in respective censuses. Three files are available: an Individual file, a Household and Housing file, and a Family file. Microdata files provide access to unaggregated data. However, to ensure the anonymity of the respondents, geographic identifiers are in most cases restricted to the provinces/territories and large metropolitan areas.

The sample size for the original set of microdata files from the 1971 Census was 1 per cent. This increased to two per cent in the 1980s for the individual file and was increased to 3 per cent for all three files for the 1991 Census. But it went down to 2.8 per cent for the PUMFs from the 1996 Census.

Samples of Anonymised Records (SARs) in U. K.

In U. K., samples of microdata were produced for the first time following the 1991 Census. Two SARs have been extracted from the 1991 census for the U. K.:

  • The 2% Individual SAR - Containing anonymized records of 1.1 million individuals Visitors and residents in private households and communal establishments Geographical detail; covering 278 SAR areas; full range of census topics on individuals and summary information about households.
  • The 1% Household SAR - Containing anonymized records of 216 thousand households, 542 thousand persons within households; allowing linkage between household and family members; standard regions plus inner-London and outer-London; derived variables at household and family level; full range of census topics and derived household and family level variables; additional information currently includes Cambridge Occupational Scores and lifestage variable (both files) plus population weighting factor (individual file).

Similar files are planned for the 2001 Census of Britain with the sample size increasing to 3 per cent. Requests have been made for SARs to be released from the British censuses prior to the 1991 Census.

 

Is Microdata What You Are Looking For?

If you are looking for statistical data sets, and you feel your data and research have to meet one or more of the following criteria, then microdata or PUMS (and its international counterparts) might fit your data needs.
  • You want to base your research on very large samples and reach an inference or description of broad scope;
  • You want to have your own choice of units of analysis and population, or the possibility to analyze subgroups of the population based on your own selection from the sample of individual records;
  • You want to have detailed variable categorizations that allow you to develop your own classifications according to your hypotheses or the patterns you may find with statistical modeling;
  • You would like your data to be readable directly by a variety of statistical software and can be manipulated and/or analyzed with those statistical tools (e. g., SAS/SPSS/Stata).
If you just need to find one or several data points, such as what are the population of Champaign city and Urbana city, Illinois in 1990, then microdata is not for you. Instead, you will need to use published aggregated data tables to find these summary numbers.

Microdata Home Page Major Sources of Microdata Tips for Extracting Microdata
Use & Manage Extracted Microdata Strategies for UIUC Users
References & Additional Links About This Guide

top


UIUC Government Documents Library University of Illinois Libraries University of Illinois

Please direct all comments or inquiries to the Government Documents Library
http://www.library.uiuc.edu/doc/newpages/microdata/microdata.htm
last updated 7/6/2001
Credits