American Community Survey 2005-2009

Getting It All - Working with the Full Dataset

Getting It All - Working with the Full Dataset

NOTE: The 2009 scripts were used as a basis for creating a set of functions that could be used or adapted for importing multiple Census data products. These meta-scripts (as well as, currently, data import scripts for ACS 2010 5-year and ACS 2010 Selected Population tables) are now available on GitHub at https://github.com/leehach.

The American Community Survey dataset is massive. The data files unzip to fill 50 GB of hard drive space. The researcher who wants to perform a national analysis will have to combine data from 51 files. The researcher who wants to analyze multiple subjects (transportation, income, etc.) will have to combine data from up to 117 different "sequence" files. While the Census provides SAS macros to combine the data files, not everyone uses SAS. At Center for Urban Research, we need to get this data into SPSS, R, ArcGIS, and Excel. Different researchers use different tools, and the data store should not be tied to a specific tool. For some time now we have used PostgreSQL, a widely-used open source relational database management system with a very strong developer and user community, to manage a wide variety of data sets, including Census 2000, IPUMS-USAHome Mortgage Disclosure Act, etc.

The following SQL scripts can be run on a PostgreSQL server to create the table shells and import the data into the database. The SQL has only been tested on PostgreSQL. No attempt has been made to write them in generic, standards-compliant SQL. Since these are GPLed, you should feel free to modify and release them for another RDBMS. However, if anyone with experience with other RDBMSs wants to test the scripts as is and help me rewrite them in generic SQL, please feel free to contact the maintainer at lhachadoorian@gc.cuny.edu.

Download the scripts:

PostgreSQL naming is case insensitive unless the user forces the creation of case sensitive table or column name. Using case sensitive object names unnecessarily complicates querying, so all objects created by these scripts use case insensitive naming. Other than that, the scripts preserve as much as possible the structure and naming of the original files, so that the user can use Census Bureau documentation to work with this data. Each sequence (1 - 117) is created as a separate database table, although estimates and margins of error, released as separate files by the Census Bureau, are loaded into the same table in this script. Also, while the data were released separated by state and geography (tracts and block groups separate from larger-area geographies), we recombine the data by sequence. That is, each database table (an ACS sequence) will hold all states, and all geographies.

However, while the data is released grouped into sequences, it is much more intuitive to interact with the data by subject table. Therefore each subject table is defined as a database view. Subject table B07204 - GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE is created as a view named b07204. Each view and view column also has an attached comment with the description taken from the Census-supplied data dictionary. This should make it easier to work with when navigating the database in your query editor.

  1. The geoheader script "ACS 2009 5yr schema and geoheader table.sql" is assumed to be run first and will create a schema named acs2009_5yr. If this script is not run first, you will need to create the schema before moving on to the next step. Also, if you change the schema name in this script, you will have to change it in the first line of each of the other scripts in this set. This script creates a temp table, imports all the state geoheader files, parses the rows, and writes the data to the final table.
    1. The script has a placeholder for the path to the files downloaded from the Census Bureau. The placeholder <upload_folder> should be changed as appropriate.
    2. The script assumes you will upload geoheader data for all 50 states plus the District of Columbia and Puerto Rico at the same time. If you don't want to import all states, the undesired COPY statements can be commented out or deleted.
  2. Next run "ACS 2009 5yr table and view shells.sql". This will create empty tables to hold all of the population data contained in the sequence files downloaded from the Census Bureau.
    1. If you changed the schema name in the last step, remember to also change it in the first line of this script.
  3. Now import the actual data with "ACS 2009 5yr import sequence files.sql". The script creates a temp tables, imports all states and all sequences, and writes the data to the final table. The intermediate step is necessary because the COPY statement can only recognize one code representing NULL values, and the downloaded files specify two different kinds of missing data. Specifically, an empty string indicates a geographic restriction, e.g. a subject table not available at the block group level, while a dot or period indicates a missing value due to too few sample observations or a suppression to meet nondisclosure guidelines. These import routines coerce both of these values to NULL. The data manager who wants to preserve the distinction will have to modify the import script to assign a special numeric value, or add a column to indicate the reason for the missing data.
    1. The script has a placeholder for the path to the files downloaded from the Census Bureau. The placeholder <upload_folder> should be changed as appropriate.
    2. The Census Bureau releases two separate ZIP archives, one containing tracts and block groups, the other containing all other geographies. The files in these archives contain different data but the exact same file names. They must therefore be unzipped to different folders. This script assumes you will create folders named All_Geographies_Not_Tracts_Block_Groups and Tracts_Block_Groups_Only (matching the names of the ZIP files released by the Census Bureau) within <upload_folder>.
    3. The script imports all states and sequences at once. If you want to import only specific states, lines holding copy statements can be commented out by using two hyphens (--) at the beginning of a line, or several lines can be commented out by surrounding the text with /* [text to be commented out] */. If you want to import only specific sequences, this cannot be accomplished easily the way the script is written. The file "ACS 2009 5yr Import Sequence Files SQL Constructor.xls" contains a spreadsheet used to create the COPY statements. A single sequence can be imported by copying a column of statements (e.g., all states for sequence 23) from the spreadsheet.
    4. Importing all states and sequences at once will take a long time to run. This script may run for several hours, or possibly even a couple of days depending upon your hardware and any other load on your database server. Therefore, you may want to do the upload in batches, using the SQL constructor spreadsheet mentioned in the last step, to do a small group of states or sequences in each step. If you do this, you need to account for the fact that the temp tables will already contain the data from the previous upload. There are a few possible approaches:
      1. By far the easiest will be end the session after running the script to upload, e.g., a single state. Ending the session will prompt PostgreSQL to delete the temp tables, so they will be created afresh in a new session.
      2. Alternatively, the script would have to be modified to clean the temp tables (with, for example, the TRUNCATE statement), before COPYing a new state or sequence to the table.
      3. The data manager may want to upload all the sequence files to temp tables first, then write all the data to the final tables. This might be easiest if the data manger created permanent tables to be dropped explicitly after the import is completed. The can be accomplished by removing the keyword TEMPORARY in each CREATE TABLE statement (which can be accomplished quickly with a search and replace in a text editor). However, if you follow this approach, performance will be significantly better if you turn off autovacuum before loading data into these (not really) temporary tables. This can be accomplished with
        ALTER TABLE tmp_seqXXXXe SET (
        autovacuum_enabled = false, toast.autovacuum_enabled = false
        );

That's it! If you follow these steps you should have ACS data for the entire United States in your database! And if you use PostGIS, you will be able to map the data easily as well. Scipts for the ACS 2008 3-year dataset are provided as well. They work similarly, although there are no tract and block group files, so all the upload data can reside in one folder. If you have any questions or comments, please feel free to contact the maintainer at lhachadoorian@gc.cuny.edu.