Population Synthesis

Population Synthesis#

POLARIS simulates individual households and people and therefore needs these as inputs. This data is usually not directly available due to privacy reasons, however the US Census Bureau continuously collects samples of the population of the entire US, the American Community Survey (ACS), and makes a detailed 1% sample available per year, the public usage micro-sample (PUMS). To preserve privacy, all geographic information, like place of home, are aggregated to so-called public usage micro-sample areas (PUMA). Additionally, more geographically detailed data is available for selected variables, like number of households or number of people in a defined age band per census tract. These are marginal distributions, i.e. they provide information on one (or few) variable only and the individual-level information is lost. A population synthesis process then takes the fully cross-tabulated seed sample (PUMS) at a geographically aggregate level and expands it such that it matches the marginal distributions at a geographically detailed scale. Refer to

Auld, J., & Mohammadian, A. (2010). Efficient Methodology for Generating Synthetic Populations with Multiple Control Levels. Transportation Research Record, 2175(1), 138–147. https://doi.org/10.3141/2175-16

for more detail. We discuss the data preparation steps for the POLARIS population synthesis process in the following.

Prerequisites#

You will need a census api key, which can be obtained from https://api.census.gov/data/key_signup.html. Before running any of the code below, set the environment variable CENSUS_API_KEY to your key value. This can be done directly from python with

import os
os.environ["CENSUS_API_KEY"] = "MY_API_KEY"

where "MY_API_KEY" is the value of your key.

Run input file generation process#

If you just want to generate data for a certain year and have a list of counties for your model area, use

import os
from polaris.prepare.popsyn.create_popsyn_input_files import run_file_generation

os.environ["CENSUS_API_KEY"] = "MY_API_KEY"

year = 2019
states_and_counties = [("state_name", ["county_fips_1", "county_fips_2", ...])]
filter_to_single_year=True

run_file_generation(year=year, states_and_counties=states_and_counties, filter_to_single_year=filter_to_single_year)

then copy the generated files (pums_hh.csv, pums_person.csv, sf1.csv, linker_file.txt) into your Polaris model folder. Note that county ids are expected to be 3-digit FIPS codes.

Note: the sf1 file should only include records for census tracts within the modelling area, so you might have to filter the output of this process.

Seed data#

polaris-studio downloads PUMS data from the US Census Bureau website. The data provides a running 5 year sample representing 5% of the population with detailed information on individual households and the associated people. The definition of variables can be found following the link in the introduction or in machine-readable version here.

The data preparation process makes sure that person and household files are consistent and generates some additional cross-tabulated fields. To generate seed data for, say, the entire state of Wyoming, run

from pathlib import Path
from polaris.prepare.popsyn.create_popsyn_input_files import parse_state
from polaris.prepare.popsyn.seed import create_seed_data

year = 2019
states_and_counties = [(parse_state("Wyoming"), [])]
working_dir = Path("/path/to/your/dir")

hh, ppl = create_seed_data(working_dir, year, states_and_counties)

The hh and ppl variables are pandas dataframes. Note this will leave two artifacts per state in your working directory, one zip-file for households and people each, which contain all unprocessed PUMS data.

Control totals#

Marginal control totals at detailed geographic level are dowloaded from the US Census Bureau website. This process can be run as follows, again with Wyoming as an example:

import os
from pathlib import Path
from polaris.prepare.popsyn.create_popsyn_input_files import parse_state
from polaris.prepare.popsyn.control_totals import create_control_totals

os.environ["CENSUS_API_KEY"] = "MY_API_KEY"
year = 2019
states_and_counties = [(parse_state("Wyoming"), [])]
working_dir = Path("/path/to/your/dir")

control_totals = create_control_totals(working_dir, states_and_counties, year)

This process downloads a small subset of the available data as specified in polaris.prepare.popsyn.control_totals.get_default_group_spec.

Linking seeds and control totals together#

To link the seed data to its corresponding control totals, POLARIS uses a linker file (default name: linker_file.txt). We here describe its structure and also provide an example of how to add another controlled variable to the population synthesis process.

Linker file fields#

Note that many fields in the linker file are referencing column indexes in the seed and control total files. These are always 0-based, i.e., index 0 refers to the first column in the file, etc. Also note that all provided values must be tab-separated. Note also that lines starting with # are comments and not parsed by the population synthesizer.

File names#

Variable	Description
HHFILE	name of household file
PERSONFILE	name of person file
ZONEFILE	name of household file

Dimensions#

Variable	Description
HHDIMS	variable length - one value per controlled variable at household level
PERSONDIMS	variable length - one value per controlled variable at person level

Each line contains information on the number of variables defined and the dimension of each. For example, the following line in the default linker file

PERSONDIMS	7	6	14

there are three controlled variables at the household level, the first has 7 distinct values, the second 6 and the third 14. Note that distinct values here refers to a later defined grouping (see MARGVAR) and not necessarily the raw values in the seed files - e.g., the first variable with 7 dimensions will turn out to be the age of a person and we will define 7 age bands in PERSONMARGVAR.

There is an optional HHTESTDIMS field for validation of uncontrolled variables at household level level - this variable will not be used in the controlling procedure but it is a useful tool to assess the ‘goodness of fit’ of the population synthesis process. If defined, the corresponding TESTHHVAR and TESTHHMARGVAR fields described below must also be defined.

Variable definition in seed file#

Variable	Description
HHVAR	2 integers corresponding to internal variable index and column index in seed file. One line per controlled variable at household level.
PERSONVAR	2 integers corresponding to internal variable index and column index in seed file. One line per controlled variable at household level.

For example, the three person-level variables defined above have the following three entries in the default linker file:

## Age
PERSONVAR	0	6
## Race incl Hispanic
PERSONVAR	1	27
## EDUCATION by EMPLOYMENT
PERSONVAR	2	32

Each line has two integers, the first indicating the position in the PERSONDIMS variable definition above and the second indicating the column index of the variable in the corresponding seed file. For example, the first variable, age, is defined first (0-based indexing) in the PERSONDIMS line above.

Linking seed and control data#

HHMARGVAR and PERSONMARGVAR ties the seed data to the controlled data. Each line entry consist of 5 integers and there is one line per variable in the control totals file. The first integer corresponds to the internal variable index in PERSONDIM, like the first integer in PERSONVAR. The second number refers to an internal index and is 0-based and contiguous. The third and fourth number define the range of the variable values in the seed column corresponding to the control column index defined in the 5th and last number on each line. For example, the person age variable used in the previous examples is defined in the following way in the default linker file:

PERSONMARGVAR	0	0	0	15	39
PERSONMARGVAR	0	1	15	25	40
PERSONMARGVAR	0	2	25	35	41
PERSONMARGVAR	0	3	35	45	42
PERSONMARGVAR	0	4	45	55	43
PERSONMARGVAR	0	5	55	65	44
PERSONMARGVAR	0	6	65	999	45

The first column after the PERSONMARGVAR keyword tells us that we are defining values for the first variable in PERSONDIMS. The value at that index in PERSONDIMS tells us that we will need 7 lines to tie the seed and linker together. PERSONVAR at that index (first value after the keyword) tells us that we are talking about column 6 in the person seed file - the age variable. The second integer column defines a margvar index and runs from 0 to 6 - the maximum number of values to be defined for this variable. The third and fourth integer columns in PERSONMARGVAR now define the age range in the seed file that corresponds to the the column in the linker file, defined in the 5th and last integer column. In this example, the age bands are 0-15 years (left inclusive and right exclusive), 15-25 years, 25-35 years, and so on.

Geographic ids and weight column#

POLARIS’ population synthesis operates at two distinct geographic levels; that of the micro-sample called region, and that of the control totals called (population synthesis) zone. In case of American census data, the regions are the PUMAs and zones are (usually) census tracts, however this is not required and in fact any two geographic aggregations with a unique crosswalk can be used. The population synthesis code is completely flexible and therefore the relation between region and zone needs to be provided in the linker file, as described in the table below. To link the output of the population synthesis process to the model, model locations need to be tied to population synthesis zones. This information resides in the location table in the supply database.

Variable	Description
REGION	3 integers to specify the column index of the region id, household id, and weight column, respectively, in the household file.
PERSON	3 integers to specify the column index of the region id, household id, and weight column, respectively, in the person file.
ZONE	2 integers to specify the column index of the zone id and region id columns in the control total file.

Data for polaris runs#

POLARIS uses some ACS attributes directly. Currently, the following variables need to be provided:

HHDATA: 6 integers to specify the column index of the following variables in the household file: household type by tenure, household size, number of vehicles, number of workers, household income, housing unit type.

PERSONDATA: 17 integers to specify the column index of the following variables in the person file: age, class of worker, education, employment industry, employment status, gender, wage, work arrival time, work mode, work travel time, work vehicle occupancy, marital status, race, school enrollment, school level, work hours, disability status.

Example: Adding a controlled variable to population synthesis#

Sex is not directly controlled for in the default linker file. However, the seed data contains this information in the SEX field and the default control data contains this information in the two columns SEX_MALE and SEX_FEMALE.

This means we only need to alter the linker file. Gender is a person property, so we will have to alter PERSONDIMS, PERSONNVAR and PERSONMARGVAR. Please note that in the following all values must be tab-separated!

Regarding PERSONDIMS, line 9 of polaris.prepare.popsyn.linker_file.txt now reads

PERSONDIMS	7	6	14	2

we have added an additional variable (another entry) with two distinct values (male/female). Regarding PERSONVAR, we need to add an additonal line to the file after line 43:

## SEX
PERSONVAR	3	15

Note the first line is a comment for human readability, only the second line is mandatory and processed by POLARIS. The first number, 3, refers to the position of the variable in the PERSONDIMS properties above - it is the 3rd and last variable defined there (note again all indexes are 0-based). The second number, 15, refers to the column index of the variable (SEX here) in the person seed file. Regarding PERSONMARGVAR, we need to add two more lines to the end of the file:

PERSONMARGVAR	3	0	1	2	37
PERSONMARGVAR	3	1	2	3	38

The first number, 3, refers to the variable index defined in PERSONDIMS again. The second number, 0 and 1, are internal indexes and are 0-based and contiguous. The third and fourth number define the range of the variable values in the seed column corresponding to the control column index defined with the 5th and last number on each line. The first line therefore encodes that column number 37 in the control file corresponds to values between 1 (inclusive) and 2 (exclusive) i.e. to values exaclty equal 1. This corresponds to SEX=male in the seed file. The second line ties the SEX=female values to its corresponding control data column, here at index 38.

In our example we did not have to change the seed and control total files. In general, when operating on the seed and control data it is important to note that if you want to use the default linker file, the column order is important because the linker file uses column indexes into seed and control data. This column order in the seed files can be ensured by running

from polaris.prepare.popsyn.seed import filter_seed_columns
hh, ppl = filter_seed_columns(hh, ppl)

For the control data, polaris.prepare.popsyn.control_totals.get_default_group_spec returns a list in the correct order, which is used in polaris.prepare.popsyn.control_totals.create_control_totals to create the default control total file. If you want to add additional variables to the control file make sure to append any specifications to the list provided by get_default_group_spec().