Population Synthesis#
POLARIS simulates individual households and people and therefore needs these as inputs. This data is usually not directly available due to privacy reasons, however the US Census Bureau continuously collects samples of the population of the entire US, the American Community Survey (ACS), and makes a detailed 1% sample available per year, the public usage micro-sample (PUMS). To preserve privacy, all geographic information, like place of home, are aggregated to so-called public usage micro-sample areas (PUMA). Additionally, more geographically detailed data is available for selected variables, like number of households or number of people in a defined age band per census tract. These are marginal distributions, i.e. they provide information on one (or few) variable only and the individual-level information is lost. A population synthesis process then takes the fully cross-tabulated seed sample (PUMS) at a geographically aggregate level and expands it such that it matches the marginal distributions at a geographically detailed scale. Refer to
Auld, J., & Mohammadian, A. (2010). Efficient Methodology for Generating Synthetic Populations with Multiple Control Levels. Transportation Research Record, 2175(1), 138–147. https://doi.org/10.3141/2175-16
for more detail. We discuss the data preparation steps for the POLARIS population synthesis process in the following.
Prerequisites#
You will need a census api key, which can be obtained from https://api.census.gov/data/key_signup.html. Before running any of the code below, set the environment variable CENSUS_API_KEY
to your key value. This can be done directly from python with
import os
os.environ["CENSUS_API_KEY"] = "MY_API_KEY"
where "MY_API_KEY"
is the value of your key.
Run input file generation process#
If you just want to generate data for a certain year and have a list of counties for your model area, use
import os
from polaris.prepare.popsyn.create_popsyn_input_files import run_file_generation
os.environ["CENSUS_API_KEY"] = "MY_API_KEY"
year = 2019
states_and_counties = [("state_name", ["county_fips_1", "county_fips_2", ...])]
filter_to_single_year=True
run_file_generation(year=year, states_and_counties=states_and_counties, filter_to_single_year=filter_to_single_year)
then copy the generated files (pums_hh.csv, pums_person.csv, sf1.csv, linker_file.txt) into your Polaris model folder.
Seed data#
polaris-studio downloads PUMS data from the US Census Bureau website. The data provides a running 5 year sample representing 5% of the population with detailed information on individual households and the associated people. The definition of variables can be found following the link in the introduction or in machine-readable version here.
The data preparation process makes sure that person and household files are consistent and generates some additional cross-tabulated fields. To generate seed data for, say, the entire state of Wyoming, run
from pathlib import Path
from polaris.prepare.popsyn.create_popsyn_input_files import parse_state
from polaris.prepare.popsyn.seed import create_seed_data
year = 2019
states_and_counties = [(parse_state("Wyoming"), [])]
working_dir = Path("/path/to/your/dir")
hh, ppl = create_seed_data(working_dir, year, states_and_counties)
The hh
and ppl
variables are pandas dataframes. Note this will leave two artifacts per state in your working directory, one zip-file for households and people each, which contain all unprocessed PUMS data.
Control totals#
Marginal control totals at detailed geographic level are dowloaded from the US Census Bureau website. This process can be run as follows, again with Wyoming as an example:
import os
from pathlib import Path
from polaris.prepare.popsyn.create_popsyn_input_files import parse_state
from polaris.prepare.popsyn.control_totals import create_control_totals
os.environ["CENSUS_API_KEY"] = "MY_API_KEY"
year = 2019
states_and_counties = [(parse_state("Wyoming"), [])]
working_dir = Path("/path/to/your/dir")
control_totals = create_control_totals(working_dir, states_and_counties, year)
This process downloads a small subset of the available data as specified in polaris.prepare.popsyn.control_totals.get_default_group_spec
.
Linking seeds and control totals together#
To link the seed data to its corresponding control totals, POLARIS uses a linker file (default name: linker_file.txt). We here describe its structure and also provide an example of how to add another controlled variable to the population synthesis process.
Linker file fields#
Note that many fields in the linker file are referencing column indexes in the seed and control total files. These are always 0-based, i.e., index 0 refers to the first column in the file, etc. Also note that all provided values must be tab-separated. Note also that lines starting with #
are comments and not parsed by the population synthesizer.
File names#
Variable |
Description |
---|---|
HHFILE |
name of household file |
PERSONFILE |
name of person file |
ZONEFILE |
name of household file |
Dimensions#
Variable |
Description |
---|---|
HHDIMS |
variable length - one value per controlled variable at household level |
PERSONDIMS |
variable length - one value per controlled variable at person level |
Each line contains information on the number of variables defined and the dimension of each. For example, the following line in the default linker file
PERSONDIMS 7 6 14
there are three controlled variables at the household level, the first has 7 distinct values, the second 6 and the third 14. Note that distinct values here refers to a later defined grouping (see MARGVAR
) and not necessarily the raw values in the seed files - e.g., the first variable with 7 dimensions will turn out to be the age of a person and we will define 7 age bands in PERSONMARGVAR
.
There is an optional HHTESTDIMS
field for validation of uncontrolled variables at household level level - this variable will not be used in the controlling procedure but it is a useful tool to assess the ‘goodness of fit’ of the population synthesis process. If defined, the corresponding TESTHHVAR
and TESTHHMARGVAR
fields described below must also be defined.
Variable definition in seed file#
Variable |
Description |
---|---|
HHVAR |
2 integers corresponding to internal variable index and column index in seed file. One line per controlled variable at household level. |
PERSONVAR |
2 integers corresponding to internal variable index and column index in seed file. One line per controlled variable at household level. |
For example, the three person-level variables defined above have the following three entries in the default linker file:
## Age
PERSONVAR 0 6
## Race incl Hispanic
PERSONVAR 1 27
## EDUCATION by EMPLOYMENT
PERSONVAR 2 32
Each line has two integers, the first indicating the position in the PERSONDIMS
variable definition above and the second indicating the column index of the variable in the corresponding seed file. For example, the first variable, age, is defined first (0-based indexing) in the PERSONDIMS
line above.
Linking seed and control data#
HHMARGVAR
and PERSONMARGVAR
ties the seed data to the controlled data. Each line entry consist of 5 integers and there is one line per variable in the control totals file. The first integer corresponds to the internal variable index in PERSONDIM
, like the first integer in PERSONVAR
. The second number refers to an internal index and is 0-based and contiguous. The third and fourth number define the range of the variable values in the seed column corresponding to the control column index defined in the 5th and last number on each line. For example, the person age variable used in the previous examples is defined in the following way in the default linker file:
PERSONMARGVAR 0 0 0 15 39
PERSONMARGVAR 0 1 15 25 40
PERSONMARGVAR 0 2 25 35 41
PERSONMARGVAR 0 3 35 45 42
PERSONMARGVAR 0 4 45 55 43
PERSONMARGVAR 0 5 55 65 44
PERSONMARGVAR 0 6 65 999 45
The first column after the PERSONMARGVAR
keyword tells us that we are defining values for the first variable in PERSONDIMS
. The value at that index in PERSONDIMS
tells us that we will need 7 lines to tie the seed and linker together. PERSONVAR
at that index (first value after the keyword) tells us that we are talking about column 6 in the person seed file - the age variable. The second integer column defines a margvar index and runs from 0 to 6 - the maximum number of values to be defined for this variable. The third and fourth integer columns in PERSONMARGVAR
now define the age range in the seed file that corresponds to the the column in the linker file, defined in the 5th and last integer column. In this example, the age bands are 0-15 years (left inclusive and right exclusive), 15-25 years, 25-35 years, and so on.
Geographic ids and weight column#
Variable |
Description |
---|---|
REGION |
3 integers to specify the column index of the region id, household id, and weight column, respectively, in the household file. |
PERSON |
3 integers to specify the column index of the region id, household id, and weight column, respectively, in the person file. |
ZONE |
2 integers to specify the column index of the census tract id and puma id columns in the control total file. |
Data for polaris runs#
POLARIS uses some ACS attributes directly. Currently, the following variables need to be provided:
HHDATA
: 6 integers to specify the column index of the following variables in the household file: household type by tenure, household size, number of vehicles, number of workers, household income, housing unit type.
PERSONDATA
: 17 integers to specify the column index of the following variables in the person file:
age, class of worker, education, employment industry, employment status, gender, wage, work arrival time, work mode, work travel time, work vehicle occupancy, marital status, race, school enrollment, school level, work hours, disability status.
Example: Adding a controlled variable to population synthesis#
Sex is not directly controlled for in the default linker file. However, the seed data contains this information in the SEX
field and the default control data contains this information in the two columns SEX_MALE
and SEX_FEMALE
.
This means we only need to alter the linker file. Gender is a person property, so we will have to alter PERSONDIMS
, PERSONNVAR
and PERSONMARGVAR
. Please note that in the following all values must be tab-separated!
Regarding PERSONDIMS
, line 9 of polaris.prepare.popsyn.linker_file.txt
now reads
PERSONDIMS 7 6 14 2
we have added an additional variable (another entry) with two distinct values (male/female). Regarding
PERSONVAR
, we need to add an additonal line to the file after line 43:
## SEX
PERSONVAR 3 15
Note the first line is a comment for human readability, only the second line is mandatory and processed by POLARIS. The first number, 3, refers to the position of the variable in the PERSONDIMS
properties above - it is the 3rd and last variable defined there (note again all indexes are 0-based). The second number, 15, refers to the column index of the variable (SEX here) in the person seed file.
Regarding PERSONMARGVAR
, we need to add two more lines to the end of the file:
PERSONMARGVAR 3 0 1 2 37
PERSONMARGVAR 3 1 2 3 38
The first number, 3, refers to the variable index defined in PERSONDIMS
again. The second number, 0 and 1, are internal indexes and are 0-based and contiguous. The third and fourth number define the range of the variable values in the seed column corresponding to the control column index defined with the 5th and last number on each line. The first line therefore encodes that column number 37 in the control file corresponds to values between 1 (inclusive) and 2 (exclusive) i.e. to values exaclty equal 1. This corresponds to SEX=male
in the seed file. The second line ties the SEX=female
values to its corresponding control data column, here at index 38.
In our example we did not have to change the seed and control total files. In general, when operating on the seed and control data it is important to note that if you want to use the default linker file, the column order is important because the linker file uses column indexes into seed and control data. This column order in the seed files can be ensured by running
from polaris.prepare.popsyn.seed import filter_seed_columns
hh, ppl = filter_seed_columns(hh, ppl)
For the control data, polaris.prepare.popsyn.control_totals.get_default_group_spec
returns a list in the correct order, which is used in polaris.prepare.popsyn.control_totals.create_control_totals
to create the default control total file. If you want to add additional variables to the control file make sure to append any specifications to the list provided by get_default_group_spec()
.