Choice table utilities API¶
Working with discrete choice models can require a lot of data preparation. Each chooser has to be matched with hypothetical alternatives, either to simulate choice probabilities or to compare them with the chosen alternative for model estimation.
ChoiceModels includes a class called
MergedChoiceTable that automates this. To build a merged table, create an instance of the class and pass it one
pd.DataFrame of choosers and another of alternatives, with whatever other arguments are needed (see below for full API).
The merged data table can be output to a DataFrame, or passed directly to other ChoiceModels tools as a
MergedChoiceTable object. (This retains metadata about indexes and other special columns.)
mct = choicemodels.tools.MergedChoiceTable(obs, alts, ..) df = mct.to_frame()
This tool is designed especially for models that need to sample from large numbers of alternatives. It supports:
- uniform random sampling of alternatives, with or without replacement
- weighted random sampling based either on characteristics of the alternatives or on combinations of chooser and alternative
- interaction terms to be merged onto the final data table
- cartesian merging of all the choosers with all the alternatives, without sampling
All of the sampling procedures work for both estimation (where the chosen alternative is known) and simulation (where it is not).
MergedChoiceTable(observations, alternatives, chosen_alternatives=None, sample_size=None, replace=True, weights=None, availability=None, interaction_terms=None, random_state=None)¶
Generates a merged long-format table of observations (choosers) and alternatives, for discrete choice model estimation or simulation.
Supports random sampling of alternatives (uniform or weighted). Supports sampling with or without replacement. Supports merging observations and alternatives without sampling them. Supports alternative-specific weights, as well as interaction weights that depend on both the observation and alternative. Supports automatic merging of interaction terms onto the final data table.
Support is PLANNED for specifying availability of alternatives, specifying random state, and passing interaction-type parameters as callable generator functions.
Does NOT support cases where the number of alternatives in the final table varies across observations.
Reserved column names: ‘chosen’.
- observations (pandas.DataFrame) – Table with one row for each chooser or choice scenario, with unique ID’s in the index field. Additional columns can contain fixed attributes of the choosers. Index name is set to ‘obs_id’ if none provided. All observation/alternative column names must be unique except for the join key.
- alternatives (pandas.DataFrame) – Table with one row for each alternative, with unique ID’s in the index field. Additional columns can contain fixed attributes of the alternatives. Index name is set to ‘alt_id’ if none provided. All observation/alternative column names must be unique except for the join key.
- chosen_alternatives (str or pandas.Series, optional) – List of the alternative ID selected in each choice scenario. (This is required for preparing estimation data, but not for simulation data.) If str, interpreted as a column name from the observations table. If Series, it will be joined onto the obserations table before processing. The column will be dropped from the merged table and replaced with a binary column named ‘chosen’.
- sample_size (int, optional) – Number of alternatives to sample for each choice scenario. If ‘None’, all of the alternatives will be available for each chooser in the merged table. The sample size includes the chosen alternative, if applicable. If replace=False, the sample size must be less than or equal to the total number of alternatives.
- replace (boolean, optional) – Whether to sample alternatives with or without replacement, at the level of a single chooser or choice scenario. If replace=True (default), alternatives may appear multiple times in a single choice set. If replace=False, an alternative will appear at most once per choice set. Sampling with replacement is much more efficient, so setting replace=False may have performance implications if there are very large numbers of observations or alternatives.
- weights (str, pandas.Series, optional) –
Numerical weights to apply when sampling alternatives. If str, interpreted as a column from the alternatives table. If Series, it can contain either (a) one weight for each alternative or (b) one weight for each combination of observation and alternative. The former should include a single index with ID’s from the alternatives table. The latter should include a MultiIndex with the first level corresponding to the observations table and the second level corresponding to the alternatives table. If callable, it should accept two arguments (obs_id, alt_id) and return the corresponding weight.
TO DO - accept weights specified with respect to derivative characteristics, like how the interaction terms work (for example weights could be based on home census tract rather than observation id if there are multiple observations per tract)
TO DO - implement support for a callable
- availability (pandas.Series or callable, optional (NOT YET IMPLEMENTED)) – Binary representation of the availability of alternatives. Specified and applied similarly to the weights.
- interaction_terms (pandas.Series, pandas.DataFrame, or list of either, optional) –
Additional column(s) of interaction terms whose values depend on the combination of observation and alternative, to be merged onto the final data table. If passed as a Series or DataFrame, it should include a two-level MultiIndex. One level’s name and values should match an index or column from the observations table, and the other should match an index or column from the alternatives table.
TO DO - implement support for a callable
- random_state (NOT YET IMPLEMENTED) – Representation of random state, for replicability of the sampling.
Name of column in the merged table containing the alternative id. Name and values will match the index of the alternatives table.
Returns: Return type: str
Name of the generated column containing a binary representation of whether each alternative was chosen in the given choice scenario, if applicable.
Returns: Return type: str or None
Name of column in the merged table containing the observation id. Name and values will match the index of the observations table.
Returns: Return type: str
Long-format DataFrame of the merged table. The rows representing alternatives for a particular chooser are contiguous, with the chosen alternative listed first if applicable. (Unless no sampling is performed, in which case the alternatives are listed in order.) The DataFrame includes a two-level MultiIndex. The first level corresponds to the index of the observations table and the second to the index of the alternatives table.
Returns: Return type: pandas.DataFrame