Utility Functions#
Helper static methods for generating and working with P2CP instances.
- hiveplotlib.p2cp.indices_for_unique_values(df: DataFrame, column: Hashable) Dict[Hashable, ndarray]#
Find the indices corresponding to each unique value in a column of a
pandasdataframe.Works when the values contained in
columnare numerical or categorical.- Parameters:
df – dataframe from which to find index values.
column – column of the dataframe to use to find indices corresponding to each of the column’s unique values.
- Returns:
dictwhose keys are the unique values in the column of data and whose values are 1d arrays of index values.
- hiveplotlib.p2cp.split_df_on_variable(df: DataFrame, column: Hashable, cutoffs: List[float] | int, labels: List[Hashable] | ndarray | None = None) ndarray#
Generate value for each record in a dataframe according to a splitting criterion.
Using either specified cutoff values or a specified number of quantiles for
cutoffs, return an(n, 1)np.ndarraywhere the ith value corresponds to the partition assignment of the ith record ofdf.If
columncorresponds to numerical data, and alistofcutoffsis provided, then dataframe records will be assigned according to the following binning scheme:(-inf,
cutoff[0]], (cutoff[0],cutoff[1]], … , (cutoff[-1], inf]If
columncorresponds to numerical data, andcutoffsis provided as anint, then dataframe records will be assigned intocutoffsequal-sized quantiles.Note
This method currently only supports splits where
columncorresponds to numerical data. For splits on categorical data values, seeindices_for_unique_values().- Parameters:
df – dataframe whose records will be assigned to a partition.
column – column of the dataframe to use to assign partition of records.
cutoffs – cutoffs to use in partitioning records according to the data under
column. When provided as alist, the specified cutoffs will partition according to (-inf,cutoffs[0]], (`cutoffs[0]`,cutoffs[1]], … , (cutoffs[-1], inf). When provided as anint, the exact numerical break points will be determined to createcutoffsequally-sized quantiles.labels – labels assigned to each bin. Default
Nonelabels each bin as a string based on its range of values. Note, whencutoffsis a list,len(labels)must be 1 greater thanlen(cutoffs). Whencutoffsis anint,len(labels)must be equal tocutoffs.
- Returns:
(n, 1)np.ndarraywhose values are partition assignments corresponding to records indf.