Utility Functions#

Helper static methods for generating and working with P2CP instances.

hiveplotlib.p2cp.indices_for_unique_values(df: DataFrame, column: Hashable) Dict[Hashable, ndarray]#

Find the indices corresponding to each unique value in a column of a pandas dataframe.

Works when the values contained in column are numerical or categorical.

Parameters:
  • df – dataframe from which to find index values.

  • column – column of the dataframe to use to find indices corresponding to each of the column’s unique values.

Returns:

dict whose keys are the unique values in the column of data and whose values are 1d arrays of index values.

hiveplotlib.p2cp.split_df_on_variable(df: DataFrame, column: Hashable, cutoffs: List[float] | int, labels: List[Hashable] | ndarray | None = None) ndarray#

Generate value for each record in a dataframe according to a splitting criterion.

Using either specified cutoff values or a specified number of quantiles for cutoffs, return an (n, 1) np.ndarray where the ith value corresponds to the partition assignment of the ith record of df.

If column corresponds to numerical data, and a list of cutoffs is provided, then dataframe records will be assigned according to the following binning scheme:

(-inf, cutoff[0]], (cutoff[0], cutoff[1]], … , (cutoff[-1], inf]

If column corresponds to numerical data, and cutoffs is provided as an int, then dataframe records will be assigned into cutoffs equal-sized quantiles.

Note

This method currently only supports splits where column corresponds to numerical data. For splits on categorical data values, see indices_for_unique_values().

Parameters:
  • df – dataframe whose records will be assigned to a partition.

  • column – column of the dataframe to use to assign partition of records.

  • cutoffs – cutoffs to use in partitioning records according to the data under column. When provided as a list, the specified cutoffs will partition according to (-inf, cutoffs[0]], (`cutoffs[0]`, cutoffs[1]], … , (cutoffs[-1], inf). When provided as an int, the exact numerical break points will be determined to create cutoffs equally-sized quantiles.

  • labels – labels assigned to each bin. Default None labels each bin as a string based on its range of values. Note, when cutoffs is a list, len(labels) must be 1 greater than len(cutoffs). When cutoffs is an int, len(labels) must be equal to cutoffs.

Returns:

(n, 1) np.ndarray whose values are partition assignments corresponding to records in df.