Creating Hive Plots from Pandas#

This notebook discusses how to create hive plots from pandas.DataFrame instances of node and edge data.

[1]:

import pandas as pd
from hiveplotlib import Edges, HivePlot, NodeCollection
from hiveplotlib.datasets import example_edge_data, example_node_data

In order to generate a HivePlot instance, we must create:

A NodeCollection instance.
An Edges instance.
A node partition variable.
A node sorting variable.

We cover each of these tasks in the sections below.

Create NodeCollection From Pandas DataFrame#

To make a hiveplotlib.NodeCollection instance, we only need a pandas.DataFrame with one column of unique IDs.

By default, the NodeCollection instantiation will use the DataFrame’s index values as unique IDs, but below we explicitly set the unique_id_column parameter to a column name in our example node DataFrame:

[2]:

node_df = example_node_data()
node_df

[2]:

	unique_id	low	med	high
0	0	6.363247	14.795079	23.193620
1	1	2.695169	12.321405	21.873202
2	2	0.409326	18.010787	26.718541
3	3	0.165111	19.226066	21.949123
4	4	8.124570	12.658641	25.771102
...	...	...	...	...
95	95	9.562530	15.708242	25.857141
96	96	1.486152	10.064025	21.225680
97	97	9.716562	17.718766	29.328351
98	98	8.890456	19.772874	26.833664
99	99	8.215515	15.892802	28.229576

100 rows × 4 columns

[3]:

nodes = NodeCollection(
    data=node_df,
    unique_id_column="unique_id",  # use the `unique_id` df column
)

nodes

[3]:

hiveplotlib.NodeCollection of 100 nodes and unique ID column 'unique_id'.

If the column of unique IDs are not unique, this will raise a RepeatUniqueNodeIDsError:

[4]:

node_df_with_copies = pd.concat([node_df.copy(), node_df.copy()]).sort_values(
    by="unique_id"
)
node_df_with_copies.head()

[4]:

	unique_id	low	med	high
0	0	6.363247	14.795079	23.193620
0	0	6.363247	14.795079	23.193620
1	1	2.695169	12.321405	21.873202
1	1	2.695169	12.321405	21.873202
2	2	0.409326	18.010787	26.718541

[5]:

import traceback

from hiveplotlib.exceptions import RepeatUniqueNodeIDsError

try:
    NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
except RepeatUniqueNodeIDsError:
    traceback.print_exc()

Traceback (most recent call last):
  File "/tmp/ipykernel_1890710/3006239315.py", line 6, in <module>
    NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
  File "/home/garyk/repos/hiveplotlib/src/hiveplotlib/node.py", line 160, in __init__
    raise RepeatUniqueNodeIDsError(msg)
hiveplotlib.exceptions.node.RepeatUniqueNodeIDsError: Found repeat unique IDs:
    unique_id       low        med       high
0           0  6.363247  14.795079  23.193620
0           0  6.363247  14.795079  23.193620
1           1  2.695169  12.321405  21.873202
1           1  2.695169  12.321405  21.873202
2           2  0.409326  18.010787  26.718541
..        ...       ...        ...        ...
97         97  9.716562  17.718766  29.328351
98         98  8.890456  19.772874  26.833664
98         98  8.890456  19.772874  26.833664
99         99  8.215515  15.892802  28.229576
99         99  8.215515  15.892802  28.229576

[200 rows x 4 columns]

Create Edges From Pandas DataFrame#

To make a hiveplotlib.Edges instance, we need a pandas.DataFrame with a minimum of two columns—one column of origin node IDs and one column of destination node IDs:

[6]:

edge_df = example_edge_data(
    nodes=nodes,
)

edge_df

[6]:

	from	to
0	85	63
1	51	26
2	30	4
3	7	1
4	17	81
...	...	...
95	82	95
96	36	14
97	51	97
98	36	88
99	38	82

100 rows × 2 columns

Note, these node IDs correspond to the unique node IDs of the NodeCollection instance, as assigned to each node by the unique_id_column parameter discussed above.

By default, the Edges instantiation expects specific names for both the DataFrame’s origin node IDs column (from) and the destination node IDs column (to).

Even though our DataFrame already has the appropriate column names, below we explicitly set these values to demonstrate how one can specify these column names when they differ from the default from and to values:

[7]:

edges = Edges(
    data=edge_df,
    from_column_name="from",  # same as default
    to_column_name="to",  # same as default
)

edges

[7]:

hiveplotlib.Edges of 100 edges.

In addition to providing a pandas.DataFrame input here, we can instead provide a two-column numpy array. For more information, see the Add Data to Edges page.

Create Partition Variable#

In order to make a hive plot, we must choose a partition of the nodes, which lets us split up the nodes into separate axes.

This can be done easily with the NodeCollection.create_partition_variable() method:

[8]:

partition_variable = nodes.create_partition_variable(
    data_column="low",
    cutoffs=3,  # same as default
    labels=["A", "B", "C"],  # nicer than default names
)

nodes.data.head()

[8]:

	unique_id	low	med	high	partition_0
0	0	6.363247	14.795079	23.193620	B
1	1	2.695169	12.321405	21.873202	A
2	2	0.409326	18.010787	26.718541	A
3	3	0.165111	19.226066	21.949123	A
4	4	8.124570	12.658641	25.771102	C

For more on how and why we partition node data for hive plots, see the Setting a Partition Variable page.

Choose Sorting Variables#

In order to make a hive plot, we must choose the sorting variables, one for each axis. This lets us order and place our nodes on each axis.

We can easily set all axes’ sorting variables to the same value by assigning our sorting_variables parameter to a node data column name when we instantiate our hive plot:

[9]:

# must choose sorting variable to place nodes on each axis
#  we will use this when we instantiate a HivePlot later
sorting_variables = "low"

For more on how and why we set the sorting variables in hive plots, see the Setting Axis Sorting Variables page.

Create HivePlot From NodeCollection and Edges#

With our nodes and edges (and the partition variable and sorting variables) set, we have everything we need to generate a hiveplotlib.HivePlot instance:

[10]:

hp = HivePlot(
    nodes=nodes,  # our NodeCollection from above
    edges=edges,  # our Edges from above
    partition_variable=partition_variable,  # node column name assigned above
    sorting_variables=sorting_variables,  # node column name assigned above
)

hp.plot();

../_images/notebooks_creating_hive_plots_from_pandas_21_0.png

	from	to
0	85	63
1	51	26
2	30	4
3	7	1
4	17	81
...	...	...
95	82	95
96	36	14
97	51	97
98	36	88
99	38	82

	from	to
0	85	63
1	51	26
2	30	4
3	7	1
4	17	81
...	...	...
95	82	95
96	36	14
97	51	97
98	36	88
99	38	82

	from	to
0	85	63
1	51	26
2	30	4
3	7	1
4	17	81
...	...	...
95	82	95
96	36	14
97	51	97
98	36	88
99	38	82