Creating Hive Plots from Pandas#

This notebook discusses how to create hive plots from pandas.DataFrame instances of node and edge data.

[1]:
import pandas as pd
from hiveplotlib import Edges, HivePlot, NodeCollection
from hiveplotlib.datasets import example_edge_data, example_node_data

In order to generate a HivePlot instance, we must create:

  1. A NodeCollection instance.

  2. An Edges instance.

  3. A node partition variable.

  4. A node sorting variable.

We cover each of these tasks in the sections below.

Create NodeCollection From Pandas DataFrame#

To make a hiveplotlib.NodeCollection instance, we only need a pandas.DataFrame with one column of unique IDs.

By default, the NodeCollection instantiation will use the DataFrame’s index values as unique IDs, but below we explicitly set the unique_id_column parameter to a column name in our example node DataFrame:

[2]:
node_df = example_node_data()
node_df
[2]:
unique_id low med high
0 0 6.363247 14.795079 23.193620
1 1 2.695169 12.321405 21.873202
2 2 0.409326 18.010787 26.718541
3 3 0.165111 19.226066 21.949123
4 4 8.124570 12.658641 25.771102
... ... ... ... ...
95 95 9.562530 15.708242 25.857141
96 96 1.486152 10.064025 21.225680
97 97 9.716562 17.718766 29.328351
98 98 8.890456 19.772874 26.833664
99 99 8.215515 15.892802 28.229576

100 rows × 4 columns

[3]:
nodes = NodeCollection(
    data=node_df,
    unique_id_column="unique_id",  # use the `unique_id` df column
)

nodes
[3]:
hiveplotlib.NodeCollection of 100 nodes and unique ID column 'unique_id'.

If the column of unique IDs are not unique, this will raise a RepeatUniqueNodeIDsError:

[4]:
node_df_with_copies = pd.concat([node_df.copy(), node_df.copy()]).sort_values(
    by="unique_id"
)
node_df_with_copies.head()
[4]:
unique_id low med high
0 0 6.363247 14.795079 23.193620
0 0 6.363247 14.795079 23.193620
1 1 2.695169 12.321405 21.873202
1 1 2.695169 12.321405 21.873202
2 2 0.409326 18.010787 26.718541
[5]:
import traceback

from hiveplotlib.exceptions import RepeatUniqueNodeIDsError

try:
    NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
except RepeatUniqueNodeIDsError:
    traceback.print_exc()
Traceback (most recent call last):
  File "/tmp/ipykernel_20959/3006239315.py", line 6, in <module>
    NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
  File "/home/garyk/repos/hiveplotlib/src/hiveplotlib/node.py", line 160, in __init__
    raise RepeatUniqueNodeIDsError(msg)
hiveplotlib.exceptions.node.RepeatUniqueNodeIDsError: Found repeat unique IDs:
    unique_id       low        med       high
0           0  6.363247  14.795079  23.193620
0           0  6.363247  14.795079  23.193620
1           1  2.695169  12.321405  21.873202
1           1  2.695169  12.321405  21.873202
2           2  0.409326  18.010787  26.718541
..        ...       ...        ...        ...
97         97  9.716562  17.718766  29.328351
98         98  8.890456  19.772874  26.833664
98         98  8.890456  19.772874  26.833664
99         99  8.215515  15.892802  28.229576
99         99  8.215515  15.892802  28.229576

[200 rows x 4 columns]

Create Edges From Pandas DataFrame#

To make a hiveplotlib.Edges instance, we need a pandas.DataFrame with a minimum of two columns—one column of origin node IDs and one column of destination node IDs:

[6]:
edge_df = example_edge_data(
    nodes=nodes,
)

edge_df
[6]:
from to
0 85 63
1 51 26
2 30 4
3 7 1
4 17 81
... ... ...
95 82 95
96 36 14
97 51 97
98 36 88
99 38 82

100 rows × 2 columns

Note, these node IDs correspond to the unique node IDs of the NodeCollection instance, as assigned to each node by the unique_id_column parameter discussed above.

By default, the Edges instantiation expects specific names for both the DataFrame’s origin node IDs column (from) and the destination node IDs column (to).

Even though our DataFrame already has the appropriate column names, below we explicitly set these values to demonstrate how one can specify these column names when they differ from the default from and to values:

[7]:
edges = Edges(
    data=edge_df,
    from_column_name="from",  # same as default
    to_column_name="to",  # same as default
)

edges
[7]:
hiveplotlib.Edges of 100 edges.

In addition to providing a pandas.DataFrame input here, we can instead provide a two-column numpy array. For more information, see the Add Data to Edges page.

Create Partition Variable#

In order to make a hive plot, we must choose a partition of the nodes, which lets us split up the nodes into separate axes.

This can be done easily with the NodeCollection.create_partition_variable() method:

[8]:
partition_variable = nodes.create_partition_variable(
    data_column="low",
    cutoffs=3,  # same as default
    labels=["A", "B", "C"],  # nicer than default names
)

nodes.data.head()
[8]:
unique_id low med high partition_0
0 0 6.363247 14.795079 23.193620 B
1 1 2.695169 12.321405 21.873202 A
2 2 0.409326 18.010787 26.718541 A
3 3 0.165111 19.226066 21.949123 A
4 4 8.124570 12.658641 25.771102 C

For more on how and why we partition node data for hive plots, see the Setting a Partition Variable page.

Choose Sorting Variables#

In order to make a hive plot, we must choose the sorting variables, one for each axis. This lets us order and place our nodes on each axis.

We can easily set all axes’ sorting variables to the same value by assigning our sorting_variables parameter to a node data column name when we instantiate our hive plot:

[9]:
# must choose sorting variable to place nodes on each axis
#  we will use this when we instantiate a HivePlot later
sorting_variables = "low"

For more on how and why we set the sorting variables in hive plots, see the Setting Axis Sorting Variables page.

Create HivePlot From NodeCollection and Edges#

With our nodes and edges (and the partition variable and sorting variables) set, we have everything we need to generate a hiveplotlib.HivePlot instance:

[10]:
hp = HivePlot(
    nodes=nodes,  # our NodeCollection from above
    edges=edges,  # our Edges from above
    partition_variable=partition_variable,  # node column name assigned above
    sorting_variables=sorting_variables,  # node column name assigned above
)

hp.plot();
../_images/notebooks_creating_hive_plots_from_pandas_21_0.png