Creating Hive Plots from Pandas#
This notebook discusses how to create hive plots from pandas.DataFrame instances of node and edge data.
[1]:
import pandas as pd
from hiveplotlib import Edges, HivePlot, NodeCollection
from hiveplotlib.datasets import example_edge_data, example_node_data
In order to generate a HivePlot instance, we must create:
A
NodeCollectioninstance.An
Edgesinstance.A node partition variable.
A node sorting variable.
We cover each of these tasks in the sections below.
Create NodeCollection From Pandas DataFrame#
To make a hiveplotlib.NodeCollection instance, we only need a pandas.DataFrame with one column of unique IDs.
By default, the NodeCollection instantiation will use the DataFrame’s index values as unique IDs, but below we explicitly set the unique_id_column parameter to a column name in our example node DataFrame:
[2]:
node_df = example_node_data()
node_df
[2]:
| unique_id | low | med | high | |
|---|---|---|---|---|
| 0 | 0 | 6.363247 | 14.795079 | 23.193620 |
| 1 | 1 | 2.695169 | 12.321405 | 21.873202 |
| 2 | 2 | 0.409326 | 18.010787 | 26.718541 |
| 3 | 3 | 0.165111 | 19.226066 | 21.949123 |
| 4 | 4 | 8.124570 | 12.658641 | 25.771102 |
| ... | ... | ... | ... | ... |
| 95 | 95 | 9.562530 | 15.708242 | 25.857141 |
| 96 | 96 | 1.486152 | 10.064025 | 21.225680 |
| 97 | 97 | 9.716562 | 17.718766 | 29.328351 |
| 98 | 98 | 8.890456 | 19.772874 | 26.833664 |
| 99 | 99 | 8.215515 | 15.892802 | 28.229576 |
100 rows × 4 columns
[3]:
nodes = NodeCollection(
data=node_df,
unique_id_column="unique_id", # use the `unique_id` df column
)
nodes
[3]:
hiveplotlib.NodeCollection of 100 nodes and unique ID column 'unique_id'.
If the column of unique IDs are not unique, this will raise a RepeatUniqueNodeIDsError:
[4]:
node_df_with_copies = pd.concat([node_df.copy(), node_df.copy()]).sort_values(
by="unique_id"
)
node_df_with_copies.head()
[4]:
| unique_id | low | med | high | |
|---|---|---|---|---|
| 0 | 0 | 6.363247 | 14.795079 | 23.193620 |
| 0 | 0 | 6.363247 | 14.795079 | 23.193620 |
| 1 | 1 | 2.695169 | 12.321405 | 21.873202 |
| 1 | 1 | 2.695169 | 12.321405 | 21.873202 |
| 2 | 2 | 0.409326 | 18.010787 | 26.718541 |
[5]:
import traceback
from hiveplotlib.exceptions import RepeatUniqueNodeIDsError
try:
NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
except RepeatUniqueNodeIDsError:
traceback.print_exc()
Traceback (most recent call last):
File "/tmp/ipykernel_20959/3006239315.py", line 6, in <module>
NodeCollection(data=node_df_with_copies, unique_id_column="unique_id")
File "/home/garyk/repos/hiveplotlib/src/hiveplotlib/node.py", line 160, in __init__
raise RepeatUniqueNodeIDsError(msg)
hiveplotlib.exceptions.node.RepeatUniqueNodeIDsError: Found repeat unique IDs:
unique_id low med high
0 0 6.363247 14.795079 23.193620
0 0 6.363247 14.795079 23.193620
1 1 2.695169 12.321405 21.873202
1 1 2.695169 12.321405 21.873202
2 2 0.409326 18.010787 26.718541
.. ... ... ... ...
97 97 9.716562 17.718766 29.328351
98 98 8.890456 19.772874 26.833664
98 98 8.890456 19.772874 26.833664
99 99 8.215515 15.892802 28.229576
99 99 8.215515 15.892802 28.229576
[200 rows x 4 columns]
Create Edges From Pandas DataFrame#
To make a hiveplotlib.Edges instance, we need a pandas.DataFrame with a minimum of two columns—one column of origin node IDs and one column of destination node IDs:
[6]:
edge_df = example_edge_data(
nodes=nodes,
)
edge_df
[6]:
| from | to | |
|---|---|---|
| 0 | 85 | 63 |
| 1 | 51 | 26 |
| 2 | 30 | 4 |
| 3 | 7 | 1 |
| 4 | 17 | 81 |
| ... | ... | ... |
| 95 | 82 | 95 |
| 96 | 36 | 14 |
| 97 | 51 | 97 |
| 98 | 36 | 88 |
| 99 | 38 | 82 |
100 rows × 2 columns
Note, these node IDs correspond to the unique node IDs of the NodeCollection instance, as assigned to each node by the unique_id_column parameter discussed above.
By default, the Edges instantiation expects specific names for both the DataFrame’s origin node IDs column (from) and the destination node IDs column (to).
Even though our DataFrame already has the appropriate column names, below we explicitly set these values to demonstrate how one can specify these column names when they differ from the default from and to values:
[7]:
edges = Edges(
data=edge_df,
from_column_name="from", # same as default
to_column_name="to", # same as default
)
edges
[7]:
hiveplotlib.Edges of 100 edges.
In addition to providing a pandas.DataFrame input here, we can instead provide a two-column numpy array. For more information, see the Add Data to Edges page.
Create Partition Variable#
In order to make a hive plot, we must choose a partition of the nodes, which lets us split up the nodes into separate axes.
This can be done easily with the NodeCollection.create_partition_variable() method:
[8]:
partition_variable = nodes.create_partition_variable(
data_column="low",
cutoffs=3, # same as default
labels=["A", "B", "C"], # nicer than default names
)
nodes.data.head()
[8]:
| unique_id | low | med | high | partition_0 | |
|---|---|---|---|---|---|
| 0 | 0 | 6.363247 | 14.795079 | 23.193620 | B |
| 1 | 1 | 2.695169 | 12.321405 | 21.873202 | A |
| 2 | 2 | 0.409326 | 18.010787 | 26.718541 | A |
| 3 | 3 | 0.165111 | 19.226066 | 21.949123 | A |
| 4 | 4 | 8.124570 | 12.658641 | 25.771102 | C |
For more on how and why we partition node data for hive plots, see the Setting a Partition Variable page.
Choose Sorting Variables#
In order to make a hive plot, we must choose the sorting variables, one for each axis. This lets us order and place our nodes on each axis.
We can easily set all axes’ sorting variables to the same value by assigning our sorting_variables parameter to a node data column name when we instantiate our hive plot:
[9]:
# must choose sorting variable to place nodes on each axis
# we will use this when we instantiate a HivePlot later
sorting_variables = "low"
For more on how and why we set the sorting variables in hive plots, see the Setting Axis Sorting Variables page.
Create HivePlot From NodeCollection and Edges#
With our nodes and edges (and the partition variable and sorting variables) set, we have everything we need to generate a hiveplotlib.HivePlot instance:
[10]:
hp = HivePlot(
nodes=nodes, # our NodeCollection from above
edges=edges, # our Edges from above
partition_variable=partition_variable, # node column name assigned above
sorting_variables=sorting_variables, # node column name assigned above
)
hp.plot();