Hive Plots for Large Networks

hiveplotlib scales well when it comes to constructing hive plots thanks to it’s mostly numpy-based backend, but as we scale the size of the network, we need to address a separate concern about how one can visualize a dense, edge- and node-heavy hive plot in a way that shows meaningful nuance for exploration.

This notebook explores the datashader-based solution to this visualization challenge implemented in hiveplotlib.

Note: this notebook requires that hiveplotlib be installed with the datashader package, which can be done by running:

pip install hiveplotlib[datashader].

[1]:
import io
from copy import copy

import matplotlib.pyplot as plt
import numpy as np
import requests
import seaborn as sns
from hiveplotlib import Node, hive_plot_n_axes
from hiveplotlib.node import split_nodes_on_variable
from hiveplotlib.viz import axes_viz, hive_plot_viz
from hiveplotlib.viz.datashader import (
    datashade_edges_mpl,
    datashade_hive_plot_mpl,
)
from matplotlib.patches import Rectangle
from scipy.signal import convolve2d

Data (Wikipedia Squirrel Pages)

We will use a dataset of Wikipedia pages, which is introduced in Rozemberczki, Allen, and Sarkar (2021) as:

“Wikipedia page-page networks on three topics: chameleons, crocodiles and squirrels. Nodes represent articles from the English Wikipedia (December 2018), edges reflect mutual links between them. Node features indicate the presence of particular nouns in the articles and the average monthly traffic (October 2017 - November 2018). The regression task is to predict the log average monthly traffic (December 2018).”

We will focus solely on the dataset of squirrel pages.

[2]:
path = "http://graphmining.ai/datasets/ptg/wiki/"
[3]:
%%time

# which animal wikipedia dataset to use
animal = "squirrel"

# disable warnings for when website https credentials are down
requests.packages.urllib3.disable_warnings()
response = requests.get(path + f"{animal}.npz", verify=False)
CPU times: user 131 ms, sys: 69.8 ms, total: 201 ms
Wall time: 12.7 s
[4]:
squirrel_data = np.load(io.BytesIO(response.content))

The data are broken into five labels. We will break out just three of these labels to make a more traditional, three-axis hive plot for this example, with nodes sorted by degree on each axis.

[5]:
degree = dict(zip(*np.unique(squirrel_data["edges"], return_counts=True)))
nodes = [
    Node(unique_id=i, data={"label": squirrel_data["label"][i], "degree": degree[i]})
    for i in range(squirrel_data["label"].size)
]
edges = squirrel_data["edges"]

# keep just 3 labels
labels_to_keep = [0, 2, 4]

partioned_data = split_nodes_on_variable(nodes, "label")
print("Labels in data: ", list(partioned_data.keys()))
nodes_for_hiveplot = [partioned_data[label] for label in labels_to_keep]
Labels in data:  [1, 3, 0, 4, 2]

Visualization Challenges in Pure Matplotlib

Although the hiveplotlib.viz module scales well computationally to plot larger HivePlot instances fast, for a dataset this big, it is visually complicated to show what’s in the data with scatter plots for nodes and line plots for edges. Below, we demonstrate some trial and error where we attempt to make this form of visualization viable.

Default Hiveplotlib Kwargs

First, we will plot using all defaults for nodes and edges.

[6]:
axes_names = [f"Label {label}" for label in labels_to_keep]

hp = hive_plot_n_axes(
    node_list=nodes,
    edges=edges,
    axes_assignments=nodes_for_hiveplot,
    sorting_variables=["degree"] * 3,
    axes_names=axes_names,
)

# edge set appears to be incorrectly repeated symmetrically
#  from looking at the within-label edges
#  so we will remove the edges in one direction
for i, name in enumerate(axes_names):
    hp.reset_edges(
        axis_id_1=name, axis_id_2=axes_names[(i + 1) % len(axes_names)], a2_to_a1=False
    )
[7]:
# count how many edges go in our final plot
total_edges = 0
for g1 in hp.edges:
    for g2 in hp.edges[g1]:
        total_edges += hp.edges[g1][g2][0]["ids"].shape[0]
[8]:
%%time
fig, ax = hive_plot_viz(hp)
fig.suptitle(
    f"Wikipedia Network of {animal.capitalize()} Pages ({total_edges:,} Edges)",
    x=0.125,
    size=20,
    ha="left",
)
ax.set_title(
    "Default hiveplotlib kwargs wash out any nuance",
    size=14,
    y=1.05,
    loc="left",
    color="gray",
)
plt.show()
_images/hive_plots_for_large_networks_12_0.png
CPU times: user 3.24 s, sys: 169 ms, total: 3.41 s
Wall time: 3.44 s

Lower the Alpha Value

The above figure suffers from severe oversaturation. One way we can attempt to tackle this is by lowering the alpha value of both lines and edges in our hive plot.

[9]:
%%time

# smallest possible matplotlib alpha value is 1/256
hp = hive_plot_n_axes(
    node_list=nodes,
    edges=edges,
    axes_assignments=nodes_for_hiveplot,
    sorting_variables=["degree"] * 3,
    axes_names=axes_names,
    all_edge_kwargs={"alpha": 1 / 256},
)

# edge set appears to be incorrectly repeated symmetrically
#  from looking at the within-label edges
#  so we will remove the edges in one direction
for i, name in enumerate(axes_names):
    hp.reset_edges(
        axis_id_1=name, axis_id_2=axes_names[(i + 1) % len(axes_names)], a2_to_a1=False
    )

fig, ax = hive_plot_viz(hp, node_kwargs={"c": "C1", "alpha": 0.01})
fig.suptitle(
    f"Wikipedia Network of {animal.capitalize()} Pages", x=0.125, size=20, ha="left"
)
ax.set_title(
    "Alpha solution has both oversaturation and undersaturation\n"
    "even using the smallest possible alpha value in matplotlib",
    size=14,
    y=1.02,
    loc="left",
    color="gray",
)
plt.show()
_images/hive_plots_for_large_networks_14_0.png
CPU times: user 4.31 s, sys: 399 ms, total: 4.7 s
Wall time: 4.71 s

Even when lowering the alpha value as low as possible in matplotlib, some areas on the plot still appear to be oversaturated. Furthermore, we’ve now created a problem of undersaturation in other parts of the figure, especially with the outermost nodes on each axis, which have all but disappeared.

Thinner Lines

Another option is to make the lines thinner, which can also reduce oversaturation (as multiple lines become unlikely to overlap each other), but we still suffer from undersaturation on relatively isolated lines.

[10]:
%%time

hp = hive_plot_n_axes(
    node_list=nodes,
    edges=edges,
    axes_assignments=nodes_for_hiveplot,
    sorting_variables=["degree"] * 3,
    axes_names=axes_names,
    all_edge_kwargs={"alpha": 0.8, "lw": 0.0005},
)

# edge set appears to be incorrectly repeated symmetrically
#  from looking at the within-label edges
#  so we will remove the edges in one direction
for i, name in enumerate(axes_names):
    hp.reset_edges(
        axis_id_1=name, axis_id_2=axes_names[(i + 1) % len(axes_names)], a2_to_a1=False
    )

fig, ax = hive_plot_viz(hp, node_kwargs={"c": "C1", "alpha": 0.01})
fig.suptitle(
    f"Wikipedia Network of {animal.capitalize()} Pages", x=0.125, size=20, ha="left"
)
ax.set_title(
    "Thinner line solution can solve oversaturation issues\n"
    "but at the cost of severe undersaturation",
    size=14,
    y=1.02,
    loc="left",
    color="gray",
)
plt.show()
_images/hive_plots_for_large_networks_17_0.png
CPU times: user 3.61 s, sys: 364 ms, total: 3.98 s
Wall time: 3.98 s

Datashading

By pulling in the capabilities of datashader, we can look at the density of our edges and nodes in our figure with nonlinear scaling.

The hiveplotlib.viz.datashader module includes the ability to datashade nodes or edges via the datashade_nodes_mpl() and datashade_edges_mpl() methods, respectively. Here, we demonstrate datashading both nodes and edges via the datashade_hive_plot_mpl() method.

Below, we remake the above hive plot of Wikipedia squirrel pages using these datashading capabilities.

[11]:
%%time

fig, ax, im_nodes, im_edges = datashade_hive_plot_mpl(hp)

cax_edges = ax.inset_axes([0.85, 0.25, 0.2, 0.01], transform=ax.transAxes)
cb_edges = fig.colorbar(im_edges, ax=ax, cax=cax_edges, orientation="horizontal")
cb_edges.ax.set_title("Edge Density")

cax_nodes = ax.inset_axes([0.85, 0.15, 0.2, 0.01], transform=ax.transAxes)
cb_nodes = fig.colorbar(im_nodes, ax=ax, cax=cax_nodes, orientation="horizontal")
cb_nodes.ax.set_title("Node Density")

fig.suptitle(
    f"Wikipedia Network of {animal.capitalize()} Pages", x=0.125, size=20, ha="left"
)
ax.set_title(
    "Datashader solution shows log(density) for both nodes and edges\n"
    "which avoids trading off oversaturation with undersaturation",
    size=14,
    y=1.02,
    loc="left",
    color="gray",
)

ax.text(
    x=0.1,
    y=-0.1,
    s=f"Subset of Wikipedia data. Looking at 3 of 5 groups of pages about {animal}s. "
    f"{total_edges:,} edges in total.\n"
    "An edge corresponds to mutual linking between two pages. "
    "Nodes placed on each axis sorted by degree.\n"
    "Data from: https://graphmining.ai/datasets/ptg/wiki\n"
    "Reference: https://arxiv.org/abs/1909.13021",
    size=9,
    color="gray",
    ha="left",
    transform=ax.transAxes,
)

plt.show()
_images/hive_plots_for_large_networks_19_0.png
CPU times: user 9.55 s, sys: 878 ms, total: 10.4 s
Wall time: 10.4 s

The ability for us to look at the log(density) allows us to see the full range of density values for nodes and edges without washing out any of the differences in density at the low-density end.

pixel_spread Fixes Discontinuous Binned Lines

matplotlib.pyplot.imshow() tends to lose smaller, more subtle pixel features when the pixels in the final figure are sufficiently small. For our use case, e.g. a 2d histogram of edges, this would lead to some lines looking choppy / discontinuous. This is highly likely to become an issue for datashading with hive plots.

The solution essentially is to make our pixel features less subtle.

Below, we demonstrate the choppiness that shows up in the raw output of our datashaded edges from our example hive plot above.

[12]:
fig, ax = plt.subplots(
    1, 2, figsize=(10, 6), gridspec_kw={"width_ratios": [6, 4]}, dpi=300
)

axes_viz(hp, fig=fig, ax=ax[0], axes_labels_fontsize=10)

_, _, im = datashade_edges_mpl(
    hp,
    figsize=(6, 6),
    pixel_spread=0,
    fig=fig,
    ax=ax[0],
)

rect = Rectangle(
    (0.5, -4), width=1, height=2, facecolor="None", edgecolor="red", alpha=0.8, ls="--"
)
ax[0].add_patch(copy(rect))

cb = fig.colorbar(im, ax=ax[0], shrink=0.7, pad=0.15, orientation="horizontal")
cb.ax.set_title("Count")

ax[0].set_title("Matplotlib imshow\nbreaks apart thin features...", loc="left")

axes_viz(hp, fig=fig, ax=ax[1], show_axes_labels=False)

_, _, im = datashade_edges_mpl(
    hp,
    figsize=(6, 6),
    pixel_spread=0,
    fig=fig,
    ax=ax[1],
)

ax[1].add_patch(copy(rect))

boundary_rect = Rectangle(
    (0.5, -4.1),
    width=1.0,
    height=2.2,
    facecolor="None",
    edgecolor="red",
    alpha=0.8,
    ls="--",
)
extent = np.array(boundary_rect.get_bbox())
ax[1].set_xlim(*extent[:, 0])
ax[1].set_ylim(*extent[:, 1])
ax[1].set_title(
    "...even though, when zooming in,\nlines are in fact complete", loc="left"
)
plt.show()
_images/hive_plots_for_large_networks_22_0.png

The thinner, more isolated line densities show up as visually discontinuous (left), even though the underlying 2d histogram shows the expected continuity when we zoom in (right).

The way we handle this as exposed in the hiveplotlib.viz.datashader module is with the pixel_spread parameter. This makes a call to the datashader.transfer_functions.spread() function, which essentially will thicken up the lines before binning.

The resulting thicker lines solves this imshow() false visual discontinuities issue, as demonstrated below.

[13]:
fig, ax = plt.subplots(
    1, 2, figsize=(10, 6), gridspec_kw={"width_ratios": [6, 4]}, dpi=300
)

axes_viz(hp, fig=fig, ax=ax[0], axes_labels_fontsize=10)

_, _, im = datashade_edges_mpl(
    hp,
    figsize=(6, 6),
    pixel_spread=2,
    fig=fig,
    ax=ax[0],
)

rect = Rectangle(
    (0.5, -4), width=1, height=2, facecolor="None", edgecolor="red", alpha=0.8, ls="--"
)
ax[0].add_patch(copy(rect))

cb = fig.colorbar(im, ax=ax[0], shrink=0.7, pad=0.15, orientation="horizontal")
cb.ax.set_title("Count")

ax[0].set_title("`pixel_spread` fixes `imshow` issues...", loc="left")

axes_viz(hp, fig=fig, ax=ax[1], show_axes_labels=False)

_, _, im = datashade_edges_mpl(
    hp,
    figsize=(6, 6),
    pixel_spread=2,
    fig=fig,
    ax=ax[1],
)

ax[1].add_patch(copy(rect))

boundary_rect = Rectangle(
    (0.5, -4.1),
    width=1.0,
    height=2.2,
    facecolor="None",
    edgecolor="red",
    alpha=0.8,
    ls="--",
)
extent = np.array(boundary_rect.get_bbox())
ax[1].set_xlim(*extent[:, 0])
ax[1].set_ylim(*extent[:, 1])
ax[1].set_title("...by generating thicker lines", loc="left")
plt.show()
_images/hive_plots_for_large_networks_24_0.png

Beware pixel_spread Affecting Bin Counts When Comparing Hive Plots

Although our second picture comes out much cleaner, note that these thicker lines lead to different bin counts, as exhibited by the different scales on each colorbar corresponding to the figures with and without pixel_spread. There is nothing wrong with this, but one must be careful when visually comparing two datashaded hive plots:

When comparing multiple hive plots, if one wants to compare relative counts, one must be sure to use equivalent bin sizes and pixel_spread values.

Motivation Behind Datashading

To better motivate a lot of what the hiveplotlib.datashader module (and the concepts behind datashader) does on our behalf, we will perform a more explicit, less-elegant version of these calculations manually for the edges of our hive plot example:

[14]:
%%time

fig, ax, im_edges = datashade_edges_mpl(hp)

axes_viz(hp, fig=fig, ax=ax)

cax_edges = ax.inset_axes([0.85, 0.25, 0.2, 0.01], transform=ax.transAxes)
cb_edges = fig.colorbar(im_edges, ax=ax, cax=cax_edges, orientation="horizontal")
cb_edges.ax.set_title("Edge Density")

fig.suptitle(
    f"Wikipedia Network of {animal.capitalize()} Pages", x=0.125, size=20, ha="left"
)
ax.set_title(
    "We will perform a manual version of datashading\n"
    "aimed at replicating just these edges",
    size=14,
    y=1.02,
    loc="left",
    color="gray",
)

plt.show()
_images/hive_plots_for_large_networks_27_0.png
CPU times: user 3.07 s, sys: 575 ms, total: 3.65 s
Wall time: 3.65 s

Right-Skewed Overlap of Edges

Regardless of how much we try to resolve the oversaturation / undersaturation tradeoff with solutions using alpha and linewidth, those solutions will tend to cause issues of underutilized range, that is, we will either be unable to distinguish the relative difference in frequency among the areas with less lines or among the areas with more lines.

This comes from the skewing of the distribution of edge density values. Since the density of edges on the figure can’t be less than 0 (we can’t have a negative number of edges anywhere in our figure), but we can have arbitrarily large numbers of edges overlapping in one area, this visualization challenge tends to be inherently right-skewed.

This is particularly common with large datasets, as more values increase the chance of more overlap, but we are still likely to have areas with minimal overlap as well. The only time we would not expect a problem is when the variables visualized are entirely uncorrelated with each other, thus having an even density throughout the figure. In other words, we hope to have a problem of right-skewed density if we expect to visualize any patterns with our network edges.

Binning the Data

The first step to making our datashader-esque visualization is the rasterization of the data. This requires breaking the 2-dimensional space spanned by the edges of the hive plot into a 2-dimensional histogram.

Below, we first bin the edges into a 2d histogram. We then plot both the distribution of the counts of those bins, as well as the 2d histogram itself with colors spanning the full range of bin counts.

Note: below we also do a proxy for “spreading” of the pixels via 2d convolution. This is not equivalent to what the datashader backend to the hiveplotlib.datashader module performs, but it will be sufficient for this proof-of-concept. For more on how the hiveplotlib.datashader module performs this task, see the pixel_spread section from earlier in the notebook.

[15]:
# grab the range of the above figure
#  we will bin along the same range
xlim = ax.get_xlim()
ylim = ax.get_ylim()
[16]:
all_curves = [
    hp.edges[i][j][t]["curves"]
    for i in hp.edges
    for j in hp.edges[i]
    for t in hp.edges[i][j]
]

all_curves = np.row_stack(all_curves)

hist, _, _ = np.histogram2d(
    all_curves[:, 1], all_curves[:, 0], bins=500, range=[xlim, ylim]
)

fig, ax = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={"width_ratios": [6, 4]})

ax[0].hist(np.ma.masked_where(hist == 0, hist).flatten(), 50, edgecolor="black")

ax[0].set_title("Distribution of bin counts is right-skewed", loc="left")

# proxy for `pixel_spread` to clean up picture
# see the `pixel_spread` section for how this is actually done
#  in the `hiveplotlib.datashader` code
hist_spread = convolve2d(hist, np.ones((4, 4)))

im = ax[1].imshow(
    np.ma.masked_where(hist_spread == 0, hist_spread),
    cmap=sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True),
    origin="lower",
    extent=[*xlim, *ylim],
)
axes_viz(hp, fig=fig, ax=ax[1], axes_labels_fontsize=8, alpha=0.2)
cb = fig.colorbar(im, ax=ax, shrink=0.7, pad=0.1, ticks=[])
cb.ax.set_title("Count")
ax[1].axis("off")
ax[1].set_title(
    "Outlier high-count bins restrict\ncolor range for low-count bins", loc="left"
)
plt.show()
_images/hive_plots_for_large_networks_30_0.png

In the above picture, we can see on the left that we have a small number of high-count bins. On the right, we see a familiar result caused by plotting those outliers without oversaturation: we have restricted the range of colors available to the vast majority of the edges to less than 1/4 of the colorbar, which washes out any relative count information among the low-count bins.

The question then becomes: how can we modify our color values without washing out the relative density information on the high or low end?

Changing vmax

One way to tackle this is to cap all values at or above a value to be the same maximum color of our color map, (vmax in matplotlib).

Although this can be productive in some cases, it is first and foremost a highly-manual and data-dependent fix. Second, one still risks losing the relative count information among the high-density or low-density values with a poor choice of vmax.

For our example, we have enough high outlier values that by setting a vmax appropriate to tease apart the low-density values, we lose the nuance amongst the high-density values, as shown below.

[17]:
fig, ax = plt.subplots(figsize=(6, 6))

im = ax.imshow(
    np.ma.masked_where(hist_spread == 0, hist_spread),
    extent=[*xlim, *ylim],
    cmap=sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True),
    origin="lower",
    vmax=500,
)

axes_viz(hp, fig=fig, ax=ax, axes_labels_fontsize=8, alpha=0.2)

cb = fig.colorbar(im, shrink=0.7, extend="max", pad=0.15)

cb.ax.set_title("Count")
ax.axis("off")
ax.set_title(
    "Setting vmax at a level that allows us to compare low-density bins\n"
    "causes us to lose any variation among high-density bins",
    loc="left",
)
plt.show()
_images/hive_plots_for_large_networks_33_0.png

Logarithmic Color Scale

Now that we are visualizing from a 2d histogram of data rather than a collection of lines, we can apply a function over that histogram. In this case, we will use a popular function for right-skewed data—taking the logarithm of the data.

Below, we repeat the earlier figure, plotting the distribution of bin counts plus the 2d histogram plot, only this time, we use the \(log_{10}\) histogram values.

[18]:
# clean up warnings, we aren't actually dividing by zero as we mask 0 vals
np.seterr(divide="ignore")

fig, ax = plt.subplots(1, 2, figsize=(10, 3), gridspec_kw={"width_ratios": [6, 4]})
ax[0].hist(
    np.log10(np.ma.masked_where(hist == 0, hist).flatten()), 50, edgecolor="black"
)
ax[0].set_title("Log(bin counts) is much less skewed", loc="left")

im = ax[1].imshow(
    np.log10(np.ma.masked_where(hist_spread == 0, hist_spread)),
    cmap=sns.color_palette("ch:start=.2,rot=-.3", as_cmap=True),
    origin="lower",
    extent=[*xlim, *ylim],
)
axes_viz(hp, fig=fig, ax=ax[1], axes_labels_fontsize=8, alpha=0.2)
cb = fig.colorbar(im, ax=ax, shrink=0.7, pad=0.1)
cb.ax.set_title("Log(Count)", size=8)
ax[1].axis("off")
ax[1].set_title("Outlier high counts visible\nwithout losing low counts", loc="left")
plt.show()
_images/hive_plots_for_large_networks_35_0.png

Not only is our log(bin counts) plot not skewed, but also, as one would thus expect, we can now view the unrestricted range of bin counts while still seeing relative edge density both among the low-density and high-density areas of the figure.

References

Fey, Matthias, and Jan Eric Lenssen. “Fast graph representation learning with PyTorch Geometric.” arXiv preprint arXiv:1903.02428 (2019).

Rozemberczki, Benedek, Carl Allen, and Rik Sarkar. “Multi-scale attributed node embedding.” Journal of Complex Networks 9.2 (2021): cnab014.