Fast PageRank Implementation in Python
I needed a fast PageRank for Wikisim project. It had to be fast enough to run real time on relatively large graphs. NetworkX was the obvious library to use, however, it needed back and forth translation from my graph representation (which was the pretty standard csr matrix), to its internal graph data structure. These translations were slowing down the process.
I implemented two versions of the algorithm in Python, both inspired by the sparse fast solutions given in Cleve Moler’s book, Experiments with MATLAB. The power method is much faster with enough precision for our task.
Personalized PageRank
I modified the algorithm a little bit to be able to calculate personalized PageRank as well.
Comparison with Popular Python Implementations: NetworkX and iGraph
Both implementations (exact solution and power method) are much faster than their correspondent methods in NetworkX. The power method is also faster than the iGraph native implementation, which is also an eigenvector based solution. Benchmarking is done on a machine with 4 cpus (2.70GHz) and 16GB
of RAM.
What is the major drawback of NetworkX PageRank?
I gave up using NetworkX for one simple reason: I had to calculate PageRank several times, and my internal representation of a graph was a simple sparse matrix. Every time I wanted to calculate PageRank I had to translate it to the graph representation of NetworkX, which was slow. My benchmarking shows that NetworkX has a pretty fast implementation of PageRank, but translating from its own graph data structure to a csr matrix before doing the actual calculations is exactly what exactly slows down the whole algorithm.
Note: I didn’t count the time spent on nx.from_scipy_sparse_array
(converting a csr matrix before passing it to NetworkX PageRank) in my benchmarking, But I could! Because that was another bottleneck for me, and for many other cases that one has a csr
adjacency matrix.
Python Implementation
The python package is hosted at https://github.com/asajadi/fast-pagerank and you can find the installation guide in the README.md file. You also can find this jupyter notebook in the notebook directory.
Appendix
What is Google PageRank Algorithm?
PageRank is another link analysis algorithm primarily used to rank search engine results. It is defined as a process in which starting from a random node, a random walker moves to a random neighbour with probability \(\alpha\) or jumps to a random vertex with the probability \(1-\alpha\) . The PageRank values are the limiting probabilities of finding a walker on each node. In the original PageRank, the jump can be to any node with a uniform probability, however later in Personalized PageRank, this can be any custom probability distribution over the nodes.
How Google PageRank is Calculated? [1, 2]
Let \(\mathbf{A}\) be the adjacency matrix (\(\mathbf{A}_{ij}\) is the weight of the edge from node \(i\) to node \(j\)) and \(\vec{s}\) be the teleporting probability, that is \(\vec{s}_i\) is the probability of jumping to node \(i\). Probability of being at node \(j\) at time \(t+1\) can be determined by two factors:
- Sum over the out-neighbors \(i\) of \(j\) of the probability that the walk was at \(i\) at time t, times the probability it moved from \(i\) to \(j\) in time \(t+1\).
- Probability of teleporting from somewhere else in the graph to \(j\).
Formally:
\[\begin{equation} \vec{p}_{t+1}(j)=\alpha\sum_{i:(i,j)\in E}\frac{A(i,j)}{d(i)}\vec{p}_t(i)+(1-\alpha)\vec{s}_j, \end{equation}\]where \(d(i)\) is the out-degree of node \(i\). To give a matrix form, we define \(\mathbf{D}\) to be the diagonal matrix with the out-degree of each node in \(\mathbf{A}\) on the diagonal. Then the PageRank vector, initialized with \(\vec{s}\), can be obtained from the following recursion:
\[\begin{equation} \vec{pr}_{t+1}=\alpha \mathbf{A}^T \mathbf{D}^{-1}\vec{pr}_{t}+(1-\alpha)\vec{s}. \end{equation}\]There is a serious problem that we need to take care: \(\mathbf{D}^{-1}\) is the inverse of \(\mathbf{D}\), which for a diagonal matrix it will be simply inverting the elements on the diagonal. This will break if there are nodes with no out neighbors, a.k.a, dangling nodes. What happens when you hit a page with no out link? You only have one option and that is to jump to a random page.
To simulate this behavior we alter \(\mathbf{A}\) by adding an edge from every dangling node to every other node \(j\) with a weight of \(\vec{s}_j\). In other words, we create \(\mathbf{\bar{A}}\) by replacing each all zero row by \(\vec{s}^T\). Formally, if we define \(\vec{r}\) to be the vector of row-wise sum of the elements of \(\mathbf{A}\), that is \(\vec{r}_i=\sum_{j}A_{ij}\), then:
\[\begin{align} \mathbf{\bar{A}}&=\mathbf{A}+\mathbf{B}\\ \mbox{where}\\ \mathbf{B}_{ij} &= \begin{cases} \vec{s}_j & \mbox{if } r_i=0 \\ 0 & \mbox{else} \end{cases} \end{align}\]We need to re-define \(\mathbf{D}\). In our new definition of \(\mathbf{D}\), we ignore nodes with no out-neighbors (or in other words, replace \(\frac{1}{0}\) by \(0\)). Similar to \(\mathbf{D}\), we define \(\mathbf{\bar{D}}\) to be the diagonal matrix of the out-degrees of \(\mathbf{\bar{A}}\). So we can rewrite the recursion as:
\[\begin{equation} \vec{pr}_{t+1}=\alpha \mathbf{\bar{A}}^T \mathbf{\bar{D}}^{-1}\vec{pr}_{t}+(1-\alpha)\vec{s}. \tag{I}\label{I} \end{equation}\]Now \(\vec{pr}\), the stationary probabilities (i.e, when \(\vec{pr}_{t+1}=\vec{pr}_t=\vec{pr}\)) can be calculated by either of the following approaches:
1. Linear System Solving
We can solve Eq. \(\eqref{I}\) and get:
\[\begin{equation} \vec{pr}=(I-\alpha\mathbf{\bar{A}}^T \mathbf{\bar{D}}^{-1})(1-\alpha)\vec{s}. \end{equation}\]And use a linear system solver to calculate \(\vec{pr}\).
2. Power-Method
Basically, reiterating the Eq. \(\eqref{I}\) until it converges.
How Fast Google PageRank Is Calculated? [3]
To speed up, we need to take advantage of sparse matrix calculations. The only problem with the current formulation is that \(\mathbf{\bar{A}}\) has a lower sparsity than the original \(\mathbf{A}\). However, we can move around pieces of the equation a little bit to skip forming this matrix. We know that:
\[\begin{align} \mathbf{\bar{A}}^T \mathbf{\bar{D}} &= (\mathbf{A}^T+\mathbf{B}^T)\mathbf{\bar{D}}\\ &= \mathbf{A}^T\mathbf{\bar{D}}^{-1} +\mathbf{B}^T\mathbf{\bar{D}}^{-1} \end{align}\]For the first term, multiplying by this diagonal matrix scales each column and \(\mathbf{\bar{D}}\) and \(\mathbf{D}\) are different only in the elements whose correspondent columns were all zero in \(\mathbf{A}^T\), so we can safely replace \(\mathbf{\bar{D}}\) with \(\mathbf{D}\). Also \(\mathbf{B}^T\mathbf{\bar{D}}^{-1}=\mathbf{B}^T\) because the non zero columns of \(\mathbf{B}^T\) are all \(\vec{s}\), which add up to \(1\), and therefore their correspondent element on \(\mathbf{D}\) will be \(1\). Therefore,
\[\begin{align} \mathbf{\bar{A}}^T \mathbf{\bar{D}} &= \mathbf{A}^T\mathbf{D}^{-1} +\mathbf{B}^T, \end{align}\]and using the above equation we can rewrite Eq. \(\eqref{I}\) and get
\[\begin{align} \vec{pr}_{t+1} &= \alpha \mathbf{A}^T\mathbf{D}^{-1}\vec{pr}_{t} +\alpha\mathbf{B}^T\vec{pr}_{t} +(1-\alpha)\vec{s}. \tag{II}\label{II} \end{align}\]This recursion has three multiplications, and the last one is a rather expensive one (\(\mathbf{B}\) is a \(n\times n\) matrix, therefore the whole multiplication will be \(O(n^2)\)).
Being a normalized vector, we know that \(\vec{1}^T\vec{pr}_t=1\). We can multiply the last term of Eq. \(\eqref{II}\) with \(\vec{1}^T\vec{pr}_t\) and factor out \(\vec{pr}\):
\[\begin{align} \vec{pr} &= \alpha \mathbf{A}^T\mathbf{D}^{-1}\vec{pr}_t +\alpha\mathbf{B}^T\vec{pr}_t +(1-\alpha)\vec{s}\vec{1}^T\vec{pr}_t \\ &= \alpha \mathbf{A}^T\mathbf{D}^{-1}\vec{pr}_t+ (\alpha\mathbf{B}^T+ (1-\alpha)\vec{s}\vec{1}^T)\vec{pr}_t. \tag{III}\label{III} \end{align}\]Let \(\mathbf{C}\) be \(\alpha\mathbf{B}^T+(1-\alpha)\vec{s}\vec{1}^T\). Notice that \(\vec{s}\vec{1}^T\) is a matrix with \(\vec{s}\) as its columns, and substituting the definition of \(\mathbf{B}\), the matrix \(\mathbf{C}\) will be:
\[\begin{align} \mathbf{C}_{ij} &= \begin{cases} \vec{s}_i & \mbox{if } r_j=0 \\ (1-\alpha)\vec{s}_i & \mbox{else} \end{cases} \end{align}\]If we let \(\vec{z}\) be:
\[\begin{align} \vec{z}_i &= \begin{cases} 1 & \mbox{if } r_i=0 \\ (1-\alpha) & \mbox{else} \end{cases} \end{align}\]then
\[\begin{equation} \mathbf{C}=\vec{s}\vec{z}^T \end{equation}\]So by replacing (\(\alpha\mathbf{B}^T+(1-\alpha)\vec{s}\vec{1}^T\)) in Eq. \(\eqref{III}\) with \(\vec{s}\vec{z}^T\), we’ll get:
\[\begin{align} \vec{pr}_{t+1} &= \alpha \mathbf{A}^T\mathbf{D}^{-1}\vec{pr}_{t}+(\vec{s}\vec{z}^T)\vec{pr}_{t}. \tag{IV}\label{IV} \end{align}\]How does this help to improve the calculations? We’ll see:
1. Solving a Linear System
Similar to before, we can solve Eq. \(\eqref{IV}\) and get:
\[\begin{equation} \vec{pr}=(I-\alpha \mathbf{A}^T\mathbf{D}^{-1})^{-1}(\vec{s}\vec{z}^T)\vec{pr}. \end{equation}\]Being able to re-parenthesize, \(\vec{z}^T\vec{p}\) is just a number, so we can ignore it and renormalize \(\vec{pr}\) at the end, and solve:
\[\begin{equation} \vec{pr}=(I-\alpha \mathbf{A}^T\mathbf{D}^{-1})^{-1}\vec{s}. \end{equation}\]We almost have the same linear equation system that we had before, except for one big improvement, we replaced the less-sparse \(\mathbf{\bar{A}}\) with \(\mathbf{A}\).
2. Power Method
We can apply one last smart modification to Eq. \(\eqref{IV}\): if we change the parenthesizing of the last multiplication (remember the famous dynamic programming algorithm?), and also define \(\mathbf{W}=\alpha\mathbf{A}^T\mathbf{D}^{-1}\), we will have:
\[\begin{equation} \vec{pr}_{t+1} = \mathbf{W}\vec{pr}_{t}+ \vec{s}(\vec{z}^T\vec{pr}_{t}) \end{equation}\]Therefore, the complexity decreased to \(O(n)\), and the whole recursion will be \(O(n)\times \#iterations\). The rate of convergence is another thing, which we ignore here, and depends on the value of the second eigenvalue (\(\lambda_2\)) of the modified transition matrix (\(\mathbf{T}\)), which is defined as: \(\begin{equation} \mathbf{T}=\alpha\mathbf{A}^T\mathbf{D}^{-1}+\vec{s}\vec{z}^T \end{equation}\)
References
[1] Daniel A. Spielman, Graphs and Networks Lecture Notes, Lecture 11: Cutting Graphs, Personal PageRank and Spilling Paint, 2013.
[2] Daniel A. Spielman, Spectral Graph Theory Lecture Notes, Lecture 10: Random Walks on Graphs, 2018
[3] Cleve Moler, Experiments with MATLAB, Chapter 7: Google PageRank
Implementation
%%writefile ../fast_pagerank/fast_pagerank.py
"""Two fast implementations of PageRank:
An exact solution using a sparse linear system solver,
and an a power method approximation.
Both solutions are taking full advantage of sparse matrix calculations.
[Reference]:
Cleve Moler. 2011. Experiments with MATLAB (Electronic ed.).
MathWorks, Inc.
"""
# uncomment
from __future__ import division
import numpy as np
import scipy as sp
import scipy.sparse as sprs
import scipy.spatial
import scipy.sparse.linalg
__author__ = "Armin Sajadi"
__copyright__ = "Copyright 2015, The Wikisim Project"
__email__ = "asajadi@gmail.com"
def pagerank(A, p=0.85,
personalize=None, reverse=False):
""" Calculates PageRank given a csr graph
Inputs:
-------
G: a csr graph.
p: damping factor
personlize: if not None, should be an array with the size of the nodes
containing probability distributions.
It will be normalized automatically
reverse: If true, returns the reversed-PageRank
outputs
-------
PageRank Scores for the nodes
"""
# In Moler's algorithm, $$A_{ij}$$ represents the existences of an edge
# from node $$j$$ to $$i$$, while we have assumed the opposite!
if reverse:
A = A.T
n, _ = A.shape
r = np.asarray(A.sum(axis=1)).reshape(-1)
k = r.nonzero()[0]
D_1 = sprs.csr_matrix((1 / r[k], (k, k)), shape=(n, n))
if personalize is None:
personalize = np.ones(n)
personalize = personalize.reshape(n, 1)
s = (personalize / personalize.sum()) * n
I = sprs.eye(n)
x = sprs.linalg.spsolve((I - p * A.T @ D_1), s)
x = x / x.sum()
return x
def pagerank_power(A, p=0.85, max_iter=100,
tol=1e-06, personalize=None, reverse=False):
""" Calculates PageRank given a csr graph
Inputs:
-------
A: a csr graph.
p: damping factor
max_iter: maximum number of iterations
personlize: if not None, should be an array with the size of the nodes
containing probability distributions.
It will be normalized automatically.
reverse: If true, returns the reversed-PageRank
Returns:
--------
PageRank Scores for the nodes
"""
# In Moler's algorithm, $$G_{ij}$$ represents the existences of an edge
# from node $$j$$ to $$i$$, while we have assumed the opposite!
if reverse:
A = A.T
n, _ = A.shape
r = np.asarray(A.sum(axis=1)).reshape(-1)
k = r.nonzero()[0]
D_1 = sprs.csr_matrix((1 / r[k], (k, k)), shape=(n, n))
if personalize is None:
personalize = np.ones(n)
personalize = personalize.reshape(n, 1)
s = (personalize / personalize.sum()) * n
z_T = (((1 - p) * (r != 0) + (r == 0)) / n)[np.newaxis, :]
W = p * A.T @ D_1
x = s / n
oldx = np.zeros((n, 1))
iteration = 0
while np.linalg.norm(x - oldx) > tol:
oldx = x
x = W @ x + s @ (z_T @ x)
iteration += 1
if iteration >= max_iter:
break
x = x / sum(x)
return x.reshape(-1)
Overwriting ../fast_pagerank/fast_pagerank.py
Testing the algorithm
%%writefile ../test/fast_pagerank_test.py
import os
import sys
import scipy as sp
import numpy as np
import scipy.sparse as sparse
from numpy.testing import assert_allclose
import unittest
sys.path.insert(
0,
os.path.abspath(
os.path.join(
os.path.dirname(__file__),
'..')))
from fast_pagerank import pagerank
from fast_pagerank import pagerank_power
class TestMolerPageRank(unittest.TestCase):
def setUp(self):
# ---G1---
n1 = 5
edges1 = np.array([[0, 1],
[1, 2],
[2, 1],
[2, 3],
[2, 4],
[3, 0],
[3, 2],
[4, 0],
[4, 2],
[4, 3]])
weights1 = [0.4923,
0.0999,
0.2132,
0.0178,
0.5694,
0.0406,
0.2047,
0.8610,
0.3849,
0.4829]
self.p1 = 0.83
self.personalize1 = np.array([0.6005, 0.1221, 0.2542, 0.4778, 0.4275])
self.G1 = sparse.csr_matrix(
(weights1, (edges1[:, 0], edges1[:, 1])), shape=(n1, n1))
self.pr1 = np.array([0.1592, 0.2114, 0.3085, 0.1, 0.2208])
# ---G2---
n2 = 10
edges2 = np.array([[2, 4],
[2, 5],
[4, 5],
[5, 3],
[5, 4],
[5, 9],
[6, 1],
[6, 2],
[9, 2],
[9, 4]])
weights2 = [0.4565,
0.2861,
0.5730,
0.0025,
0.4829,
0.3866,
0.3041,
0.3407,
0.2653,
0.8079]
self.G2 = sparse.csr_matrix(
(weights2, (edges2[:, 0], edges2[:, 1])), shape=(n2, n2))
self.personalize2 = np.array([0.8887, 0.6491, 0.7843, 0.7103, 0.7428,
0.6632, 0.7351, 0.3006, 0.8722, 0.1652])
self.p2 = 0.92
self.pr2 = np.array([0.0234, 0.0255, 0.0629, 0.0196, 0.3303,
0.3436, 0.0194, 0.0079, 0.023, 0.1445])
# ---G3---
n3 = 5
edges3 = np.array([[2, 4]])
weights3 = [0.5441]
self.G3 = sparse.csr_matrix(
(weights3, (edges3[:, 0], edges3[:, 1])), shape=(n3, n3))
self.personalize3 = np.array([0.0884, 0.2797, 0.3093, 0.5533, 0.985])
self.p3 = 0.81
self.pr3 = np.array([0.0358, 0.1134, 0.1254, 0.2244, 0.501])
# ---G4---
n4 = 5
edges4_rows = []
edges4_cols = []
weights4 = []
self.G4 = sparse.csr_matrix(
(weights4, (edges4_rows, edges4_cols)), shape=(n4, n4))
self.personalize4 = np.array([0.2534, 0.8945, 0.9562, 0.056, 0.9439])
self.p4 = 0.70
self.pr4 = np.array([0.0816, 0.2882, 0.3081, 0.018, 0.3041])
# ---G5---
n5 = 0
edges5_rows = []
edges5_cols = []
weights5 = []
self.G5 = sparse.csr_matrix(
(weights5, (edges5_rows, edges5_cols)), shape=(n5, n5))
self.personalize5 = np.array([])
self.p5 = 0.70
self.pr5 = np.array([])
def test_pagerank_1(self):
calculated_pagerank = pagerank(self.G1, p=self.p1,
personalize=self.personalize1)
assert_allclose(calculated_pagerank, self.pr1, rtol=0, atol=1e-04)
def test_pagerank_2(self):
calculated_pagerank = pagerank(self.G2, p=self.p2,
personalize=self.personalize2)
assert_allclose(calculated_pagerank, self.pr2, rtol=0, atol=1e-04)
def test_single_edge(self):
calculated_pagerank = pagerank(self.G3, p=self.p3,
personalize=self.personalize3)
assert_allclose(calculated_pagerank, self.pr3, rtol=0, atol=1e-04)
def test_zero_edge(self):
calculated_pagerank = pagerank(self.G4, p=self.p4,
personalize=self.personalize4)
assert_allclose(calculated_pagerank, self.pr4, rtol=0, atol=1e-04)
def test_empty_graph(self):
calculated_pagerank = pagerank(self.G5, p=self.p5,
personalize=self.personalize5)
self.assertEqual(calculated_pagerank.size, 0)
def test_power_pagerank_1(self):
calculated_pagerank = pagerank_power(self.G1, p=self.p1,
personalize=self.personalize1)
assert_allclose(calculated_pagerank, self.pr1, rtol=0, atol=1e-04)
def test_power_pagerank_2(self):
calculated_pagerank = pagerank_power(self.G2, p=self.p2,
personalize=self.personalize2)
assert_allclose(calculated_pagerank, self.pr2, rtol=0, atol=1e-04)
def test_power_single_edge(self):
calculated_pagerank = pagerank_power(self.G3, p=self.p3,
personalize=self.personalize3)
assert_allclose(calculated_pagerank, self.pr3, rtol=0, atol=1e-04)
def test_power_zero_edge(self):
calculated_pagerank = pagerank_power(self.G4, p=self.p4,
personalize=self.personalize4)
assert_allclose(calculated_pagerank, self.pr4, rtol=0, atol=1e-04)
def test_power_empty_graph(self):
calculated_pagerank = pagerank_power(self.G5, p=self.p5,
personalize=self.personalize5)
self.assertEqual(calculated_pagerank.size, 0)
# assert_array_almost_equal(Ynx, Yml, decimal = 5)
if __name__ == '__main__':
unittest.main()
Overwriting ../test/fast_pagerank_test.py
!python ../test/fast_pagerank_test.py
..........
----------------------------------------------------------------------
Ran 10 tests in 0.020s
OK
Benchmarking
To avoid the clutter, we only visualize the fastest method from each implementation, that is:
networkx.pagerank
- Latest implementation of
iGraph.personalized_pagerank
(PRPACK) - Our
pagerank_power
''' Calcualate PageRank on several random graphs.
'''
import numpy as np
import scipy as sp
import pandas as pd
import timeit
import os
import sys
import random
import igraph
import networkx as nx
sys.path.insert(0, '..')
from fast_pagerank import pagerank
from fast_pagerank import pagerank_power
# def print_and_flush(args):
# sys.stdout.flush()
def get_random_graph(
min_size=20,
max_size=2000,
min_density=0.1,
max_density=0.5):
''' Creates a random graph and a teleport vector and output them
in different formats for different algorithms
Inputs
------
min_size and max_size: The size of the graph will be a random number
in the range of (min_size, max_size)
min_sparsity and max_sparsity: The sparcity of the graph
will be a random number in the range of (min_sparsity, max_sparsity)
Returns
-------
nxG: A random Graph for NetworkX
A: The equivallent csr Adjacency matrix, for our PageRank
iG: The equivallent iGraph
personalize_vector: Personalization probabily vector
personalize_dict: Personalization probabily vector,
in the form of a dictionary for NetworkX
'''
G_size = random.randint(min_size, max_size)
p = random.uniform(min_density, max_density)
A = sp.sparse.random(G_size, G_size, p, format='csr')
nxG = nx.from_scipy_sparse_array(A, create_using=nx.DiGraph())
iG = igraph.Graph(list(nxG.edges()), directed=True)
iG.es['weight'] = A.data
personalize_vector = np.random.random(G_size)
personalize_dict = dict(enumerate(personalize_vector.reshape(-1)))
return A, nxG, iG, personalize_vector, personalize_dict
n = 5
number_of_graphs = 50
node_size_vector = np.zeros(number_of_graphs)
edge_size_vector = np.zeros(number_of_graphs)
netx_pagerank_times = np.zeros(number_of_graphs)
ig_pagerank_times = np.zeros(number_of_graphs)
pagerank_times = np.zeros(number_of_graphs)
pagerank_times_power = np.zeros(number_of_graphs)
damping_factor = 0.85
tol = 1e-3
for i in range(number_of_graphs):
A, nxG, iG, personalize_vector, personalize_dict = get_random_graph()
node_size_vector[i] = A.shape[0]
edge_size_vector[i] = A.count_nonzero()
print ("Graph %d: Nodes: %d, Edges: %d ..." %(i, node_size_vector[i], edge_size_vector[i]))
sys.stdout.flush()
netx_pagerank_times[i] = timeit.timeit(
lambda: nx.pagerank(nxG, alpha=damping_factor, tol=tol),
number=n) / n
#iGraph, only "prpack", which is their latest version.
ig_pagerank_times[i] = timeit.timeit(
lambda: iG.personalized_pagerank(directed=True,
damping=damping_factor,
weights=iG.es['weight'],
implementation="prpack"),
number=n) / n
# My implementations
pagerank_times[i] = timeit.timeit(
lambda: pagerank(A, p=damping_factor),
number=n) / n
pagerank_times_power[i] = timeit.timeit(
lambda: pagerank_power(A, p=damping_factor, tol=tol),
number=n) / n
argsort = edge_size_vector.argsort()
edge_size_vector_sorted = edge_size_vector[argsort]
node_size_vector_sorted = node_size_vector[argsort]
netx_pagerank_times_sorted = netx_pagerank_times[argsort]
ig_pagerank_times_sorted = ig_pagerank_times[argsort]
pagerank_times_sorted = pagerank_times[argsort]
pagerank_times_power_sorted = pagerank_times_power[argsort]
comparison_table = pd.DataFrame(list(zip(node_size_vector_sorted,
edge_size_vector_sorted,
netx_pagerank_times_sorted,
ig_pagerank_times_sorted,
pagerank_times_sorted,
pagerank_times_power_sorted)),
columns=['Nodes', 'Edges',
'NetX',
'iGraph',
'(fast) pagerank',
'(fast) pagerank_power']).\
astype({'Nodes': 'int32', 'Edges': 'int32'})
comparison_table.to_csv('pagerank_methods_comparison.csv')
print("Done")
Graph 0: Nodes: 1614, Edges: 892921 ...
Graph 1: Nodes: 1291, Edges: 397622 ...
Graph 2: Nodes: 838, Edges: 132426 ...
Graph 3: Nodes: 673, Edges: 199501 ...
Graph 4: Nodes: 429, Edges: 27151 ...
Graph 5: Nodes: 1454, Edges: 825369 ...
Graph 6: Nodes: 1635, Edges: 579105 ...
Graph 7: Nodes: 141, Edges: 3475 ...
Graph 8: Nodes: 706, Edges: 206913 ...
Graph 9: Nodes: 751, Edges: 137461 ...
Graph 10: Nodes: 1641, Edges: 1195768 ...
Graph 11: Nodes: 1413, Edges: 842877 ...
Graph 12: Nodes: 1042, Edges: 419973 ...
Graph 13: Nodes: 1681, Edges: 282626 ...
Graph 14: Nodes: 1089, Edges: 584362 ...
Graph 15: Nodes: 1260, Edges: 498237 ...
Graph 16: Nodes: 1736, Edges: 1412119 ...
Graph 17: Nodes: 1406, Edges: 310850 ...
Graph 18: Nodes: 240, Edges: 17255 ...
Graph 19: Nodes: 1176, Edges: 368369 ...
Graph 20: Nodes: 1083, Edges: 420160 ...
Graph 21: Nodes: 1933, Edges: 956007 ...
Graph 22: Nodes: 841, Edges: 145601 ...
Graph 23: Nodes: 905, Edges: 127398 ...
Graph 24: Nodes: 858, Edges: 247544 ...
Graph 25: Nodes: 1989, Edges: 1581139 ...
Graph 26: Nodes: 869, Edges: 228061 ...
Graph 27: Nodes: 773, Edges: 164815 ...
Graph 28: Nodes: 607, Edges: 156039 ...
Graph 29: Nodes: 1840, Edges: 855672 ...
Graph 30: Nodes: 349, Edges: 49752 ...
Graph 31: Nodes: 1722, Edges: 391689 ...
Graph 32: Nodes: 615, Edges: 170985 ...
Graph 33: Nodes: 181, Edges: 13687 ...
Graph 34: Nodes: 1060, Edges: 510199 ...
Graph 35: Nodes: 956, Edges: 372646 ...
Graph 36: Nodes: 1784, Edges: 628763 ...
Graph 37: Nodes: 338, Edges: 32353 ...
Graph 38: Nodes: 664, Edges: 120154 ...
Graph 39: Nodes: 888, Edges: 179415 ...
Graph 40: Nodes: 606, Edges: 173639 ...
Graph 41: Nodes: 168, Edges: 8709 ...
Graph 42: Nodes: 159, Edges: 11571 ...
Graph 43: Nodes: 369, Edges: 67840 ...
Graph 44: Nodes: 1572, Edges: 672553 ...
Graph 45: Nodes: 593, Edges: 50826 ...
Graph 46: Nodes: 603, Edges: 130738 ...
Graph 47: Nodes: 924, Edges: 152257 ...
Graph 48: Nodes: 923, Edges: 232844 ...
Graph 49: Nodes: 532, Edges: 75687 ...
Done
Plotting
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
comparison_table = pd.read_csv('pagerank_methods_comparison.csv', index_col=0)
display(comparison_table)
plt.ioff()
fig=plt.figure(num=None, figsize=(10, 10), dpi=80, facecolor='w', edgecolor='k')
plt.plot(comparison_table['Edges'], comparison_table['NetX'],
'o-', ms=8, lw=2, alpha=0.7, color='cyan',
label='networkx.PageRank')
plt.plot(comparison_table['Edges'], comparison_table['iGraph'],
'x-', ms=8, lw=2, alpha=0.7, color='black',
label='iGraph_PageRank_ARPACK')
plt.plot(comparison_table['Edges'], comparison_table['(fast) pagerank'],
'*-', ms=8, lw=2, alpha=0.7, color='red',
label='fast_pagerank.pagerank')
plt.plot(comparison_table['Edges'], comparison_table['(fast) pagerank_power'],
'^-', ms=8, lw=2, alpha=0.7, color='green',
label='fast_pagerank.pagerank_power')
plt.xlabel('Number of the edges')
plt.ylabel('Time (Seconds)')
plt.tight_layout()
plt.legend(loc=2)
plt.savefig('pagerank_methods_comparison.png')
plt.show()
Nodes | Edges | NetX | iGraph | (fast) pagerank | (fast) pagerank_power | |
---|---|---|---|---|---|---|
0 | 141 | 3475 | 0.008167 | 0.000539 | 0.001923 | 0.000849 |
1 | 168 | 8709 | 0.020619 | 0.001263 | 0.002540 | 0.000976 |
2 | 159 | 11571 | 0.024342 | 0.001638 | 0.002263 | 0.000924 |
3 | 181 | 13687 | 0.026383 | 0.001721 | 0.002278 | 0.000875 |
4 | 240 | 17255 | 0.031868 | 0.002163 | 0.006060 | 0.001765 |
5 | 429 | 27151 | 0.046971 | 0.003327 | 0.018867 | 0.002158 |
6 | 338 | 32353 | 0.062241 | 0.004158 | 0.012208 | 0.002291 |
7 | 349 | 49752 | 0.091428 | 0.006546 | 0.020979 | 0.002568 |
8 | 593 | 50826 | 0.094839 | 0.004848 | 0.053456 | 0.003128 |
9 | 369 | 67840 | 0.129616 | 0.009707 | 0.015578 | 0.003154 |
10 | 532 | 75687 | 0.153020 | 0.007637 | 0.036882 | 0.003722 |
11 | 664 | 120154 | 0.230722 | 0.012611 | 0.064293 | 0.004689 |
12 | 905 | 127398 | 0.243970 | 0.015039 | 0.184043 | 0.005806 |
13 | 603 | 130738 | 0.264537 | 0.013170 | 0.045426 | 0.004366 |
14 | 838 | 132426 | 0.248668 | 0.020659 | 0.163919 | 0.004816 |
15 | 751 | 137461 | 0.267247 | 0.013927 | 0.076969 | 0.004700 |
16 | 841 | 145601 | 0.275542 | 0.013754 | 0.099576 | 0.004973 |
17 | 924 | 152257 | 0.296603 | 0.015513 | 0.137361 | 0.005365 |
18 | 607 | 156039 | 0.295475 | 0.014830 | 0.053392 | 0.005314 |
19 | 773 | 164815 | 0.320011 | 0.014407 | 0.078800 | 0.005080 |
20 | 615 | 170985 | 0.324415 | 0.014716 | 0.048411 | 0.005089 |
21 | 606 | 173639 | 0.329282 | 0.017196 | 0.047157 | 0.005871 |
22 | 888 | 179415 | 0.349202 | 0.017302 | 0.122838 | 0.006556 |
23 | 673 | 199501 | 0.355692 | 0.019995 | 0.075585 | 0.005577 |
24 | 706 | 206913 | 0.405331 | 0.020133 | 0.068514 | 0.005883 |
25 | 869 | 228061 | 0.466303 | 0.020160 | 0.161987 | 0.008556 |
26 | 923 | 232844 | 0.464954 | 0.025322 | 0.170835 | 0.008981 |
27 | 858 | 247544 | 0.498882 | 0.028198 | 0.130814 | 0.007992 |
28 | 1681 | 282626 | 0.530677 | 0.027410 | 0.695906 | 0.008940 |
29 | 1406 | 310850 | 0.550259 | 0.028906 | 0.428251 | 0.009485 |
30 | 1176 | 368369 | 0.715257 | 0.044657 | 0.396467 | 0.017334 |
31 | 956 | 372646 | 0.737979 | 0.031602 | 0.136079 | 0.009442 |
32 | 1722 | 391689 | 0.763951 | 0.047625 | 0.849482 | 0.015669 |
33 | 1291 | 397622 | 0.704370 | 0.034768 | 0.311621 | 0.009304 |
34 | 1042 | 419973 | 0.768082 | 0.043818 | 0.184774 | 0.011507 |
35 | 1083 | 420160 | 1.006566 | 0.158505 | 1.352094 | 0.011576 |
36 | 1260 | 498237 | 1.011772 | 0.059739 | 0.368076 | 0.011779 |
37 | 1060 | 510199 | 1.140068 | 0.059229 | 0.237280 | 0.019025 |
38 | 1635 | 579105 | 1.137819 | 0.061414 | 0.767978 | 0.013246 |
39 | 1089 | 584362 | 1.037787 | 0.074236 | 0.254159 | 0.012441 |
40 | 1784 | 628763 | 1.427946 | 0.058649 | 0.877097 | 0.015091 |
41 | 1572 | 672553 | 1.295139 | 0.056538 | 0.638620 | 0.015338 |
42 | 1454 | 825369 | 1.507927 | 0.107227 | 0.405008 | 0.014615 |
43 | 1413 | 842877 | 1.624581 | 0.111469 | 0.444221 | 0.015862 |
44 | 1840 | 855672 | 1.623843 | 0.075287 | 0.838486 | 0.016817 |
45 | 1614 | 892921 | 1.710065 | 0.133828 | 0.627106 | 0.015474 |
46 | 1933 | 956007 | 2.507147 | 0.119857 | 1.135223 | 0.016590 |
47 | 1641 | 1195768 | 2.196352 | 0.172022 | 0.535405 | 0.019178 |
48 | 1736 | 1412119 | 2.638824 | 0.192373 | 0.674815 | 0.021137 |
49 | 1989 | 1581139 | 3.274551 | 0.256537 | 1.405950 | 0.020392 |