[SciPy-User] Conversion to graph adjacency list

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] Conversion to graph adjacency list

elastica
Hi,

Scipy library provides C-implementation for some classical graph algorithms like Kruskal or connected component finding. Nevertheless, there is an efficiency problem: sparse graphs are usually given by list of edges or adjacency list (and not adjacency matrix). And in order to run the Scipy routines, we have to convert our graphs to the compressed SCR format (dense array is not well suited for graph with a lot of vertices, say more than 20,000). And sometimes, after execution we have to convert back. This causes the routine spending most of the execution time in conversion and retro-conversion tasks.

So my question is: does Scipy provide a C-routine for converting and back converting to edge/adjacency list format?
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

Robert Kern-2
On Fri, Dec 7, 2018 at 10:51 PM <[hidden email]> wrote:
>
> Hi,
>
> Scipy library provides C-implementation for some classical graph algorithms like Kruskal or connected component finding. Nevertheless, there is an efficiency problem: sparse graphs are usually given by list of edges or adjacency list (and not adjacency matrix). And in order to run the Scipy routines, we have to convert our graphs to the compressed SCR format (dense array is not well suited for graph with a lot of vertices, say more than 20,000). And sometimes, after execution we have to convert back. This causes the routine spending most of the execution time in conversion and retro-conversion tasks.
>
> So my question is: does Scipy provide a C-routine for converting and back converting to edge/adjacency list format?

How are you currently doing the conversion? The most efficient method is probably to convert from the CSR format to COO, which provides two parallel arrays giving the row and column indices of the edges. These can be simply zip()ed together to get a list of edge tuples.

|30> import numpy as np

|31> Nnodes = 20000

|32> edges = []

|33> for i in range(Nnodes):
...>     js = np.random.permutation(Nnodes)[:int(np.sqrt(Nnodes))]
...>     edges.extend([(i, j) for j in js])
...>     

# Construct a CSR matrix of the graph, reasonably efficiently.

|35> from scipy import sparse

|36> row_ind, col_ind = np.transpose(edges)

|37> Gcsr = sparse.csr_matrix((np.ones_like(row_ind), (row_ind, col_ind)), shape=(Nnodes,
...>  Nnodes))

# Now convert back to a list of edge tuples.

|39> Gcoo = Gcsr.tocoo()

|40> coo_edges = zip(Gcoo.row, Gcoo.col)

|41> set(edges) == set(coo_edges)
True

--
Robert Kern

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

elastica

Thanks Robert for your hepfull response, now I better understand how csr conversion works.

I'm benchmarking some graph libraries. My edges-to-csr conversion was hand-written in Python, so very slow! Now, with your code, execution time is 7 times faster!

Here is the code, perhaps we can optimize more:


# =============================================
from time import clock
from sys import stderr, stdin

from scipy.sparse.csgraph import minimum_spanning_tree
from scipy.sparse import csr_matrix
import numpy as np


# ------------ Scipy Code ---------------------------

def wedges2adj(edges, n):
   
    G=[[]for _ in range(n)]
    for a, b, w in edges:
        G[a].append((b,w))
        G[b].append((a,w))
    return G

def wedges2scr(edges, n):
    G=wedges2adj(edges, n)
    indptr=[0]
    cnt=0
     
    for line in G:
        cnt+=len(line)
        indptr.append(cnt)
    data=[]
    indices=[]
    for i in range(n):
        for (j,w) in G[i]:
            data.append(w)
            indices.append(j)
           
    return [data, indptr, indices]

def csr2wedges(Mcsr, shape):
    n, p=shape
    k=0
    edges=[]
    data, cumul, cols  = Mcsr.data,Mcsr.indptr, Mcsr.indices  
    for i in range(n):
        for j in range(cumul[i+1]-cumul[i]):
            edges.append((i, cols[k], data[k]))
            k+=1
    return edges

def kruskal_scipy(edges, n):
    data, indptr, indices=wedges2scr(edges, n)
    csr=csr_matrix((data, indices, indptr), shape=(n, n))    
    tree=minimum_spanning_tree(csr)
    edges=csr2wedges(tree, (n,n))
    return int(sum((round(w)) for (a,b,w) in edges))

# °°°°°°°°°°°°°°°°°°°°°°°



def wedges2scr_FAST(wedges, n):
    row_ind, col_ind, costs = np.transpose(wedges)
    return csr_matrix((costs, (row_ind, col_ind)),
    shape=(n,n))

def kruskal_scipy_FAST(wedges, n):    
    csr=wedges2scr_FAST(wedges, n)
    tree=minimum_spanning_tree(csr).tocoo()
    return int(round(sum(w for (i,j,w) in zip(tree.row, tree.col, tree.data))))


# ------------ Benchmark ---------------------------

def go(solver, L):
    global duration

    N=len(L)
    k=1
    solution=[]
    while k<N:
        edges=[]
        n=L[k]
        k+=1
           
        for a in range(n):
            d=L[k]
            k+=1
            for j in range(d):
                b=L[k]
                k+=1
                w=L[k]
                k+=1
                if a<b:
                    edges.append([a,b-1,w])
        if solver==kruskal_scipy:
            data, indptr, indices=wedges2scr(edges, n)
               
            start=clock()
            csr=csr_matrix((data, indices, indptr), shape=(n, n))            
            tree=minimum_spanning_tree(csr)
            edges=csr2wedges(tree, (n,n))
            sol=solver(edges, n)
            duration+=clock()-start              
            solution.append(sol)            
        else:
         
            start=clock()
            sol=solver(edges, n)
            duration+=clock()-start
            solution.append(sol)
    return solution

# data
L=list(int(z) for z in stdin.read().split() if z.isdigit())
output=[]
solvers=[kruskal_scipy, kruskal_scipy_FAST]
for solver in solvers:
    duration=0
    costs=go(solver, L)
    output.append(costs)
    print("%-20s : %.3f" %(solver.__name__, duration), file=stderr)

if all(output[i]== output[0] for i in range(len(solvers))):
    print("NO error detected", file=stderr)
else:
    print("ERROR detected", file=stderr)

# =============================================


Link to data file (100 graphs with at most 30000 vertices each):

https://drive.google.com/file/d/1BKxRAzJ9jeowbrFPyLZ23EomKn1eqohs/view?usp=sharing

The above code is faster than Networkit library code and little slower than Graph-Tool library.

I have two observations:

1. Python code is only 2 times slower
2. Pure C-code is about 9 times faster

In each case, even SciPy cf. _min_spanning_tree.pyx, algorithm is the same: kruskal with Union-find.




_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

Jason Sachs
Hmm. I'm not an expert here, but did you try using the "networkx" package?

On Sat, Dec 8, 2018 at 7:36 AM <[hidden email]> wrote:

Thanks Robert for your hepfull response, now I better understand how csr conversion works.

I'm benchmarking some graph libraries. My edges-to-csr conversion was hand-written in Python, so very slow! Now, with your code, execution time is 7 times faster!

Here is the code, perhaps we can optimize more:


# =============================================
from time import clock
from sys import stderr, stdin

from scipy.sparse.csgraph import minimum_spanning_tree
from scipy.sparse import csr_matrix
import numpy as np


# ------------ Scipy Code ---------------------------

def wedges2adj(edges, n):

    G=[[]for _ in range(n)]
    for a, b, w in edges:
        G[a].append((b,w))
        G[b].append((a,w))
    return G

def wedges2scr(edges, n):
    G=wedges2adj(edges, n)
    indptr=[0]
    cnt=0

    for line in G:
        cnt+=len(line)
        indptr.append(cnt)
    data=[]
    indices=[]
    for i in range(n):
        for (j,w) in G[i]:
            data.append(w)
            indices.append(j)

    return [data, indptr, indices]

def csr2wedges(Mcsr, shape):
    n, p=shape
    k=0
    edges=[]
    data, cumul, cols  = Mcsr.data,Mcsr.indptr, Mcsr.indices 
    for i in range(n):
        for j in range(cumul[i+1]-cumul[i]):
            edges.append((i, cols[k], data[k]))
            k+=1
    return edges

def kruskal_scipy(edges, n):
    data, indptr, indices=wedges2scr(edges, n)
    csr=csr_matrix((data, indices, indptr), shape=(n, n))     
    tree=minimum_spanning_tree(csr)
    edges=csr2wedges(tree, (n,n))
    return int(sum((round(w)) for (a,b,w) in edges))

# °°°°°°°°°°°°°°°°°°°°°°°



def wedges2scr_FAST(wedges, n):
    row_ind, col_ind, costs = np.transpose(wedges)
    return csr_matrix((costs, (row_ind, col_ind)),
    shape=(n,n))

def kruskal_scipy_FAST(wedges, n):   
    csr=wedges2scr_FAST(wedges, n)
    tree=minimum_spanning_tree(csr).tocoo()
    return int(round(sum(w for (i,j,w) in zip(tree.row, tree.col, tree.data))))


# ------------ Benchmark ---------------------------

def go(solver, L):
    global duration

    N=len(L)
    k=1
    solution=[]
    while k<N:
        edges=[]
        n=L[k]
        k+=1

        for a in range(n):
            d=L[k]
            k+=1
            for j in range(d):
                b=L[k]
                k+=1
                w=L[k]
                k+=1
                if a<b:
                    edges.append([a,b-1,w])
        if solver==kruskal_scipy:
            data, indptr, indices=wedges2scr(edges, n)

            start=clock()
            csr=csr_matrix((data, indices, indptr), shape=(n, n))             
            tree=minimum_spanning_tree(csr)
            edges=csr2wedges(tree, (n,n))
            sol=solver(edges, n)
            duration+=clock()-start             
            solution.append(sol)             
        else:

            start=clock()
            sol=solver(edges, n)
            duration+=clock()-start
            solution.append(sol)
    return solution

# data
L=list(int(z) for z in stdin.read().split() if z.isdigit())
output=[]
solvers=[kruskal_scipy, kruskal_scipy_FAST]
for solver in solvers:
    duration=0
    costs=go(solver, L)
    output.append(costs)
    print("%-20s : %.3f" %(solver.__name__, duration), file=stderr)

if all(output[i]== output[0] for i in range(len(solvers))):
    print("NO error detected", file=stderr)
else:
    print("ERROR detected", file=stderr)

# =============================================


Link to data file (100 graphs with at most 30000 vertices each):

https://drive.google.com/file/d/1BKxRAzJ9jeowbrFPyLZ23EomKn1eqohs/view?usp=sharing

The above code is faster than Networkit library code and little slower than Graph-Tool library.

I have two observations:

1. Python code is only 2 times slower
2. Pure C-code is about 9 times faster

In each case, even SciPy cf. _min_spanning_tree.pyx, algorithm is the same: kruskal with Union-find.




_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

elastica




 
> Hmm. I'm not an expert here, but did you try using the "networkx" package?


Of course I did. NetworkX has a very clean interface, clean code, is nicelly documented, well maintained, provides many features. But so slow :(

Here are the results against the data file from my previous post:

kruskal_networkX     : 29.411
kruskal_python       : 3.021
kruskal_gt           : 1.384
kruskal_networkit    : 2.501
kruskal_igraph       : 7.530
kruskal_scipy        : 10.744
kruskal_scipy_FAST   : 1.509
NO error detected

[gt=Graph-tool]


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

Guillaume Gay
Have you looked at graph-tool, which is fast?

Le 8 décembre 2018 17:49:14 GMT+01:00, [hidden email] a écrit :





Hmm. I'm not an expert here, but did you try using the "networkx" package?


Of course I did. NetworkX has a very clean interface, clean code, is nicelly documented, well maintained, provides many features. But so slow :(

Here are the results against the data file from my previous post:

kruskal_networkX : 29.411
kruskal_python : 3.021
kruskal_gt : 1.384
kruskal_networkit : 2.501
kruskal_igraph : 7.530
kruskal_scipy : 10.744
kruskal_scipy_FAST : 1.509
NO error detected

[gt=Graph-tool]
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

--
Envoyé de mon appareil Android avec Courriel K-9 Mail. Veuillez excuser ma brièveté.
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

Guillaume Gay
In reply to this post by elastica
Sorry, I saw your benchmark too late!

Le 8 décembre 2018 17:49:14 GMT+01:00, [hidden email] a écrit :





Hmm. I'm not an expert here, but did you try using the "networkx" package?


Of course I did. NetworkX has a very clean interface, clean code, is nicelly documented, well maintained, provides many features. But so slow :(

Here are the results against the data file from my previous post:

kruskal_networkX : 29.411
kruskal_python : 3.021
kruskal_gt : 1.384
kruskal_networkit : 2.501
kruskal_igraph : 7.530
kruskal_scipy : 10.744
kruskal_scipy_FAST : 1.509
NO error detected

[gt=Graph-tool]
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

--
Envoyé de mon appareil Android avec Courriel K-9 Mail. Veuillez excuser ma brièveté.
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Conversion to graph adjacency list

elastica
In reply to this post by Guillaume Gay



 
> Have you looked at graph-tool, which is fast?


Yes, I did, cf. the previous post:


> >kruskal_gt           : 1.384

[snip]

> >[gt=Graph-tool]


Graph-tool performs well but I was expecting better timing, it's only 2 times faster than pure Python code where Kruskal is implemented with a basic Union-Find, we are far from genuine C/C++ performance, because usually graph implementations are (about) 20 times faster in C/C++ than pure Python.


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user