[SciPy-User] How to read row_name, col_name, value format TSV into a sparse matrix?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] How to read row_name, col_name, value format TSV into a sparse matrix?

Peng Yu
Suppose that I have a TSV file in the following format.

```
row_name<TAB>col_name<TAB>value
```

Is there an easy way to read it into a sparse matrix format in scipy? Thanks.

I don't see such examples in the doc.

https://docs.scipy.org/doc/scipy/reference/sparse.html

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Hjalmar Turesson

On Tue, Jan 28, 2020 at 10:09 PM Peng Yu <[hidden email]> wrote:
Suppose that I have a TSV file in the following format.

```
row_name<TAB>col_name<TAB>value
```

Is there an easy way to read it into a sparse matrix format in scipy? Thanks.

I don't see such examples in the doc.

https://docs.scipy.org/doc/scipy/reference/sparse.html

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Peng Yu
No. Which one to try?

Just to be clear I want to eventually use the sparse matrix with
sklearn's .fit().

On 1/28/20, Hjalmar Turesson <[hidden email]> wrote:

> Have you tried using Pandas?
>
> https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
>
> On Tue, Jan 28, 2020 at 10:09 PM Peng Yu <[hidden email]> wrote:
>
>> Suppose that I have a TSV file in the following format.
>>
>> ```
>> row_name<TAB>col_name<TAB>value
>> ```
>>
>> Is there an easy way to read it into a sparse matrix format in scipy?
>> Thanks.
>>
>> I don't see such examples in the doc.
>>
>> https://docs.scipy.org/doc/scipy/reference/sparse.html
>>
>> --
>> Regards,
>> Peng
>> _______________________________________________
>> SciPy-User mailing list
>> [hidden email]
>> https://mail.python.org/mailman/listinfo/scipy-user
>>
>


--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Guillaume Gay
You can either use pandas or numpy.read_csv. If the row_name and
col_name columns contain the indices, you can then instanciate a
scipy.sparse matrix with sparse.coo_matrix(val, (row, col)))

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy-sparse-coo-matrix


G.

Le 29/01/2020 à 04:44, Peng Yu a écrit :

> No. Which one to try?
>
> Just to be clear I want to eventually use the sparse matrix with
> sklearn's .fit().
>
> On 1/28/20, Hjalmar Turesson <[hidden email]> wrote:
>> Have you tried using Pandas?
>>
>> https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
>>
>> On Tue, Jan 28, 2020 at 10:09 PM Peng Yu <[hidden email]> wrote:
>>
>>> Suppose that I have a TSV file in the following format.
>>>
>>> ```
>>> row_name<TAB>col_name<TAB>value
>>> ```
>>>
>>> Is there an easy way to read it into a sparse matrix format in scipy?
>>> Thanks.
>>>
>>> I don't see such examples in the doc.
>>>
>>> https://docs.scipy.org/doc/scipy/reference/sparse.html
>>>
>>> --
>>> Regards,
>>> Peng
>>> _______________________________________________
>>> SciPy-User mailing list
>>> [hidden email]
>>> https://mail.python.org/mailman/listinfo/scipy-user
>>>
>
--
Guillaume Gay, PhD

Morphogénie Logiciels SAS
http://morphogenie.fr

12 rue Camoin Jeune
13004 Marseille

  +336 51 95 94 00

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Peng Yu
But does pandas read_csv generate a dense matrix? (I don't find numpy
read_csv. I only find numpy.loadtxt, which also only deal with dense
matrix.) What is the purpose of read into a dense matrix then convert
it to a sparse one? Isn't it better to directly read into a sparse
matrix to save memory? Thanks.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html

> You can either use pandas or numpy.read_csv. If the row_name and
> col_name columns contain the indices, you can then instanciate a
> scipy.sparse matrix with sparse.coo_matrix(val, (row, col)))
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy-sparse-coo-matrix

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Thomas Kluyver-2
Reading the csv/tsv (either with pandas or numpy) doesn't create a matrix at all. It just gives you the data as it is in the file: values with associated coordinates. Then you would use something like scipy.sparse.coo_matrix() to convert that to a sparse matrix.

On Wed, 29 Jan 2020 at 08:47, Peng Yu <[hidden email]> wrote:
But does pandas read_csv generate a dense matrix? (I don't find numpy
read_csv. I only find numpy.loadtxt, which also only deal with dense
matrix.) What is the purpose of read into a dense matrix then convert
it to a sparse one? Isn't it better to directly read into a sparse
matrix to save memory? Thanks.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html

> You can either use pandas or numpy.read_csv. If the row_name and
> col_name columns contain the indices, you can then instanciate a
> scipy.sparse matrix with sparse.coo_matrix(val, (row, col)))
>
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy-sparse-coo-matrix

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Peng Yu
> Reading the csv/tsv (either with pandas or numpy) doesn't create a matrix
> at all. It just gives you the data as it is in the file: values with
> associated coordinates. Then you would use something like
> scipy.sparse.coo_matrix() to convert that to a sparse matrix.

Where it documented that pandas.read_csv don't generate the whole
matrix? The return value is either of the two?

"""
DataFrame or TextParser

    A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
"""

Are you referring "TextParser"? How to control which one to return? I
don't see an option for it.

Which function of numpy do refer to specifically? numpy.loadtxt? It
returns ndarray, which should read a dense matrix into the memory.

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

lingxz1
Hi Peng Yu, 

Seems like these links might be useful: 

Should be easy to switch out csv for tsv parsing. 

Lingyi

On Wed, Jan 29, 2020 at 5:34 PM Peng Yu <[hidden email]> wrote:
> Reading the csv/tsv (either with pandas or numpy) doesn't create a matrix
> at all. It just gives you the data as it is in the file: values with
> associated coordinates. Then you would use something like
> scipy.sparse.coo_matrix() to convert that to a sparse matrix.

Where it documented that pandas.read_csv don't generate the whole
matrix? The return value is either of the two?

"""
DataFrame or TextParser

    A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
"""

Are you referring "TextParser"? How to control which one to return? I
don't see an option for it.

Which function of numpy do refer to specifically? numpy.loadtxt? It
returns ndarray, which should read a dense matrix into the memory.

--
Regards,
Peng
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

Thomas Kluyver-2
In reply to this post by Peng Yu
On Wed, 29 Jan 2020 at 09:34, Peng Yu <[hidden email]> wrote:
Where it documented that pandas.read_csv don't generate the whole
matrix? The return value is either of the two?

It returns a 2D data structure as in the rows and columns of your CSV file - so the shape will be (3, n_entries). It doesn't try to interpret them as referring to entries in a matrix - you have to do that as a separate step.

It's probably not exactly documented like this, because documentation doesn't usually say what a function *doesn't* do, unless it's a very common confusion.

Thomas

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: How to read row_name, col_name, value format TSV into a sparse matrix?

lingxz1
Thomas,

Unless I'm misunderstanding, I think Peng Yu doesn't want to read the zeros (or empty values) from the tsv file into memory. I'm pretty sure pandas.read_csv reads your whole data into memory, zeros or not. There is no option to read it in a sparse format (only store the position of nonzero entries). So that doesn't solve the problem. 

I think you can also read it in chunks, call df.to_sparse to convert to a sparse matrix for each chunk and concat them. I'm not sure if you've seen this: https://stackoverflow.com/questions/31888856/read-a-large-csv-into-a-sparse-pandas-dataframe-in-a-memory-efficient-way, but it might also offer some useful insights.

On Wed, Jan 29, 2020 at 5:57 PM Thomas Kluyver <[hidden email]> wrote:
On Wed, 29 Jan 2020 at 09:34, Peng Yu <[hidden email]> wrote:
Where it documented that pandas.read_csv don't generate the whole
matrix? The return value is either of the two?

It returns a 2D data structure as in the rows and columns of your CSV file - so the shape will be (3, n_entries). It doesn't try to interpret them as referring to entries in a matrix - you have to do that as a separate step.

It's probably not exactly documented like this, because documentation doesn't usually say what a function *doesn't* do, unless it's a very common confusion.

Thomas
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user