[SciPy-User] scipy.io.loadmat throws TypeError with large files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] scipy.io.loadmat throws TypeError with large files

Richard Llewellyn
Hi,

I get this or similar (different integer than 75724 in error) exceptions when loading a sparse matrix (CSC) saved with savemat, all default options.

>>> m = loadmat('my_large_mat.mat')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio.py", line 176, in loadmat
    matfile_dict = MR.get_variables(variable_names)
  File "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py", line 274, in get_variables
    hdr, next_position = self.read_var_header()
  File "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py", line 236, in read_var_header
    raise TypeError('Expecting miMATRIX type here, got %d' %  mdtype)
TypeError: Expecting miMATRIX type here, got 75724


here the matrix was: 

> matrix
<400000x4176 sparse matrix of type '<class 'numpy.uint8'>'
with 934099575 stored elements in Compressed Sparse Column format>

and looks fine before saving.

It looks as if this only occurs when the saved matrix file size is > 4GB -- at least I haven't seen it with files in the 3GB range.

64 bit Linux.

Not a crisis, as I am chunking anyway, so I can just chunk smaller, but when I get more RAM would be nice to bump it up to 8 GB files or so.

Thanks.




_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.loadmat throws TypeError with large files

Matthew Brett
Hi,

On Wed, Aug 7, 2013 at 12:15 PM, Richard Llewellyn <[hidden email]> wrote:

> Hi,
>
> I get this or similar (different integer than 75724 in error) exceptions
> when loading a sparse matrix (CSC) saved with savemat, all default options.
>
>>>> m = loadmat('my_large_mat.mat')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio.py",
> line 176, in loadmat
>     matfile_dict = MR.get_variables(variable_names)
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py",
> line 274, in get_variables
>     hdr, next_position = self.read_var_header()
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py",
> line 236, in read_var_header
>     raise TypeError('Expecting miMATRIX type here, got %d' %  mdtype)
> TypeError: Expecting miMATRIX type here, got 75724
>
>
> here the matrix was:
>
>> matrix
> <400000x4176 sparse matrix of type '<class 'numpy.uint8'>'
> with 934099575 stored elements in Compressed Sparse Column format>
>
> and looks fine before saving.
>
> It looks as if this only occurs when the saved matrix file size is > 4GB --
> at least I haven't seen it with files in the 3GB range.
>
> 64 bit Linux.
>
> Not a crisis, as I am chunking anyway, so I can just chunk smaller, but when
> I get more RAM would be nice to bump it up to 8 GB files or so.

Ugh.  I hesitate to ask, but do you get the same error for a very
large non-sparse matrix?

Thanks,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.loadmat throws TypeError with large files

Richard Llewellyn
Thanks Matthew for the thought.

This may not fully answer your question, but the same values saved as a large sparse matrix (csc) at 4.9GB fails to load with same TypeError, but as a numpy 2D array and matrix, which are less than half the file size (1.8GB) when saved with savemat, load without issue.

I also noticed that a sparse (csc) matrix that saved at 3.9GB loaded without issue, again suggesting that 4GB is a trigger.

Again, this is not an immediate problem for me.

Thanks,
Richard

PS scipy 0.12


On Wed, Aug 7, 2013 at 4:47 PM, Matthew Brett <[hidden email]> wrote:
Hi,

On Wed, Aug 7, 2013 at 12:15 PM, Richard Llewellyn <[hidden email]> wrote:
> Hi,
>
> I get this or similar (different integer than 75724 in error) exceptions
> when loading a sparse matrix (CSC) saved with savemat, all default options.
>
>>>> m = loadmat('my_large_mat.mat')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio.py",
> line 176, in loadmat
>     matfile_dict = MR.get_variables(variable_names)
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py",
> line 274, in get_variables
>     hdr, next_position = self.read_var_header()
>   File
> "/home/richard/venv3.3/lib/python3.3/site-packages/scipy/io/matlab/mio5.py",
> line 236, in read_var_header
>     raise TypeError('Expecting miMATRIX type here, got %d' %  mdtype)
> TypeError: Expecting miMATRIX type here, got 75724
>
>
> here the matrix was:
>
>> matrix
> <400000x4176 sparse matrix of type '<class 'numpy.uint8'>'
> with 934099575 stored elements in Compressed Sparse Column format>
>
> and looks fine before saving.
>
> It looks as if this only occurs when the saved matrix file size is > 4GB --
> at least I haven't seen it with files in the 3GB range.
>
> 64 bit Linux.
>
> Not a crisis, as I am chunking anyway, so I can just chunk smaller, but when
> I get more RAM would be nice to bump it up to 8 GB files or so.

Ugh.  I hesitate to ask, but do you get the same error for a very
large non-sparse matrix?

Thanks,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.loadmat throws TypeError with large files

Matthew Brett
Hi,

On Wed, Aug 7, 2013 at 9:04 PM, Richard Llewellyn <[hidden email]> wrote:
> Thanks Matthew for the thought.
>
> This may not fully answer your question, but the same values saved as a
> large sparse matrix (csc) at 4.9GB fails to load with same TypeError, but as
> a numpy 2D array and matrix, which are less than half the file size (1.8GB)
> when saved with savemat, load without issue.

Do the dimensions of the arrays (M, N) make a difference?  Or are they
all the same (M, N) shape, with more or less non-zeros?

Can you make a script that will replicate the problem for me?

Thanks a lot,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.loadmat throws TypeError with large files

Richard Llewellyn
Hi Matthew,

A short script below that shows that increasing the density triggers the error, on my machine, at file sizes over 4GB.  Originally I had increased either M and N to trigger the error as well.
I suspect you'll run into a problem with available RAM.  I run this on my 32GB machine with 64GB swap, and it swaps, so this takes several minutes to process at least.  Pain, I know.  Once I get more RAM it would be easier for me to test various permutations, but that will be awhile.

Maybe a generator could be used to build the matrix?  Still, I think RAM will be an issue.

Richard

####################################

import numpy as np
import scipy
from scipy import sparse
from scipy.io import loadmat,savemat

no_ones = 1000 # this fails, but 800 yields 3.6GB and passes

filename = "test_csc"

z = np.zeros(4250) # no of columns corresponds to my original problem, more or less.
z[np.arange(no_ones)] += 1 
m = sparse.csc_matrix(np.array([z]*400000)) # increasing the number of rows during chunking is where I first ran into error.

savemat(filename,{'mat':m})

# fails here with TypeError
m = loadmat(filename)['mat']







On Thu, Aug 8, 2013 at 2:12 AM, Matthew Brett <[hidden email]> wrote:
Hi,

On Wed, Aug 7, 2013 at 9:04 PM, Richard Llewellyn <[hidden email]> wrote:
> Thanks Matthew for the thought.
>
> This may not fully answer your question, but the same values saved as a
> large sparse matrix (csc) at 4.9GB fails to load with same TypeError, but as
> a numpy 2D array and matrix, which are less than half the file size (1.8GB)
> when saved with savemat, load without issue.

Do the dimensions of the arrays (M, N) make a difference?  Or are they
all the same (M, N) shape, with more or less non-zeros?

Can you make a script that will replicate the problem for me?

Thanks a lot,

Matthew
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.loadmat throws TypeError with large files

Matthew Brett
Hi,

On Thu, Aug 8, 2013 at 4:41 PM, Richard Llewellyn <[hidden email]> wrote:

> Hi Matthew,
>
> A short script below that shows that increasing the density triggers the
> error, on my machine, at file sizes over 4GB.  Originally I had increased
> either M and N to trigger the error as well.
> I suspect you'll run into a problem with available RAM.  I run this on my
> 32GB machine with 64GB swap, and it swaps, so this takes several minutes to
> process at least.  Pain, I know.  Once I get more RAM it would be easier for
> me to test various permutations, but that will be awhile.
>
> Maybe a generator could be used to build the matrix?  Still, I think RAM
> will be an issue.

Aha - thanks for tracking that down a little further.

The problem is that the matlab 5-7 file format (non-HDF) has a uint32
to store the number of bytes that the matrix takes up on disk.   Your
matrices causing the error are a little larger than 2**32, hence the
error.

Here's a relevant thread:

http://www.mathworks.de/matlabcentral/newsreader/view_thread/307845

It's not hard to reproduce the error on non-sparse (appended script).

We certainly need a better error for this - I'll try putting one in,

Cheers,

Matthew

from io import BytesIO
import numpy as np
from scipy.io import loadmat,savemat

fobj = BytesIO()

m = np.empty(2**32, dtype=np.int8)
n = np.arange(10).reshape((2, 5))

savemat(fobj, {'mat': m, 'n': n})

# fails here with TypeError
m = loadmat(fobj)['mat']
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user