scipy.io.read_array: NaN in data file

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

scipy.io.read_array: NaN in data file

Erik Granstedt
Hello,

I am using scipy.io.read_array to read in values from data files to
arrays.  The data files occasionally contain "NaN"s, and I would like
the returned array to also contain "NaN"s.  I've tried calling
read_array with:

scipy.io.read_array(file('read_array_test.dat','r'),missing=float('NaN'))

but this still seems to convert the "NaN"s to 0.0

Is there a way to get it to return "NaN"s in the array instead of
converting them to 0.0 ?

Thanks,

-Erik
_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

fred-87
Erik Granstedt a écrit :
>
> Is there a way to get it to return "NaN"s in the array instead of
> converting them to 0.0 ?
I use fread() and fwrite() from scipy.io.numpyio without problem.

My 2 cts.

Cheers,

--
Fred
_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Pierre GM-2
In reply to this post by Erik Granstedt
With the SVN version of Numpy:

 >>> import numpy as np
 >>> import StringIO

 >>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",")

If you want to output a MaskedArray:
 >>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",",
                       missing="NaN", usemask=True)
 >>> isinstance(a, np.ma.MaskedArray)
True

On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote:

> Hello,
>
> I am using scipy.io.read_array to read in values from data files to
> arrays.  The data files occasionally contain "NaN"s, and I would like
> the returned array to also contain "NaN"s.  I've tried calling
> read_array with:
>
> scipy
> .io.read_array(file('read_array_test.dat','r'),missing=float('NaN'))
>
> but this still seems to convert the "NaN"s to 0.0
>
> Is there a way to get it to return "NaN"s in the array instead of
> converting them to 0.0 ?
>
> Thanks,
>
> -Erik
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina

so does np.genfromtxtx also deal with missing values in a file?

i.e something like:
1999,1,22,42
1999,2,18,23
1999,3,,22
1999,4,12,
1999,5,,,
1999,6,12,34

I've worked out how to do this using np.loadtxt by defining conversions for each column buts its pretty cumbersome and looks like spagetti in the code.
 
- dharhas

>>> Pierre GM <[hidden email]> 3/10/2009 11:14 AM >>>
With the SVN version of Numpy:

 >>> import numpy as np
 >>> import StringIO

 >>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",")

If you want to output a MaskedArray:
 >>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",",
                       missing="NaN", usemask=True)
 >>> isinstance(a, np.ma.MaskedArray)
True

On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote:

> Hello,
>
> I am using scipy.io.read_array to read in values from data files to
> arrays.  The data files occasionally contain "NaN"s, and I would like
> the returned array to also contain "NaN"s.  I've tried calling
> read_array with:
>
> scipy
> .io.read_array(file('read_array_test.dat','r'),missing=float('NaN'))
>
> but this still seems to convert the "NaN"s to 0.0
>
> Is there a way to get it to return "NaN"s in the array instead of
> converting them to 0.0 ?
>
> Thanks,
>
> -Erik
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user 

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Pierre GM-2

On Mar 10, 2009, at 12:22 PM, Dharhas Pothina wrote:
>
> so does np.genfromtxtx also deal with missing values in a file?

Yep:
 >>> data = StringIO.StringIO("""#
1999,4,12,
1999,5,,,
1999,6,12,34
""")
 >>> a = np.genfromtxt(data, delimiter=",", usemask=True)
 >>> a
masked_array(data =
  [[1999.0 4.0 12.0 --]
  [1999.0 5.0 -- --]
  [1999.0 6.0 12.0 34.0]],
              mask =
  [[False False False  True]
  [False False  True  True]
  [False False False False]],
        fill_value = 1e+20)

Looks like the first 2 columns are YYYY and MM: you can use  
scikits.timeseries.tsfromtxt for that, with a special converter to  
transform the first 2 columns into a datearray:
 >>> dconv=lambda y,m: Date('M', year=y, month=m)



> i.e something like:
> 1999,1,22,42
> 1999,2,18,23
> 1999,3,,22
> 1999,4,12,
> 1999,5,,,
> 1999,6,12,34
>
> I've worked out how to do this using np.loadtxt by defining  
> conversions for each column buts its pretty cumbersome and looks  
> like spagetti in the code.
>
> - dharhas
>
>>>> Pierre GM <[hidden email]> 3/10/2009 11:14 AM >>>
> With the SVN version of Numpy:
>
>>>> import numpy as np
>>>> import StringIO
>
>>>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",")
>
> If you want to output a MaskedArray:
>>>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",",
>                       missing="NaN", usemask=True)
>>>> isinstance(a, np.ma.MaskedArray)
> True
>
> On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote:
>
>> Hello,
>>
>> I am using scipy.io.read_array to read in values from data files to
>> arrays.  The data files occasionally contain "NaN"s, and I would like
>> the returned array to also contain "NaN"s.  I've tried calling
>> read_array with:
>>
>> scipy
>> .io.read_array(file('read_array_test.dat','r'),missing=float('NaN'))
>>
>> but this still seems to convert the "NaN"s to 0.0
>>
>> Is there a way to get it to return "NaN"s in the array instead of
>> converting them to 0.0 ?
>>
>> Thanks,
>>
>> -Erik
>> _______________________________________________
>> SciPy-user mailing list
>> [hidden email]
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina

>> so does np.genfromtxtx also deal with missing values in a file?
>Yep:

sweet. This is going to be very useful.

 >>> data = StringIO.StringIO("""#
1999,4,12,
1999,5,,,
1999,6,12,34
""")
> Looks like the first 2 columns are YYYY and MM: you can use  
> scikits.timeseries.tsfromtxt for that, with a special converter to  
> transform the first 2 columns into a datearray:
> dconv=lambda y,m: Date('M', year=y, month=m)

This was just an example I made up. But most of the files I'm reading are in the format :

columns that define date followed by columns of various data

Could you run me through the commands to go from the file containing the data to the timeseries masking missing data in the process?

ie. can StringIO read from a file or do I need to load the data first and then call StringIO and then call tsfromtxt() to reread the file?

thanks ,

- dharhas






_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Pierre GM-2

On Mar 10, 2009, at 12:44 PM, Dharhas Pothina wrote:

>
>>> so does np.genfromtxtx also deal with missing values in a file?
>> Yep:
>
> sweet. This is going to be very useful.

That was the whole aim of the game ;)

>
> This was just an example I made up. But most of the files I'm  
> reading are in the format :
>
> columns that define date followed by columns of various data
>
> Could you run me through the commands to go from the file containing  
> the data to the timeseries masking missing data in the process?
>
> ie. can StringIO read from a file or do I need to load the data  
> first and then call StringIO and then call tsfromtxt() to reread the  
> file?

ts.tsfromtxt is just a tailored version of np.genfromtxt. The input  
can be a filename ("data.txt"), a file (gzip version supported), or a  
string content (a la StringIO). Just use datecols to precise what  
column should be interpreted as date, your delimiter, any specific  
string representing a missing data (eg, "NaN". By default, '' is  
recognized), any additional converter... Just check the docstrings of  
ts.tsfromtxt and np.genfromtxt for more info, and let us know how we  
can improve them.


_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina

So this would need svn versions of numpy & the timeseries scikit? What is the roadmap for release versions?

my only other concern would be whether tsfromtxt would choke if duplicate dates were present in the data file. I still haven't found a good way in python to remove duplicate dates in general.

thanks,

- dharhas

>>> Pierre GM <[hidden email]> 3/10/2009 11:54 AM >>>

On Mar 10, 2009, at 12:44 PM, Dharhas Pothina wrote:

>
>>> so does np.genfromtxtx also deal with missing values in a file?
>> Yep:
>
> sweet. This is going to be very useful.

That was the whole aim of the game ;)

>
> This was just an example I made up. But most of the files I'm  
> reading are in the format :
>
> columns that define date followed by columns of various data
>
> Could you run me through the commands to go from the file containing  
> the data to the timeseries masking missing data in the process?
>
> ie. can StringIO read from a file or do I need to load the data  
> first and then call StringIO and then call tsfromtxt() to reread the  
> file?

ts.tsfromtxt is just a tailored version of np.genfromtxt. The input  
can be a filename ("data.txt"), a file (gzip version supported), or a  
string content (a la StringIO). Just use datecols to precise what  
column should be interpreted as date, your delimiter, any specific  
string representing a missing data (eg, "NaN". By default, '' is  
recognized), any additional converter... Just check the docstrings of  
ts.tsfromtxt and np.genfromtxt for more info, and let us know how we  
can improve them.


_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Pierre GM-2

On Mar 10, 2009, at 1:01 PM, Dharhas Pothina wrote:

>
> So this would need svn versions of numpy & the timeseries scikit?  
> What is the roadmap for release versions?

numpy 1.3 should be released on 04/01. scikits.timeseries 1. should be  
released shortly afterwards.


> my only other concern would be whether tsfromtxt would choke if  
> duplicate dates were present in the data file.

A TimeSeries object support duplicated dates, so no problem on this  
side: you'll have duplicated dates in your resulting series.


> I still haven't found a good way in python to remove duplicate dates  
> in general.

Well, because there's no standard way to do that: when you have  
duplicated dates, should you take the first  one? The last one ? Take  
some kind of average of the values ?


_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina
>> I still haven't found a good way in python to remove duplicate dates  
>> in general.

> Well, because there's no standard way to do that: when you have  
> duplicated dates, should you take the first  one? The last one ? Take  
> some kind of average of the values ?

Assuming I choose one of the three options above. Most likely the first. How would I proceed then?

- dharhas



_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Timmie
Administrator
Hello!
> > Well, because there's no standard way to do that: when you have  
> > duplicated dates, should you take the first  one? The last one ? Take  
> > some kind of average of the values ?
>
> Assuming I choose one of the three options above. Most likely the first. How
would I proceed then?
I haven't solved that problem either. But maybe the code from interpolate
modules of scikit have some checkers on consecutive values?

Please tell what ideas you have.

Thanks,
Timmie

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Timmie
Administrator
In reply to this post by Pierre GM-2
> Well, because there's no standard way to do that: when you have  
> duplicated dates, should you take the first  one? The last one ? Take  
> some kind of average of the values ?
Sometimes, there are inherent faults in the data set. Therefore, a automatic
treatment may introduce further errors.
It's only possible when this errors are occuring somewhat systematically.




_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina

In this particular case we know the cause:

It is either :

a) Overlapping files have been appended. ie file1 contains data from Jan1 to Feb1 and file2 contains data from jan1 to March1. The overlap region has identical data.

b) The data comes from sequential deployments and there is an small overlap at the beginning of the second file. ie file1 has data from Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may be a few data points overlap. These are junk because the equipment was set up in the lab and took measurements in the air until it was swapped with the installed instrument in the water.

In both these cases it is appropriate to take the first value. In the second case we really should be stripping the bad data before appending but this is a work in progress. Right now we are developing a semi-automated QA/QC procedure to clean up data before posting it on the web. We presently use a mix of awk and shell scripts but I'm trying to convert everything to python to make it easier to use, more maintainable, have nicer plots than gnuplot and to develop a gui application to help us do this.

- dharhas

>>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>>
> Well, because there's no standard way to do that: when you have  
> duplicated dates, should you take the first  one? The last one ? Take  
> some kind of average of the values ?
Sometimes, there are inherent faults in the data set. Therefore, a automatic
treatment may introduce further errors.
It's only possible when this errors are occuring somewhat systematically.




_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Pierre GM-2
Dharhas,
To find duplicates, you can use the following function (on SVN r2111).  
find_duplicated_dates will give you a dictionary, you can then use the  
values to decide what you want to do. remove_duplicated_dates will  
strip the series to keep only the first occurrence of duplicated dates.




def find_duplicated_dates(series):
     """
     Return a dictionary (duplicated dates <> indices) for the input  
series.

     The indices are given as a tuple of ndarrays, a la :meth:`nonzero`.

     Parameters
     ----------
     series : TimeSeries, DateArray
         A valid :class:`TimeSeries` or :class:`DateArray` object.

     Examples
     --------
     >>> series = time_series(np.arange(10),
                             dates=[2000, 2001, 2002, 2003, 2003,
                                    2003, 2004, 2005, 2005, 2006],  
freq='A')
     >>> test = find_duplicated_dates(series)
      {<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7,  
8]),)}
     """
     dates = getattr(series, '_dates', series)
     steps = dates.get_steps()
     duplicated_dates = tuple(set(dates[steps==0]))
     indices = {}
     for d in duplicated_dates:
         indices[d] = (dates==d).nonzero()
     return indices



def remove_duplicated_dates(series):
     """
     Remove the entries of `series` corresponding to duplicated dates.

     The series is first sorted in chronological order.
     Only the first occurence of a date is then kept, the others are  
discarded.

     Parameters
     ----------
     series : TimeSeries
         Time series to process
     """
     dates = getattr(series, '_dates', series)
     steps = np.concatenate(([1,], dates.get_steps()))
     if not dates.is_chronological():
         series = series.copy()
         series.sort_chronologically()
         dates = series._dates
     return series[steps.nonzero()]





On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote:

>
> In this particular case we know the cause:
>
> It is either :
>
> a) Overlapping files have been appended. ie file1 contains data from  
> Jan1 to Feb1 and file2 contains data from jan1 to March1. The  
> overlap region has identical data.
>
> b) The data comes from sequential deployments and there is an small  
> overlap at the beginning of the second file. ie file1 has data from  
> Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may  
> be a few data points overlap. These are junk because the equipment  
> was set up in the lab and took measurements in the air until it was  
> swapped with the installed instrument in the water.
>
> In both these cases it is appropriate to take the first value. In  
> the second case we really should be stripping the bad data before  
> appending but this is a work in progress. Right now we are  
> developing a semi-automated QA/QC procedure to clean up data before  
> posting it on the web. We presently use a mix of awk and shell  
> scripts but I'm trying to convert everything to python to make it  
> easier to use, more maintainable, have nicer plots than gnuplot and  
> to develop a gui application to help us do this.
>
> - dharhas
>
>>>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>>
>> Well, because there's no standard way to do that: when you have
>> duplicated dates, should you take the first  one? The last one ? Take
>> some kind of average of the values ?
> Sometimes, there are inherent faults in the data set. Therefore, a  
> automatic
> treatment may introduce further errors.
> It's only possible when this errors are occuring somewhat  
> systematically.
>
>
>
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: scipy.io.read_array: NaN in data file

Dharhas Pothina

Great to hear. Once I'm done with my present project I'll see if I can install and play around with the SVN version.

- d

>>> Pierre GM <[hidden email]> 3/11/2009 11:26 AM >>>
Dharhas,
To find duplicates, you can use the following function (on SVN r2111).  
find_duplicated_dates will give you a dictionary, you can then use the  
values to decide what you want to do. remove_duplicated_dates will  
strip the series to keep only the first occurrence of duplicated dates.




def find_duplicated_dates(series):
     """
     Return a dictionary (duplicated dates <> indices) for the input  
series.

     The indices are given as a tuple of ndarrays, a la :meth:`nonzero`.

     Parameters
     ----------
     series : TimeSeries, DateArray
         A valid :class:`TimeSeries` or :class:`DateArray` object.

     Examples
     --------
     >>> series = time_series(np.arange(10),
                             dates=[2000, 2001, 2002, 2003, 2003,
                                    2003, 2004, 2005, 2005, 2006],  
freq='A')
     >>> test = find_duplicated_dates(series)
      {<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7,  
8]),)}
     """
     dates = getattr(series, '_dates', series)
     steps = dates.get_steps()
     duplicated_dates = tuple(set(dates[steps==0]))
     indices = {}
     for d in duplicated_dates:
         indices[d] = (dates==d).nonzero()
     return indices



def remove_duplicated_dates(series):
     """
     Remove the entries of `series` corresponding to duplicated dates.

     The series is first sorted in chronological order.
     Only the first occurence of a date is then kept, the others are  
discarded.

     Parameters
     ----------
     series : TimeSeries
         Time series to process
     """
     dates = getattr(series, '_dates', series)
     steps = np.concatenate(([1,], dates.get_steps()))
     if not dates.is_chronological():
         series = series.copy()
         series.sort_chronologically()
         dates = series._dates
     return series[steps.nonzero()]





On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote:

>
> In this particular case we know the cause:
>
> It is either :
>
> a) Overlapping files have been appended. ie file1 contains data from  
> Jan1 to Feb1 and file2 contains data from jan1 to March1. The  
> overlap region has identical data.
>
> b) The data comes from sequential deployments and there is an small  
> overlap at the beginning of the second file. ie file1 has data from  
> Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may  
> be a few data points overlap. These are junk because the equipment  
> was set up in the lab and took measurements in the air until it was  
> swapped with the installed instrument in the water.
>
> In both these cases it is appropriate to take the first value. In  
> the second case we really should be stripping the bad data before  
> appending but this is a work in progress. Right now we are  
> developing a semi-automated QA/QC procedure to clean up data before  
> posting it on the web. We presently use a mix of awk and shell  
> scripts but I'm trying to convert everything to python to make it  
> easier to use, more maintainable, have nicer plots than gnuplot and  
> to develop a gui application to help us do this.
>
> - dharhas
>
>>>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>>
>> Well, because there's no standard way to do that: when you have
>> duplicated dates, should you take the first  one? The last one ? Take
>> some kind of average of the values ?
> Sometimes, there are inherent faults in the data set. Therefore, a  
> automatic
> treatment may introduce further errors.
> It's only possible when this errors are occuring somewhat  
> systematically.
>
>
>
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user 
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://mail.scipy.org/mailman/listinfo/scipy-user 

_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________
SciPy-user mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user