Hello,
I am using scipy.io.read_array to read in values from data files to arrays. The data files occasionally contain "NaN"s, and I would like the returned array to also contain "NaN"s. I've tried calling read_array with: scipy.io.read_array(file('read_array_test.dat','r'),missing=float('NaN')) but this still seems to convert the "NaN"s to 0.0 Is there a way to get it to return "NaN"s in the array instead of converting them to 0.0 ? Thanks, -Erik _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Erik Granstedt a écrit :
> > Is there a way to get it to return "NaN"s in the array instead of > converting them to 0.0 ? I use fread() and fwrite() from scipy.io.numpyio without problem. My 2 cts. Cheers, -- Fred _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Erik Granstedt
With the SVN version of Numpy:
>>> import numpy as np >>> import StringIO >>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",") If you want to output a MaskedArray: >>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",", missing="NaN", usemask=True) >>> isinstance(a, np.ma.MaskedArray) True On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote: > Hello, > > I am using scipy.io.read_array to read in values from data files to > arrays. The data files occasionally contain "NaN"s, and I would like > the returned array to also contain "NaN"s. I've tried calling > read_array with: > > scipy > .io.read_array(file('read_array_test.dat','r'),missing=float('NaN')) > > but this still seems to convert the "NaN"s to 0.0 > > Is there a way to get it to return "NaN"s in the array instead of > converting them to 0.0 ? > > Thanks, > > -Erik > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
so does np.genfromtxtx also deal with missing values in a file? i.e something like: 1999,1,22,42 1999,2,18,23 1999,3,,22 1999,4,12, 1999,5,,, 1999,6,12,34 I've worked out how to do this using np.loadtxt by defining conversions for each column buts its pretty cumbersome and looks like spagetti in the code. - dharhas >>> Pierre GM <[hidden email]> 3/10/2009 11:14 AM >>> With the SVN version of Numpy: >>> import numpy as np >>> import StringIO >>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",") If you want to output a MaskedArray: >>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",", missing="NaN", usemask=True) >>> isinstance(a, np.ma.MaskedArray) True On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote: > Hello, > > I am using scipy.io.read_array to read in values from data files to > arrays. The data files occasionally contain "NaN"s, and I would like > the returned array to also contain "NaN"s. I've tried calling > read_array with: > > scipy > .io.read_array(file('read_array_test.dat','r'),missing=float('NaN')) > > but this still seems to convert the "NaN"s to 0.0 > > Is there a way to get it to return "NaN"s in the array instead of > converting them to 0.0 ? > > Thanks, > > -Erik > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mar 10, 2009, at 12:22 PM, Dharhas Pothina wrote: > > so does np.genfromtxtx also deal with missing values in a file? Yep: >>> data = StringIO.StringIO("""# 1999,4,12, 1999,5,,, 1999,6,12,34 """) >>> a = np.genfromtxt(data, delimiter=",", usemask=True) >>> a masked_array(data = [[1999.0 4.0 12.0 --] [1999.0 5.0 -- --] [1999.0 6.0 12.0 34.0]], mask = [[False False False True] [False False True True] [False False False False]], fill_value = 1e+20) Looks like the first 2 columns are YYYY and MM: you can use scikits.timeseries.tsfromtxt for that, with a special converter to transform the first 2 columns into a datearray: >>> dconv=lambda y,m: Date('M', year=y, month=m) > i.e something like: > 1999,1,22,42 > 1999,2,18,23 > 1999,3,,22 > 1999,4,12, > 1999,5,,, > 1999,6,12,34 > > I've worked out how to do this using np.loadtxt by defining > conversions for each column buts its pretty cumbersome and looks > like spagetti in the code. > > - dharhas > >>>> Pierre GM <[hidden email]> 3/10/2009 11:14 AM >>> > With the SVN version of Numpy: > >>>> import numpy as np >>>> import StringIO > >>>> a = np.genfromtxtx(StringIO.StringIO("1, NaN"), delimiter=",") > > If you want to output a MaskedArray: >>>> a = np.genfromtxt(StringIO.StringIO("1, NaN"), delimiter=",", > missing="NaN", usemask=True) >>>> isinstance(a, np.ma.MaskedArray) > True > > On Mar 10, 2009, at 11:57 AM, Erik Granstedt wrote: > >> Hello, >> >> I am using scipy.io.read_array to read in values from data files to >> arrays. The data files occasionally contain "NaN"s, and I would like >> the returned array to also contain "NaN"s. I've tried calling >> read_array with: >> >> scipy >> .io.read_array(file('read_array_test.dat','r'),missing=float('NaN')) >> >> but this still seems to convert the "NaN"s to 0.0 >> >> Is there a way to get it to return "NaN"s in the array instead of >> converting them to 0.0 ? >> >> Thanks, >> >> -Erik >> _______________________________________________ >> SciPy-user mailing list >> [hidden email] >> http://mail.scipy.org/mailman/listinfo/scipy-user > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
>> so does np.genfromtxtx also deal with missing values in a file? >Yep: sweet. This is going to be very useful. >>> data = StringIO.StringIO("""# 1999,4,12, 1999,5,,, 1999,6,12,34 """) > Looks like the first 2 columns are YYYY and MM: you can use > scikits.timeseries.tsfromtxt for that, with a special converter to > transform the first 2 columns into a datearray: > dconv=lambda y,m: Date('M', year=y, month=m) This was just an example I made up. But most of the files I'm reading are in the format : columns that define date followed by columns of various data Could you run me through the commands to go from the file containing the data to the timeseries masking missing data in the process? ie. can StringIO read from a file or do I need to load the data first and then call StringIO and then call tsfromtxt() to reread the file? thanks , - dharhas _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mar 10, 2009, at 12:44 PM, Dharhas Pothina wrote: > >>> so does np.genfromtxtx also deal with missing values in a file? >> Yep: > > sweet. This is going to be very useful. That was the whole aim of the game ;) > > This was just an example I made up. But most of the files I'm > reading are in the format : > > columns that define date followed by columns of various data > > Could you run me through the commands to go from the file containing > the data to the timeseries masking missing data in the process? > > ie. can StringIO read from a file or do I need to load the data > first and then call StringIO and then call tsfromtxt() to reread the > file? ts.tsfromtxt is just a tailored version of np.genfromtxt. The input can be a filename ("data.txt"), a file (gzip version supported), or a string content (a la StringIO). Just use datecols to precise what column should be interpreted as date, your delimiter, any specific string representing a missing data (eg, "NaN". By default, '' is recognized), any additional converter... Just check the docstrings of ts.tsfromtxt and np.genfromtxt for more info, and let us know how we can improve them. _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
So this would need svn versions of numpy & the timeseries scikit? What is the roadmap for release versions? my only other concern would be whether tsfromtxt would choke if duplicate dates were present in the data file. I still haven't found a good way in python to remove duplicate dates in general. thanks, - dharhas >>> Pierre GM <[hidden email]> 3/10/2009 11:54 AM >>> On Mar 10, 2009, at 12:44 PM, Dharhas Pothina wrote: > >>> so does np.genfromtxtx also deal with missing values in a file? >> Yep: > > sweet. This is going to be very useful. That was the whole aim of the game ;) > > This was just an example I made up. But most of the files I'm > reading are in the format : > > columns that define date followed by columns of various data > > Could you run me through the commands to go from the file containing > the data to the timeseries masking missing data in the process? > > ie. can StringIO read from a file or do I need to load the data > first and then call StringIO and then call tsfromtxt() to reread the > file? ts.tsfromtxt is just a tailored version of np.genfromtxt. The input can be a filename ("data.txt"), a file (gzip version supported), or a string content (a la StringIO). Just use datecols to precise what column should be interpreted as date, your delimiter, any specific string representing a missing data (eg, "NaN". By default, '' is recognized), any additional converter... Just check the docstrings of ts.tsfromtxt and np.genfromtxt for more info, and let us know how we can improve them. _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mar 10, 2009, at 1:01 PM, Dharhas Pothina wrote: > > So this would need svn versions of numpy & the timeseries scikit? > What is the roadmap for release versions? numpy 1.3 should be released on 04/01. scikits.timeseries 1. should be released shortly afterwards. > my only other concern would be whether tsfromtxt would choke if > duplicate dates were present in the data file. A TimeSeries object support duplicated dates, so no problem on this side: you'll have duplicated dates in your resulting series. > I still haven't found a good way in python to remove duplicate dates > in general. Well, because there's no standard way to do that: when you have duplicated dates, should you take the first one? The last one ? Take some kind of average of the values ? _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
>> I still haven't found a good way in python to remove duplicate dates
>> in general. > Well, because there's no standard way to do that: when you have > duplicated dates, should you take the first one? The last one ? Take > some kind of average of the values ? Assuming I choose one of the three options above. Most likely the first. How would I proceed then? - dharhas _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Administrator
|
Hello!
> > Well, because there's no standard way to do that: when you have > > duplicated dates, should you take the first one? The last one ? Take > > some kind of average of the values ? > > Assuming I choose one of the three options above. Most likely the first. How would I proceed then? I haven't solved that problem either. But maybe the code from interpolate modules of scikit have some checkers on consecutive values? Please tell what ideas you have. Thanks, Timmie _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Administrator
|
In reply to this post by Pierre GM-2
> Well, because there's no standard way to do that: when you have
> duplicated dates, should you take the first one? The last one ? Take > some kind of average of the values ? Sometimes, there are inherent faults in the data set. Therefore, a automatic treatment may introduce further errors. It's only possible when this errors are occuring somewhat systematically. _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In this particular case we know the cause: It is either : a) Overlapping files have been appended. ie file1 contains data from Jan1 to Feb1 and file2 contains data from jan1 to March1. The overlap region has identical data. b) The data comes from sequential deployments and there is an small overlap at the beginning of the second file. ie file1 has data from Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may be a few data points overlap. These are junk because the equipment was set up in the lab and took measurements in the air until it was swapped with the installed instrument in the water. In both these cases it is appropriate to take the first value. In the second case we really should be stripping the bad data before appending but this is a work in progress. Right now we are developing a semi-automated QA/QC procedure to clean up data before posting it on the web. We presently use a mix of awk and shell scripts but I'm trying to convert everything to python to make it easier to use, more maintainable, have nicer plots than gnuplot and to develop a gui application to help us do this. - dharhas >>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>> > Well, because there's no standard way to do that: when you have > duplicated dates, should you take the first one? The last one ? Take > some kind of average of the values ? Sometimes, there are inherent faults in the data set. Therefore, a automatic treatment may introduce further errors. It's only possible when this errors are occuring somewhat systematically. _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Dharhas,
To find duplicates, you can use the following function (on SVN r2111). find_duplicated_dates will give you a dictionary, you can then use the values to decide what you want to do. remove_duplicated_dates will strip the series to keep only the first occurrence of duplicated dates. def find_duplicated_dates(series): """ Return a dictionary (duplicated dates <> indices) for the input series. The indices are given as a tuple of ndarrays, a la :meth:`nonzero`. Parameters ---------- series : TimeSeries, DateArray A valid :class:`TimeSeries` or :class:`DateArray` object. Examples -------- >>> series = time_series(np.arange(10), dates=[2000, 2001, 2002, 2003, 2003, 2003, 2004, 2005, 2005, 2006], freq='A') >>> test = find_duplicated_dates(series) {<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7, 8]),)} """ dates = getattr(series, '_dates', series) steps = dates.get_steps() duplicated_dates = tuple(set(dates[steps==0])) indices = {} for d in duplicated_dates: indices[d] = (dates==d).nonzero() return indices def remove_duplicated_dates(series): """ Remove the entries of `series` corresponding to duplicated dates. The series is first sorted in chronological order. Only the first occurence of a date is then kept, the others are discarded. Parameters ---------- series : TimeSeries Time series to process """ dates = getattr(series, '_dates', series) steps = np.concatenate(([1,], dates.get_steps())) if not dates.is_chronological(): series = series.copy() series.sort_chronologically() dates = series._dates return series[steps.nonzero()] On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote: > > In this particular case we know the cause: > > It is either : > > a) Overlapping files have been appended. ie file1 contains data from > Jan1 to Feb1 and file2 contains data from jan1 to March1. The > overlap region has identical data. > > b) The data comes from sequential deployments and there is an small > overlap at the beginning of the second file. ie file1 has data from > Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may > be a few data points overlap. These are junk because the equipment > was set up in the lab and took measurements in the air until it was > swapped with the installed instrument in the water. > > In both these cases it is appropriate to take the first value. In > the second case we really should be stripping the bad data before > appending but this is a work in progress. Right now we are > developing a semi-automated QA/QC procedure to clean up data before > posting it on the web. We presently use a mix of awk and shell > scripts but I'm trying to convert everything to python to make it > easier to use, more maintainable, have nicer plots than gnuplot and > to develop a gui application to help us do this. > > - dharhas > >>>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>> >> Well, because there's no standard way to do that: when you have >> duplicated dates, should you take the first one? The last one ? Take >> some kind of average of the values ? > Sometimes, there are inherent faults in the data set. Therefore, a > automatic > treatment may introduce further errors. > It's only possible when this errors are occuring somewhat > systematically. > > > > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Great to hear. Once I'm done with my present project I'll see if I can install and play around with the SVN version. - d >>> Pierre GM <[hidden email]> 3/11/2009 11:26 AM >>> Dharhas, To find duplicates, you can use the following function (on SVN r2111). find_duplicated_dates will give you a dictionary, you can then use the values to decide what you want to do. remove_duplicated_dates will strip the series to keep only the first occurrence of duplicated dates. def find_duplicated_dates(series): """ Return a dictionary (duplicated dates <> indices) for the input series. The indices are given as a tuple of ndarrays, a la :meth:`nonzero`. Parameters ---------- series : TimeSeries, DateArray A valid :class:`TimeSeries` or :class:`DateArray` object. Examples -------- >>> series = time_series(np.arange(10), dates=[2000, 2001, 2002, 2003, 2003, 2003, 2004, 2005, 2005, 2006], freq='A') >>> test = find_duplicated_dates(series) {<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7, 8]),)} """ dates = getattr(series, '_dates', series) steps = dates.get_steps() duplicated_dates = tuple(set(dates[steps==0])) indices = {} for d in duplicated_dates: indices[d] = (dates==d).nonzero() return indices def remove_duplicated_dates(series): """ Remove the entries of `series` corresponding to duplicated dates. The series is first sorted in chronological order. Only the first occurence of a date is then kept, the others are discarded. Parameters ---------- series : TimeSeries Time series to process """ dates = getattr(series, '_dates', series) steps = np.concatenate(([1,], dates.get_steps())) if not dates.is_chronological(): series = series.copy() series.sort_chronologically() dates = series._dates return series[steps.nonzero()] On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote: > > In this particular case we know the cause: > > It is either : > > a) Overlapping files have been appended. ie file1 contains data from > Jan1 to Feb1 and file2 contains data from jan1 to March1. The > overlap region has identical data. > > b) The data comes from sequential deployments and there is an small > overlap at the beginning of the second file. ie file1 has data from > Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may > be a few data points overlap. These are junk because the equipment > was set up in the lab and took measurements in the air until it was > swapped with the installed instrument in the water. > > In both these cases it is appropriate to take the first value. In > the second case we really should be stripping the bad data before > appending but this is a work in progress. Right now we are > developing a semi-automated QA/QC procedure to clean up data before > posting it on the web. We presently use a mix of awk and shell > scripts but I'm trying to convert everything to python to make it > easier to use, more maintainable, have nicer plots than gnuplot and > to develop a gui application to help us do this. > > - dharhas > >>>> Timmie <[hidden email]> 3/11/2009 4:35 AM >>> >> Well, because there's no standard way to do that: when you have >> duplicated dates, should you take the first one? The last one ? Take >> some kind of average of the values ? > Sometimes, there are inherent faults in the data set. Therefore, a > automatic > treatment may introduce further errors. > It's only possible when this errors are occuring somewhat > systematically. > > > > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user _______________________________________________ SciPy-user mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |