# predicting values based on (linear) models

14 messages
Open this post in threaded view
|

## predicting values based on (linear) models

 Administrator Hello, I had to do several statistical computations lately. I therefore looked at the statistical language R since it seems to contain already many models and functionality. Is there some function like "predict" [1] in Python? Example:       x <- rnorm(15)       y <- x + rnorm(15)       t = predict(lm(y ~ x))       t => the predicted data determined by the linear model (comapare to scipy.stats.linregress) How is this done by pure Python? Are there many using Rpy (rpy2) to acces the statistical functionalities provided by R? What are your experiences with this? Programming in python seems to be more convenient than in R but lacking the vast statistics. Thanks in advance, Timmie [1] predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. Most prediction methods which similar to fitting linear models have an argument newdata specifying the first place to look for explanatory variables to be used for prediction. Some considerable attempts are made to match up the columns in newdata to those used for fitting, for example that they are of comparable types and that any factors have the same level set in the same order (or can be transformed to be so).   Time series prediction methods in package stats have an argument n.ahead specifying how many time steps ahead to predict. Eample:       x <- rnorm(15)       y <- x + rnorm(15)       t = predict(lm(y ~ x))       t => the predicted data determined by the linear model (comapare to scipy.stats.linregress) _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

Open this post in threaded view
|

## Re: predicting values based on (linear) models

 Administrator Hello Josef, thank for your extensive answer. I really appreciate it and will see how I could use it. > olsexample.py is in attachment is from the cookbook and I'm slowly reworking it. > fancier models will be in scipy.stats.models when they are ready for inclusion. Do you have the scipy.stats.models in a SVN repository somewhere? > I'm using rpy (version 1) to check scipy.stats function, and for sure > the available methods are very extensive in R, while coverage of > statistics and econometrics in python packages including scipy is > spotty, some good spots and many missing pieces. As you are checking against R with rpy, do you think that the R functions are more accurate? Do you see benefit from re-programming the stats functions in scipy? Thanks and regards, Timmie _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 On Wed, Jan 14, 2009 at 7:37 PM, Tim Michelsen <[hidden email]> wrote: > Hello Josef, > thank for your extensive answer. > I really appreciate it and will see how I could use it. > > >> olsexample.py is in attachment is from the cookbook and I'm slowly reworking it. >> fancier models will be in scipy.stats.models when they are ready for inclusion. > Do you have the scipy.stats.models in a SVN repository somewhere? The main current location is at http://bazaar.launchpad.net/~nipy-developers/nipy/trunk/files/head%3A/neuroimaging/fixes/scipy/stats/I made a few changes to stats.models, so that all existing tests pass at http://bazaar.launchpad.net/~josef-pktd/%2Bjunk/nipy_stats_models2/files/head%3A/neuroimaging/fixes/scipy/stats/models/> >> I'm using rpy (version 1) to check scipy.stats function, and for sure >> the available methods are very extensive in R, while coverage of >> statistics and econometrics in python packages including scipy is >> spotty, some good spots and many missing pieces. > As you are checking against R with rpy, do you think that the R > functions are more accurate? The function in stats, that I tested or rewrote, are usually identical to around 1e-15, but in some cases R has a more accurate test distribution for small samples (option "exact" in R), while in scipy.stats we only have the asymptotic distribution. Also, not all existing functions in scipy.stats are tested (yet). > Do you see benefit from re-programming the stats functions in scipy? > (Since R and its packages are GPL we cannot copy from it directly, but I was looking at R and matlab for the interface/signature of statistical functions.) I would like to see many of the basic statistics functions included in scipy (or in an addon, or initially as cookbook recipes). Much of the basic supporting tools for statistics like optimize, linalg, distributions, special and signal, are available but it is a pain to figure out each time how to use it; for example,  how to get the error and covariance estimates for linear or non-linear regression. There are many good specialized packages for python available, for example for machine learning or MCMC, but no complete collection of basic statistical functionality. But, my impression is that, since scipy is mostly developer driven (?), what finally ends up in scipy, depends on the needs of the developers, and their willingness to share the code and to incorporate user feedback. Josef _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: > The function in stats, that I tested or rewrote, are usually identical > to around 1e-15, but in some cases R has a more accurate test > distribution for small samples (option "exact" in R), while in > scipy.stats we only have the asymptotic distribution. We could try to reimplement part of it in C,. In any   case, it might   be worth to output a warning (or at least be very explicit in the doc)   that the results may not hold for samples smaller than 10-20. > Also, not all > existing functions in scipy.stats are tested (yet). We should also try to make sure missing data are properly supported   (not always possible) and that the results are consistent between the   masked and non-masked versions. >> Do you see benefit from re-programming the stats functions in scipy? >> > > (Since R and its packages are GPL we cannot copy from it directly, but > I was looking at R and matlab for the interface/signature of > statistical functions.) There's one obvious advantage (on top of the pedagogical exercise):   that's one dependency less. > I would like to see many of the basic statistics functions included in > scipy (or in an addon, or initially as cookbook recipes). Much of the > basic supporting tools for statistics like optimize, linalg, > distributions, special and signal, are available but it is a pain to > figure out each time how to use it; for example,  how to get the error > and covariance estimates for linear or non-linear regression. Very true, but it's also what attracted me in numpy/scipy in the first   place: the functions I needed were at the time non-existent, and I was   reluctant to rely on other softwares which, albeit more powerful,   hided how values were actually calculated (what assumptions were made,   what were the validity domains...). It was nice to have some time at   hand. > > But, my impression is that, since scipy is mostly developer driven > (?), what finally ends up in scipy, depends on the needs of the > developers, and their willingness to share the code and to incorporate > user feedback. IMHO, the readiness to incorporate user feedback is here. The feedback   is not, or at least not as much as we'd like. _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote: > > On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >> The function in stats, that I tested or rewrote, are usually identical >> to around 1e-15, but in some cases R has a more accurate test >> distribution for small samples (option "exact" in R), while in >> scipy.stats we only have the asymptotic distribution. > > We could try to reimplement part of it in C,. In any   case, it might > be worth to output a warning (or at least be very explicit in the doc) > that the results may not hold for samples smaller than 10-20. I am not a "C" person and I never went much beyond HelloWorld in C. I just checked some of the doc strings, and I am usually mention that we use the asymptotic distribution, but there are still pretty vague statements in some of the doc strings, such as "The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so." > >> Also, not all >> existing functions in scipy.stats are tested (yet). > > We should also try to make sure missing data are properly supported > (not always possible) and that the results are consistent between the > masked and non-masked versions. > I added a ticket so we don't forget to check this. > IMHO, the readiness to incorporate user feedback is here. The feedback > is not, or at least not as much as we'd like. That depends on the subpackage, some problems in stats have been reported and known for quite some time and the expected lifetime of a ticket can be pretty long. I was looking at different python packages that use statistics, and many of them are reluctant to use scipy while numpy looks very well established. But, I suppose this will improve with time and the user base will increase, especially with the recent improvements in the build/distribution and the documentation. Josef _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 [hidden email] wrote: > On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote: >   >> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >>     >>> The function in stats, that I tested or rewrote, are usually identical >>> to around 1e-15, but in some cases R has a more accurate test >>> distribution for small samples (option "exact" in R), while in >>> scipy.stats we only have the asymptotic distribution. >>>       >> We could try to reimplement part of it in C,. In any   case, it might >> be worth to output a warning (or at least be very explicit in the doc) >> that the results may not hold for samples smaller than 10-20. >>     > > I am not a "C" person and I never went much beyond HelloWorld in C. > I just checked some of the doc strings, and I am usually mention that > we use the asymptotic distribution, but there are still pretty vague > statements in some of the doc strings, such as > > "The p-values are not entirely reliable but are probably reasonable for > datasets larger than 500 or so." > > >   The 'exact' test are usually Fisher's exact tests (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very different from the asymptotic testing and can get very demanding. Also I do not think that such statements should be part of the doc strings. >>> Also, not all >>> existing functions in scipy.stats are tested (yet). >>>       >> We should also try to make sure missing data are properly supported >> (not always possible) and that the results are consistent between the >> masked and non-masked versions. >> >>     > > I added a ticket so we don't forget to check this. > > > >   >> IMHO, the readiness to incorporate user feedback is here. The feedback >> is not, or at least not as much as we'd like. >>     > > That depends on the subpackage, some problems in stats have been > reported and known for quite some time and the expected lifetime of a > ticket can be pretty long. I was looking at different python packages > that use statistics, and many of them are reluctant to use scipy while > numpy looks very well established. But, I suppose this will improve > with time and the user base will increase, especially with the recent > improvements in the build/distribution and the documentation. > > Josef > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipy-user>   There are different reasons for a lack of user base. One of the reasons for R is that many, many statistics classes use it. Some of the reasons that I do not use scipy for stats (and have not looked at this in some time) included: 1) The difficulty of installation which is considerably better now. 2) Lack of support for missing values as virtually everything that I have worked with involves missing values at some stage. 3) Lack of an suitable statistical modeling interface where you can specify the model to be fit without having to create each individual array. The approach must work for a range of scenarios. Bruce _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 On Jan 15, 2009, at 10:09 AM, Bruce Southey wrote: >> > Some of the reasons that I do not use scipy for stats (and have not > looked at this in some time) included: > 1) The difficulty of installation which is considerably better now. > 2) Lack of support for missing values as virtually everything that I > have worked with involves missing values at some stage. Can you proceed and give us examples of your needs ? That way we could   improve scipy.stats.mstats > > 3) Lack of an suitable statistical modeling interface where you can > specify the model to be fit without having to create each individual > array. The approach must work for a range of scenarios. Here again, a short example would help. Thx a lot in advance, P. _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 In reply to this post by Bruce Southey On Thu, Jan 15, 2009 at 10:09 AM, Bruce Southey <[hidden email]> wrote: > [hidden email] wrote: >> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote: >> >>> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >>> >>>> The function in stats, that I tested or rewrote, are usually identical >>>> to around 1e-15, but in some cases R has a more accurate test >>>> distribution for small samples (option "exact" in R), while in >>>> scipy.stats we only have the asymptotic distribution. >>>> >>> We could try to reimplement part of it in C,. In any   case, it might >>> be worth to output a warning (or at least be very explicit in the doc) >>> that the results may not hold for samples smaller than 10-20. >>> >> >> I am not a "C" person and I never went much beyond HelloWorld in C. >> I just checked some of the doc strings, and I am usually mention that >> we use the asymptotic distribution, but there are still pretty vague >> statements in some of the doc strings, such as >> >> "The p-values are not entirely reliable but are probably reasonable for >> datasets larger than 500 or so." >> >> >> > The 'exact' test are usually Fisher's exact tests > (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very > different from the asymptotic testing and can get very demanding. Also I > do not think that such statements should be part of the doc strings. According to the wikipedia reference this is for contingency tables, the two cases I worked on, were the exact two-sided Kolmogorov-Smirnov distribution, were I found a good approximation, and the exact distribution for the Spearman correlation coefficient for the Null of no correlation. > >>>> Also, not all >>>> existing functions in scipy.stats are tested (yet). >>>> >>> We should also try to make sure missing data are properly supported >>> (not always possible) and that the results are consistent between the >>> masked and non-masked versions. >>> >>> >> >> I added a ticket so we don't forget to check this. >> >> >> >> >>> IMHO, the readiness to incorporate user feedback is here. The feedback >>> is not, or at least not as much as we'd like. >>> >> >> That depends on the subpackage, some problems in stats have been >> reported and known for quite some time and the expected lifetime of a >> ticket can be pretty long. I was looking at different python packages >> that use statistics, and many of them are reluctant to use scipy while >> numpy looks very well established. But, I suppose this will improve >> with time and the user base will increase, especially with the recent >> improvements in the build/distribution and the documentation. >> >> Josef >> _______________________________________________ >> SciPy-user mailing list >> [hidden email] >> http://projects.scipy.org/mailman/listinfo/scipy-user>> > There are different reasons for a lack of user base. One of the reasons > for R is that many, many statistics classes use it. > > Some of the reasons that I do not use scipy for stats (and have not > looked at this in some time) included: > 1) The difficulty of installation which is considerably better now. > 2) Lack of support for missing values as virtually everything that I > have worked with involves missing values at some stage. > 3) Lack of an suitable statistical modeling interface where you can > specify the model to be fit without having to create each individual > array. The approach must work for a range of scenarios. > With 2 and 3 I have little experience Missing observations, I usually remove or clean in the initial data preparation. mstats provides functions for masked arrays, but stats mostly assumes no missing values. What would be the generic treatment for missing observations, just dropping all observations that have NaNs or converting them to masked arrays and expand the function that can handle those? Jonathan Taylor included a formula framework in stats.models similar to R, but I haven't looked very closely at it. I haven't learned much of R's syntax and I usually prefer to build by own arrays (with some exceptions such as polynomials) than hide them behind a mini model language. For both stats.models and for the interface for general stats functions, feedback would be very appreciated. Josef _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 > > With 2 and 3 I have little experience > Missing observations, I usually remove or clean in the initial data > preparation. mstats provides functions for masked arrays, but stats > mostly assumes no missing values. What would be the generic treatment > for missing observations, just dropping all observations that have > NaNs or converting them to masked arrays and expand the function that > can handle those? > That depends on the situation. For linear fitting, missing values   could be dropped (using the MaskedArray.compressed method if the data   is 1D, or by using something like a[~np.isnan(a)]). In other cases,   the missing values have to be taken into account. _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|

## Re: predicting values based on (linear) models

 In reply to this post by josef.pktd [hidden email] wrote: >> There are different reasons for a lack of user base. One of the reasons >> for R is that many, many statistics classes use it. >> >> Some of the reasons that I do not use scipy for stats (and have not >> looked at this in some time) included: >> 1) The difficulty of installation which is considerably better now. >> 2) Lack of support for missing values as virtually everything that I >> have worked with involves missing values at some stage. >> 3) Lack of an suitable statistical modeling interface where you can >> specify the model to be fit without having to create each individual >> array. The approach must work for a range of scenarios. >> >>     > > With 2 and 3 I have little experience > Missing observations, I usually remove or clean in the initial data > preparation. mstats provides functions for masked arrays, but stats > mostly assumes no missing values. What would be the generic treatment > for missing observations, just dropping all observations that have > NaNs or converting them to masked arrays and expand the function that > can handle those? >   No! We have had considerable discussion on this aspect in the past on the numpy/scipy lists. Basically a missing observation should not be treated as an NaNs (and there are different types of NaNs) because they are not the same. In some cases, missing values disappear in the calculations such as creating the X'X matrix etc but you probably do not want that if you have real NaNs in your data (say after taking square root of an array that includes negative numbers). > Jonathan Taylor included a formula framework in stats.models similar > to R, but I haven't looked very closely at it. I haven't learned much > of R's syntax and I usually prefer to build by own arrays (with some > exceptions such as polynomials) than hide them behind a mini model > language. > For both stats.models and for the interface for general stats > functions, feedback would be very appreciated. > > Josef > _______________________________________________ > SciPy-user mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipy-user>   If you look at R's lm function you can see that you can fit a model using a formula. Without a similar framework, you can not do useful stats. Also you must have a 'mini model language' because the inputs must be created correctly and it gets very repetitive very quickly. For example, in R (and all major stats languages like SAS) you can just fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y, x1, x2, and x3 are with the appropriate dataframe (not necessarily in that order). If I understand mstats.linregress correctly, I have to create two arrays just to fit one of these two models. In the second case, I have to create yet another array. If I have my original data in one array, now I have unnecessarily duplicated 3 columns of that array not to mention had to do all this extra work, hopefully error free, just to do 2 lines of R code. Jonathan's formula is along the right approach but, based on the doc string, rather cumbersome and does not use array inputs. It probably would be more effective with a record masked array. Bruce PS Way back when I did give feedback to the direction of stats stuff. _______________________________________________ SciPy-user mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipy-user
Open this post in threaded view
|