Administrator

Hello,
I had to do several statistical computations lately. I therefore looked at the statistical language R since it seems to contain already many models and functionality. Is there some function like "predict" [1] in Python? Example: x < rnorm(15) y < x + rnorm(15) t = predict(lm(y ~ x)) t => the predicted data determined by the linear model (comapare to scipy.stats.linregress) How is this done by pure Python? Are there many using Rpy (rpy2) to acces the statistical functionalities provided by R? What are your experiences with this? Programming in python seems to be more convenient than in R but lacking the vast statistics. Thanks in advance, Timmie [1] predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. Most prediction methods which similar to fitting linear models have an argument newdata specifying the first place to look for explanatory variables to be used for prediction. Some considerable attempts are made to match up the columns in newdata to those used for fitting, for example that they are of comparable types and that any factors have the same level set in the same order (or can be transformed to be so). Time series prediction methods in package stats have an argument n.ahead specifying how many time steps ahead to predict. Eample: x < rnorm(15) y < x + rnorm(15) t = predict(lm(y ~ x)) t => the predicted data determined by the linear model (comapare to scipy.stats.linregress) _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Tue, Jan 13, 2009 at 3:33 PM, Tim Michelsen
<[hidden email]> wrote: > Hello, > I had to do several statistical computations lately. I therefore looked > at the statistical language R since it seems to contain already many > models and functionality. > > Is there some function like "predict" [1] in Python? > > Example: > > x < rnorm(15) > y < x + rnorm(15) > t = predict(lm(y ~ x)) > > t => the predicted data determined by the linear model (comapare > to scipy.stats.linregress) > > > How is this done by pure Python? > > > Are there many using Rpy (rpy2) to acces the statistical functionalities > provided by R? > What are your experiences with this? > > Programming in python seems to be more convenient than in R but lacking > the vast statistics. > > > Thanks in advance, > Timmie > > > [1] predict is a generic function for predictions from the results of > various model fitting functions. The function invokes particular methods > which depend on the class of the first argument. > > Most prediction methods which similar to fitting linear models have an > argument newdata specifying the first place to look for explanatory > variables to be used for prediction. Some considerable attempts are made > to match up the columns in newdata to those used for fitting, for > example that they are of comparable types and that any factors have the > same level set in the same order (or can be transformed to be so). > Time series prediction methods in package stats have an argument > n.ahead specifying how many time steps ahead to predict. > > Eample: > > x < rnorm(15) > y < x + rnorm(15) > t = predict(lm(y ~ x)) > > t => the predicted data determined by the linear model (comapare > to scipy.stats.linregress) > > _______________________________________________ > SciPyuser mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipyuser > scipy.stats.linregress treats only the case with a single explanatory variable doing it by explicitly  assumes x is data vector without constant, y is endogenous variable for estimation xnew are observation of explanatory variables for prediction see scipy tutorial, the only thing to watch out for are the matrix/array dimensions >>> from scipy import linalg >>> b,resid,rank,sigma = linalg.lstsq(np.c_[np.ones((x.shape[0],1)),x],y) >>> b array([[ 5.47073574], [ 0.6575267 ], [ 2.09241884]]) >>> xnewwc=np.c_[np.ones((xnew.shape[0],1)),xnew] >>> ypred = np.dot(xnewwc,b) # prediction with ols estimate of parameters b >>> print np.c_[ynew, ypred, ynew  ypred] [[ 8.23128832 8.69250962 0.46122129] [ 9.14636291 9.66243911 0.51607621] [0.10198498 0.27382934 0.17184436]] or using the ols example from the cookbook to which I added a predict method  # try_olsexample.py import numpy as np from olsexample import ols def generate_data(nobs): x = np.random.randn(nobs,2) btrue = np.array([[5,1,2]]).T y = np.dot(x,btrue[1:,:]) + btrue[0,:] + 0.5 * np.random.randn(nobs,1) return y,x y,x = generate_data(15) est = ols(y,x) # initialize and estimate with ols, constant added by default print 'ols estimate' print est.b print np.array([[5,1,2]]) # true coefficients ynew,xnew = generate_data(3) ypred = est.predict(xnew) print ' ytrue ypred error' print np.c_[ynew, ypred, ynew  ypred] # EOF output: ols estimate [[ 5.47073574] [ 0.6575267 ] [ 2.09241884]] [[5 1 2]] ytrue ypred error [[ 8.23128832 8.69250962 0.46122129] [ 9.14636291 9.66243911 0.51607621] [0.10198498 0.27382934 0.17184436]] olsexample.py is in attachment is from the cookbook and I'm slowly reworking it. fancier models will be in scipy.stats.models when they are ready for inclusion. I'm using rpy (version 1) to check scipy.stats function, and for sure the available methods are very extensive in R, while coverage of statistics and econometrics in python packages including scipy is spotty, some good spots and many missing pieces. Josef _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser olsexample.py (11K) Download Attachment 
Administrator

Hello Josef,
thank for your extensive answer. I really appreciate it and will see how I could use it. > olsexample.py is in attachment is from the cookbook and I'm slowly reworking it. > fancier models will be in scipy.stats.models when they are ready for inclusion. Do you have the scipy.stats.models in a SVN repository somewhere? > I'm using rpy (version 1) to check scipy.stats function, and for sure > the available methods are very extensive in R, while coverage of > statistics and econometrics in python packages including scipy is > spotty, some good spots and many missing pieces. As you are checking against R with rpy, do you think that the R functions are more accurate? Do you see benefit from reprogramming the stats functions in scipy? Thanks and regards, Timmie _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Wed, Jan 14, 2009 at 7:37 PM, Tim Michelsen
<[hidden email]> wrote: > Hello Josef, > thank for your extensive answer. > I really appreciate it and will see how I could use it. > > >> olsexample.py is in attachment is from the cookbook and I'm slowly reworking it. >> fancier models will be in scipy.stats.models when they are ready for inclusion. > Do you have the scipy.stats.models in a SVN repository somewhere? The main current location is at http://bazaar.launchpad.net/~nipydevelopers/nipy/trunk/files/head%3A/neuroimaging/fixes/scipy/stats/ I made a few changes to stats.models, so that all existing tests pass at http://bazaar.launchpad.net/~josefpktd/%2Bjunk/nipy_stats_models2/files/head%3A/neuroimaging/fixes/scipy/stats/models/ > >> I'm using rpy (version 1) to check scipy.stats function, and for sure >> the available methods are very extensive in R, while coverage of >> statistics and econometrics in python packages including scipy is >> spotty, some good spots and many missing pieces. > As you are checking against R with rpy, do you think that the R > functions are more accurate? The function in stats, that I tested or rewrote, are usually identical to around 1e15, but in some cases R has a more accurate test distribution for small samples (option "exact" in R), while in scipy.stats we only have the asymptotic distribution. Also, not all existing functions in scipy.stats are tested (yet). > Do you see benefit from reprogramming the stats functions in scipy? > (Since R and its packages are GPL we cannot copy from it directly, but I was looking at R and matlab for the interface/signature of statistical functions.) I would like to see many of the basic statistics functions included in scipy (or in an addon, or initially as cookbook recipes). Much of the basic supporting tools for statistics like optimize, linalg, distributions, special and signal, are available but it is a pain to figure out each time how to use it; for example, how to get the error and covariance estimates for linear or nonlinear regression. There are many good specialized packages for python available, for example for machine learning or MCMC, but no complete collection of basic statistical functionality. But, my impression is that, since scipy is mostly developer driven (?), what finally ends up in scipy, depends on the needs of the developers, and their willingness to share the code and to incorporate user feedback. Josef _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: > The function in stats, that I tested or rewrote, are usually identical > to around 1e15, but in some cases R has a more accurate test > distribution for small samples (option "exact" in R), while in > scipy.stats we only have the asymptotic distribution. We could try to reimplement part of it in C,. In any case, it might be worth to output a warning (or at least be very explicit in the doc) that the results may not hold for samples smaller than 1020. > Also, not all > existing functions in scipy.stats are tested (yet). We should also try to make sure missing data are properly supported (not always possible) and that the results are consistent between the masked and nonmasked versions. >> Do you see benefit from reprogramming the stats functions in scipy? >> > > (Since R and its packages are GPL we cannot copy from it directly, but > I was looking at R and matlab for the interface/signature of > statistical functions.) There's one obvious advantage (on top of the pedagogical exercise): that's one dependency less. > I would like to see many of the basic statistics functions included in > scipy (or in an addon, or initially as cookbook recipes). Much of the > basic supporting tools for statistics like optimize, linalg, > distributions, special and signal, are available but it is a pain to > figure out each time how to use it; for example, how to get the error > and covariance estimates for linear or nonlinear regression. Very true, but it's also what attracted me in numpy/scipy in the first place: the functions I needed were at the time nonexistent, and I was reluctant to rely on other softwares which, albeit more powerful, hided how values were actually calculated (what assumptions were made, what were the validity domains...). It was nice to have some time at hand. > > But, my impression is that, since scipy is mostly developer driven > (?), what finally ends up in scipy, depends on the needs of the > developers, and their willingness to share the code and to incorporate > user feedback. IMHO, the readiness to incorporate user feedback is here. The feedback is not, or at least not as much as we'd like. _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote:
> > On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >> The function in stats, that I tested or rewrote, are usually identical >> to around 1e15, but in some cases R has a more accurate test >> distribution for small samples (option "exact" in R), while in >> scipy.stats we only have the asymptotic distribution. > > We could try to reimplement part of it in C,. In any case, it might > be worth to output a warning (or at least be very explicit in the doc) > that the results may not hold for samples smaller than 1020. I am not a "C" person and I never went much beyond HelloWorld in C. I just checked some of the doc strings, and I am usually mention that we use the asymptotic distribution, but there are still pretty vague statements in some of the doc strings, such as "The pvalues are not entirely reliable but are probably reasonable for datasets larger than 500 or so." > >> Also, not all >> existing functions in scipy.stats are tested (yet). > > We should also try to make sure missing data are properly supported > (not always possible) and that the results are consistent between the > masked and nonmasked versions. > I added a ticket so we don't forget to check this. > IMHO, the readiness to incorporate user feedback is here. The feedback > is not, or at least not as much as we'd like. That depends on the subpackage, some problems in stats have been reported and known for quite some time and the expected lifetime of a ticket can be pretty long. I was looking at different python packages that use statistics, and many of them are reluctant to use scipy while numpy looks very well established. But, I suppose this will improve with time and the user base will increase, especially with the recent improvements in the build/distribution and the documentation. Josef _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
[hidden email] wrote:
> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote: > >> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >> >>> The function in stats, that I tested or rewrote, are usually identical >>> to around 1e15, but in some cases R has a more accurate test >>> distribution for small samples (option "exact" in R), while in >>> scipy.stats we only have the asymptotic distribution. >>> >> We could try to reimplement part of it in C,. In any case, it might >> be worth to output a warning (or at least be very explicit in the doc) >> that the results may not hold for samples smaller than 1020. >> > > I am not a "C" person and I never went much beyond HelloWorld in C. > I just checked some of the doc strings, and I am usually mention that > we use the asymptotic distribution, but there are still pretty vague > statements in some of the doc strings, such as > > "The pvalues are not entirely reliable but are probably reasonable for > datasets larger than 500 or so." > > > (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very different from the asymptotic testing and can get very demanding. Also I do not think that such statements should be part of the doc strings. >>> Also, not all >>> existing functions in scipy.stats are tested (yet). >>> >> We should also try to make sure missing data are properly supported >> (not always possible) and that the results are consistent between the >> masked and nonmasked versions. >> >> > > I added a ticket so we don't forget to check this. > > > > >> IMHO, the readiness to incorporate user feedback is here. The feedback >> is not, or at least not as much as we'd like. >> > > That depends on the subpackage, some problems in stats have been > reported and known for quite some time and the expected lifetime of a > ticket can be pretty long. I was looking at different python packages > that use statistics, and many of them are reluctant to use scipy while > numpy looks very well established. But, I suppose this will improve > with time and the user base will increase, especially with the recent > improvements in the build/distribution and the documentation. > > Josef > _______________________________________________ > SciPyuser mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipyuser > for R is that many, many statistics classes use it. Some of the reasons that I do not use scipy for stats (and have not looked at this in some time) included: 1) The difficulty of installation which is considerably better now. 2) Lack of support for missing values as virtually everything that I have worked with involves missing values at some stage. 3) Lack of an suitable statistical modeling interface where you can specify the model to be fit without having to create each individual array. The approach must work for a range of scenarios. Bruce _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Jan 15, 2009, at 10:09 AM, Bruce Southey wrote: >> > Some of the reasons that I do not use scipy for stats (and have not > looked at this in some time) included: > 1) The difficulty of installation which is considerably better now. > 2) Lack of support for missing values as virtually everything that I > have worked with involves missing values at some stage. Can you proceed and give us examples of your needs ? That way we could improve scipy.stats.mstats > > 3) Lack of an suitable statistical modeling interface where you can > specify the model to be fit without having to create each individual > array. The approach must work for a range of scenarios. Here again, a short example would help. Thx a lot in advance, P. _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
In reply to this post by Bruce Southey
On Thu, Jan 15, 2009 at 10:09 AM, Bruce Southey <[hidden email]> wrote:
> [hidden email] wrote: >> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote: >> >>> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote: >>> >>>> The function in stats, that I tested or rewrote, are usually identical >>>> to around 1e15, but in some cases R has a more accurate test >>>> distribution for small samples (option "exact" in R), while in >>>> scipy.stats we only have the asymptotic distribution. >>>> >>> We could try to reimplement part of it in C,. In any case, it might >>> be worth to output a warning (or at least be very explicit in the doc) >>> that the results may not hold for samples smaller than 1020. >>> >> >> I am not a "C" person and I never went much beyond HelloWorld in C. >> I just checked some of the doc strings, and I am usually mention that >> we use the asymptotic distribution, but there are still pretty vague >> statements in some of the doc strings, such as >> >> "The pvalues are not entirely reliable but are probably reasonable for >> datasets larger than 500 or so." >> >> >> > The 'exact' test are usually Fisher's exact tests > (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very > different from the asymptotic testing and can get very demanding. Also I > do not think that such statements should be part of the doc strings. According to the wikipedia reference this is for contingency tables, the two cases I worked on, were the exact twosided KolmogorovSmirnov distribution, were I found a good approximation, and the exact distribution for the Spearman correlation coefficient for the Null of no correlation. > >>>> Also, not all >>>> existing functions in scipy.stats are tested (yet). >>>> >>> We should also try to make sure missing data are properly supported >>> (not always possible) and that the results are consistent between the >>> masked and nonmasked versions. >>> >>> >> >> I added a ticket so we don't forget to check this. >> >> >> >> >>> IMHO, the readiness to incorporate user feedback is here. The feedback >>> is not, or at least not as much as we'd like. >>> >> >> That depends on the subpackage, some problems in stats have been >> reported and known for quite some time and the expected lifetime of a >> ticket can be pretty long. I was looking at different python packages >> that use statistics, and many of them are reluctant to use scipy while >> numpy looks very well established. But, I suppose this will improve >> with time and the user base will increase, especially with the recent >> improvements in the build/distribution and the documentation. >> >> Josef >> _______________________________________________ >> SciPyuser mailing list >> [hidden email] >> http://projects.scipy.org/mailman/listinfo/scipyuser >> > There are different reasons for a lack of user base. One of the reasons > for R is that many, many statistics classes use it. > > Some of the reasons that I do not use scipy for stats (and have not > looked at this in some time) included: > 1) The difficulty of installation which is considerably better now. > 2) Lack of support for missing values as virtually everything that I > have worked with involves missing values at some stage. > 3) Lack of an suitable statistical modeling interface where you can > specify the model to be fit without having to create each individual > array. The approach must work for a range of scenarios. > With 2 and 3 I have little experience Missing observations, I usually remove or clean in the initial data preparation. mstats provides functions for masked arrays, but stats mostly assumes no missing values. What would be the generic treatment for missing observations, just dropping all observations that have NaNs or converting them to masked arrays and expand the function that can handle those? Jonathan Taylor included a formula framework in stats.models similar to R, but I haven't looked very closely at it. I haven't learned much of R's syntax and I usually prefer to build by own arrays (with some exceptions such as polynomials) than hide them behind a mini model language. For both stats.models and for the interface for general stats functions, feedback would be very appreciated. Josef _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
>
> With 2 and 3 I have little experience > Missing observations, I usually remove or clean in the initial data > preparation. mstats provides functions for masked arrays, but stats > mostly assumes no missing values. What would be the generic treatment > for missing observations, just dropping all observations that have > NaNs or converting them to masked arrays and expand the function that > can handle those? > That depends on the situation. For linear fitting, missing values could be dropped (using the MaskedArray.compressed method if the data is 1D, or by using something like a[~np.isnan(a)]). In other cases, the missing values have to be taken into account. _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
In reply to this post by josef.pktd
[hidden email] wrote:
>> There are different reasons for a lack of user base. One of the reasons >> for R is that many, many statistics classes use it. >> >> Some of the reasons that I do not use scipy for stats (and have not >> looked at this in some time) included: >> 1) The difficulty of installation which is considerably better now. >> 2) Lack of support for missing values as virtually everything that I >> have worked with involves missing values at some stage. >> 3) Lack of an suitable statistical modeling interface where you can >> specify the model to be fit without having to create each individual >> array. The approach must work for a range of scenarios. >> >> > > With 2 and 3 I have little experience > Missing observations, I usually remove or clean in the initial data > preparation. mstats provides functions for masked arrays, but stats > mostly assumes no missing values. What would be the generic treatment > for missing observations, just dropping all observations that have > NaNs or converting them to masked arrays and expand the function that > can handle those? > the numpy/scipy lists. Basically a missing observation should not be treated as an NaNs (and there are different types of NaNs) because they are not the same. In some cases, missing values disappear in the calculations such as creating the X'X matrix etc but you probably do not want that if you have real NaNs in your data (say after taking square root of an array that includes negative numbers). > Jonathan Taylor included a formula framework in stats.models similar > to R, but I haven't looked very closely at it. I haven't learned much > of R's syntax and I usually prefer to build by own arrays (with some > exceptions such as polynomials) than hide them behind a mini model > language. > For both stats.models and for the interface for general stats > functions, feedback would be very appreciated. > > Josef > _______________________________________________ > SciPyuser mailing list > [hidden email] > http://projects.scipy.org/mailman/listinfo/scipyuser > using a formula. Without a similar framework, you can not do useful stats. Also you must have a 'mini model language' because the inputs must be created correctly and it gets very repetitive very quickly. For example, in R (and all major stats languages like SAS) you can just fit regression models like lm(Y~ x2) and lm( Y~ x3 + x1), where Y, x1, x2, and x3 are with the appropriate dataframe (not necessarily in that order). If I understand mstats.linregress correctly, I have to create two arrays just to fit one of these two models. In the second case, I have to create yet another array. If I have my original data in one array, now I have unnecessarily duplicated 3 columns of that array not to mention had to do all this extra work, hopefully error free, just to do 2 lines of R code. Jonathan's formula is along the right approach but, based on the doc string, rather cumbersome and does not use array inputs. It probably would be more effective with a record masked array. Bruce PS Way back when I did give feedback to the direction of stats stuff. _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
On Jan 15, 2009, at 12:36 PM, Bruce Southey wrote: > No! We have had considerable discussion on this aspect in the past on > the numpy/scipy lists. Basically a missing observation should not be > treated as an NaNs (and there are different types of NaNs) because > they > are not the same. In some cases, missing values disappear in the > calculations such as creating the X'X matrix etc but you probably do > not > want that if you have real NaNs in your data (say after taking square > root of an array that includes negative numbers). numpy.ma implements equivalents of ufuncs that return a masked array, where invalid outputs are masked (the output is invalid if the input is masked or if it falls outside the validity domain of the function), so we're set. There are functions that mask full rows or columns of a 2D array, or even get rid of the columns/rows that contain one or several missing values which can be used in some cases. >> > If you look at R's lm function you can see that you can fit a model > using a formula. Without a similar framework, you can not do useful > stats. Also you must have a 'mini model language' because the inputs > must be created correctly and it gets very repetitive very quickly. > For example, in R (and all major stats languages like SAS) you can > just > fit regression models like lm(Y~ x2) and lm( Y~ x3 + x1), where Y, > x1, > x2, and x3 are with the appropriate dataframe (not necessarily in that > order). Well, we could adapt the functions to accept a structured array as input and define your x1, x2... from the fields of this array. I tried to significantly improve the support of structured arrays in numpy.ma 1.3., so it shouldn't be that difficult to use masked arrays by default. > If I understand mstats.linregress correctly, I have to create two > arrays > just to fit one of these two models. In the second case, I have to > create yet another array. If I have my original data in one array, > now I > have unnecessarily duplicated 3 columns of that array not to mention > had > to do all this extra work, hopefully error free, just to do 2 lines > of R > code. > For the first case (Y~x2), you don't need 2 arrays, you can use a 2D array with either 2 rows or 2 columns and that would work. mstats.linregress use the same approach as stats.linregress. The second case is a tad more complex, but could probably be adapted relatively easily. > Jonathan's formula is along the right approach but, based on the doc > string, rather cumbersome and does not use array inputs. It probably > would be more effective with a record masked array. OK, more on my todo list... _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
In reply to this post by Bruce Southey
On Thu, Jan 15, 2009 at 12:36 PM, Bruce Southey <[hidden email]> wrote:
> [hidden email] wrote: >>> There are different reasons for a lack of user base. One of the reasons >>> for R is that many, many statistics classes use it. >>> >>> Some of the reasons that I do not use scipy for stats (and have not >>> looked at this in some time) included: >>> 1) The difficulty of installation which is considerably better now. >>> 2) Lack of support for missing values as virtually everything that I >>> have worked with involves missing values at some stage. >>> 3) Lack of an suitable statistical modeling interface where you can >>> specify the model to be fit without having to create each individual >>> array. The approach must work for a range of scenarios. >>> >>> >> >> With 2 and 3 I have little experience >> Missing observations, I usually remove or clean in the initial data >> preparation. mstats provides functions for masked arrays, but stats >> mostly assumes no missing values. What would be the generic treatment >> for missing observations, just dropping all observations that have >> NaNs or converting them to masked arrays and expand the function that >> can handle those? >> > No! We have had considerable discussion on this aspect in the past on > the numpy/scipy lists. Basically a missing observation should not be > treated as an NaNs (and there are different types of NaNs) because they > are not the same. In some cases, missing values disappear in the > calculations such as creating the X'X matrix etc but you probably do not > want that if you have real NaNs in your data (say after taking square > root of an array that includes negative numbers). > >> Jonathan Taylor included a formula framework in stats.models similar >> to R, but I haven't looked very closely at it. I haven't learned much >> of R's syntax and I usually prefer to build by own arrays (with some >> exceptions such as polynomials) than hide them behind a mini model >> language. >> For both stats.models and for the interface for general stats >> functions, feedback would be very appreciated. >> >> Josef >> _______________________________________________ >> SciPyuser mailing list >> [hidden email] >> http://projects.scipy.org/mailman/listinfo/scipyuser >> > If you look at R's lm function you can see that you can fit a model > using a formula. Without a similar framework, you can not do useful > stats. Also you must have a 'mini model language' because the inputs > must be created correctly and it gets very repetitive very quickly. > > For example, in R (and all major stats languages like SAS) you can just > fit regression models like lm(Y~ x2) and lm( Y~ x3 + x1), where Y, x1, > x2, and x3 are with the appropriate dataframe (not necessarily in that > order). For the simple case, it could be done with accepting a sequence of args, and building the design matrix inside the function, e.g. ols( Y, x3, x1, x2**2 ) To build design matrices, I wrote, for myself, functions like simplex(x,n) where x is a 2D column matrix and it builds the interaction terms matrix, x[:,1], x[:,2], x[:,1]*x[:,2], ... x[:,1]**n, which if I read the R stats help correctly would correspond to (x[:,1] + x[:,2])^n My ols call would then be ols(Y, simplex(x3,x1,2) ), This uses explicit functions and avoids the minilanguage, but it requires some design building functions. Being able to access some metainformation to data arrays would be nice, but I haven't used these features much, except for building my own classes in python or structs in matlab. Josef _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
Administrator

In reply to this post by Pierre GM2
Hello,
thanks very much for continuing this discussion. It is very helpful for me and perhaps others who need to chose the right tools for statistical processing and analysis. > IMHO, the readiness to incorporate user feedback is here. The feedback > is not, or at least not as much as we'd like. I am often very much ocupied writing my own special routines building on top of scipy/numpy etc. And therefore, I find it difficult to contribute code. But I can do testings and bug reporting. Since I want the libaries I rely on to be good working I have a vital interest in this. Please just indicate where testing is needed. If it matches with my knowledge and application I am happy to contribute. Kind regards, Timmie _______________________________________________ SciPyuser mailing list [hidden email] http://projects.scipy.org/mailman/listinfo/scipyuser 
Free forum by Nabble  Edit this page 