predicting values based on (linear) models

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

predicting values based on (linear) models

Timmie
Administrator
Hello,
I had to do several statistical computations lately. I therefore looked
at the statistical language R since it seems to contain already many
models and functionality.

Is there some function like "predict" [1] in Python?

Example:

      x <- rnorm(15)
      y <- x + rnorm(15)
      t = predict(lm(y ~ x))

      t => the predicted data determined by the linear model (comapare
to scipy.stats.linregress)


How is this done by pure Python?


Are there many using Rpy (rpy2) to acces the statistical functionalities
provided by R?
What are your experiences with this?

Programming in python seems to be more convenient than in R but lacking
the vast statistics.


Thanks in advance,
Timmie


[1] predict is a generic function for predictions from the results of
various model fitting functions. The function invokes particular methods
which depend on the class of the first argument.

Most prediction methods which similar to fitting linear models have an
argument newdata specifying the first place to look for explanatory
variables to be used for prediction. Some considerable attempts are made
to match up the columns in newdata to those used for fitting, for
example that they are of comparable types and that any factors have the
same level set in the same order (or can be transformed to be so).
  Time series prediction methods in package stats have an argument
n.ahead specifying how many time steps ahead to predict.

Eample:

      x <- rnorm(15)
      y <- x + rnorm(15)
      t = predict(lm(y ~ x))

      t => the predicted data determined by the linear model (comapare
to scipy.stats.linregress)

_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

josef.pktd
On Tue, Jan 13, 2009 at 3:33 PM, Tim Michelsen
<[hidden email]> wrote:

> Hello,
> I had to do several statistical computations lately. I therefore looked
> at the statistical language R since it seems to contain already many
> models and functionality.
>
> Is there some function like "predict" [1] in Python?
>
> Example:
>
>      x <- rnorm(15)
>      y <- x + rnorm(15)
>      t = predict(lm(y ~ x))
>
>      t => the predicted data determined by the linear model (comapare
> to scipy.stats.linregress)
>
>
> How is this done by pure Python?
>
>
> Are there many using Rpy (rpy2) to acces the statistical functionalities
> provided by R?
> What are your experiences with this?
>
> Programming in python seems to be more convenient than in R but lacking
> the vast statistics.
>
>
> Thanks in advance,
> Timmie
>
>
> [1] predict is a generic function for predictions from the results of
> various model fitting functions. The function invokes particular methods
> which depend on the class of the first argument.
>
> Most prediction methods which similar to fitting linear models have an
> argument newdata specifying the first place to look for explanatory
> variables to be used for prediction. Some considerable attempts are made
> to match up the columns in newdata to those used for fitting, for
> example that they are of comparable types and that any factors have the
> same level set in the same order (or can be transformed to be so).
>  Time series prediction methods in package stats have an argument
> n.ahead specifying how many time steps ahead to predict.
>
> Eample:
>
>      x <- rnorm(15)
>      y <- x + rnorm(15)
>      t = predict(lm(y ~ x))
>
>      t => the predicted data determined by the linear model (comapare
> to scipy.stats.linregress)
>
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/scipy-user
>
This is on the todo list
scipy.stats.linregress treats only the case with a single explanatory variable

doing it by explicitly
----------------------------
assumes x is data vector without constant, y is endogenous variable
for estimation
xnew are observation of explanatory variables for prediction
see scipy tutorial, the only thing to watch out for are the
matrix/array dimensions

>>> from scipy import linalg
>>> b,resid,rank,sigma = linalg.lstsq(np.c_[np.ones((x.shape[0],1)),x],y)
>>> b
array([[ 5.47073574],
       [ 0.6575267 ],
       [ 2.09241884]])
>>> xnewwc=np.c_[np.ones((xnew.shape[0],1)),xnew]
>>> ypred = np.dot(xnewwc,b)   # prediction with ols estimate of parameters b
>>> print np.c_[ynew, ypred, ynew - ypred]
[[ 8.23128832  8.69250962 -0.46122129]
 [ 9.14636291  9.66243911 -0.51607621]
 [-0.10198498 -0.27382934  0.17184436]]


or using the ols example from the cookbook to which I added a predict method
-----------------------------------------------------------------------------------------------------------------

#------------------------ try_olsexample.py

import numpy as np
from olsexample import ols

def generate_data(nobs):
    x = np.random.randn(nobs,2)
    btrue = np.array([[5,1,2]]).T
    y = np.dot(x,btrue[1:,:]) + btrue[0,:] + 0.5 * np.random.randn(nobs,1)
    return y,x

y,x = generate_data(15)

est = ols(y,x)  # initialize and estimate with ols, constant added by default
print 'ols estimate'
print est.b
print np.array([[5,1,2]])  # true coefficients

ynew,xnew = generate_data(3)
ypred = est.predict(xnew)

print '    ytrue        ypred        error'
print np.c_[ynew, ypred, ynew - ypred]
#------------------------------- EOF

output:

ols estimate
[[ 5.47073574]
 [ 0.6575267 ]
 [ 2.09241884]]
[[5 1 2]]
    ytrue        ypred        error
[[ 8.23128832  8.69250962 -0.46122129]
 [ 9.14636291  9.66243911 -0.51607621]
 [-0.10198498 -0.27382934  0.17184436]]


olsexample.py is in attachment is from the cookbook and I'm slowly reworking it.
fancier models will be in scipy.stats.models when they are ready for inclusion.

I'm using rpy (version 1) to check scipy.stats function, and for sure
the available methods are very extensive in R, while coverage of
statistics and econometrics in python packages including scipy is
spotty, some good spots and many missing pieces.

Josef

_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user

olsexample.py (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Timmie
Administrator
Hello Josef,
thank for your extensive answer.
I really appreciate it and will see how I could use it.


> olsexample.py is in attachment is from the cookbook and I'm slowly reworking it.
> fancier models will be in scipy.stats.models when they are ready for inclusion.
Do you have the scipy.stats.models in a SVN repository somewhere?

> I'm using rpy (version 1) to check scipy.stats function, and for sure
> the available methods are very extensive in R, while coverage of
> statistics and econometrics in python packages including scipy is
> spotty, some good spots and many missing pieces.
As you are checking against R with rpy, do you think that the R
functions are more accurate?
Do you see benefit from re-programming the stats functions in scipy?

Thanks and regards,
Timmie

_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

josef.pktd
On Wed, Jan 14, 2009 at 7:37 PM, Tim Michelsen
<[hidden email]> wrote:
> Hello Josef,
> thank for your extensive answer.
> I really appreciate it and will see how I could use it.
>
>
>> olsexample.py is in attachment is from the cookbook and I'm slowly reworking it.
>> fancier models will be in scipy.stats.models when they are ready for inclusion.
> Do you have the scipy.stats.models in a SVN repository somewhere?

The main current location is at
http://bazaar.launchpad.net/~nipy-developers/nipy/trunk/files/head%3A/neuroimaging/fixes/scipy/stats/

I made a few changes to stats.models, so that all existing tests pass at
http://bazaar.launchpad.net/~josef-pktd/%2Bjunk/nipy_stats_models2/files/head%3A/neuroimaging/fixes/scipy/stats/models/

>
>> I'm using rpy (version 1) to check scipy.stats function, and for sure
>> the available methods are very extensive in R, while coverage of
>> statistics and econometrics in python packages including scipy is
>> spotty, some good spots and many missing pieces.
> As you are checking against R with rpy, do you think that the R
> functions are more accurate?

The function in stats, that I tested or rewrote, are usually identical
to around 1e-15, but in some cases R has a more accurate test
distribution for small samples (option "exact" in R), while in
scipy.stats we only have the asymptotic distribution. Also, not all
existing functions in scipy.stats are tested (yet).

> Do you see benefit from re-programming the stats functions in scipy?
>

(Since R and its packages are GPL we cannot copy from it directly, but
I was looking at R and matlab for the interface/signature of
statistical functions.)

I would like to see many of the basic statistics functions included in
scipy (or in an addon, or initially as cookbook recipes). Much of the
basic supporting tools for statistics like optimize, linalg,
distributions, special and signal, are available but it is a pain to
figure out each time how to use it; for example,  how to get the error
and covariance estimates for linear or non-linear regression. There
are many good specialized packages for python available, for example
for machine learning or MCMC, but no complete collection of basic
statistical functionality.

But, my impression is that, since scipy is mostly developer driven
(?), what finally ends up in scipy, depends on the needs of the
developers, and their willingness to share the code and to incorporate
user feedback.

Josef
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Pierre GM-2

On Jan 14, 2009, at 10:15 PM, [hidden email] wrote:
> The function in stats, that I tested or rewrote, are usually identical
> to around 1e-15, but in some cases R has a more accurate test
> distribution for small samples (option "exact" in R), while in
> scipy.stats we only have the asymptotic distribution.

We could try to reimplement part of it in C,. In any   case, it might  
be worth to output a warning (or at least be very explicit in the doc)  
that the results may not hold for samples smaller than 10-20.

> Also, not all
> existing functions in scipy.stats are tested (yet).

We should also try to make sure missing data are properly supported  
(not always possible) and that the results are consistent between the  
masked and non-masked versions.



>> Do you see benefit from re-programming the stats functions in scipy?
>>
>
> (Since R and its packages are GPL we cannot copy from it directly, but
> I was looking at R and matlab for the interface/signature of
> statistical functions.)

There's one obvious advantage (on top of the pedagogical exercise):  
that's one dependency less.


> I would like to see many of the basic statistics functions included in
> scipy (or in an addon, or initially as cookbook recipes). Much of the
> basic supporting tools for statistics like optimize, linalg,
> distributions, special and signal, are available but it is a pain to
> figure out each time how to use it; for example,  how to get the error
> and covariance estimates for linear or non-linear regression.

Very true, but it's also what attracted me in numpy/scipy in the first  
place: the functions I needed were at the time non-existent, and I was  
reluctant to rely on other softwares which, albeit more powerful,  
hided how values were actually calculated (what assumptions were made,  
what were the validity domains...). It was nice to have some time at  
hand.
>
> But, my impression is that, since scipy is mostly developer driven
> (?), what finally ends up in scipy, depends on the needs of the
> developers, and their willingness to share the code and to incorporate
> user feedback.

IMHO, the readiness to incorporate user feedback is here. The feedback  
is not, or at least not as much as we'd like.
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

josef.pktd
On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote:

>
> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote:
>> The function in stats, that I tested or rewrote, are usually identical
>> to around 1e-15, but in some cases R has a more accurate test
>> distribution for small samples (option "exact" in R), while in
>> scipy.stats we only have the asymptotic distribution.
>
> We could try to reimplement part of it in C,. In any   case, it might
> be worth to output a warning (or at least be very explicit in the doc)
> that the results may not hold for samples smaller than 10-20.

I am not a "C" person and I never went much beyond HelloWorld in C.
I just checked some of the doc strings, and I am usually mention that
we use the asymptotic distribution, but there are still pretty vague
statements in some of the doc strings, such as

"The p-values are not entirely reliable but are probably reasonable for
datasets larger than 500 or so."


>
>> Also, not all
>> existing functions in scipy.stats are tested (yet).
>
> We should also try to make sure missing data are properly supported
> (not always possible) and that the results are consistent between the
> masked and non-masked versions.
>

I added a ticket so we don't forget to check this.



> IMHO, the readiness to incorporate user feedback is here. The feedback
> is not, or at least not as much as we'd like.

That depends on the subpackage, some problems in stats have been
reported and known for quite some time and the expected lifetime of a
ticket can be pretty long. I was looking at different python packages
that use statistics, and many of them are reluctant to use scipy while
numpy looks very well established. But, I suppose this will improve
with time and the user base will increase, especially with the recent
improvements in the build/distribution and the documentation.

Josef
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Bruce Southey
[hidden email] wrote:

> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote:
>  
>> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote:
>>    
>>> The function in stats, that I tested or rewrote, are usually identical
>>> to around 1e-15, but in some cases R has a more accurate test
>>> distribution for small samples (option "exact" in R), while in
>>> scipy.stats we only have the asymptotic distribution.
>>>      
>> We could try to reimplement part of it in C,. In any   case, it might
>> be worth to output a warning (or at least be very explicit in the doc)
>> that the results may not hold for samples smaller than 10-20.
>>    
>
> I am not a "C" person and I never went much beyond HelloWorld in C.
> I just checked some of the doc strings, and I am usually mention that
> we use the asymptotic distribution, but there are still pretty vague
> statements in some of the doc strings, such as
>
> "The p-values are not entirely reliable but are probably reasonable for
> datasets larger than 500 or so."
>
>
>  
The 'exact' test are usually Fisher's exact tests
(http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very
different from the asymptotic testing and can get very demanding. Also I
do not think that such statements should be part of the doc strings.

>>> Also, not all
>>> existing functions in scipy.stats are tested (yet).
>>>      
>> We should also try to make sure missing data are properly supported
>> (not always possible) and that the results are consistent between the
>> masked and non-masked versions.
>>
>>    
>
> I added a ticket so we don't forget to check this.
>
>
>
>  
>> IMHO, the readiness to incorporate user feedback is here. The feedback
>> is not, or at least not as much as we'd like.
>>    
>
> That depends on the subpackage, some problems in stats have been
> reported and known for quite some time and the expected lifetime of a
> ticket can be pretty long. I was looking at different python packages
> that use statistics, and many of them are reluctant to use scipy while
> numpy looks very well established. But, I suppose this will improve
> with time and the user base will increase, especially with the recent
> improvements in the build/distribution and the documentation.
>
> Josef
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/scipy-user
>  
There are different reasons for a lack of user base. One of the reasons
for R is that many, many statistics classes use it.

Some of the reasons that I do not use scipy for stats (and have not
looked at this in some time) included:
1) The difficulty of installation which is considerably better now.
2) Lack of support for missing values as virtually everything that I
have worked with involves missing values at some stage.
3) Lack of an suitable statistical modeling interface where you can
specify the model to be fit without having to create each individual
array. The approach must work for a range of scenarios.

Bruce
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Pierre GM-2

On Jan 15, 2009, at 10:09 AM, Bruce Southey wrote:
>>
> Some of the reasons that I do not use scipy for stats (and have not
> looked at this in some time) included:
> 1) The difficulty of installation which is considerably better now.
> 2) Lack of support for missing values as virtually everything that I
> have worked with involves missing values at some stage.

Can you proceed and give us examples of your needs ? That way we could  
improve scipy.stats.mstats


>
> 3) Lack of an suitable statistical modeling interface where you can
> specify the model to be fit without having to create each individual
> array. The approach must work for a range of scenarios.

Here again, a short example would help.

Thx a lot in advance,
P.
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

josef.pktd
In reply to this post by Bruce Southey
On Thu, Jan 15, 2009 at 10:09 AM, Bruce Southey <[hidden email]> wrote:

> [hidden email] wrote:
>> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <[hidden email]> wrote:
>>
>>> On Jan 14, 2009, at 10:15 PM, [hidden email] wrote:
>>>
>>>> The function in stats, that I tested or rewrote, are usually identical
>>>> to around 1e-15, but in some cases R has a more accurate test
>>>> distribution for small samples (option "exact" in R), while in
>>>> scipy.stats we only have the asymptotic distribution.
>>>>
>>> We could try to reimplement part of it in C,. In any   case, it might
>>> be worth to output a warning (or at least be very explicit in the doc)
>>> that the results may not hold for samples smaller than 10-20.
>>>
>>
>> I am not a "C" person and I never went much beyond HelloWorld in C.
>> I just checked some of the doc strings, and I am usually mention that
>> we use the asymptotic distribution, but there are still pretty vague
>> statements in some of the doc strings, such as
>>
>> "The p-values are not entirely reliable but are probably reasonable for
>> datasets larger than 500 or so."
>>
>>
>>
> The 'exact' test are usually Fisher's exact tests
> (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very
> different from the asymptotic testing and can get very demanding. Also I
> do not think that such statements should be part of the doc strings.

According to the wikipedia reference this is for contingency tables, the
two cases I worked on, were the exact two-sided Kolmogorov-Smirnov distribution,
were I found a good approximation, and the exact distribution for the
Spearman correlation coefficient for the Null of no correlation.

>
>>>> Also, not all
>>>> existing functions in scipy.stats are tested (yet).
>>>>
>>> We should also try to make sure missing data are properly supported
>>> (not always possible) and that the results are consistent between the
>>> masked and non-masked versions.
>>>
>>>
>>
>> I added a ticket so we don't forget to check this.
>>
>>
>>
>>
>>> IMHO, the readiness to incorporate user feedback is here. The feedback
>>> is not, or at least not as much as we'd like.
>>>
>>
>> That depends on the subpackage, some problems in stats have been
>> reported and known for quite some time and the expected lifetime of a
>> ticket can be pretty long. I was looking at different python packages
>> that use statistics, and many of them are reluctant to use scipy while
>> numpy looks very well established. But, I suppose this will improve
>> with time and the user base will increase, especially with the recent
>> improvements in the build/distribution and the documentation.
>>
>> Josef
>> _______________________________________________
>> SciPy-user mailing list
>> [hidden email]
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
> There are different reasons for a lack of user base. One of the reasons
> for R is that many, many statistics classes use it.
>
> Some of the reasons that I do not use scipy for stats (and have not
> looked at this in some time) included:
> 1) The difficulty of installation which is considerably better now.
> 2) Lack of support for missing values as virtually everything that I
> have worked with involves missing values at some stage.
> 3) Lack of an suitable statistical modeling interface where you can
> specify the model to be fit without having to create each individual
> array. The approach must work for a range of scenarios.
>

With 2 and 3 I have little experience
Missing observations, I usually remove or clean in the initial data
preparation. mstats provides functions for masked arrays, but stats
mostly assumes no missing values. What would be the generic treatment
for missing observations, just dropping all observations that have
NaNs or converting them to masked arrays and expand the function that
can handle those?

Jonathan Taylor included a formula framework in stats.models similar
to R, but I haven't looked very closely at it. I haven't learned much
of R's syntax and I usually prefer to build by own arrays (with some
exceptions such as polynomials) than hide them behind a mini model
language.
For both stats.models and for the interface for general stats
functions, feedback would be very appreciated.

Josef
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Pierre GM-2
>
> With 2 and 3 I have little experience
> Missing observations, I usually remove or clean in the initial data
> preparation. mstats provides functions for masked arrays, but stats
> mostly assumes no missing values. What would be the generic treatment
> for missing observations, just dropping all observations that have
> NaNs or converting them to masked arrays and expand the function that
> can handle those?
>

That depends on the situation. For linear fitting, missing values  
could be dropped (using the MaskedArray.compressed method if the data  
is 1D, or by using something like a[~np.isnan(a)]). In other cases,  
the missing values have to be taken into account.


_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Bruce Southey
In reply to this post by josef.pktd
[hidden email] wrote:

>> There are different reasons for a lack of user base. One of the reasons
>> for R is that many, many statistics classes use it.
>>
>> Some of the reasons that I do not use scipy for stats (and have not
>> looked at this in some time) included:
>> 1) The difficulty of installation which is considerably better now.
>> 2) Lack of support for missing values as virtually everything that I
>> have worked with involves missing values at some stage.
>> 3) Lack of an suitable statistical modeling interface where you can
>> specify the model to be fit without having to create each individual
>> array. The approach must work for a range of scenarios.
>>
>>    
>
> With 2 and 3 I have little experience
> Missing observations, I usually remove or clean in the initial data
> preparation. mstats provides functions for masked arrays, but stats
> mostly assumes no missing values. What would be the generic treatment
> for missing observations, just dropping all observations that have
> NaNs or converting them to masked arrays and expand the function that
> can handle those?
>  
No! We have had considerable discussion on this aspect in the past on
the numpy/scipy lists. Basically a missing observation should not be
treated as an NaNs (and there are different types of NaNs) because they
are not the same. In some cases, missing values disappear in the
calculations such as creating the X'X matrix etc but you probably do not
want that if you have real NaNs in your data (say after taking square
root of an array that includes negative numbers).

> Jonathan Taylor included a formula framework in stats.models similar
> to R, but I haven't looked very closely at it. I haven't learned much
> of R's syntax and I usually prefer to build by own arrays (with some
> exceptions such as polynomials) than hide them behind a mini model
> language.
> For both stats.models and for the interface for general stats
> functions, feedback would be very appreciated.
>
> Josef
> _______________________________________________
> SciPy-user mailing list
> [hidden email]
> http://projects.scipy.org/mailman/listinfo/scipy-user
>  
If you look at R's lm function you can see that you can fit a model
using a formula. Without a similar framework, you can not do useful
stats. Also you must have a 'mini model language' because the inputs
must be created correctly and it gets very repetitive very quickly.

For example, in R (and all major stats languages like SAS) you can just
fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y, x1,
x2, and x3 are with the appropriate dataframe (not necessarily in that
order).

If I understand mstats.linregress correctly, I have to create two arrays
just to fit one of these two models. In the second case, I have to
create yet another array. If I have my original data in one array, now I
have unnecessarily duplicated 3 columns of that array not to mention had
to do all this extra work, hopefully error free, just to do 2 lines of R
code.

Jonathan's formula is along the right approach but, based on the doc
string, rather cumbersome and does not use array inputs. It probably
would be more effective with a record masked array.

Bruce

PS Way back when I did give feedback to the direction of stats stuff.
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Pierre GM-2

On Jan 15, 2009, at 12:36 PM, Bruce Southey wrote:
> No! We have had considerable discussion on this aspect in the past on
> the numpy/scipy lists. Basically a missing observation should not be
> treated as an NaNs (and there are different types of NaNs) because  
> they
> are not the same. In some cases, missing values disappear in the
> calculations such as creating the X'X matrix etc but you probably do  
> not
> want that if you have real NaNs in your data (say after taking square
> root of an array that includes negative numbers).

numpy.ma implements equivalents of ufuncs that return a masked array,  
where invalid outputs are masked (the output is invalid if the input  
is masked or if it falls outside the validity domain of the function),  
so we're set. There are functions that mask full rows or columns of a  
2D array, or even get rid of the columns/rows that contain one or  
several missing values which can be used in some cases.

>>
> If you look at R's lm function you can see that you can fit a model
> using a formula. Without a similar framework, you can not do useful
> stats. Also you must have a 'mini model language' because the inputs
> must be created correctly and it gets very repetitive very quickly.

> For example, in R (and all major stats languages like SAS) you can  
> just
> fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y,  
> x1,
> x2, and x3 are with the appropriate dataframe (not necessarily in that
> order).

Well, we could adapt the functions to accept a structured array as  
input and define your x1, x2... from the fields of this array. I tried  
to significantly improve the support of structured arrays in numpy.ma  
1.3., so it shouldn't be that difficult to use masked arrays by default.



> If I understand mstats.linregress correctly, I have to create two  
> arrays
> just to fit one of these two models. In the second case, I have to
> create yet another array. If I have my original data in one array,  
> now I
> have unnecessarily duplicated 3 columns of that array not to mention  
> had
> to do all this extra work, hopefully error free, just to do 2 lines  
> of R
> code.
>

For the first case (Y~x2), you don't need 2 arrays, you can use a 2D  
array with either 2 rows or 2 columns and that would work.  
mstats.linregress use the same approach as stats.linregress.
The second case is a tad more complex, but could probably be adapted  
relatively easily.


> Jonathan's formula is along the right approach but, based on the doc
> string, rather cumbersome and does not use array inputs. It probably
> would be more effective with a record masked array.

OK, more on my todo list...
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

josef.pktd
In reply to this post by Bruce Southey
On Thu, Jan 15, 2009 at 12:36 PM, Bruce Southey <[hidden email]> wrote:

> [hidden email] wrote:
>>> There are different reasons for a lack of user base. One of the reasons
>>> for R is that many, many statistics classes use it.
>>>
>>> Some of the reasons that I do not use scipy for stats (and have not
>>> looked at this in some time) included:
>>> 1) The difficulty of installation which is considerably better now.
>>> 2) Lack of support for missing values as virtually everything that I
>>> have worked with involves missing values at some stage.
>>> 3) Lack of an suitable statistical modeling interface where you can
>>> specify the model to be fit without having to create each individual
>>> array. The approach must work for a range of scenarios.
>>>
>>>
>>
>> With 2 and 3 I have little experience
>> Missing observations, I usually remove or clean in the initial data
>> preparation. mstats provides functions for masked arrays, but stats
>> mostly assumes no missing values. What would be the generic treatment
>> for missing observations, just dropping all observations that have
>> NaNs or converting them to masked arrays and expand the function that
>> can handle those?
>>
> No! We have had considerable discussion on this aspect in the past on
> the numpy/scipy lists. Basically a missing observation should not be
> treated as an NaNs (and there are different types of NaNs) because they
> are not the same. In some cases, missing values disappear in the
> calculations such as creating the X'X matrix etc but you probably do not
> want that if you have real NaNs in your data (say after taking square
> root of an array that includes negative numbers).
>
>> Jonathan Taylor included a formula framework in stats.models similar
>> to R, but I haven't looked very closely at it. I haven't learned much
>> of R's syntax and I usually prefer to build by own arrays (with some
>> exceptions such as polynomials) than hide them behind a mini model
>> language.
>> For both stats.models and for the interface for general stats
>> functions, feedback would be very appreciated.
>>
>> Josef
>> _______________________________________________
>> SciPy-user mailing list
>> [hidden email]
>> http://projects.scipy.org/mailman/listinfo/scipy-user
>>
> If you look at R's lm function you can see that you can fit a model
> using a formula. Without a similar framework, you can not do useful
> stats. Also you must have a 'mini model language' because the inputs
> must be created correctly and it gets very repetitive very quickly.
>
> For example, in R (and all major stats languages like SAS) you can just
> fit regression models like lm(Y~ x2) and  lm( Y~ x3 + x1), where Y, x1,
> x2, and x3 are with the appropriate dataframe (not necessarily in that
> order).

For the simple case, it could be done with accepting a sequence of
args, and building
the design matrix inside the function, e.g.

ols( Y, x3, x1, x2**2 )

To build design matrices, I wrote,  for myself, functions like
simplex(x,n) where x is a 2D column matrix and it builds the
interaction terms matrix, x[:,1], x[:,2],  x[:,1]*x[:,2], ...
x[:,1]**n, which if I read the R stats help correctly would correspond
to (x[:,1] + x[:,2])^n

My ols call would then be
       ols(Y, simplex(x3,x1,2) ),

This uses explicit functions and avoids the mini-language, but it
requires some design building functions.

Being able to access some meta-information to data arrays would be
nice, but I haven't used these features much, except for building my
own classes in python or structs in matlab.

Josef
_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: predicting values based on (linear) models

Timmie
Administrator
In reply to this post by Pierre GM-2
Hello,
thanks very much for continuing this discussion. It is very helpful for
me and perhaps others who need to chose the right tools for statistical
processing and analysis.

> IMHO, the readiness to incorporate user feedback is here. The feedback  
> is not, or at least not as much as we'd like.
I am often very much ocupied writing my own special routines building on
top of scipy/numpy etc. And therefore, I find it difficult to contribute
code.
But I can do testings and bug reporting. Since I want the libaries I
rely on to be good working I have a vital interest in this.

Please just indicate where testing is needed. If it matches with my
knowledge and application I am happy to contribute.

Kind regards,
Timmie


_______________________________________________
SciPy-user mailing list
[hidden email]
http://projects.scipy.org/mailman/listinfo/scipy-user