Hey folks,
I've run into a bit of a roadblock. I've got some model runs (x) in an Nx2 array where the first column is the input, and the second column is the output. So in a single case, I'd do:
popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1]) But how should I handle, say, 5000 model runs such that x.shape = (500, N, 2) and I want the 5000 results for popt?
This works: popt_array = np.empty(5000, 2) for r, layer in enumerate(model_runs):
popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1]) popt_array[r] = popt But is there a better (faster way)? The number of model runs and data points may grow sustatially (~10^4 runs and 10^3 data points). Thanks, -paul
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Hi Paul,
I've played around with this for scipy.optimize.leastsq, and I've never been successful. I'm not an expert, but I don't think the original code was set up for that type of usage. (Just speculating here... An equally likely explanation is that I'm not smart enough to figure it out :) What I've found really useful for this type of problem is IPython's new parallel architecture (IPython version 0.13.1). It's easy to set up an IPython cluster and get things running in parallel through the notebook interface, so I've attached a simple notebook that does a trivial fitting (using leastsq) of some noisy data. Before you run the notebook, you'll need to set up a cluster through the Cluster tab in the IPython notebook dashboard. Unfortunately, in my experience I've found that the speed improvement is only noticeable until the number of IPython engines in your cluster equals the number of cores not the number of processor threads. (This might be obvious for those in the know, but it took me a while to figure out.) But every little bit of improvement helps, I suppose. Ryan On 2/1/2013 6:07 PM, Paul Hobson wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user Parallel Opt.ipynb (6K) Download Attachment |
On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <[hidden email]> wrote:
> Hi Paul, > > I've played around with this for scipy.optimize.leastsq, and I've never been > successful. I'm not an expert, but I don't think the original code was set > up for that type of usage. (Just speculating here... An equally likely > explanation is that I'm not smart enough to figure it out :) > > What I've found really useful for this type of problem is IPython's new > parallel architecture (IPython version 0.13.1). It's easy to set up an > IPython cluster and get things running in parallel through the notebook > interface, so I've attached a simple notebook that does a trivial fitting > (using leastsq) of some noisy data. Before you run the notebook, you'll need > to set up a cluster through the Cluster tab in the IPython notebook > dashboard. > > Unfortunately, in my experience I've found that the speed improvement is > only noticeable until the number of IPython engines in your cluster equals > the number of cores not the number of processor threads. (This might be > obvious for those in the know, but it took me a while to figure out.) But > every little bit of improvement helps, I suppose. > > Ryan > > > On 2/1/2013 6:07 PM, Paul Hobson wrote: > > Hey folks, > > I've run into a bit of a roadblock. I've got some model runs (x) in an Nx2 > array where the first column is the input, and the second column is the > output. So in a single case, I'd do: > > popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1]) > > But how should I handle, say, 5000 model runs such that x.shape = (500, N, > 2) and I want the 5000 results for popt? > > This works: > popt_array = np.empty(5000, 2) > for r, layer in enumerate(model_runs): > popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1]) > popt_array[r] = popt > > But is there a better (faster way)? The number of model runs and data points > may grow sustatially (~10^4 runs and 10^3 data points). I think there is no efficient way to avoid the loop. besides going parallel: If you are only interested in popt, then using optimize.leastsq directly will save some calculations. The other major speedup for this kind of problems is to find good starting values. For example, if the solutions in the sequence are close to each other, then using previous solutions as starting values for the next case will speed up convergence. IIRC, in statsmodels we use an average of previous solutions, after some initial warm-up, in a similar problem. A moving average could also be useful. Josef > > Thanks, > -paul > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Feb 3, 2013 at 4:28 AM, <[hidden email]> wrote:
Thanks for the advice, Ryan and Josef. I've been meaning to get a feel of for IPython Parallel for some time now, so this will be a good impetus to do so.
Also, Josef, the suggestion about using optimize.leastsq with an initial guess, will be a very good one, I think. A better description of what I'm doing is actually bootstrapping the regression parameters to find their 95% confidence intervals. So it make a lot sense, really, to get the initial guess at the parameters from the full data set for use in the bootstrap iterations.
Cheers, -paul _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Feb 3, 2013 at 1:29 PM, Paul Hobson <[hidden email]> wrote:
> > On Sun, Feb 3, 2013 at 4:28 AM, <[hidden email]> wrote: >> >> On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <[hidden email]> wrote: >> > Hi Paul, >> > >> > I've played around with this for scipy.optimize.leastsq, and I've never >> > been >> > successful. I'm not an expert, but I don't think the original code was >> > set >> > up for that type of usage. (Just speculating here... An equally likely >> > explanation is that I'm not smart enough to figure it out :) >> > >> > What I've found really useful for this type of problem is IPython's new >> > parallel architecture (IPython version 0.13.1). It's easy to set up an >> > IPython cluster and get things running in parallel through the notebook >> > interface, so I've attached a simple notebook that does a trivial >> > fitting >> > (using leastsq) of some noisy data. Before you run the notebook, you'll >> > need >> > to set up a cluster through the Cluster tab in the IPython notebook >> > dashboard. >> > >> > Unfortunately, in my experience I've found that the speed improvement is >> > only noticeable until the number of IPython engines in your cluster >> > equals >> > the number of cores not the number of processor threads. (This might be >> > obvious for those in the know, but it took me a while to figure out.) >> > But >> > every little bit of improvement helps, I suppose. >> > >> > Ryan >> > >> > >> > On 2/1/2013 6:07 PM, Paul Hobson wrote: >> > >> > Hey folks, >> > >> > I've run into a bit of a roadblock. I've got some model runs (x) in an >> > Nx2 >> > array where the first column is the input, and the second column is the >> > output. So in a single case, I'd do: >> > >> > popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1]) >> > >> > But how should I handle, say, 5000 model runs such that x.shape = (500, >> > N, >> > 2) and I want the 5000 results for popt? >> > >> > This works: >> > popt_array = np.empty(5000, 2) >> > for r, layer in enumerate(model_runs): >> > popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] layer[:,1]) >> > popt_array[r] = popt >> > >> > But is there a better (faster way)? The number of model runs and data >> > points >> > may grow sustatially (~10^4 runs and 10^3 data points). >> >> I think there is no efficient way to avoid the loop. >> >> besides going parallel: >> >> If you are only interested in popt, then using optimize.leastsq >> directly will save some calculations. >> >> The other major speedup for this kind of problems is to find good >> starting values. For example, if the solutions in the sequence are >> close to each other, then using previous solutions as starting values >> for the next case will speed up convergence. >> IIRC, in statsmodels we use an average of previous solutions, after >> some initial warm-up, in a similar problem. A moving average could >> also be useful. >> >> Josef >> >> > >> > Thanks, >> > -paul >> > > Thanks for the advice, Ryan and Josef. I've been meaning to get a feel of > for IPython Parallel for some time now, so this will be a good impetus to do > so. If you just want to run it on multiple cores, then joblib is easier. scikit-learn uses it extensively, and statsmodels uses it at the few parts, especially for bootstrap. > > Also, Josef, the suggestion about using optimize.leastsq with an initial > guess, will be a very good one, I think. A better description of what I'm > doing is actually bootstrapping the regression parameters to find their 95% > confidence intervals. So it make a lot sense, really, to get the initial > guess at the parameters from the full data set for use in the bootstrap > iterations. The only problems I would worry a bit about in this case are the possible presence of multiple local minima, if there are any, and convergence failures. Josef > > Cheers, > -paul > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Feb 3, 2013 at 10:47 AM, <[hidden email]> wrote:
Thanks for the advice. Currently working with linear fits, but I'm hoping to expand this work to be more general. I'll be sure to keep these things in mind.
-p _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Feb 4, 2013 at 12:13 AM, Paul Hobson <[hidden email]> wrote:
> On Sun, Feb 3, 2013 at 10:47 AM, <[hidden email]> wrote: >> >> On Sun, Feb 3, 2013 at 1:29 PM, Paul Hobson <[hidden email]> wrote: >> > >> > On Sun, Feb 3, 2013 at 4:28 AM, <[hidden email]> wrote: >> >> >> >> On Sat, Feb 2, 2013 at 9:38 PM, Ryan Nelson <[hidden email]> >> >> wrote: >> >> > Hi Paul, >> >> > >> >> > I've played around with this for scipy.optimize.leastsq, and I've >> >> > never >> >> > been >> >> > successful. I'm not an expert, but I don't think the original code >> >> > was >> >> > set >> >> > up for that type of usage. (Just speculating here... An equally >> >> > likely >> >> > explanation is that I'm not smart enough to figure it out :) >> >> > >> >> > What I've found really useful for this type of problem is IPython's >> >> > new >> >> > parallel architecture (IPython version 0.13.1). It's easy to set up >> >> > an >> >> > IPython cluster and get things running in parallel through the >> >> > notebook >> >> > interface, so I've attached a simple notebook that does a trivial >> >> > fitting >> >> > (using leastsq) of some noisy data. Before you run the notebook, >> >> > you'll >> >> > need >> >> > to set up a cluster through the Cluster tab in the IPython notebook >> >> > dashboard. >> >> > >> >> > Unfortunately, in my experience I've found that the speed improvement >> >> > is >> >> > only noticeable until the number of IPython engines in your cluster >> >> > equals >> >> > the number of cores not the number of processor threads. (This might >> >> > be >> >> > obvious for those in the know, but it took me a while to figure out.) >> >> > But >> >> > every little bit of improvement helps, I suppose. >> >> > >> >> > Ryan >> >> > >> >> > >> >> > On 2/1/2013 6:07 PM, Paul Hobson wrote: >> >> > >> >> > Hey folks, >> >> > >> >> > I've run into a bit of a roadblock. I've got some model runs (x) in >> >> > an >> >> > Nx2 >> >> > array where the first column is the input, and the second column is >> >> > the >> >> > output. So in a single case, I'd do: >> >> > >> >> > popt, pcov = scipy.optimize.curve_fit(myFit, x[:,0], x[:,1]) >> >> > >> >> > But how should I handle, say, 5000 model runs such that x.shape = >> >> > (500, >> >> > N, >> >> > 2) and I want the 5000 results for popt? >> >> > >> >> > This works: >> >> > popt_array = np.empty(5000, 2) >> >> > for r, layer in enumerate(model_runs): >> >> > popt, pcov = scipy.optimize.curve_fit(myFit, layer[:,0] >> >> > layer[:,1]) >> >> > popt_array[r] = popt >> >> > >> >> > But is there a better (faster way)? The number of model runs and data >> >> > points >> >> > may grow sustatially (~10^4 runs and 10^3 data points). >> >> >> >> I think there is no efficient way to avoid the loop. >> >> >> >> besides going parallel: >> >> >> >> If you are only interested in popt, then using optimize.leastsq >> >> directly will save some calculations. >> >> >> >> The other major speedup for this kind of problems is to find good >> >> starting values. For example, if the solutions in the sequence are >> >> close to each other, then using previous solutions as starting values >> >> for the next case will speed up convergence. >> >> IIRC, in statsmodels we use an average of previous solutions, after >> >> some initial warm-up, in a similar problem. A moving average could >> >> also be useful. >> >> >> >> Josef >> >> >> >> > >> >> > Thanks, >> >> > -paul >> >> >> > >> > Thanks for the advice, Ryan and Josef. I've been meaning to get a feel >> > of >> > for IPython Parallel for some time now, so this will be a good impetus >> > to do >> > so. >> >> If you just want to run it on multiple cores, then joblib is easier. >> scikit-learn uses it extensively, and statsmodels uses it at the few >> parts, especially for bootstrap. >> >> > >> > Also, Josef, the suggestion about using optimize.leastsq with an initial >> > guess, will be a very good one, I think. A better description of what >> > I'm >> > doing is actually bootstrapping the regression parameters to find their >> > 95% >> > confidence intervals. So it make a lot sense, really, to get the initial >> > guess at the parameters from the full data set for use in the bootstrap >> > iterations. >> >> The only problems I would worry a bit about in this case are the >> possible presence of multiple local minima, if there are any, and >> convergence failures. >> >> Josef >> > > Thanks for the advice. Currently working with linear fits, but I'm hoping to > expand this work to be more general. I'll be sure to keep these things in > mind. If you work with linear fits, why do you use curve_fit/leastsq instead of linalg? Josef > -p > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Feb 3, 2013 at 9:16 PM, <[hidden email]> wrote:
First reason: I didn't know any better :) Second reason: I'm pretty certain that non-linear data will be coming my way soon, so I'm trying to plan ahead.
-paul _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Feb 4, 2013 at 7:26 PM, Paul Hobson <[hidden email]> wrote:
> On Sun, Feb 3, 2013 at 9:16 PM, <[hidden email]> wrote: >> >> On Mon, Feb 4, 2013 at 12:13 AM, Paul Hobson <[hidden email]> wrote: >> > >> > Thanks for the advice. Currently working with linear fits, but I'm >> > hoping to >> > expand this work to be more general. I'll be sure to keep these things >> > in >> > mind. >> >> If you work with linear fits, why do you use curve_fit/leastsq instead >> of linalg? >> >> Josef > > > First reason: I didn't know any better :) > Second reason: I'm pretty certain that non-linear data will be coming my way > soon, so I'm trying to plan ahead. The speed difference between linear and nonlinear models is huge. So, unless you have only very rare cases for linear fit, I would split the code. You could still share almost all the code, just use two different fit methods/functions in the bootstrap loop. Josef > -paul > > _______________________________________________ > SciPy-User mailing list > [hidden email] > http://mail.scipy.org/mailman/listinfo/scipy-user > _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |