Hi list, I'm working on a package that does some complicate Monte Carlo experiments. The package passes around frozen distributions quite a lot. Trying to understand why certain parts were so slow, I did a bit of profiling, and stumbled upon this: > %timeit x = scipy.stats.norm.rvs(size=1000) > 10000 loops, best of 3: 49.3 µs per loop > %timeit dist = scipy.stats.norm(); x = dist.rvs(size=1000) > 1000 loops, best of 3: 512 µs per loop So a x10 penalty when using a frozen dist, even if the size of the simulated vector is 1000. This is using scipy 0.16.0 on Ubuntu 16.04. I cannot replicate this problem on another machine with scipy 0.13.3 and Ubuntu 14.04 (there is a penalty, but it's much smaller). In the profiler, I can see that a lot of time is spent doing string operations (such as expand_tabs) in order to generate the doc. In the source, I see that this may depend on a certain -00 flag??? I do realise that instantiating a frozen distribution requires some argument checking and what not, but here it looks too expensive. For my package, this amounts to hours spent on ... tab extensions? Anyway, I'd like to ask (a) is this a known problem? I could not find anything on-line about this. (b) Is this going to be fixed in some future version of scipy? (c) is there a way to fix this with *this* version of scipy using this flag mentioned in the source, and then how? Many thanks for reading this! :-) All the best _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
On Fri, Oct 28, 2016 at 12:53 PM, Nicolas Chopin <[hidden email]> wrote:
Can you time here just the rvs call and not the instantiation of the frozen distribution. Frozen distributions have now more overhead in the construction because a new instance of the distribution is created instead of reusing the global instance as in older scipy versions.That might still have an effect in the µs range. (The reason was to avoid the possibility of spillover of attributes across instances.)
I think we never had any discussion on timing details. Overall, the overhead of scipy.stats.distributions is not relatively small when the underlying calculation is fast, e.g. using numpy.random directly for rvs is quite a bit faster, when the function is available in numpy. Josef
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
If I time just the rvs call then I get essentially the same time as with > x = scipy.stats.norm.rvs(size=1000) Thanks a lot for your prompt answer Nicolas On Fri, 28 Oct 2016 at 19:12 <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nicolas Chopin
On Fri, Oct 28, 2016 at 7:53 PM, Nicolas Chopin <[hidden email]> wrote:
> Hi list, > I'm working on a package that does some complicate Monte Carlo experiments. > The package passes around frozen distributions quite a lot. Trying to > understand why certain parts were so slow, I did a bit of profiling, and > stumbled upon this: > > > %timeit x = scipy.stats.norm.rvs(size=1000) >> 10000 loops, best of 3: 49.3 µs per loop > >> %timeit dist = scipy.stats.norm(); x = dist.rvs(size=1000) >> 1000 loops, best of 3: 512 µs per loop > > So a x10 penalty when using a frozen dist, even if the size of the simulated > vector is 1000. This is using scipy 0.16.0 on Ubuntu 16.04. I cannot > replicate this problem on another machine with scipy 0.13.3 and Ubuntu 14.04 > (there is a penalty, but it's much smaller). > > In the profiler, I can see that a lot of time is spent doing string > operations (such as expand_tabs) in order to generate the doc. In the > source, I see that this may depend on a certain -00 flag??? > > I do realise that instantiating a frozen distribution requires some argument > checking and what not, but here it looks too expensive. For my package, this > amounts to hours spent on ... tab extensions? > > Anyway, I'd like to ask > (a) is this a known problem? I could not find anything on-line about this. > (b) Is this going to be fixed in some future version of scipy? > (c) is there a way to fix this with *this* version of scipy using this flag > mentioned in the source, and then how? > (c) or should I instead re-define manually my own distributions objects? > (it's really convenient for what I'm trying to do to define distributions as > objects with methods rvs, logpdf, and so on). > > Many thanks for reading this! :-) > All the best Why are you including the construction time into your timings? Surely, if you use frozen distributions for some MC work, you're not recreating frozen instances in hot loops? In [4]: %timeit norm.rvs(size=100, random_state=123) The slowest run took 142.68 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 74.2 µs per loop In [5]: %timeit dist = norm(); dist.rvs(size=100, random_state=123) The slowest run took 4.40 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 796 µs per loop In [6]: %timeit dist = norm() The slowest run took 4.89 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 672 µs per loop > (b) Is this going to be fixed in some future version of scipy? > (c) is there a way to fix this with *this* version of scipy using this flag > mentioned in the source, and then how? You could of course try reverting https://github.com/scipy/scipy/pull/3245 for your local copy of scipy. It went in into scipy 0.14, so this is the likely suspect. _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nicolas Chopin
On Fri, Oct 28, 2016 at 1:21 PM, Nicolas Chopin <[hidden email]> wrote:
Creating a new instance is a feature. It's still possible that there is some speedup possible in the implementation but AFAIR I didn't see anything that would have been obvious (a few mu-s up or down?) However, given your description that you pass the frozen instances around, you shouldn't be so much instance creation, otherwise you could also use the unfrozen global instance of the distributions. In general, I avoid scipy.stats.distributions in loops for restricted cases when I don't need the flexibility and input checking, but I don't think it's worth the effort when we would have to replicate most of what's already there. Josef
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Evgeni Burovski
Yes, as I have just said, I agree that it is the creation of the frozen dist that in what I do. Typically, one run may involve O(10^8) frozen distributions;explains the difference. I do need to create a *lot* of frozen distributions, there is no way around that On Fri, 28 Oct 2016 at 19:29 Evgeni Burovski <[hidden email]> wrote: On Fri, Oct 28, 2016 at 7:53 PM, Nicolas Chopin <[hidden email]> wrote: _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
On Sat, Oct 29, 2016 at 6:37 AM, Nicolas Chopin <[hidden email]> wrote:
Whatever you can do with frozen distributions you can also do with the regular non-frozen ones, so I doubt that that's true.
You haven't explained what's wrong with simply using the rvs() and logpdf() methods from the distribution instances provided in the stats namespace. Ralf
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by Nicolas Chopin
On Fri, Oct 28, 2016 at 10:53 AM, Nicolas Chopin <[hidden email]> wrote:
Did you try running with the -OO flag? Anyone know how well that works? Chuck _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
hi, Charles: no, I didn't, I'm not clear how to use this flag? Ralf: since you're asking, I may as well give you more details about my stuff. Basically, I'd like to do some basic probabilistic programming: i.e.to give the user the ability to define stochastic models as Python objects; e.g. class MarkovChain(object): " abstract class " def simulate(T): path = [] for t in range(T): path.extend(self.M(path[t-1])) class RandomWalk(MarkovChain): def __init__(self,sigma=1.): self.sigma = sigma def M(self,t,xp): return stats.norm(loc=xp,scale=self.sigma) Here, I define a base class for Markov chains, with method simulate that can simulate a trajectory. Then I define a particular (parametric) sub-class, that of Gaussian random walks. One part of my package defines an algorithm that takes as an argument such a *class*, generate many possible parameters (above, sigma), and for each parameter, generate trajectories; sometimes the logpdf or the ppf functions must be computed as well. Of course, I could ask the user to provide as an input a function for generating rvs, but then I would need to ask also a function for computing the log-pdf, and so on. In fact, I have a few ideas (and prototype code) on how to extend frozen distributions so as to do more advanced probabilistic programming, such as: * product distributions: prod_dist(stats.beta(3,2), norm(loc=3) ) returns an object that corresponds to the distribution of (X,Y), where X~Beta(3,2), Y~N(3,1); for instance if you apply method rvs, you obtain a [N,2] numpy array * dict distribution: same idea, but returns a record array, (or takes a record array for logpdf, etc) But I'm not sure there's much interest in extending scipy distributions in this way? Best On Sat, 29 Oct 2016 at 15:06 Charles R Harris <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
On Sat, Oct 29, 2016 at 8:51 AM, Nicolas Chopin <[hidden email]> wrote:
It is passed to cpython and produces *.pyo files without docstrings. I probably doesn't do what you want if the docstrings are dynamically generated (I don't know), but it can be checked if the flag was passed to python so it should be possible to make docstring generation depend on it, and it probably should. <snip> Chuck _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Oct 30, 2016 at 4:49 AM, Charles R Harris <[hidden email]> wrote:
That is handled by doing docstring manipulation inside ``if __doc__ is None:`` Ralf
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |