Hi all-
I am trying to create a custom PDF with arguments for some statistical analysis I am doing. I want to have a power law with an exponential cutoff at low values: mathematically, this looks like P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters: a normalization (N), an index (index), and a scale value (m0). I have tried to produce some code that creates this pdf and while the pdf and cdf functions appear to be working as I hoped, when I try to generate random values they don't follow this distribution at all, and I am not sure why. It seems to be far more heavily drawing from the low end of the curve (which are explicitly suppressed in the pdf) Can you suggest the changes I need to make to try and figure out why my code isn't working as planned? My code snippet is below: from scipy import stats import numpy as np from matplotlib import pyplot as plt class my_pdf(stats.rv_continuous): def _pdf(self,x,norm,index,m0): return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) my_cv= my_pdf(a=1e-6,b=1e-2,name="Test") rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around 1e-3, but histogram peaks around 1e-5... I am really having a hard time figuring out how to create a robust and complex custom rv_continuous in scipy based on scipy, so any help you can provide would be most appreciated. Cheers, Steve _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
First and foremost, you need to fix the normalization: the norm constant is not a free parameter, as the pdf must integrate to unity over the support of your RV. Next, the scale parameter is handled by the framework: pdf, cdf methods all accept the `scale` argument. Thus you do not need the m0 parameter. HTH, Evgeni _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
Okay- Thank you for the insights. So three questions about this, as I am not an expert in the implementation. I figured that pdf normalization may play a role First of all, what is the best way to numerically handle the normalization? Do I need to include the integral every time I initialize the pdf? Something like this: class my_pdf(stats.rv_continuous): def _pdf(self,x,index,m0): step=(self.b-self.a)/100. this_xrange=np.arange(self.a,self.b+step,step) this_function=this_xrange**(-1*index) * (1. -np.exp((-1*this_xrange**2)/m0**2) ) #for x,f in zip(this_xrange,this_function): print x,f this_norm=1./(integrate.simps(this_function,x=this_xrange)) #print this_norm return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) I don't think there is an analytic form for this integral, so I have to do it numerically, but I also have to implement it with the pdf. And the second question I have is: how do I call the scale variable within the _pdf function? Would I simply put in self.scale in place of m0 in the code above? Or is it more sophisticated than that? Finally, while I realized these concerns when I wrote it, but I didn't expect the pdf to look okay but the rvs to be so off- are there reasons to expect this behavior based on my previously incorrect normalization? It is not obvious to me that it would. Thank you for your help thus far and I look forward to learning more about custom statistical distributions in SciPy. Steven Ehlert On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
On Wed, Jun 8, 2016 at 1:31 PM, Steven Ehlert <[hidden email]> wrote:
> Okay- > > Thank you for the insights. > > So three questions about this, as I am not an expert in the implementation. > I figured that pdf normalization may play a role > > > First of all, what is the best way to numerically handle the normalization? > Do I need to include the integral every time I initialize the pdf? Something > like this: > class my_pdf(stats.rv_continuous): > def _pdf(self,x,index,m0): > step=(self.b-self.a)/100. > > this_xrange=np.arange(self.a,self.b+step,step) > > this_function=this_xrange**(-1*index) * (1. > -np.exp((-1*this_xrange**2)/m0**2) ) > > > #for x,f in zip(this_xrange,this_function): print x,f > this_norm=1./(integrate.simps(this_function,x=this_xrange)) > #print this_norm > > return this_norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) > Yes, this would need to compute it at each call to pdf. The integral can be expressed in terms of gamma functions (do a substitution of y = x**2 /2 in the integral). Alternatively, you can do the integral numerically, likely `quad` is better than `simps`. > I don't think there is an analytic form for this integral, so I have to do > it numerically, but I also have to implement it with the pdf. > > And the second question I have is: how do I call the scale variable within > the _pdf function? Would I simply put in self.scale in place of m0 in the > code above? Or is it more sophisticated than that? It's easy: you don't need to :-). distribution.pdf(x, loc=11, scale=42) calls distribution._pdf(y) with y = (x - loc) / scale. This way, you `_pdf` only needs to deal with a scaled variable. There is no self.scale --- it's just a parameter to any call to pdf, cdf, sf etc. > Finally, while I realized these concerns when I wrote it, but I didn't > expect the pdf to look okay but the rvs to be so off- are there reasons to > expect this behavior based on my previously incorrect normalization? It is > not obvious to me that it would. PDF needs to integrate to one. If the normalization is off, it still might look "okay" because the shape is the same, but it's still wrong. Drawing random variates assumes that the cdf is a real distribution function. From experience, having ok-looking PDF and nonsense for the random variates is a typical sign that the normalization might be off. > Thank you for your help thus far and I look forward to learning more about > custom statistical distributions in SciPy. > > > Steven Ehlert > > > On Wed, Jun 8, 2016 at 5:59 AM, Evgeni Burovski <[hidden email]> > wrote: >> >> >> 08.06.2016 13:50 пользователь "Steven Ehlert" <[hidden email]> >> написал: >> >> >> > >> > Hi all- >> > >> > I am trying to create a custom PDF with arguments for some statistical >> > analysis I am doing. I want to have a power law with an exponential cutoff >> > at low values: mathematically, this looks like >> > >> > P(m) = N * m**(-index) * (1 - exp((-m**2)/m0**2) with three parameters: >> > a normalization (N), an index (index), and a scale value (m0). >> > >> > >> > >> > I have tried to produce some code that creates this pdf and while the >> > pdf and cdf functions appear to be working as I hoped, when I try to >> > generate random values they don't follow this distribution at all, and I am >> > not sure why. It seems to be far more heavily drawing from the low end of >> > the curve (which are explicitly suppressed in the pdf) >> > >> > Can you suggest the changes I need to make to try and figure out why my >> > code isn't working as planned? My code snippet is below: >> > >> > >> > from scipy import stats >> > import numpy as np >> > from matplotlib import pyplot as plt >> > >> > >> > class my_pdf(stats.rv_continuous): >> > def _pdf(self,x,norm,index,m0): >> > return norm*x**(-1*index) * (1. -np.exp((-1*x**2)/m0**2) ) >> > >> > >> > my_cv= my_pdf(a=1e-6,b=1e-2,name="Test") >> > >> > >> > rvs=my_cv.rvs(1.0,1.7,1e-3,size=1000) #Values should be peaked around >> > 1e-3, but histogram peaks around 1e-5... >> > >> > >> > >> > I am really having a hard time figuring out how to create a robust and >> > complex custom rv_continuous in scipy based on scipy, so any help you can >> > provide would be most appreciated. >> > >> > >> > Cheers, >> > >> > Steve >> > >> > _______________________________________________ >> > SciPy-User mailing list >> > [hidden email] >> > https://mail.scipy.org/mailman/listinfo/scipy-user >> > >> >> First and foremost, you need to fix the normalization: the norm constant >> is not a free parameter, as the pdf must integrate to unity over the support >> of your RV. >> >> Next, the scale parameter is handled by the framework: pdf, cdf methods >> all accept the `scale` argument. Thus you do not need the m0 parameter. >> >> HTH, >> >> Evgeni >> >> >> _______________________________________________ >> SciPy-User mailing list >> [hidden email] >> https://mail.scipy.org/mailman/listinfo/scipy-user >> > > > _______________________________________________ > SciPy-User mailing list > [hidden email] > https://mail.scipy.org/mailman/listinfo/scipy-user > SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
Explicitly defining the cdf would probably help too. If it is analytically calculable, defining the inverse cdf would almost definitely help since that is how the random numbers are being generated.
https://en.wikipedia.org/wiki/Inverse_transform_sampling What you have now only defines the pdf. The base class automagically integrates to get cdf, and might do some kind of numerical parameter optimization to get to the inverse cdf (not sure about that). If nothing else, defining the cdf/inverse cdf would make things faster. On Wed, Jun 8, 2016 at 8:46 AM Evgeni Burovski <[hidden email]> wrote: On Wed, Jun 8, 2016 at 1:31 PM, Steven Ehlert <[hidden email]> wrote: -- Kevin Gullikson Data Scientist Spark Cognition _______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
Hi all- After doing some math, I was able to figure out how to normalize this pdf in a meaningful way and get the right math packages to interpret it for me. A new question: how do I properly call this within the context of a KS-test? What I want to start doing is comparing histograms of data to this pdf and see if they make sense. I am guessing it would be something like stats.kstest(stuffsample, stuff) where stuff=observed_stuff(name="Stuff",a=1e-6,b=100) but I am not sure how to make the call appropriate. The one I have above does not work, and I know I need to send off a few arguments. The new class with the correct math in place is below. Thanks for your help! class observed_stuff(stats.rv_continuous): def _pdf(self, x, massindex,complimit): dndm=(x)**(-1.*massindex) complete= 1. - np.exp(-1.*x**2/complimit**2) self.massindex=massindex self.complimit=complimit norm=self.integral(self.b)-self.integral(self.a) #print "Norm ", norm return 1./norm * dndm * complete def _cdf(self, xval,massindex,complimit): self.massindex=massindex self.complimit=complimit norm=self.integral(self.b)-self.integral(self.a) return 1./norm * (self.integral(xval)-self.integral(self.a)) def integral(self,xval): usq=xval**2/self.complimit**2 mi=self.massindex arg1= (1.-mi)/2. #We have to use mpmath for the incomplete gamma function that arises in this integral...otherwise the values are all screwy. This try-except snippet ensures the output is in the right format for scipy stats try: gammaval = [float(mpmath.gammainc(a,u)) for a,u in zip(arg1,usq)] gammaval=np.array(gammaval) except: gammaval=float(mpmath.gammainc(arg1,usq)) numerator= xval**(1-mi) * ( -2. * np.sqrt(usq) + (mi-1.)* (usq)**(mi/2.) * gammaval) denominator=2.*(mi-1.)*np.sqrt(usq) return numerator/denominator On Wed, Jun 8, 2016 at 8:54 AM, Kevin Gullikson <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] https://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |