對(duì)于一個(gè)樣本序列 ,經(jīng)驗(yàn)累積分布函數(shù) (Empirical Cumulative Distribution Function)可被定義為
創(chuàng)新互聯(lián)建站基于成都重慶香港及美國等地區(qū)分布式IDC機(jī)房數(shù)據(jù)中心構(gòu)建的電信大帶寬,聯(lián)通大帶寬,移動(dòng)大帶寬,多線BGP大帶寬租用,是為眾多客戶提供專業(yè)成都多線服務(wù)器托管報(bào)價(jià),主機(jī)托管價(jià)格性價(jià)比高,為金融證券行業(yè)服務(wù)器托管,ai人工智能服務(wù)器托管提供bgp線路100M獨(dú)享,G口帶寬及機(jī)柜租用的專業(yè)成都idc公司。
其中 是一個(gè)指示函數(shù),如果 ,指示函數(shù)取值為1,否則取值為0,因此 能反映在樣本中小于 的元素?cái)?shù)量占比。
根據(jù)格利文科定理(Glivenko–Cantelli Theorem),如果一個(gè)樣本滿足獨(dú)立同分布(IID),那么其經(jīng)驗(yàn)累積分布函數(shù) 會(huì)趨近于真實(shí)的累積分布函數(shù) 。
首先定義一個(gè)類,命名為ECDF:
我們采用均勻分布(Uniform)進(jìn)行驗(yàn)證,導(dǎo)入 uniform 包,然后進(jìn)行兩輪抽樣,第一輪抽取10次,第二輪抽取1000次,比較輸出的結(jié)果。
輸出結(jié)果為:
而我們知道,在真實(shí)的0到1均勻分布中, 時(shí), ,從模擬結(jié)果可以看出,樣本量越大,最終的經(jīng)驗(yàn)累積分布函數(shù)值也越接近于真實(shí)的累積分布函數(shù)值,因此格利文科定理得以證明。
下面的程序繪制隨機(jī)變量X的累積分布函數(shù)和數(shù)組p的累加結(jié)果
pl.plot(t, X.cdf(t))
pl.plot(t2, np.add.accumulate(p)*(t2[1]-t2[0]))
Shape Parameters
形態(tài)參數(shù)
While a general continuous random variable can be shifted and scaled
with the loc and scale parameters, some distributions require additional
shape parameters. For instance, the gamma distribution, with density
γ(x,a)=λ(λx)a?1Γ(a)e?λx,
requires the shape parameter a. Observe that setting λ can be obtained by setting the scale keyword to 1/λ.
雖然一個(gè)一般的連續(xù)隨機(jī)變量可以被位移和伸縮通過loc和scale參數(shù),但一些分布還需要額外的形態(tài)參數(shù)。作為例子,看到這個(gè)伽馬分布,這是它的密度函數(shù)
γ(x,a)=λ(λx)a?1Γ(a)e?λx,
要求一個(gè)形態(tài)參數(shù)a。注意到λ的設(shè)置可以通過設(shè)置scale關(guān)鍵字為1/λ進(jìn)行。
Let’s check the number and name of the shape parameters of the gamma
distribution. (We know from the above that this should be 1.)
讓我們檢查伽馬分布的形態(tài)參數(shù)的名字的數(shù)量。(我們知道從上面知道其應(yīng)該為1)
from scipy.stats import gamma
gamma.numargs
1
gamma.shapes
'a'
Now we set the value of the shape variable to 1 to obtain the
exponential distribution, so that we compare easily whether we get the
results we expect.
現(xiàn)在我們?cè)O(shè)置形態(tài)變量的值為1以變成指數(shù)分布。所以我們可以容易的比較是否得到了我們所期望的結(jié)果。
gamma(1, scale=2.).stats(moments="mv")
(array(2.0), array(4.0))
Notice that we can also specify shape parameters as keywords:
注意我們也可以以關(guān)鍵字的方式指定形態(tài)參數(shù):
gamma(a=1, scale=2.).stats(moments="mv")
(array(2.0), array(4.0))
Freezing a Distribution
凍結(jié)分布
Passing the loc and scale keywords time and again can become quite
bothersome. The concept of freezing a RV is used to solve such problems.
不斷地傳遞loc與scale關(guān)鍵字最終會(huì)讓人厭煩。而凍結(jié)RV的概念被用來解決這個(gè)問題。
rv = gamma(1, scale=2.)
By using rv we no longer have to include the scale or the shape
parameters anymore. Thus, distributions can be used in one of two ways,
either by passing all distribution parameters to each method call (such
as we did earlier) or by freezing the parameters for the instance of the
distribution. Let us check this:
通過使用rv我們不用再更多的包含scale與形態(tài)參數(shù)在任何情況下。顯然,分布可以被多種方式使用,我們可以通過傳遞所有分布參數(shù)給對(duì)方法的每次調(diào)用(像我們之前做的那樣)或者可以對(duì)一個(gè)分布對(duì)象凍結(jié)參數(shù)。讓我們看看是怎么回事:
rv.mean(), rv.std()
(2.0, 2.0)
This is indeed what we should get.
這正是我們應(yīng)該得到的。
Broadcasting
廣播
The basic methods pdf and so on satisfy the usual numpy broadcasting
rules. For example, we can calculate the critical values for the upper
tail of the t distribution for different probabilites and degrees of
freedom.
像pdf這樣的簡(jiǎn)單方法滿足numpy的廣播規(guī)則。作為例子,我們可以計(jì)算t分布的右尾分布的臨界值對(duì)于不同的概率值以及自由度。
stats.t.isf([0.1, 0.05, 0.01], [[10], [11]])
array([[ 1.37218364, 1.81246112, 2.76376946],
[ 1.36343032, 1.79588482, 2.71807918]])
Here, the first row are the critical values for 10 degrees of freedom
and the second row for 11 degrees of freedom (d.o.f.). Thus, the
broadcasting rules give the same result of calling isf twice:
這里,第一行是以10自由度的臨界值,而第二行是以11為自由度的臨界值。所以,廣播規(guī)則與下面調(diào)用了兩次isf產(chǎn)生的結(jié)果相同。
stats.t.isf([0.1, 0.05, 0.01], 10)
array([ 1.37218364, 1.81246112, 2.76376946])
stats.t.isf([0.1, 0.05, 0.01], 11)
array([ 1.36343032, 1.79588482, 2.71807918])
If the array with probabilities, i.e, [0.1, 0.05, 0.01] and the array of
degrees of freedom i.e., [10, 11, 12], have the same array shape, then
element wise matching is used. As an example, we can obtain the 10% tail
for 10 d.o.f., the 5% tail for 11 d.o.f. and the 1% tail for 12 d.o.f.
by calling
但是如果概率數(shù)組,如[0.1,0.05,0.01]與自由度數(shù)組,如[10,11,12]具有相同的數(shù)組形態(tài),則元素對(duì)應(yīng)捕捉被作用,我們可以分別得到10%,5%,1%尾的臨界值對(duì)于10,11,12的自由度。
stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12])
array([ 1.37218364, 1.79588482, 2.68099799])
Specific Points for Discrete Distributions
離散分布的特殊之處
Discrete distribution have mostly the same basic methods as the
continuous distributions. However pdf is replaced the probability mass
function pmf, no estimation methods, such as fit, are available, and
scale is not a valid keyword parameter. The location parameter, keyword
loc can still be used to shift the distribution.
離散分布的簡(jiǎn)單方法大多數(shù)與連續(xù)分布很類似。當(dāng)然像pdf被更換為密度函數(shù)pmf,沒有估計(jì)方法,像fit是可用的。而scale不是一個(gè)合法的關(guān)鍵字參數(shù)。Location參數(shù),關(guān)鍵字loc則仍然可以使用用于位移。
The computation of the cdf requires some extra attention. In the case of
continuous distribution the cumulative distribution function is in most
standard cases strictly monotonic increasing in the bounds (a,b) and
has therefore a unique inverse. The cdf of a discrete distribution,
however, is a step function, hence the inverse cdf, i.e., the percent
point function, requires a different definition:
ppf(q) = min{x : cdf(x) = q, x integer}
Cdf的計(jì)算要求一些額外的關(guān)注。在連續(xù)分布的情況下,累積分布函數(shù)在大多數(shù)標(biāo)準(zhǔn)情況下是嚴(yán)格遞增的,所以有唯一的逆。而cdf在離散分布,無論如何,是階躍函數(shù),所以cdf的逆,分位點(diǎn)函數(shù),要求一個(gè)不同的定義:
ppf(q) = min{x : cdf(x) = q, x integer}
For further info, see the docs here.
為了更多信息可以看這里。
We can look at the hypergeometric distribution as an example
from scipy.stats import hypergeom
[M, n, N] = [20, 7, 12]
我們可以看這個(gè)超幾何分布的例子
from scipy.stats import hypergeom
[M, n, N] = [20, 7, 12]
If we use the cdf at some integer points and then evaluate the ppf at
those cdf values, we get the initial integers back, for example
如果我們使用在一些整數(shù)點(diǎn)使用cdf,它們的cdf值再作用ppf會(huì)回到開始的值。
x = np.arange(4)*2
x
array([0, 2, 4, 6])
prb = hypergeom.cdf(x, M, n, N)
prb
array([ 0.0001031991744066, 0.0521155830753351, 0.6083591331269301,
0.9897832817337386])
hypergeom.ppf(prb, M, n, N)
array([ 0., 2., 4., 6.])
If we use values that are not at the kinks of the cdf step function, we get the next higher integer back:
如果我們使用的值不是cdf的函數(shù)值,則我們得到一個(gè)更高的值。
hypergeom.ppf(prb + 1e-8, M, n, N)
array([ 1., 3., 5., 7.])
hypergeom.ppf(prb - 1e-8, M, n, N)
array([ 0., 2., 4., 6.])
新聞標(biāo)題:累積分布函數(shù)python 累積分布函數(shù)是什么意思
文章地址:http://chinadenli.net/article30/hgcdso.html
成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供自適應(yīng)網(wǎng)站、網(wǎng)站策劃、靜態(tài)網(wǎng)站、定制開發(fā)、小程序開發(fā)、
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)