The cumulative distribution function

The distribution function completely caracterizes a random variable.

$$F_{U(0,1)}(x) = P[U(0,1) \leq x] = \int_0^x 1dt = x$$
$$F_X(a) = P[X \leq a]$$

Assume that $X$ is a continuous random variable and let $Z = F_X(X)$. If $x \in [0,1]$, the cumulative distribution function of $Z$ is $$ F_Z(x) = P[Z \leq x] = P[F_X(X) \leq x] = P[X \leq F_X^{-1}(x)] = F_X(F_X^{-1}(x)) = x. $$ In other terms, the distribution function of any continuous random variable follows a $U(0,1)$.

Empirical distribution function

Assume that we observe ${x_1,\dots,x_n}$, and sort them by ascending order. The sorted observations are denoted by ${x_{(1)},\dots,x_{(n)}}$. The empirical distribution function is defined as $$ {\hat F_n(x)} = \frac{1}{n} \sum_{i=1}^n I[x_{(i)} \le x], $$ where $$ I[y \le x] = \begin{cases} 1 \mbox{ if } y \le x, \\ 0 \mbox{ otherwise.} \end{cases} $$

Example

In [1]:
using RandomStreams
using Distributions

const SEED = 12345

seeds = [SEED, SEED, SEED, SEED, SEED, SEED]
gen = MRG32k3aGen(seeds)
unif = next_stream(gen)

n = 10

x = Array(Float64, n)
for i = 1:n
    x[i] = rand(Poisson(10000))/10000
end
In [2]:
x
Out[2]:
10-element Array{Float64,1}:
 1.0003
 1.0096
 1.0047
 0.9886
 0.9901
 0.9847
 1.0152
 0.9717
 0.9963
 1.0157

We can directly represent the empirical distribution function in Julia using the method ef.

In [3]:
using StatsBase

ef = ecdf(x)
methods(ef)

We can the evaluate it as any other distribution function.

In [30]:
u = ef(0.99)
Out[30]:
0.3

Quantiles

Several definitions exist to quantify the quantile of a sample, but all of them are consistent as an Monte Carlo estimator of a quantile.

In [9]:
y = sort(x)
Out[9]:
10-element Array{Float64,1}:
 0.9717
 0.9847
 0.9886
 0.9901
 0.9963
 1.0003
 1.0047
 1.0096
 1.0152
 1.0157
In [10]:
l = length(y)
Out[10]:
10
In [11]:
m = Int64(floor(n*0.45))
Out[11]:
4
In [12]:
y[m]
Out[12]:
0.9901
In [13]:
y[m+1]
Out[13]:
0.9963
In [14]:
y[m+2]
Out[14]:
1.0003
In [15]:
n*0.6
Out[15]:
6.0
In [16]:
quantile(y,0.45)
Out[16]:
0.9964999999999999
In [17]:
ef(quantile(y,0.6))
Out[17]:
0.6
In [18]:
0.5*(y[m+1]+y[m+2])
Out[18]:
0.9983
In [19]:
n/2.0
Out[19]:
5.0
In [20]:
methods(quantile)
Out[20]:
59 methods for generic function quantile:
In [21]:
ef(9)
Out[21]:
1.0
In [22]:
ef(10)
Out[22]:
1.0
In [23]:
?ecdf
search: ecdf searchsortedfirst secd vecdot asecd ObjectIdDict reducedim

Out[23]:
ecdf(X)

Compute the empirical cumulative distribution function (ECDF) of a real-valued vector.

In [24]:
ef(quantile(y,0.6))
Out[24]:
0.6
In [25]:
quantile(y,0.6)
Out[25]:
1.00206
In [26]:
y[700]
LoadError: BoundsError: attempt to access 10-element Array{Float64,1} at index [700]
while loading In[26], in expression starting on line 1

 in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/bastin/.julia/v0.5/IJulia/src/execute_request.jl:169
 in eventloop(::ZMQ.Socket) at /home/bastin/.julia/v0.5/IJulia/src/eventloop.jl:8
 in (::IJulia.##9#15)() at ./task.jl:360
In [27]:
ef(y[700])
LoadError: BoundsError: attempt to access 10-element Array{Float64,1} at index [700]
while loading In[27], in expression starting on line 1

 in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/bastin/.julia/v0.5/IJulia/src/execute_request.jl:169
 in eventloop(::ZMQ.Socket) at /home/bastin/.julia/v0.5/IJulia/src/eventloop.jl:8
 in (::IJulia.##9#15)() at ./task.jl:360
In [28]:
ceil(n*0.6)
Out[28]:
6.0
In [233]:
y[Int64(ceil(n*0.6))]
Out[233]:
1.0052
In [234]:
?quantile
search: quantile quantile! wquantile nquantile cquantile

Out[234]:
quantile(v, p; sorted=false)

Compute the quantile(s) of a vector v at a specified probability or vector p. The keyword argument sorted indicates whether v can be assumed to be sorted.

The p should be on the interval [0,1], and v should not have any NaN values.

Quantiles are computed via linear interpolation between the points ((k-1)/(n-1), v[k]), for k = 1:n where n = length(v). This corresponds to Definition 7 of Hyndman and Fan (1996), and is the same as the R default.

!!! note Julia does not ignore NaN values in the computation. For applications requiring the handling of missing data, the DataArrays.jl package is recommended. quantile will throw an ArgumentError in the presence of NaN values in the data array.

  • Hyndman, R.J and Fan, Y. (1996) "Sample Quantiles in Statistical Packages", The American Statistician, Vol. 50, No. 4, pp. 361-365
quantile(v, w::WeightVec, p)

Compute pth quantile(s) of v with weights w.

In [272]:
x = [1, 3, 5, 6, 6, 21]
Out[272]:
6-element Array{Int64,1}:
  1
  3
  5
  6
  6
 21
In [273]:
quantile(x,0.5)
Out[273]:
5.5
In [ ]:

In [ ]: