Information dimension

Term in information theory

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.[1]

Simply speaking, it is a measure of the fractal dimension of a probability distribution. It characterizes the growth rate of the Shannon entropy given by successively finer discretizations of the space.

In 2010, Wu and Verdú gave an operational characterization of Rényi information dimension as the fundamental limit of almost lossless data compression for analog sources under various regularity constraints of the encoder/decoder.

Definition and Properties

The entropy of a discrete random variable Z {\displaystyle Z} is

H 0 ( Z ) = z s u p p ( P Z ) P Z ( z ) log 2 1 P Z ( z ) {\displaystyle \mathbb {H} _{0}(Z)=\sum _{z\in supp(P_{Z})}P_{Z}(z)\log _{2}{\frac {1}{P_{Z}(z)}}}

where P Z ( z ) {\displaystyle P_{Z}(z)} is the probability measure of Z {\displaystyle Z} when Z = z {\displaystyle Z=z} , and the s u p p ( P Z ) {\displaystyle supp(P_{Z})} denotes a set { z | z Z , P Z ( z ) > 0 } {\displaystyle \{z|z\in {\mathcal {Z}},P_{Z}(z)>0\}} .

Let X {\displaystyle X} be an arbitrary real-valued random variable. Given a positive integer m {\displaystyle m} , we create a new discrete random variable

X m = m X m {\displaystyle \langle X\rangle _{m}={\frac {\lfloor mX\rfloor }{m}}}

where the {\displaystyle \lfloor \cdot \rfloor } is the floor operator which converts a real number to the greatest integer less than it. Then

d _ ( X ) = lim inf m H 0 ( X m ) log 2 m {\displaystyle {\underline {d}}(X)=\liminf _{m\rightarrow \infty }{\frac {\mathbb {H} _{0}(\langle X\rangle _{m})}{\log _{2}m}}}

and

d ¯ ( X ) = lim sup m H 0 ( X m ) log 2 m {\displaystyle {\bar {d}}(X)=\limsup _{m\rightarrow \infty }{\frac {\mathbb {H} _{0}(\langle X\rangle _{m})}{\log _{2}m}}}

are called lower and upper information dimensions of X {\displaystyle X} respectively. When d _ ( X ) = d ¯ ( X ) {\displaystyle {\underline {d}}(X)={\bar {d}}(X)} , we call this value information dimension of X {\displaystyle X} ,

d ( X ) = lim m H 0 ( X m ) log 2 m {\displaystyle d(X)=\lim _{m\rightarrow \infty }{\frac {\mathbb {H} _{0}(\langle X\rangle _{m})}{\log _{2}m}}}

Some important properties of information dimension d ( X ) {\displaystyle d(X)} :

  • If the mild condition H ( X ) < {\displaystyle \mathbb {H} (\lfloor X\rfloor )<\infty } is fulfilled, we have 0 d _ ( X ) d ¯ ( X ) 1 {\displaystyle 0\leq {\underline {d}}(X)\leq {\bar {d}}(X)\leq 1} .
  • For an n {\displaystyle n} -dimensional random vector X {\displaystyle {\vec {X}}} , the first property can be generalized to 0 d _ ( X ) d ¯ ( X ) n {\displaystyle 0\leq {\underline {d}}({\vec {X}})\leq {\bar {d}}({\vec {X}})\leq n} .
  • It is sufficient to calculate the upper and lower information dimensions when restricting to the exponential subsequence m = 2 l {\displaystyle m=2^{l}} .
  • d _ ( X ) {\displaystyle {\underline {d}}(X)} and d ¯ ( X ) {\displaystyle {\bar {d}}(X)} are kept unchanged if rounding or ceiling functions are used in quantization.

d-Dimensional Entropy

If the information dimension d {\displaystyle d} exists, one can define the d {\displaystyle d} -dimensional entropy of this distribution by

H d ( X ) ( X ) = lim n + ( H 0 ( X n ) d ( X ) log 2 n ) {\displaystyle \mathbb {H} _{d(X)}(X)=\lim _{n\rightarrow +\infty }(\mathbb {H} _{0}(\langle X\rangle _{n})-d(X)\log _{2}n)}

provided the limit exists. If d = 0 {\displaystyle d=0} , the zero-dimensional entropy equals the standard Shannon entropy H 0 ( X ) {\displaystyle \mathbb {H} _{0}(X)} . For integer dimension d = n 1 {\displaystyle d=n\geq 1} , the n {\displaystyle n} -dimensional entropy is the n {\displaystyle n} -fold integral defining the respective differential entropy.

An equivalent definition of Information Dimension

In 1994, Kawabata and Dembo in Kawabata & Dembo 1994 proposed a new way of measuring information based on rate distortion value of a random variable. The measure is defined as

d R ( X ) = 2 R ( X , D ) log D , {\displaystyle d_{R}(X)=-2{\frac {R(X,D)}{\log D}},}

where R ( X , D ) {\displaystyle R(X,D)} is the rate-distortion function that is defined as

R ( X , D ) = min X X ^ 2 D I ( X , X ^ ) , {\displaystyle R(X,D)=\min _{\|X-{\hat {X}}\|_{2}\leq D}I(X,{\hat {X}}),}

or equivalently, minimum information that could lead to a D {\displaystyle D} -close approximation of X {\displaystyle X} .

They further, proved that such definition is equivalent to the definition of information dimension. Formally,

d R ( X ) = d ( X ) . {\displaystyle d_{R}(X)=d(X).}

Dimensional-Rate Bias

Using the above definition of Rényi information dimension, a similar measure to d-dimensional entropy is defined in Charusaie, Amini & Rini 2022 . This value b ( X ) {\displaystyle b(X)} that is named dimensional-rate bias is defined in a way to capture the finite term of rate-distortion function. Formally,

R ( X , D ) = d ( X ) 2 log 2 π e D d ( X ) + b ( X ) . {\displaystyle R(X,D)=-{\frac {d(X)}{2}}\log {\frac {2\pi eD}{d(X)}}+b(X).}

The dimensional-rate bias is equal to d-dimensional rate for continuous, discrete, and discrete-continuous mixed distribution. Furthermore, it is calculable for a set of singular random variables, while d-dimensional entropy does not necessarily exist there.

Finally, dimensional-rate bias generalizes the Shannon's entropy and differential entropy, as one could find the mutual information I ( X ; Y ) {\displaystyle I(X;Y)} using the following formula:

I ( X ; Y ) = b ( X ) + b ( Y ) b ( X , Y ) . {\displaystyle I(X;Y)=b(X)+b(Y)-b(X,Y).}

Discrete-Continuous Mixture Distributions

According to Lebesgue decomposition theorem,[2] a probability distribution can be uniquely represented by the mixture

v = p P X d + q P X c + r P X s {\displaystyle v=pP_{Xd}+qP_{Xc}+rP_{Xs}}

where p + q + r = 1 {\displaystyle p+q+r=1} and p , q , r 0 {\displaystyle p,q,r\geq 0} ; P X d {\displaystyle P_{Xd}} is a purely atomic probability measure (discrete part), P X c {\displaystyle P_{Xc}} is the absolutely continuous probability measure, and P X s {\displaystyle P_{Xs}} is a probability measure singular with respect to Lebesgue measure but with no atoms (singular part). Let X {\displaystyle X} be a random variable such that H ( X ) < {\displaystyle \mathbb {H} (\lfloor X\rfloor )<\infty } . Assume the distribution of X {\displaystyle X} can be represented as

v = ( 1 ρ ) P X d + ρ P X c {\displaystyle v=(1-\rho )P_{Xd}+\rho P_{Xc}}

where P X d {\displaystyle P_{Xd}} is a discrete measure and P X c {\displaystyle P_{Xc}} is the absolutely continuous probability measure with 0 ρ 1 {\displaystyle 0\leq \rho \leq 1} . Then

d ( X ) = ρ {\displaystyle d(X)=\rho }

Moreover, given H 0 ( P X d ) {\displaystyle \mathbb {H} _{0}(P_{Xd})} and differential entropy h ( P X c ) {\displaystyle h(P_{Xc})} , the d {\displaystyle d} -Dimensional Entropy is simply given by

H ρ ( X ) = ( 1 ρ ) H 0 ( P X d ) + ρ h ( P X c ) + H 0 ( ρ ) {\displaystyle \mathbb {H} _{\rho }(X)=(1-\rho )\mathbb {H} _{0}(P_{Xd})+\rho h(P_{Xc})+\mathbb {H} _{0}(\rho )}

where H 0 ( ρ ) {\displaystyle \mathbb {H} _{0}(\rho )} is the Shannon entropy of a discrete random variable Z {\displaystyle Z} with P Z ( 1 ) = ρ {\displaystyle P_{Z}(1)=\rho } and P Z ( 0 ) = 1 ρ {\displaystyle P_{Z}(0)=1-\rho } and given by

H 0 ( ρ ) = ρ log 2 1 ρ + ( 1 ρ ) log 2 1 1 ρ {\displaystyle \mathbb {H} _{0}(\rho )=\rho \log _{2}{\frac {1}{\rho }}+(1-\rho )\log _{2}{\frac {1}{1-\rho }}}

Example

Consider a signal which has a Gaussian probability distribution.

We pass the signal through a half-wave rectifier which converts all negative value to 0, and maintains all other values. The half-wave rectifier can be characterized by the function

f ( x ) = { x , if  x 0 0 , x < 0 {\displaystyle f(x)={\begin{cases}x,&{\text{if }}x\geq 0\\0,&x<0\end{cases}}}

Then, at the output of the rectifier, the signal has a rectified Gaussian distribution. It is characterized by an atomic mass of weight 0.5 and has a Gaussian PDF for all x > 0 {\displaystyle x>0} .

With this mixture distribution, we apply the formula above and get the information dimension d {\displaystyle d} of the distribution and calculate the d {\displaystyle d} -dimensional entropy.

d ( X ) = ρ = 0.5 {\displaystyle d(X)=\rho =0.5}

The normalized right part of the zero-mean Gaussian distribution has entropy h ( P X c ) = 1 2 log 2 ( 2 π e σ 2 ) 1 {\displaystyle h(P_{Xc})={\frac {1}{2}}\log _{2}(2\pi e\sigma ^{2})-1} , hence

H 0.5 ( X ) = ( 1 0.5 ) ( 1 log 2 1 ) + 0.5 h ( P X c ) + H 0 ( 0.5 ) = 0 + 1 2 ( 1 2 log 2 ( 2 π e σ 2 ) 1 ) + 1 = 1 4 log 2 ( 2 π e σ 2 ) + 1 2  bit(s) {\displaystyle {\begin{aligned}\mathbb {H} _{0.5}(X)&=(1-0.5)(1\log _{2}1)+0.5h(P_{Xc})+\mathbb {H} _{0}(0.5)\\&=0+{\frac {1}{2}}({\frac {1}{2}}\log _{2}(2\pi e\sigma ^{2})-1)+1\\&={\frac {1}{4}}\log _{2}(2\pi e\sigma ^{2})+{\frac {1}{2}}\,{\text{ bit(s)}}\end{aligned}}}

Connection to Differential Entropy

It is shown [3] that information dimension and differential entropy are tightly connected.

Let X {\displaystyle X} be a random variable with continuous density f ( x ) {\displaystyle f(x)} .

Suppose we divide the range of X {\displaystyle X} into bins of length Δ {\displaystyle \Delta } . By the mean value theorem, there exists a value x i {\displaystyle x_{i}} within each bin such that

f ( x i ) Δ = i Δ ( i + 1 ) Δ f ( x ) d x {\displaystyle f(x_{i})\Delta =\int _{i\Delta }^{(i+1)\Delta }f(x)\;\mathrm {d} x}

Consider the discretized random variable X Δ = x i {\displaystyle X^{\Delta }=x_{i}} if i Δ X < ( i + 1 ) Δ {\displaystyle i\Delta \leq X<(i+1)\Delta } .

The probability of each support point X Δ = x i {\displaystyle X^{\Delta }=x_{i}} is

P X Δ ( x i ) = i Δ ( i + 1 ) Δ f ( x ) d x = f ( x i ) Δ {\displaystyle P_{X^{\Delta }}(x_{i})=\int _{i\Delta }^{(i+1)\Delta }f(x)\;\mathrm {d} x=f(x_{i})\Delta }

Let S = supp ( P X Δ ) {\displaystyle S=\operatorname {supp} (P_{X^{\Delta }})} . The entropy of X Δ {\displaystyle X^{\Delta }} is

H 0 ( X Δ ) = x i S P X Δ log 2 P X Δ = x i S f ( x i ) Δ log 2 ( f ( x i ) Δ ) = x i S Δ f ( x i ) log 2 f ( x i ) x i S f ( x i ) Δ log 2 Δ = x i S Δ f ( x i ) log 2 f ( x i ) log 2 Δ {\displaystyle {\begin{aligned}\mathbb {H} _{0}(X^{\Delta })&=-\sum _{x_{i}\in S}P_{X^{\Delta }}\log _{2}P_{X^{\Delta }}\\&=-\sum _{x_{i}\in S}f(x_{i})\Delta \log _{2}(f(x_{i})\Delta )\\&=-\sum _{x_{i}\in S}\Delta f(x_{i})\log _{2}f(x_{i})-\sum _{x_{i}\in S}f(x_{i})\Delta \log _{2}\Delta \\&=-\sum _{x_{i}\in S}\Delta f(x_{i})\log _{2}f(x_{i})-\log _{2}\Delta \\\end{aligned}}}

If we set Δ = 1 / m {\displaystyle \Delta =1/m} and x i = i / m {\displaystyle x_{i}=i/m} then we are doing exactly the same quantization as the definition of information dimension. Since relabeling the events of a discrete random variable does not change its entropy, we have

H 0 ( X 1 / m ) = H 0 ( X m ) . {\displaystyle \mathbb {H} _{0}(X^{1/m})=\mathbb {H} _{0}(\langle X\rangle _{m}).}

This yields

H 0 ( X m ) = 1 m f ( x i ) log 2 f ( x i ) + log 2 m {\displaystyle \mathbb {H} _{0}(\langle X\rangle _{m})=-\sum {\frac {1}{m}}f(x_{i})\log _{2}f(x_{i})+\log _{2}m}

and when m {\displaystyle m} is sufficiently large,

Δ f ( x i ) log 2 f ( x i ) f ( x ) log 2 1 f ( x ) d x {\displaystyle -\sum \Delta f(x_{i})\log _{2}f(x_{i})\approx \int f(x)\log _{2}{\frac {1}{f(x)}}\mathrm {d} x}

which is the differential entropy h ( x ) {\displaystyle h(x)} of the continuous random variable. In particular, if f ( x ) {\displaystyle f(x)} is Riemann integrable, then

h ( X ) = lim m H 0 ( X m ) log 2 ( m ) . {\displaystyle h(X)=\lim _{m\rightarrow \infty }\mathbb {H} _{0}(\langle X\rangle _{m})-\log _{2}(m).}

Comparing this with the d {\displaystyle d} -dimensional entropy shows that the differential entropy is exactly the one-dimensional entropy

h ( X ) = H 1 ( X ) . {\displaystyle h(X)=\mathbb {H} _{1}(X).}

In fact, this can be generalized to higher dimensions. Rényi shows that, if X {\displaystyle {\vec {X}}} is a random vector in a n {\displaystyle n} -dimensional Euclidean space n {\displaystyle \Re ^{n}} with an absolutely continuous distribution with a probability density function f X ( x ) {\displaystyle f_{\vec {X}}({\vec {x}})} and finite entropy of the integer part ( H 0 ( X m ) < {\displaystyle H_{0}(\langle {\vec {X}}\rangle _{m})<\infty } ), we have d ( X ) = n {\displaystyle d({\vec {X}})=n}

and

H n ( X ) = f X ( x ) log 2 1 f X ( x ) d x , {\displaystyle \mathbb {H} _{n}({\vec {X}})=\int \cdots \int f_{\vec {X}}({\vec {x}})\log _{2}{\frac {1}{f_{\vec {X}}({\vec {x}})}}\mathrm {d} {\vec {x}},}

if the integral exists.

Lossless data compression

The information dimension of a distribution gives a theoretical upper bound on the compression rate, if one wants to compress a variable coming from this distribution. In the context of lossless data compression, we try to compress real number with less real number which both have infinite precision.

The main objective of the lossless data compression is to find efficient representations for source realizations x n X n {\displaystyle x^{n}\in {\mathcal {X}}^{n}} by y n Y n {\displaystyle y^{n}\in {\mathcal {Y}}^{n}} . A ( n , k ) {\displaystyle (n,k)-} code for { X i : i N } {\displaystyle \{X_{i}:i\in {\mathcal {N}}\}} is a pair of mappings:

  • encoder: f n : X n Y k {\displaystyle f_{n}:{\mathcal {X}}^{n}\rightarrow {\mathcal {Y}}^{k}} which converts information from a source into symbols for communication or storage;
  • decoder: g n : Y k X n {\displaystyle g_{n}:{\mathcal {Y}}^{k}\rightarrow {\mathcal {X}}^{n}} is the reverse process, converting code symbols back into a form that the recipient understands.

The block error probability is P { g n ( f n ( X n ) ) X n } {\displaystyle {\mathcal {P}}\{g_{n}(f_{n}(X^{n}))\neq X^{n}\}} .

Define r ( ϵ ) {\displaystyle r(\epsilon )} to be the infimum of r 0 {\displaystyle r\geq 0} such that there exists a sequence of ( n , r n ) {\displaystyle (n,\lfloor rn\rfloor )-} codes such that P { g n ( f n ( X n ) ) X n } ϵ {\displaystyle {\mathcal {P}}\{g_{n}(f_{n}(X^{n}))\neq X^{n}\}\leq \epsilon } for all sufficiently large n {\displaystyle n} .

So r ( ϵ ) {\displaystyle r(\epsilon )} basically gives the ratio between the code length and the source length, it shows how good a specific encoder decoder pair is. The fundamental limits in lossless source coding are as follows.[4]

Consider a continuous encoder function f ( x ) : n R n {\displaystyle f(x):\Re ^{n}\rightarrow \Re ^{\lfloor Rn\rfloor }} with its continuous decoder function g ( x ) : R n n {\displaystyle g(x):\Re ^{\lfloor Rn\rfloor }\rightarrow \Re ^{n}} . If we impose no regularity on f ( x ) {\displaystyle f(x)} and g ( x ) {\displaystyle g(x)} , due to the rich structure of {\displaystyle \Re } , we have the minimum ϵ {\displaystyle \epsilon } -achievable rate R 0 ( ϵ ) = 0 {\displaystyle R_{0}(\epsilon )=0} for all 0 < ϵ 1 {\displaystyle 0<\epsilon \leq 1} . It means that one can build an encoder-decoder pair with infinity compression rate.

In order to get some nontrivial and meaningful conclusions, let R ( ϵ ) {\displaystyle R^{*}(\epsilon )} the minimum ϵ {\displaystyle \epsilon -} achievable rate for linear encoder and Borel decoder. If random variable X {\displaystyle X} has a distribution which is a mixture of discrete and continuous part. Then R ( ϵ ) = d ( X ) {\displaystyle R^{*}(\epsilon )=d(X)} for all 0 < ϵ 1 {\displaystyle 0<\epsilon \leq 1} Suppose we restrict the decoder to be a Lipschitz continuous function and d ¯ ( X ) < {\displaystyle {\bar {d}}(X)<\infty } holds, then the minimum ϵ {\displaystyle \epsilon -} achievable rate R ( ϵ ) d ¯ ( X ) {\displaystyle R(\epsilon )\geq {\bar {d}}(X)} for all 0 < ϵ 1 {\displaystyle 0<\epsilon \leq 1} .

The fundamental role of information dimension in lossless data compression further extends beyond the i.i.d. data. It is shown that for specified processes (e.g., moving-average processes) the ratio of lossless compression is also equal to the information dimension rate rate.[5] This result allows for further compression that was not possible by considering only marginal distribution of the process.

See also

Notes

References

  • Çınlar, Erhan (2011). Probability and Stochastics. Graduate Texts in Mathematics. Vol. 261. Springer. doi:10.1007/978-0-387-87859-1. ISBN 978-0-387-87858-4.
  • Cover, Thomas M.; Thomas, Joy A. (2012). Elements of Information Theory (2nd ed.). Wiley. pp. 247–248. ISBN 9781118585771.
  • Wu, Yihong; Verdu, S. (August 2010). "Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression". IEEE Transactions on Information Theory. 56 (8): 3721–3748. doi:10.1109/TIT.2010.2050803. ISSN 0018-9448. S2CID 206737933.
  • Charusaie, M.; Amini, A.; Rini, S. (May 2022). "Compressibility Measures for Affinely Singular Random Vectors". IEEE Transactions on Information Theory. 68 (9): 6245–6275. arXiv:2001.03884. doi:10.1109/TIT.2022.3174623.