What is Entropy(information theory)?

Tag: Machine Learning; Date: 02 January 2020

Information Theory:

What is Information?

Considering a discrete random variable \(x\) and \(h(x)\) denote how much information is received when we observe a specific value for this variable. The amount of information can be viewed as the ‘degree of surprise’ on learning the value of \(x\).

Intuition: if we are told that a highly improbable event has just occurred, we will have received more information than if we were told that some very likely event has just occurred, and if we knew that the event was certain to happen we would receive no information
The function is \(h(x)\) monotonic function of probability \(P(x)\): \(P(x)\) Large → \(h(x)\) small \(P(x)\) Small → \(h(x)\) Large. (e.g. you know the sun will raise tomorrow(large \(P(x)\)), then no information gain(small \(h(x)\))
Therefore, the measure of information will depend on the probability distribution \(P(x)\)

\[h(x)=-\log_2P(x)\]

If we have two event \(x, y\) that are unrelated, then the information gain from observing both of them should be sum of information gained from each of them separately e.g. \(h(x, y) = h(x)+ h(y)\)
Two unrelated events will be statistically independent and so \(P(x, y) = P(x)P(y)\)
Based on above we can, it is easily shown that \(h(x)\) must be given by the logarithm of \(P(x)\)and so we have \(h(x) = -\log_2 P(x)\)
Note the minus sign is to ensures that information is positive or zero.
The base of log is arbitrary. In convention, people chose 2.

Entropy of a random variable is the average level of “information”, “surprise”, of “uncertainty” inherent in the variable’s possible outcomes.

\[H[x] = - \sum_xP(x)\log_2P(x)\]

Sequential methods in pattern recognition and machine learning, Christopher Bishop.
https://en.wikipedia.org/wiki/Entropy_(information_theory)