In terms of the frequency of letters, how is it possible to have different frequ
ID: 648482 • Letter: I
Question
In terms of the frequency of letters, how is it possible to have different frequent letters when the length of the text I'm analyzing is shorter?
At the moment, I'm comparing the frequencies of a long text and a subtext from that text. To my surprise, the most frequent letters changed. In the long one it was the letter e followed by the letter t, however in the small text it was t followed by e. Also, when I checked the frequency of different types of texts (e.g news articles), the frequency of letters also changed as well as the most frequent one.
The bottom line is, how can that be possible? It makes no sense to me.
Explanation / Answer
Speaking in statistical terms, this is the difference between the law of large numbers and the "law of small numbers" (e.g. see Poisson distribution).
Short texts are not statistically significant, or more detailed: If you assume statistical independent letters (not true in general, but can be used as simplification), for short texts the variance will be much higher, so that you have to expect a larger gab between the expected value and your actual measurement.
If you want to know whether a sample coincides with a given frequency distribution, there are statistical hypothesis tests, e.g. the chi-squared test, where the result indicates how likely your sample matches the given distribution.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.