Emoji and Zipf’s Law

Emojitracker.com is a very interesting site that tracks how often each emoji is used on Twitter. The statistics are updated in real time, and it’s really quite amazing to watch.

Screenshot from 2015-05-06 23:40:09

(In case the reader is unaware, an emoji is a small icon that is often used to convey ideas or emotions in electronic communications. It should not be confused with an emoticon, which uses ordinary text characters for a similar purpose.)

Is there is a formula that describes these frequencies? In 1935, the American linguist George Zipf proposed that, in any corpus of words, the frequency of a word is inversely proportional to its rank in the frequency table. That is, the most common word should be \(N\) times more frequent than the \(N\)th word. More generally, we may suppose that the frequencies follow an inverse power law of the form \(y = Cx^{-r}\), where \(y\) is the frequency and \(x\) is the rank.

Power laws can be identified using a log-log plot, which has logarithmic scales on both axes. When we take the logarithm of both sides, we get

\[\log y = \log C – r \log x\]

which appears on a log-log plot as a straight line with slope \((-r)\) .

Does the frequency of emojis follow a power law? To answer this question, we need to find a way to scrape the data from the webpage. My solution was crude but effective. I selected all of the text on the page, and used copy-and-paste to get it into a text file. I wrote a short Python script to extract the numbers into a list.

import re
with open("emoji.txt") as f:
    text = '\n'.join(f.readlines())
y = [int(m) for m in re.findall(r"\d+", text)]
x = range(1, len(y)+1)

Next, I used matplotlib to plot the data on a log-log scale, and scipy.stats to compute the line of best fit for the transformed data.

from math import log, exp
from scipy import stats
import matplotlib.pyplot as plt

logx = map(log, x)
logy = map(log, y)
slope, intercept, r_value, p_value, std_er = stats.linregress(logx[:40], logy[:40])
y_pred = [exp(slope*t + intercept) for t in logx]
plt.loglog(x, y)
plt.loglog(x, y_pred)
plt.title('Frequency of Emoji characters on Twitter')
plt.text(100, 3e7, r'$y = (%.3f \times 10^8)\ x^{%.3f}$' % (exp(intercept)/10**8, slope)) 
plt.text(100, 1e8, r'$(r^2 = \ %.4f)$' % r_value**2)

From the graph shown below, it appears that the numbers do not follow a power law. However, the 40 most common emojis do appear to follow a power law. The graph shows the frequencies for all of the emojis, with a line of best fit for the 40 most common emojis.