In October 2011, Apple added the emoji keyboard to iOS as an international keyboard. Since then, digital language has evolved such that nearly half of comments and captions on Instagram contain emoji characters. And earlier this week, Instagram also added support for emoji characters in hashtags, which allows people to tag and search content with their favorite emoji #.
In Part 1 of this blog post series, we will take a deep dive into emoji usage on Instagram. By applying machine learning and natural language processing techniques, we’ll discover the hidden semantics of emoji.
Emoji on Instagram: Up and to the Right
It is a rare privilege to observe the rise of a new language. Instagram has always supported emoji, but they did not see wide adoption until the introduction of the emoji keyboard on iOS (October 2011) and on most Android platforms (July 2013). The graph below shows the percentage of text (comments and captions) containing emoji characters graphed over time .
In the month following the introduction of the iOS emoji keyboard, 10% of text on Instagram contained emoji. The trend continued until the release of Instagram for Android in April of 2012, when many new users did not have emoji support. Afterwards, there was a clear upward trend which accelerated after Android received native support for emoji in July 2013.
The graph below shows that users from Finland are using emoji characters in over 60% of text! In contrast, the lower bound is in Tanzania with only 10% of text containing emoji. If the overall trend continues, we might be looking at a future where the majority of text on Instagramcontains emoji.
Natural Language Processing
Learning an Emoji Representation
We’re often asked about the meaning of emoji such as . Intuitively, substitutable words have similar meanings. For example, we might say that “dog” and “cat” are similar words because they can both be used in sentences like “The pet store sells _ food.” In the field of natural language processing, this intuition is called the distributional hypothesis . It can be applied to emoji by treating them as if they are normal words.
More formally, we can place (or embed) emoji and hashtags together with words into a common metric spacewhere there are well-defined distances between elements. The representation of the words are chosen so that similar words have a small distance. In the scatter chart below, we embedded words, emoji, and hashtags into a 100-dimensional space of floating point numbers using 50 million English Instagram comments and captions from 2015.
We learn the floating point numbers using the Gensim library, which re-implements a tool called word2vec. In skip-gram mode, word2vec reads through text and predicts the context around a given word or emoji. If the algorithm predicts the context incorrectly, then it adjusts its internals to make a better guess in the next round. As part of that unsupervised training process, word2vec learns our 100-dimensional representation for words and emoji.
Having learned a good representation for emoji, we can begin to ask questions about similarity. Namely, for a given emoji, what English words are semantically similar? For each emoji, we compute the “angle” (equivalently the cosine similarity) between it and other words. Words with a small angle are said to be similar and provide a natural, English-language translation for that emoji.
Using our algorithm, we find that many of our popular emoji have meanings in-line with early internet slang:
- (ranked 1st in emoji usage): lolol, lmao, lololol, lolz, lmfao, lmaoo, lolololol, lol, ahahah, ahahha, loll, ahaha, ahah, lmfaoo, ahha, lmaooo, lolll, lollll, ahahaha, ahhaha, lml, lmfaooo
- (ranked 2nd in emoji usage): beautifull, gawgeous, gorgeous, perfff, georgous, gorgous, hottt, goregous, cuteeee, beautifullll, georgeous, baeeeee, hotttt, babeee, sexyyyy, perffff, hawttt
- ❤ (ranked 3rd in emoji usage): xoxoxox, xoxoxo, xoxo, xoxoxoxo, xoxoxoxoxo, xoxoxoxox, xxoo, oxox, babycakes, muahhhh, mwahh, babe, boobear, loveyou, bunches, muahhh, muahh, xoxox, muahhhhh
- (ranked 9th in emoji usage): #keepitup, #fingerscrossed, aswell, haha, #impressed, #yourock, lol, #greatjob, bud, #goodjob, awesome, good, #muchlove, #proudofyou, job, #goodluck
- (ranked 11th in emoji usage): ughh, ughhh, ughhhh, ugh, uggh, ugghh, ughhhhh, ughhhhhh, ugggh, lolol, wahhhh, rn, oml, uhg, agh, xc, omgg, omfg, omf, lololol, whyyy, loll, wahhhhh, tooo, kms
Some of the more distinctive emoji had particularly distinctive meanings:
- : #waitonit, #justwaitonit, #wonthedoit, #nuffsaid, #yeslawd, #youtherealmvp, #stayblessed, #thatisall, thou, #enoughsaid, leggo, #onlythebeginning
- : #sistasista, #sistersforlife, #sistersister, #bestiesforlife, yearsoffriendship, #sisterfromanothermister, #morelikesisters, #bffl, #bestiesfortheresties, #bestfriendsforever
- : #birthdaybehavior, #bdaybehavior, #tu, #ladiesnight, #turnuptime, #dontmissit, #bdaycelebration, #piscesseason, #bethere, turnup, #grownandsexy
- : merry, christmas, #merrychristmas, #christmas2014, #christmaseve, #christmastime, xmas, eve, #santa, claus, #happyholidays, #xmas, clause, reindeer, pesach
Naturally, people have strong associations with the flag emoji:
- : merica, #godblessamerica, ‘merica, #murica, #merica, #hooah, #america, #specialforces, #supportourtroops, #goarmy, #redwhiteandblue
- : paris, france, #eiffeltower, #paris, #france, louvre, italy, #montreal
- : #japan, #osaka, #kyoto, #japanese, japan, taipei, osaka, beijing, taiwan, tokyo, #日本
And in answer to our question, we can find that the emoji is associated with: #goodmorningtho,#yadigg,lbvs,#gn,#inmyfeelings,#latenightthoughts,#deletinglater. Personally, I like laughing but very serious (lbvs).
Changing the vocabulary
It seems that the most popular emoji have similar semantics to words like “lol/hehe” (), “xoxo” (❤️) and “omg” (). Are these emoji also replacing the usage of the words?
Precisely, we examine the usage of language in Instagram comments and captions by measuring the percentage of text containing emoji or internet slang. To control for natural changes in Instagram demographics, we examined four cohorts past the launch of Instagram for Android: those joining Instagram in the first week of July 2012, January 2013, July 2013, and January 2014. Each cohort contains millions of Instagram users. We defined internet slang as words matching variants of “xoxo”, “omg”, “muah”, “babe”, “bae”, “lol”, “haha,” and “hehe” with the following regular expression:
As shown in the chart below, all groups exhibit a similar pattern in the rise of emoji (with an upper bound around 45%) and a decline of internet slang (with a lower bound of around 5%). Correlation coefficients within the respective cohorts are all below -0.93, indicating a strongly negative correlation.
The vocabulary of Instagram is shifting similarly across many different cohorts with a decline in internet slang corresponding to rise in the usage of emoji.
Having our vectorized representation opens up a wealth of semantic analysis. One of the purported advantages of word2vec representations are that they allow for algebraic operations in semantic space. For example, it can be particularly hard to distinguish the heart emoji . We can isolate some of the effects by subtracting off the representation of ❤️ and finding similar concepts roughly corresponding to color. For example:
- – ❤️ ~= #goblue, #letsgoduke, #bleedblue, #ibleedblue, #worldautismawarenessday, #goduke, #beatduke, #autismspeaks, #autismawarenessday, #gobroncos, duke
- – ❤️ ~= #gogreen, loyals, #herballife, #happysaintpatricksday, , #stpats, , #jointhemovement, green, #hairskinnails, #happystpatricksday
- – ❤️ ~= , ,#springhassprung , ,#springiscoming ,#springishere, #aprilshowers, #thinkspring, #hellospring, , #wildflower, #happyearthday
- – ❤️~= ✨, , , , , , faldc, , brassy, topaz, peachy ,purple, #thinkpink,☁, sparkle, , shimmer, sparkles, kaleidoscope, periwinkle, , greenish
- -❤️ ~= gorl, , cwd, s4s, aynmalik, spvm, ulee, , , yulema, sfs, bvby, ɑnd, indirect, priv
- -❤️ ~= ulitzer, , peachy, february’s, tulle, mackz, kendall’s, curvy, faldc, #dancewear, strapless, , ◽, floral
- – ❤️ ~= , ℹ, , , ✉, , , , , paypal, , item, ⏬, , inquire, orders, payment, , , , deposit
Naturally, there are some mistakes in this type of algebra. Nonetheless, subtracting off ❤️ often leaves us with events highly associated with a specific color like #goblue, #gogreen, peachy and purple.
A Semantic Map
A hundred dimensions are pretty hard for humans to visualize. To visually inspect the relationships between emoji, we can take our 100-dimensional representation for emoji, reduce it two dimensions, and then plot them on a grid. We do this using an algorithm called t-SNE, which attempts to preserve relationships in a visually meaningful way:
One has to be careful not to read *too* much into the representation since it is an attempt to produce a 2D space out of a 100D one. But it’s clear that semantics are being approximated in our representation.