Emoji.prototype.length — a tale of characters in Unicode
- Published at
- Updated at
- Reading time
- 9min
Emoji are the base for text-driven conversations these days. Without these tiny little symbols, a lot of chat conversations today would end in awkward situations and misunderstandings. I still remember the old days when SMS were a thing.
Text chats that don’t include smileys often lead to “Are you kidding?” messages to be sure to not take a stupid joke too serious. Everybody quickly learned that humor and sarcasm (we should be less sarcastic anyways) are not easily transferable using only written characters. At some point the first Emoji appeared, and they quickly became a fundamental component of everyone's text-based conversations.
Even though I use Emoji every day, I never questioned how they work technically. They surely have to have a connection to Unicode somehow, but I had no idea about the actual functionality. And honestly, I didn’t care too much...
This all changed when I came across a tweet by Wes Bos in which he shared some JavaScript operations on strings including the Emoji family.
[...'👨👩👦'] // ["👨", "", "👩", "", "👦"]
‘👨👩👦’.length // 8
Okay – using the spread operator on a string like that didn’t get me excited, but the fact that this one visible symbol will be split into three symbols and two empty strings puzzled me. And going on seeing the string property length returning 8 increased my confusion even more as there were five entries in the spread array and not eight.
I immediately tried the code snippets, and it behaved the way Wes had described it. So what is going on here? I decided to dig deeper into Unicode, JavaScript, and the Emoji family to find some answers.
To understand why JavaScript treats Emoji like that we have to have a deeper look at Unicode itself.
Unicode is an international computing industry standard. It is a mapping from each letter, character or symbol to a numeric value. Thanks to Unicode we can share documents including e.g. special German characters like ß, ä, ö with people on systems that don’t use these characters. Thanks to Unicode encoding works across different platforms and environments.
1,114,112 different code points are defined in Unicode, and these code points are usually formatted using U+
followed by a hexadecimal number. The range of Unicode code points goes from U+0000
to U+10FFFF
.
These over one million code points are then divided into 17 so called “planes”, and each plane includes more than 65,000 code points. The most significant plane is the “Basic Multilingual Plane” (BMP) which ranges from U+0000
to U+FFFF
.
The BMP includes characters for almost all modern languages plus a lot of different symbols. The other 16 planes are called “Supplementary Planes” and have several different use cases like — you might have guessed it — the definition of most of the Emoji symbols.
An Emoji as we know it today is defined by at least one code point in the Unicode range. When looking at all defined Emoji listed in the Full Emoji Data list, you’ll see that there are a lot of them. And by saying “a lot”, I really mean a lot. You might ask yourself how many different Emoji we have defined in Unicode right now. The answer to this question is — as so often in computer science — “It depends”, and we have to understand them first to answer it.
As said an Emoji is defined by at least one code point. This means that there are also several Emoji out there being a combination of several different Emoji and code points. These combinations are called sequences. Thanks to sequences it is for example possible to modify neutral Emoji (usually displayed with yellow skin color) and make them fit your personal preference.
Modifier sequences for diversity in skin color
I still remember when I first noticed in a chat conversation that I could modify the “thumbs up” Emoji to match my own skin tone. It gave me a feeling of inclusion, and I felt way more connected to that thumb symbol that was all over my messages.
In Unicode, five modifiers can be used to alter the neutral Emoji of a human resulting in a variation having the desired skin tone. The modifiers range from U+1F3FB
to U+1F3FF
and are based on the Fitzpatrick scale.
By using these, we can transform a neutral Emoji to one with a more expressive skin tone. So let’s look at an example here:
// U+1F467 + U+1F3FD
👧 + 🏽
> 👧🏽
When we take the girl Emoji which has the code point U+1F467
and put a skin tone modifier (U+1F3FD
) after it, we automatically get a girl with an adjusted skin tone on systems that support these sequences.
ZWJ sequences for even more diversity
Skin color isn't the only thing people can relate to. When we’re looking back at the family example, it’s quite obvious that not every family consists of a man, a woman, and a boy.
Unicode includes a single code point for the neutral family (U+1F46A
- 👪), but that's not how every family looks like. We can create different families with a so-called Zero-Width-Joiner sequence.
And here is how it works: there is a code point called zero-width-joiner (U+200D
). This code point acts like glue indicating that two code points should be represented as one single symbol when possible.
Thinking of this sequence logically what could we glue together to display a family? That’s a simple one – two grown ups and a kid. By using a Zero-Width-Joiner sequence, diverse families can be represented easily.
// neutral family
// U+1F46A
> 👪
// ZWJ sequence: family (man, woman, boy)
// U+1F468 + U+200D + U+1F469 + U+200D + U+1F466
// 👨 + U+200D + 👩 + U+200D + 👦
> 👨👩👦
// ZWJ sequence: family (woman, woman, girl)
// U+1F469 + U+200D + U+1F469 + U+200D + U+1F467
// 👩 + U+200D + 👩 U+200D + 👧
> 👩👩👧
// ZWJ sequence: family (woman, woman, girl, girl)
// U+1F469 + U+200D + U+1F469 + U+200D + U+1F467 + U+200D + U+1F467
// 👩 + U+200D + 👩 + U+200D + 👧 + U+200D + 👧
> 👩👩👧👧
Looking at all the defined sequences, you’ll see that there are even more variants for e.g. one father having two girls. Unfortunately, the support for these isn't really good at the time of writing, but Zero-Width-Joiner sequences degrade gracefully resulting in the single code points being displayed. This helps to keep the semantic of the particular combined symbol.
// ZWJ sequence: family (man, girl, girl)
// U+1F468 + U+200D + U+1F467 + U+200D + U+1F467
// 👨 + U+200D + 👧 + U+200D + 👧
> 👨👧👧 -> single symbol not supported yet
Another cool thing is that these principles don’t apply to the family Emoji only. Let’s take for example the famous David Bowie Emoji (the real name of this Emoji is actually “man singer”). This one is also a ZWJ sequence consisting of a man (U+1F468
), a ZWJ and a microphone (U+1F3A4
).
And you might have guessed it, exchanging the man (U+1F468
) with a woman (U+1F469
) will result in a female singer (or female version of David Bowie). Bringing in skin tone modifiers is also possible to display a black female singer. Great stuff!
ZWJ sequence: woman singer
U+1F469 + U+1F3FF + U+200D + U+1F3A4
👩 + 🏿 + U+200D + 🎤
> 👩🏿🎤 -> single symbol not supported yet
Unfortunately support for these new sequences is also not very good at the time of writing.
Various counts of Emoji
To answer the question how many Emoji are out there, it really depends on what you count as an Emoji. Is it the number of different code points that can be used to display Emoji? Or do we count all the different Emoji variations that can be displayed?
When we count all the different Emoji that can be displayed (including all sequences and variations), we come up with an overall number of 2198. In case you’re interested in the counting, there is a complete section about that topic on unicode.org.
Additional to the “How to count” question there is also the fact that new Emoji and Unicode characters are added to the spec constantly, which makes it also hard to keep track of the overall number.
Coming back to JavaScript strings and the 16-bit code unit
UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters. Doing the math, this means that a bit over 65,000 different code points can fit into one single JavaScript code unit. This maps exactly to the BMP. So let’s give this a try with a few symbols defined in the BMP.
‘ツ’.length // 1 -> U+FF82
‘⛷’.length // 1 -> U+26F7
‘☃’.length // 1 -> U+9731
When using the length
property on these strings, it is completely matching our expectations and returning the count of 1. But what happens when I want to use a symbol in JavaScript that's not in the range of the BMP?
Surrogate pairs to the rescue
It is possible to combine two code points defined in the BMP to express another code point that lies outside of the first 65 thousand code points. This combination is called surrogate pair.
The code points from U+D800
to U+DBFF
are reserved for the so-called high or “leading” surrogates and from U+DC00
to U+DFFF
for the low or “trailing” surrogates.
These two code points always have to be used in pairs beginning with the high surrogate followed by the low surrogate. Then a specific formula will be applied to decode the out-of-range code points.
Let’s have a look at an example here:
‘👨’.length // 2
‘👨’.charCodeAt(0) // 55357 -> U+D83D // returns code point of leading surrogate
‘👨’.charCodeAt(1) // 56424 -> U+DC68
‘👨’.codePointAt(0) // 128104 -> U+1F468 // returns combined code point of surrogate
‘👨’.codePointAt(1) // 56424 -> U+DC68
The neutral man Emoji has the code point U+1F468
. It can’t be represented in a single code unit in JavaScript. That’s why a surrogate pair has to be used, making it consist of two single code units.
To analyze code units in JavaScript, there are two possible methods. You can use charCodeAt
, which will return you the code points of each surrogate in case you hit a surrogate pair. The second method is codePointAt
, which will return you the code point of the combined surrogate pair in case you hit the leading surrogate and the code point of the trailing surrogate in case you hit the trailing one.
You think this is horrible confusing? I’m with you on that one and highly recommend to read the linked MDN articles on these two methods carefully.
Let’s have a deeper look at the man Emoji and do the math. Using charCodeAt
we can retrieve the code points of the single code units included in the surrogate pair.
The first entry has the value 55357
which maps to D83D
in hexadecimal. This is the high surrogate. The second entry has the value 56424
which then maps to DC68
being the low surrogate. It is a classic surrogate pair which will result after applying the formula in 128104
, which maps to the man Emoji.
// hexadecimal
0x1F468 = (0xD83D - 0xD800) * 0x400 + 0xDC68 - 0xDC00 + 0x10000
// decimal
128104 = (55357 - 55296) * 1024 + 56424 - 56320 + 65536
JavaScript length and the number of code units
With the knowledge of code units, we now can make sense out of the puzzled length
property. It returns the number of code units — and not the symbols we see, as we first thought. This can lead to really hard to find bugs when you’re dealing with Unicode in your JavaScript strings – so watch out when you’re dealing with symbols defined outside of the BMP.
Let’s get back to Wes’ initial example then.
// ZWJ sequence: family (man, woman, boy)
// U+1F468 + U+200D + U+1F469 + U+200D + U+1F466
[...'👨👩👦'] // ["👨", "", "👩", "", "👦"]
‘👨👩👦’.length // 8
// neutral family
// U+1F46A
[...’👪’] // [’👪’]
’👪’.length // 2
The Emoji family we see here is a ZWJ sequence consisting of a man, a woman, and boy. The spread operator will go over code points. The empty strings are no empty strings but rather Zero-Width-Joiners. Calling length
on it then will return 2 for each Emoji and 1 for the ZWJ resulting in 8.
I really enjoyed digging into Unicode. In case you’re also interested in this topic, I want to recommend the @fakeunicode Twitter account. It always shares great examples of what Unicode is capable of. And did you know that there is even a podcast and a conference about Emoji? I’ll continue looking at them, because I think it’s super interesting to learn more about these tiny symbols we use daily and maybe you’re interested, too.
Join 5.5k readers and learn something new every week with Web Weekly.