back

Emoji Parsing

This article provides a brief description how Unicode emoji work internally from a technical perspective. It is intended mainly for those of you who desire to build some kind of software application that enables interaction with the Emoji Language, but may also be interesting to others who were always wondering how computers actually look at emoji. While this article goes into some technical details, it does not require previous technical knowledge to understand the basics.

Contents

Introduction

The Unicode Consortium defines ranges of numbers that computers use to refer to specific symbols. It publishes them as a technical standard standard called "Unicode". This standard is used by software manufacturers to implement their fonts in such a way that characters will have the same meaning and similar visual representation across different platforms and operating systems.

The main reason that the Unicode Consortium exists is that a wide range of alphabets is used around the globe, not just the Latin alphabet known to the Western world in which this text is written. Chinese characters may be a prominent example, but there are also a lot of smaller alphabets.

Back in the early days, computers supported only a very limited set of characters defined by the "ASCII" standard ("American Standard Code for Information Interchange", first published in 1963), which does not even contain the diacritic marks for Latin characters used by some European languages.

Also, many fields of science and art have their own set of symbols, such as maths or music. Wouldn't it be great if everybody could use the symbols they're already familiar with on their computers and phones too, and wouldn't be forced into some Latin transcription?

With Unicode, this is almost always possible nowadays. The Unicode standard is designed in such a way that it could potentially hold up to more then one million character definitions (1,114,112 to be precise), distributed among seventeen so-called "planes". At the time of writing, 143,859 characters are defined.

While the Unicode Consortium was officially formed in 1991, they didn't standardize emoji until relatively recently. The first worldwide emoji standard was published in 2015, and a newer version with a handful of additional emoji is published roughly once or twice a year.

Types of Emoji

From a technical perspective, the emoji defined by Unicode can be divided into six classes:

Simple Emoji

Simple emoji that look like a single image, for example the rocket 🚀, and also technically are just that: A single symbol, internally represented by one number called a "code point" (here 128640 for the rocket). The simple emoji are the foundation for emoji with skin-tone modifiers and compound emoji.

Emoji with skin-tone modifiers

Simple emoji can be modified with so-called skin-tone modifiers which alter the representation of human skin color. Currently, there are five skin-tone modifiers available. The skin-tone modifiers are based upon the Fitzpatrick scale which actually describes six colors. For the emoji representation, the two palest colors have been merged.

Emoji with skin-tone modifiers look like this: 🙋🏻 🙋🏼 🙋🏽 🙋🏾 🙋🏿 All of them have a corresponding simple emoji that is represented with a neutral, non-realistic yellow skin color, in this case 🙋 (person raising hand). The base emoji has a regular number, such as 128587, that is identical to the number of the simple emoji.

The skin-tone modifiers are in fact a second symbol that immediately follows the simple emoji, which has its own number, namely 127995, 127996, 127997, 127998, and 127999, for light, medium-light, medium, medium-dark, and dark skin tone, respectively.

So from the computer's perspective, our examples would look like 🙋 đŸť, 🙋 đŸź, 🙋 đŸ˝, 🙋 đŸž, and 🙋 đŸż. Modern computers and smartphones merge emoji followed by such a skin-tone modifier symbol together and represent them using a single image with appropriate skin color.

The skin-tone modifiers normally don't appear alone, but on systems with limited emoji support, they may be rendered as colorized boxes after the simple emoji, as shown above. The skin-tone modifiers are implemented for use with emoji that represent people or body parts. Emoji representing inanimate objects are not be affected by them; they would also just appear as boxes behind them.

Compound Emoji

Compound emoji look like a single image, but in fact also consist of multiple symbols. For example the cook 🧑‍🍳, which consists of the emoji for a person 🧑, and the emoji for a cooking pan 🍳. These symbols are joined together using an invisible, magical character called the "zero-width joiner" (ZWJ) that goes between them.

So, from a computer's perspective, this emoji looks like it's three characters in a row: person, ZWJ, and cooking pan, each with their own number (129489, 8205, 127859). The number of the ZWJ is always 8205. Because the ZWJ has no width, it isn't really visible by itself. Apparently, there is no abstract symbol to represent the ZWJ, except for a vertical-bar-like image defined by the ISO.

Most modern computers and phones contain a visual representation for this specific sequence, the cook, but older computers and phones made before the advent of the ZWJ don't. If you're reading this page on an old device or with otherwise limited emoji support, you might actually see the two separate images for person and cooking pan instead of the cook.

The ZWJ is also used to encode other detailed information not contained in the original simple emoji, for example hair style, color (other than skin-tone), gender, or direction, as the following table shows:

Category Example Literally Description
Hair 🧑‍🦳 🧑, ZWJ, 🦳 person with white hair
Color 🐈‍⬛ 🐈, ZWJ, ⬛ cat with black fur
Gender 🧝‍♀️ 🧝, ZWJ, ♀️ elf (female)
Direction 🏃‍➡ 🏃, ZWJ, ➡ person running rightwards

The hair style emoji are define as "component" emoji that are not normally used alone, but the others are regular simple emoji. Note that at the time of writing, the direction emoji may not (yet) be supported on your device. In theory, the ZWJ can be used to built arbitrary sequences of new compound emoji, but only a limited subset of these is actually rendered as single images, even on current computers.

Flag Emoji

Flag emoji also look like single emoji that represent the flag of a country. Technically, they are made up of two individual characters, forming a two-letter country code, like "US" or "CH". The country codes themselves are defined by ISO 3166-1, independently from the emoji.

Instead of the regular letters, a special set of "regional indicator symbols" is used, ranging from 🇦 (127462, representing uppercase A) to 🇿 (127487, representing uppercase Z). So, to the computer, the flags 🇺🇸 and 🇨🇭 look like the letters 🇺 đŸ‡¸ and 🇨 đŸ‡­ respectively.

When a computer or phone encounters two adjacent regional indicator symbols, it is supposed to render them together as a country flag image, but again, if it doesn't support flag emoji or the specific two-letter combination does not exist as a country code, the two characters may show up separately as two stylized uppercase letters.

Regional Flag Emoji

Regional flags look like regular flag emoji, but work differently on a technical level. Semantically, they represent the flag of a "subdivision" of a country. For example, for the United Kingdom, there are the subdivisions 🏴󠁧󠁢󠁥󠁮󠁧󠁿 (England), 🏴󠁧󠁢󠁳󠁣󠁴󠁿󠁧󠁢󠁥󠁮󠁧󠁿 (Scotland), and 🏴󠁧󠁢󠁷󠁬󠁳󠁿󠁧󠁢󠁥󠁮󠁧󠁿 (Wales).

Technically, regional flags are probably the most complex emoji, because they consist of six or seven individual characters. Every regional flag starts with the "black flag" 🏴 (127988), which is also available as a simple emoji, as a foundation.

What follows is a sequence of five letters, but these are neither normal letters nor flag emoji letters, but characters from yet another category called "tags". These include the lowercase letters a (917601) to z (917626), as well as the digits 0 (917552) to 9 (917561), but for the regional flags only the letters are used. The five-letter sequence ends with a special tag, called the "cancel tag" (917631).

Thus, the flag of England looks as follows to the computer:

Character Number Description
🏴 127988 black flag emoji
917607 tag letter g
917602 tag letter b
917605 tag letter e
917614 tag letter n
917607 tag letter g
917631 cancel tag

There isn't really any standard way to represent the tag characters on their own, so if your device doesn't support the regional flags, all you are going to see is the single black flag emoji.

The abbreviations for subdivisions, for example "ENG" for "England" in the United Kingdom, are defined by a separate standard called ISO-3166-2, which takes the two-letter country codes as a foundation and adds two- or three-letter subdivision codes, for example for the United Kingdom. It must be noted that at the time of writing, only very few subdivision flags are usually supported as emoji.

Keycap Emoji

Finally, the keycap emoji represent symbols that you would typically find on a computer keyboard or on a keypad installed in an ATM, pocket calculator, television remote control, or similar device. Hence the name "keycap", which refers to the small plastic cap of a keyboard key.

At the time of writing, thirteen keycap emoji are defined, comprising the digits zero 0️⃣ to nine 9️⃣, a shorthand for the number ten 🔟, the asterisk or simply "star" *️⃣, and the number sign or simply "hash" #️⃣.

Technically, a keycap emoji consists of three individual characters: The actual digit or symbol, the color variation selector (more on that below), and the "keycap" emoji. Therefore, the keycap emoji "three" 3️⃣, for example, is encoded as follows:

Character Number Description
3 51 regular digit three
65039 color variation selector
8419 neutral keycap emoji

The digit three is represented using its standard character (with number 51). Unlike the country flag and regional flag emoji, the keycap emoji do not use a special range of letters and digits. The variation selector is not visible by itself. The final neutral keycap emoji is not normally used as a single emoji, but if it is, it is usually rendered as a blank keycap with no symbol on it.

A Note on Color Emoji

To complicate matters further, emoji sequences may sometimes contain an invisible "variation selector" character, as shown in the keycap example. While the Unicode standard defines sixteen Variation Selectors, only two of them are typically used in the context of emoji, the variation selector 15 (character number 65038) and the variation selector 16 (character number 65039), denoting text and color representation, respectively.

These variation selectors do not change the meaning of the emoji itself. They are just a hint to the computer how the emoji should be represented visually. For example, the Mars and Venus symbols ♂️ and ♀️ are available as colorized emoji and as a black-and-white symbol on many systems. If such an emoji is followed by one of the variation selectors, it will be rendered in the corresponding mode (if possible), instead of whatever the system default is.

Historically, the variation selectors were introduced because the Unicode standard already included several symbols, like the Mars and Venus symbol, before emoji were included. Instead of adding a duplicate definition for the emoji with the same meaning, the original black-and-white symbol retained its number and the color emoji variation selector was added to make the existing symbols appear as emoji.

When processing a sequence of emoji, the variation selectors can be ignored most of the time, as they don't change the semantics of emoji. But from a technical point of view, one must keep in mind that those variation selectors do occur, and they must be handled properly. When a program produces emoji output, it would be preferable to add the color variation selector to make the output visually consistent. For keycap emoji, as described above, the color variation selector is always added before the neutral keycap emoji.

Encoding and Decoding Emoji

Emoji and Character Encodings

As you have learned so far, Unicode symbols are basically just numbers. This would make a sequence of emoji (or other text) just a sequence of numbers. While that is actually how computers store text, the number sequences used for storing information are not quite the same as the Unicode numbers we've seen so far.

This has to do with how numbers are stored in binary computer systems. As mentioned in the beginning, the number of Unicode characters may go up to one million in the future. This is a pretty large number, and for computers to process it, it is always necessary to reserve storage space for the highest possible number, even when only actually using lower numbers.

From the computer's perspective, the numbers one million (binary 11110100001001000000) and forty-two (binary 00000000000000101010) take up exactly the same amount of storage space. Storing such potentially large numbers all the time would use up unnecessary disk space most of the time, and would make loading web pages, for example, unnecessarily slow.

Thus, several additional standards were invented that allow for abbreviating the smaller numbers so that they take up less disk space, switching to larger numbers (representing seldom used characters) only when actually needed. The most widely used of these standards are UTF-8 and UTF-16. UTF stands for "Unicode Transformation Format", and the number describes the length of the abbreviated number format.

Take the simple emoji sequence ↪️🧝‍♀️ (left arrow curving right, woman elf) for example. How long is this sequence really? We would probably assume that it's two emoji, and from the previous description of compound emoji, we know that the woman elf is actually an "elf-ZWJ-female" sequence. So that would be three Unicode characters for the elf internally. But let's look at the example in detail:

Position Character Number Description
1 ↪️ 8618 left arrow curving right
2 65039 color variation selector
3 🧝 55358 elf
4 56797
5 8205 ZWJ (zero-width joiner)
6 ♀️ 9792 female sign
7 65039 color variation selector

This is how those two emoji would probably be encoded in a real-life situation with UTF-16. The UTF-16 encoding is the default encoding used by JavaScript, a programming language typically used for developing interactive websites. The arrow takes an additional color variation selector to turn the existing Unicode character into a color emoji, as described in the "A Note on Color Emoji" section.

What's interesting here is that the gender-neutral elf (🧝), which is just a simple emoji with number 129501, is encoded with two separate numbers. The reason for this is that the UTF-16 encoding stores the numbers for the circa 65,000 most frequently used characters from the so-called "Basic Multilingual Plane". Less frequent characters with numbers outside of that range are encoded by a pair of so-called "surrogate" characters.

The actual Unicode character number is then calculated using a special formula:

  (first - 55296) * 1024 + second - 56320 + 65536
  

Or, in our example:

  (55358 - 55296) * 1024 + 56797 - 56320 + 65536
  = 129501
  

While this makes processing more complicated, it does help to save storage space. The even more common UTF-8 encoding works in a similar way, but instead of distinguishing between the Basic Multilingual Plane and everything beyond that, it splits the numbers into four categories, from really short encoding of frequent characters to longer encodings for less frequent characters.

A JavaScript Example

It is typically not needed to convert between Unicode numbers and the various encodings manually, but the existence of surrogate characters must still be taken into account. For example, when working with JavaScript, the standard function charCodeAt would always return the raw values, that is 55358 and 56797 in the previous example, instead of the correct Unicode number 129501.

When processing characters outside the Basic Multilingual Plane – and some emoji do reside outside of it, as we can see from this example – the more sophisticated function codePointAt can be used instead. This function would perform the calculation internally, and return the correct Unicode value 129501 for the first number of the elf emoji, because it looks ahead and includes the second number in the calculation.

However, when asking codePointAt for the Unicode number of the second number that also belongs to the elf emoji, it would still return the literal value 56797 which makes no sense on its own. The reason for this is that the function only looks forward for corresponding a surrogate character, and never backwards. This is necessary to keep the output of the function consistent even if it is used in the middle of several characters encoded by surrogate pairs.

In conclusion, something along the lines of the following code could be used to correctly determine the number of individual emoji characters within an emoji sequence, correctly ignoring the surrogate characters:

  function emoji_length(s) {
    var l = 0;
    for(var i = 0; i < s.length; i++) {
      const code = s.codePointAt(i);
      const highSurrogate = code >= 0xD800 && code <= 0xDBFF;
      const lowSurrogate = code >= 0xDC00 && code <= 0xDFFF;
      if(!highSurrogate && !lowSurrogate) {
        l++;
      }
    }
    return l;
  }
  

This function would return the correct value 6 for the example sequence ↪️🧝‍♀️, consisting of the arrow (1) and the elf-ZWJ-female sequence (3), each followed by a color variation selector, thus 1+1 + 3+1 = 6. Depending on what exactly you need to calculate, it would be helpful to skip the color variation selectors as well, and perhaps count all emoji joined by ZWJ's as a single emoji.

Emoji Ranges

Outside their specific use for the Emoji Language, emoji are typically mixed with regular text. (Also, take the Reference Grammar document for example, where Emoji Language is mixed with explanations in English.) So the first step when processing emoji is often to separate them from regular text. That is, we need a method for determining which of the Unicode characters in a text are emoji and which are not.

Contrary to what we might expect, emoji are not defined in a single, continuous range of numbers in the Unicode standard that would be easy to detect. Instead, they are scattered among various ranges. You can test this yourself by copying-and-pasting all emoji from the Emoji Test Page into the rang tool and selecting the input mode "Treat each character in input as one number (Unicode)".

The reason for this lies in the history of the Unicode standard. As mentioned before, characters like the Mars and Venus symbol, and even some simple black-and-white smiley faces, already existed in the Unicode standard before emoji were officially standardized. Instead of adding duplicate definitions, those symbols retained their original number. This was done to ensure long-term compatibility with existing software and content.

While it would be possible to compile a list of all ranges that include currently defined emoji, to keep it simple, we can use a regular expression. Regular expressions are a way of expressing the pattern a sequence of text follows, instead of listing every possible example explicitly, and regular expressions are supported by most programming languages today.

  \p{RI} \p{RI} 
  | \p{Emoji} 
    ( \p{EMod} 
    | \x{FE0F} \x{20E3}? 
    | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
    (\x{200D} \p{Emoji}
      ( \p{EMod} 
      | \x{FE0F} \x{20E3}? 
      | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
    )*
  

This regular expression is presented in the Unicode Technical Standard #51 (link in the sources section) and covers all emoji defined today. It uses hexadecimal numbers instead of decimal numbers to represent the numeric ranges, as well as named character classes (\p{…}). Note however, that the named class emoji does not really include all emoji, because of the special encoding for flags etc.

Even though this regular expression looks complex at first glance, it is highly recommended to use it when writing custom emoji processing code instead of creating your own regular expression. This official Unicode regular expression is already a simplified (but fully functional) version of the regular expression you would get from the formal definition of emoji sequences, as explained in the technical standard document.

Parsing Emoji Sequences

To sum it up, handling emoji on a technical level is not as easy as the colorful images might suggest at first glance. There are various caveats along the way, from character encoding, to font variation selection. The emoji sequences themselves almost form a small language of their own, with skin-tone, ZWJ sequences, and flags being expressed in a completely different way, due to the historical development of emoji and their standardization.

From the information given in this introductory article as well as the primary sources listed below, you should be able to implement your own emoji software in any programming language you're familiar with. You'll find some existing software libraries on the Internet that help you with handling typical tasks such as splitting sequences into individual emoji. As always, it is worthwhile to evaluate what best suits the needs of your projects – custom code or library code.

In any case, any emoji-handling software should always be tested thoroughly, as there are many unusual edge cases such as the regional flags, which do not occur too frequently, but if they do, they can break the software when not handled properly. This article does not contain recommendations for specific software, as available implementations for various programming languages keep changing. I'm planning to update this document in the future though, if there are fundamental changes to the emoji standard.

Sources


Copyright © 2021 by Thomas Heller [ˈtoːmas ˈhɛlɐ]