Awesome Unicode

A curated list of delightful Unicode tidbits, packages and resources.

Please read the contribution guidelines before contributing. Key Unicode terminology is defined in the glossary.Cross posted to Wisdom's Dev Blog

ForewordUnicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts - some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human - and ultimately underappreciated.
Contents

Quick Unicode Background

What Characters Does the Unicode Standard Include?
Unicode Character Encodings
Lets talk Numbers
UTF-16 Surrogate Pairs
Calculating Surrogate Pairs
Composing & Decomposing
Myths of Unicode
Applied Unicode Encodings
Source Code

Awesome Characters List

Special Characters
Variable identifiers can effectively include whitespace!
Modifiers
Uppercase Transformation Collisions
Lowercase Transformation Collisions

Quirks and Troubleshooting

One-To-Many Case Mappings

Awesome Packages & Libraries

Emojis

Diversity

Creatively Naming Variables and Methods

Recursive HTML Tag Renaming Script

Unicode Fonts

More Reading

Exploring Deeper into Unicode Yourself

Overview Map

A map of the Basic Multilingual Plane
Unicode Blocks

Principles of the Unicode Standard

Quick Unicode BackgroundWhat Characters Does the Unicode Standard Include?The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world's alphabets, ideograph sets, and symbol collections.The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.Unicode Character EncodingsCharacter encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.

Awesome Unicode

A curated list of delightful Unicode tidbits, packages and resources.

Please read the contribution guidelines before contributing. Key Unicode terminology is defined in the glossary.

Cross posted to Wisdom's Dev Blog

Foreword

Unicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts - some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human - and ultimately underappreciated.

Quick Unicode Background
Awesome Characters List
Quirks and Troubleshooting
- One-To-Many Case Mappings
Awesome Packages & Libraries
Emojis
- Diversity
Creatively Naming Variables and Methods
- Recursive HTML Tag Renaming Script
Unicode Fonts
More Reading
Exploring Deeper into Unicode Yourself
Overview Map
- A map of the Basic Multilingual Plane
- Unicode Blocks
Principles of the Unicode Standard
Unicode Versions
Contributing
Code of Conduct
License

Quick Unicode Background

What Characters Does the Unicode Standard Include?

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world's alphabets, ideograph sets, and symbol collections.

The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.

The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

Unicode Character Encodings

Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.