What is a Character Set?
A brief overview of character sets and the difference between a character set and character encoding.
A character set (also called a repertoire) is a collection of characters that have been grouped together for a specific purpose.
A character is a minimal unit of text that has semantic value. For example, the letter A is a character, as is the number 1. But the number 10 consists of two characters — the 1 character, and the 0 character.
A character doesn't need to be a letter or number. It could be a symbol or icon of some sort. For example, the greater-than sign > is a character, as are each of the various smiley face 😀 characters.
Some character sets are used by multiple languages. For example, the Latin character set is used by many languages, including Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic. Other character sets (such as the Thai character set) are used in only one language.
Some character sets are used for other purposes, such as punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, and more.
The term "character set" tends to have slightly different conotations depending on the context, and it is often used quite loosely. The term is also avoided by some in favor of more precise terminology. This is in large part due to the introduction of the Universal Coded Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. The UCS now encompasses most other character sets, and it has changed the way that characters are encoded on computer systems.
What is Character Encoding?
The terms character set and character encoding are often used interchangeably despite the fact that they have different meanings.
Character encoding is a set of mappings between the bytes in the computer and the characters in the character set. It's how a character is encoded so that computer systems can read, write, and display that character as intended. This means that any system that supports that encoding method can use the characters that the encoding method supports.
Probably the main reason these terms have been used interchangeably is because, historically, the same standard would be used to define the repertoire (character set) as well as how it was going to be encoded.
There have been many hundreds of different encoding systems over the years. This has caused all sorts of problems, especially when users from different systems tried to share data.
Users would often open a document, only to find it unreadable, with weird looking characters being displayed all through it. This is because the person who created the document used a different encoding system to the person who was trying to read it.
Things have changed a lot since the 1990s when this issue was prevalent.
The Unicode Standard
The Unicode Consortium was founded in 1991 to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards.
The Unicode Standard now encodes almost all characters used on computers in the world, and it is also a superset of most other character sets (i.e. the character sets of many existing international, national and corporate standards are incorporated within the Unicode Standard).
The way Unicode works is, it assigns each character a unique numeric value and name. It provides a unique number (also known as a code point) for every character, regardless of the platform, the program, or the language. This resolves issues that would arise when multiple encoding systems used the same code points for different characters.
This unique numbering system is referred to as a coded character set.
The Unicode Standard has been adopted by industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. It is also required by standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, and more.
See this Unicode reference for a list of commonly used Unicode characters, along with the code for adding them to a web page or other HTML document.
The ISO/IEC 10646 Standard
ISO/IEC 10646 specifies the Universal Coded Character Set (UCS).
The ISO Working Group responsible for ISO/IEC 10646 and the Unicode Consortium have been working together since 1991 to create one universal standard for coding multilingual text.
Although ISO/IEC 10646 is a separate standard to Unicode, both standards use the same character codes and encoding forms.
Also, each version of the Unicode Standard identifies the corresponding version of ISO/IEC 10646.
However, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications.
UTF-8
UTF-8 is a character encoding defined by Unicode, which is capable of encoding all possible characters. UTF stands for Unicode Transformation Format. The 8 means it uses 8-bit blocks to represent a character.
The W3C recommends that web authors use UTF-8 as the character encoding for all web content.
Other UTF encodings include UTF-7, UTF-16, and UTF-32, however, UTF-8 is by far the most commonly used.
HTML5 Named Character References
HTML5 defines over 2,000 named character references that correspond to Unicode characters/code points. These named character references (often referred to as "named entities" or "HTML entities") can be used within HTML documents as an alternative to the character's numeric value.
For example, to display a copyright symbol, authors can use ©
(HTML named character reference), ©
(hexadecimal value of the Unicode code point), or ©
(decimal conversion of the hexadecimal value).