utf8 binaries

Feb 7, 2015 at 12:11pm
I have a question concerning utf-8 Unicode binary numbers . In this page :

http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&utf8=bin

In the utf-8 table (utf-8 (bin) section ) the character Ā is 11000100 10000000

in binary . I believed that each character in Unicode is 8 bits long so why

Ā is 16 bits long (11000100 10000000) ?? I am confused
Feb 7, 2015 at 1:48pm
I believed that each character in Unicode is 8 bits long
Wrong. Unicode characters do not have any length. It depends on encoding: in UTF-32 all characters are 4 bytes (32 bits) long. In UTF-16 all characters are 2 bytes (16 bit) long, but this encoding cannot represent all Unicode characters.
UTF-8 is a multibyte encoding: each character is represented by one or more bytes: those from lower ASCII plane are represented by one byte, other by more.
Feb 7, 2015 at 5:06pm
Wikipedia has a great explanation of UTF-8 and how it works:

http://en.wikipedia.org/wiki/UTF-8#Description
Feb 10, 2015 at 6:34am
thanks to correct my wrong idea . As for the Wikipedia article it is very interesting . However , I found in it (the great article) two notions that confused me and I am still unable to understand them; namely :

1)the code point . Is it the character itself or the position of the character in the Unicode table?? and

2)the code unit

What are they in clear ??
Last edited on Feb 10, 2015 at 6:37am
Feb 10, 2015 at 7:25am
The code point is the position of character in Unicode table uniquely identifying one [pseudo]character.

The code unit is what character representation consist of. UTF-16 and UTF-32 uses a single 16- and 32-bits code unit respectively. UTF-8 uses one or more 8-bit code units to denote a single character.
Feb 10, 2015 at 5:40pm
NiiNiPaa you mean that the code unit is the number of the sequence of bits used to encode one unique character ??? for example the code unit of the character "a" is 8 bits or 1 byte . Is this what you mean ?
Feb 10, 2015 at 6:52pm
you mean that the code unit is the number of the sequence of bits used to encode one unique character
No. It is a building block used to encode value of character. UTF-8 is a variable length encoding. That means it can use more than one code usnit to represent a character, depending on character in questions.
Feb 10, 2015 at 6:56pm
MiiNiPaa wrote:
UTF-16 and UTF-32 uses a single 16- and 32-bits


<hypertechnicality>
UTF-16 uses 2 code units for code points above U+FFFF
</hypertechnicality>




@dilver:

Sort of. A code unit is a measure of how large the 'units' are for encoding text. This does not change per character, but instead changes per encoding.

The character "a" can be expressed in 1 code unit... but the size of that code unit varies depending on the encoding:

UTF-8 has 8-bit code units.
UTF-16 has 16-bit code units.
UTF-32 has 32-bit code units.
Last edited on Feb 10, 2015 at 7:33pm
Feb 11, 2015 at 3:20am
I don't think it's all that "hypertechnical" to know that UTF-16 is a variable-length encoding, just like UTF-8.
Topic archived. No new replies allowed.