В кодировке windows один символ весит

В таблице ASCII символ весит 8 бит, это 1 байт.В UTF-8 от 1 од 6 байт

В таблице ASCII символ весит 8 бит, это 1 байт.
В UTF-8 от 1 од 6 байт

Отмена




Олеся Точемкина


Отвечено 1 октября 2019

  • Комментариев (0)

Добавить

Отмена

Определить объём текста

Онлайн калькулятор легко и непринужденно вычислит объем текста в битах, байтах и килобайтах. Для перевода в другие единицы измерения данных воспользуйтесь онлайн конвертером.

Информационный вес (объем) символа текста определяется для следующих кодировок:
Unicode UTF-8
Unicode UTF-16
ASCII, ANSI, Windows-1251

Текст

Символов 0

Символов без учета пробелов 0

Уникальных символов 0

Слов 0

Слов (буквенных) 0

Уникальных слов 0

Строк 0

Абзацев 0

Предложений 0

Средняя длина слова 0

Время чтения 0 сек

Букв 0

Русских букв 0

Латинских букв 0

Гласных букв 0

Согласных букв 0

Слогов 0

Цифр 0

Чисел 0

Пробелов 0

Остальных знаков 0

Знаков препинания 0

Объем текста (Unicode UTF-8) бит 0

Объем текста (Unicode UTF-8) байт 0

Объем текста (Unicode UTF-8) килобайт 0

Объем текста (Unicode UTF-16) бит 0

Объем текста (Unicode UTF-16) байт 0

Объем текста (Unicode UTF-16) килобайт 0

Объем текста (ASCII, ANSI, Windows-1251) бит 0

Объем текста (ASCII, ANSI, Windows-1251) байт 0

Объем текста (ASCII, ANSI, Windows-1251) килобайт 0

Почему на windows сохраняя текст блокноте перенос строки занимает — 4 байта в юникоде или 2 байта в анси?
Это историческое явление, которое берёт начало с дос, последовательность OD OA (nr ) в виндовс используются чтоб был единообразный вывод на терминал независимо консоль это или принтер. Но для вывода просто на консоль достаточно только n.

В юникоде есть символы которые весят 4 байта, например эмоджи: 🙃

×

Для установки калькулятора на iPhone — просто добавьте страницу
«На главный экран»

Для установки калькулятора на Android — просто добавьте страницу
«На главный экран»

Перейти к контенту

ГДЗ по информатике 10 класс Босова § 14. Кодирование текстовой информации

§ 14. Кодирование текстовой информации ГДЗ по Информатике 10 класс. Босова.


12. Текст на русском языке, первоначально записанный в 8-битовом коде Windows, был перекодирован в 16-битную кодировку Unicode. Известно, что этот текст был распечатан на 128 страницах, каждая из которых содержала 32 строки по 64 символа в каждой строке. Каков информационный объём этого текста?

Ответ

У нас есть 128 станиц по 32 строки с 64 символами.

На каждой строке по 64 символа, на каждой странице 32 строки

=> 32 строки * 64 символа  = 2048 символов (на одной странице)

=> 128 страниц * 2048 символов = 262144 символов (на всех 128 страницах)

Изначально наш текст был записан в кодировке Windows-1251 или 8-битовый код Windows, где один символ весит 8 бит, то есть 1 байт.

=> 262144 символов = 262144 байт

Произошла перекодировка из Windows-1251 в UTF-16 (16-битная кодировка Unicode), где один символ весит 16 бит, то есть 2 байта

=> 262144 символов = 524288 байт


I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language — am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

Isaac D. Cohen's user avatar

asked Mar 13, 2011 at 15:02

nan's user avatar

8

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it’ll take up.

tambre's user avatar

tambre

4,4734 gold badges45 silver badges55 bronze badges

answered Oct 26, 2015 at 15:38

paul.ago's user avatar

paul.agopaul.ago

3,8251 gold badge21 silver badges15 bronze badges

5

You won’t see a simple answer because there isn’t one.

First, Unicode doesn’t contain «every character from every language», although it sure does try.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can’t be represented in the encoding at all (this is a problem for instance with UCS-2).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

Mazdak's user avatar

Mazdak

103k18 gold badges158 silver badges186 bronze badges

answered Mar 13, 2011 at 15:19

Logan Capaldo's user avatar

Logan CapaldoLogan Capaldo

39.2k5 gold badges63 silver badges78 bronze badges

3

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it’ll be useful to someone).

As far as I know old ASCII characters took one byte per character.

Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).

How many bytes does a Unicode character require?

Unicode just maps characters to codepoints. It doesn’t define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.

I assume that one Unicode character can contain every possible
character from any language — am I correct?

No. But almost. So basically yes. But still no.

So how many bytes does it need per character?

Same as your 2nd question.

And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode
versions?

No, those are encodings. They define how bytes/octets should represent Unicode characters.

A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn’t support them), go to http://codepoints.net/U+1F6AA (replace 1F6AA with the codepoint in hex) to see an image.

    • U+0061 LATIN SMALL LETTER A: a
      • Nº: 97
      • UTF-8: 61
      • UTF-16: 00 61
    • U+00A9 COPYRIGHT SIGN: ©
      • Nº: 169
      • UTF-8: C2 A9
      • UTF-16: 00 A9
    • U+00AE REGISTERED SIGN: ®
      • Nº: 174
      • UTF-8: C2 AE
      • UTF-16: 00 AE
    • U+1337 ETHIOPIC SYLLABLE PHWA:
      • Nº: 4919
      • UTF-8: E1 8C B7
      • UTF-16: 13 37
    • U+2014 EM DASH:
      • Nº: 8212
      • UTF-8: E2 80 94
      • UTF-16: 20 14
    • U+2030 PER MILLE SIGN:
      • Nº: 8240
      • UTF-8: E2 80 B0
      • UTF-16: 20 30
    • U+20AC EURO SIGN:
      • Nº: 8364
      • UTF-8: E2 82 AC
      • UTF-16: 20 AC
    • U+2122 TRADE MARK SIGN:
      • Nº: 8482
      • UTF-8: E2 84 A2
      • UTF-16: 21 22
    • U+2603 SNOWMAN:
      • Nº: 9731
      • UTF-8: E2 98 83
      • UTF-16: 26 03
    • U+260E BLACK TELEPHONE:
      • Nº: 9742
      • UTF-8: E2 98 8E
      • UTF-16: 26 0E
    • U+2614 UMBRELLA WITH RAIN DROPS:
      • Nº: 9748
      • UTF-8: E2 98 94
      • UTF-16: 26 14
    • U+263A WHITE SMILING FACE:
      • Nº: 9786
      • UTF-8: E2 98 BA
      • UTF-16: 26 3A
    • U+2691 BLACK FLAG:
      • Nº: 9873
      • UTF-8: E2 9A 91
      • UTF-16: 26 91
    • U+269B ATOM SYMBOL:
      • Nº: 9883
      • UTF-8: E2 9A 9B
      • UTF-16: 26 9B
    • U+2708 AIRPLANE:
      • Nº: 9992
      • UTF-8: E2 9C 88
      • UTF-16: 27 08
    • U+271E SHADOWED WHITE LATIN CROSS:
      • Nº: 10014
      • UTF-8: E2 9C 9E
      • UTF-16: 27 1E
    • U+3020 POSTAL MARK FACE:
      • Nº: 12320
      • UTF-8: E3 80 A0
      • UTF-16: 30 20
    • U+8089 CJK UNIFIED IDEOGRAPH-8089:
      • Nº: 32905
      • UTF-8: E8 82 89
      • UTF-16: 80 89
    • U+1F4A9 PILE OF POO: 💩
      • Nº: 128169
      • UTF-8: F0 9F 92 A9
      • UTF-16: D8 3D DC A9
    • U+1F680 ROCKET: 🚀
      • Nº: 128640
      • UTF-8: F0 9F 9A 80
      • UTF-16: D8 3D DE 80

Okay I’m getting carried away…

Fun facts:

  • If you’re looking for a specific character, you can copy&paste it on http://codepoints.net/.
  • I wasted a lot of time on this useless list (but it’s sorted!).
  • MySQL has a charset called «utf8» which actually does not support characters longer than 3 bytes. So you can’t insert a pile of poo, the field will be silently truncated. Use «utf8mb4» instead.
  • There’s a snowman test page (unicodesnowmanforyou.com).

answered May 1, 2014 at 15:17

basic6's user avatar

basic6basic6

3,4931 gold badge39 silver badges47 bronze badges

6

Simply speaking Unicode is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).

Now you need to represent this code points using bytes, thats called character encoding. UTF-8, UTF-16, UTF-6 are ways of representing those characters.

UTF-8 is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).

UTF-32 each characters have 4 bytes a characters.

UTF-16 uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

answered Mar 13, 2011 at 15:15

Zimbabao's user avatar

ZimbabaoZimbabao

8,1142 gold badges29 silver badges36 bronze badges

5

In UTF-8:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

In UTF-32:

4 bytes:      0 - 10FFFF

10FFFF is the last unicode codepoint by definition, and it’s defined that way because it’s UTF-16’s technical limit.

It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8’s encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

answered Aug 27, 2016 at 12:18

John's user avatar

JohnJohn

6,3903 gold badges48 silver badges89 bronze badges

In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.

Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.

The only encoding where (as of now) we can make the statement about the size is UTF-32. There it’s always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)

What makes it so difficult are at least two things:

  1. composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A).
  2. code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8.
    • The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC.
    • Both are valid, and this shows how complicated the answer is when talking about «Unicode» and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn’t seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.

answered Mar 13, 2011 at 15:10

0xC0000022L's user avatar

0xC0000022L0xC0000022L

20.2k9 gold badges82 silver badges149 bronze badges

2

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw «Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)»

As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.

So your simple answer that you want is that it varies.

answered Mar 13, 2011 at 15:09

Loduwijk's user avatar

LoduwijkLoduwijk

1,9201 gold badge16 silver badges27 bronze badges

Unicode is a standard which provides a unique number for every character. These unique numbers are called code points (which is just unique code) to all characters existing in the world (some’s are still to be added).

For different purposes, you might need to represent this code points in bytes (most programming languages do so), and here’s where Character Encoding kicks in.

UTF-8, UTF-16, UTF-32 and so on are all Character Encodings, and Unicode’s code points are represented in these encodings, in different ways.

UTF-8 encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;

UTF-16 has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it’s enough for almost all the cases. Java uses UTF-16 encoding for its strings and characters;

UTF-32 has fixed length and each character takes exactly 4 bytes (32 bits).

answered Jun 17, 2020 at 14:15

Giorgi Tsiklauri's user avatar

Giorgi TsiklauriGiorgi Tsiklauri

9,0478 gold badges40 silver badges61 bronze badges

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a «surrogate pair.» More specifically, a surrogate pair has the form:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

where […] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).

See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.

answered Jul 12, 2016 at 20:45

prewett's user avatar

prewettprewett

1,55714 silver badges19 bronze badges

Check out this Unicode code converter. For example, enter 0x2009, where 2009 is the Unicode number for thin space, in the «0x… notation» field, and click Convert. The hexadecimal number E2 80 89 (3 bytes) appears in the «UTF-8 code units» field.

Yash's user avatar

Yash

8,9702 gold badges67 silver badges72 bronze badges

answered Oct 9, 2013 at 16:14

ma11hew28's user avatar

ma11hew28ma11hew28

118k116 gold badges447 silver badges645 bronze badges

From Wiki:

UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;

UTF-16, a 16-bit, variable-width encoding;

UTF-32, a 32-bit, fixed-width encoding.

These are the three most popular different encoding.

  • In UTF-8 each character is encoded into 1 to 4 bytes ( the dominant encoding )
  • In UTF16 each character is encoded into 1 to two 16-bit words and
  • in UTF-32 every character is encoded as a single 32-bit word.

Community's user avatar

answered Nov 24, 2019 at 21:20

chikitin's user avatar

chikitinchikitin

7336 silver badges24 bronze badges

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language — am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

Isaac D. Cohen's user avatar

asked Mar 13, 2011 at 15:02

nan's user avatar

8

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it’ll take up.

tambre's user avatar

tambre

4,4734 gold badges45 silver badges55 bronze badges

answered Oct 26, 2015 at 15:38

paul.ago's user avatar

paul.agopaul.ago

3,8251 gold badge21 silver badges15 bronze badges

5

You won’t see a simple answer because there isn’t one.

First, Unicode doesn’t contain «every character from every language», although it sure does try.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can’t be represented in the encoding at all (this is a problem for instance with UCS-2).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

Mazdak's user avatar

Mazdak

103k18 gold badges158 silver badges186 bronze badges

answered Mar 13, 2011 at 15:19

Logan Capaldo's user avatar

Logan CapaldoLogan Capaldo

39.2k5 gold badges63 silver badges78 bronze badges

3

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it’ll be useful to someone).

As far as I know old ASCII characters took one byte per character.

Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).

How many bytes does a Unicode character require?

Unicode just maps characters to codepoints. It doesn’t define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.

I assume that one Unicode character can contain every possible
character from any language — am I correct?

No. But almost. So basically yes. But still no.

So how many bytes does it need per character?

Same as your 2nd question.

And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode
versions?

No, those are encodings. They define how bytes/octets should represent Unicode characters.

A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn’t support them), go to http://codepoints.net/U+1F6AA (replace 1F6AA with the codepoint in hex) to see an image.

    • U+0061 LATIN SMALL LETTER A: a
      • Nº: 97
      • UTF-8: 61
      • UTF-16: 00 61
    • U+00A9 COPYRIGHT SIGN: ©
      • Nº: 169
      • UTF-8: C2 A9
      • UTF-16: 00 A9
    • U+00AE REGISTERED SIGN: ®
      • Nº: 174
      • UTF-8: C2 AE
      • UTF-16: 00 AE
    • U+1337 ETHIOPIC SYLLABLE PHWA:
      • Nº: 4919
      • UTF-8: E1 8C B7
      • UTF-16: 13 37
    • U+2014 EM DASH:
      • Nº: 8212
      • UTF-8: E2 80 94
      • UTF-16: 20 14
    • U+2030 PER MILLE SIGN:
      • Nº: 8240
      • UTF-8: E2 80 B0
      • UTF-16: 20 30
    • U+20AC EURO SIGN:
      • Nº: 8364
      • UTF-8: E2 82 AC
      • UTF-16: 20 AC
    • U+2122 TRADE MARK SIGN:
      • Nº: 8482
      • UTF-8: E2 84 A2
      • UTF-16: 21 22
    • U+2603 SNOWMAN:
      • Nº: 9731
      • UTF-8: E2 98 83
      • UTF-16: 26 03
    • U+260E BLACK TELEPHONE:
      • Nº: 9742
      • UTF-8: E2 98 8E
      • UTF-16: 26 0E
    • U+2614 UMBRELLA WITH RAIN DROPS:
      • Nº: 9748
      • UTF-8: E2 98 94
      • UTF-16: 26 14
    • U+263A WHITE SMILING FACE:
      • Nº: 9786
      • UTF-8: E2 98 BA
      • UTF-16: 26 3A
    • U+2691 BLACK FLAG:
      • Nº: 9873
      • UTF-8: E2 9A 91
      • UTF-16: 26 91
    • U+269B ATOM SYMBOL:
      • Nº: 9883
      • UTF-8: E2 9A 9B
      • UTF-16: 26 9B
    • U+2708 AIRPLANE:
      • Nº: 9992
      • UTF-8: E2 9C 88
      • UTF-16: 27 08
    • U+271E SHADOWED WHITE LATIN CROSS:
      • Nº: 10014
      • UTF-8: E2 9C 9E
      • UTF-16: 27 1E
    • U+3020 POSTAL MARK FACE:
      • Nº: 12320
      • UTF-8: E3 80 A0
      • UTF-16: 30 20
    • U+8089 CJK UNIFIED IDEOGRAPH-8089:
      • Nº: 32905
      • UTF-8: E8 82 89
      • UTF-16: 80 89
    • U+1F4A9 PILE OF POO: 💩
      • Nº: 128169
      • UTF-8: F0 9F 92 A9
      • UTF-16: D8 3D DC A9
    • U+1F680 ROCKET: 🚀
      • Nº: 128640
      • UTF-8: F0 9F 9A 80
      • UTF-16: D8 3D DE 80

Okay I’m getting carried away…

Fun facts:

  • If you’re looking for a specific character, you can copy&paste it on http://codepoints.net/.
  • I wasted a lot of time on this useless list (but it’s sorted!).
  • MySQL has a charset called «utf8» which actually does not support characters longer than 3 bytes. So you can’t insert a pile of poo, the field will be silently truncated. Use «utf8mb4» instead.
  • There’s a snowman test page (unicodesnowmanforyou.com).

answered May 1, 2014 at 15:17

basic6's user avatar

basic6basic6

3,4931 gold badge39 silver badges47 bronze badges

6

Simply speaking Unicode is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).

Now you need to represent this code points using bytes, thats called character encoding. UTF-8, UTF-16, UTF-6 are ways of representing those characters.

UTF-8 is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).

UTF-32 each characters have 4 bytes a characters.

UTF-16 uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

answered Mar 13, 2011 at 15:15

Zimbabao's user avatar

ZimbabaoZimbabao

8,1142 gold badges29 silver badges36 bronze badges

5

In UTF-8:

1 byte:       0 -     7F     (ASCII)
2 bytes:     80 -    7FF     (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF     (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

2 bytes:      0 -   D7FF     (multilingual plane except the top 1792 and private-use )
4 bytes:   D800 - 10FFFF

In UTF-32:

4 bytes:      0 - 10FFFF

10FFFF is the last unicode codepoint by definition, and it’s defined that way because it’s UTF-16’s technical limit.

It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8’s encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

answered Aug 27, 2016 at 12:18

John's user avatar

JohnJohn

6,3903 gold badges48 silver badges89 bronze badges

In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.

Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.

The only encoding where (as of now) we can make the statement about the size is UTF-32. There it’s always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)

What makes it so difficult are at least two things:

  1. composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A).
  2. code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8.
    • The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC.
    • Both are valid, and this shows how complicated the answer is when talking about «Unicode» and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn’t seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.

answered Mar 13, 2011 at 15:10

0xC0000022L's user avatar

0xC0000022L0xC0000022L

20.2k9 gold badges82 silver badges149 bronze badges

2

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw «Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)»

As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.

So your simple answer that you want is that it varies.

answered Mar 13, 2011 at 15:09

Loduwijk's user avatar

LoduwijkLoduwijk

1,9201 gold badge16 silver badges27 bronze badges

Unicode is a standard which provides a unique number for every character. These unique numbers are called code points (which is just unique code) to all characters existing in the world (some’s are still to be added).

For different purposes, you might need to represent this code points in bytes (most programming languages do so), and here’s where Character Encoding kicks in.

UTF-8, UTF-16, UTF-32 and so on are all Character Encodings, and Unicode’s code points are represented in these encodings, in different ways.

UTF-8 encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;

UTF-16 has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it’s enough for almost all the cases. Java uses UTF-16 encoding for its strings and characters;

UTF-32 has fixed length and each character takes exactly 4 bytes (32 bits).

answered Jun 17, 2020 at 14:15

Giorgi Tsiklauri's user avatar

Giorgi TsiklauriGiorgi Tsiklauri

9,0478 gold badges40 silver badges61 bronze badges

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a «surrogate pair.» More specifically, a surrogate pair has the form:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

where […] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).

See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.

answered Jul 12, 2016 at 20:45

prewett's user avatar

prewettprewett

1,55714 silver badges19 bronze badges

Check out this Unicode code converter. For example, enter 0x2009, where 2009 is the Unicode number for thin space, in the «0x… notation» field, and click Convert. The hexadecimal number E2 80 89 (3 bytes) appears in the «UTF-8 code units» field.

Yash's user avatar

Yash

8,9702 gold badges67 silver badges72 bronze badges

answered Oct 9, 2013 at 16:14

ma11hew28's user avatar

ma11hew28ma11hew28

118k116 gold badges447 silver badges645 bronze badges

From Wiki:

UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;

UTF-16, a 16-bit, variable-width encoding;

UTF-32, a 32-bit, fixed-width encoding.

These are the three most popular different encoding.

  • In UTF-8 each character is encoded into 1 to 4 bytes ( the dominant encoding )
  • In UTF16 each character is encoded into 1 to two 16-bit words and
  • in UTF-32 every character is encoded as a single 32-bit word.

Community's user avatar

answered Nov 24, 2019 at 21:20

chikitin's user avatar

chikitinchikitin

7336 silver badges24 bronze badges

industi502

industi502

Вопрос по информатике:

В кодировке Windows один символ весит…?????

Трудности с пониманием предмета? Готовишься к экзаменам, ОГЭ или ЕГЭ?

Воспользуйся формой подбора репетитора и занимайся онлайн. Пробный урок — бесплатно!

Ответы и объяснения 1

gomsqulyth

gomsqulyth

В таблице ASCII символ весит 8 бит, это 1 байт.
В UTF-8 от 1 од 6 байт

Знаете ответ? Поделитесь им!

Гость

Гость ?

Как написать хороший ответ?

Как написать хороший ответ?

Чтобы добавить хороший ответ необходимо:

  • Отвечать достоверно на те вопросы, на которые знаете
    правильный ответ;
  • Писать подробно, чтобы ответ был исчерпывающий и не
    побуждал на дополнительные вопросы к нему;
  • Писать без грамматических, орфографических и
    пунктуационных ошибок.

Этого делать не стоит:

  • Копировать ответы со сторонних ресурсов. Хорошо ценятся
    уникальные и личные объяснения;
  • Отвечать не по сути: «Подумай сам(а)», «Легкотня», «Не
    знаю» и так далее;
  • Использовать мат — это неуважительно по отношению к
    пользователям;
  • Писать в ВЕРХНЕМ РЕГИСТРЕ.

Есть сомнения?

Не нашли подходящего ответа на вопрос или ответ отсутствует?
Воспользуйтесь поиском по сайту, чтобы найти все ответы на похожие
вопросы в разделе Информатика.

Трудности с домашними заданиями? Не стесняйтесь попросить о помощи —
смело задавайте вопросы!

Информатика — наука о методах и процессах сбора, хранения, обработки, передачи, анализа и оценки информации с применением компьютерных технологий, обеспечивающих возможность её использования для принятия решений.

Понравилась статья? Поделить с друзьями:
  • В кодирование windows 1251 каждый символ кодируется 8 битами
  • В кнопке пуск нет завершения работы windows
  • В кластере нет серверов под управлением ос windows в 1с
  • В качестве физических хранилищ сертификатов в ос семейства windows могут выступать
  • В качестве параметра указано недопустимое имя компонента windows код ошибки 0x800f080c