UTF-8 바이트 크기

알면 몸에 좋은 글/Tech 2008. 7. 13. 01:13

보통 통상적으로 유니코드 표현법으로 쓰이는 UTF-8 의 사이즈는 가변길이다.

한글 표현에 있어 가장 범용적인 EUC-KR의 경우 2Byte를 사용하게 되는데,
UTF-8의 경우는 보통 3Byte를 사용한다. (오라클 및 기타 application)

한글 위키에는 이러한 사이즈에 대한 분류 카테코리가 알기쉽게 잘 명시되어 있지 않지만, 영문 위키에는 이 카테고리가 알기쉽게 명시되어 있는데 내용은 다음과 같다.
(http://en.wikipedia.org/wiki/UTF-8)

1. One byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F).
2. Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
3. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
4. Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

즉, 무슨 이야기냐면,

(1) 한바이트는 ASCII 에 명시된 케릭터 128개 표현이 가능하다. (쉽게 알파벳)
(2) 두바이트는 그리스아, 히브리어, 등등 라틴계통 언어나 몇가지 계통 언어 표현이 가능하다.
(3) 세바이트는 일반적인 다국어 표현이 가능하다. (즉 한글의 범주다!)
(4) 네바이트는 일반적으로 잘 안쓰이는 글들의 표현이 가능하다.

요약하면, UTF-8에서 한글을 사용하기 위해서는 세바이트가 필요하다.