如何获取utf-8字符串中给定字符的代码点编号？

3回答

慕雪6442864

Scott Reynen编写了一个将UTF-8转换为Unicode的函数。我在PHP文档中发现了它。function utf8_to_unicode( $str ) {    $unicode = array();            $values = array();    $lookingFor = 1;    for ($i = 0; $i < strlen( $str ); $i++ ) {        $thisValue = ord( $str[ $i ] );    if ( $thisValue < ord('A') ) {        // exclude 0-9        if ($thisValue >= ord('0') && $thisValue <= ord('9')) {             // number             $unicode[] = chr($thisValue);        }        else {             $unicode[] = '%'.dechex($thisValue);        }    } else {          if ( $thisValue < 128)         $unicode[] = $str[ $i ];          else {                if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;                                $values[] = $thisValue;                                if ( count( $values ) == $lookingFor ) {                    $number = ( $lookingFor == 3 ) ?                        ( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):                        ( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );            $number = dechex($number);            $unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;                    $values = array();                    $lookingFor = 1;          } // if        } // if    }    } // for    return implode("",$unicode);} // utf8_to_unicode

0 0

蝴蝶刀刀

使用现有的实用程序（例如iconv）或您使用的语言随附的任何库。如果您坚持使用自己的解决方案，请阅读UTF-8格式。基本上，每个代码点都存储为1-4个字节，具体取决于代码点的值。范围如下：U + 0000 — U + 007F：1个字节：0xxxxxxxU + 0080 — U + 07FF：2个字节：110xxxxx 10xxxxxxU + 0800 — U + FFFF：3个字节：1110xxxx 10xxxxxx 10xxxxxxU + 10000 — U + 10FFFF：4个字节：11110xxx 10xxxxxx 10xxxxxx 10xxxxxx其中每个x是一个数据位。因此，您可以通过查看第一个字节来判断每个代码点由多少字节组成：如果它以0开头，则为1字节字符。如果以110开头，则为2字节字符。如果以1110开头，则为3字节字符。如果以11110开头，则为4字节字符。如果以10开头，则为多字节字符的非初始字节。如果以11111开头，则为无效字符。一旦确定了字符中有多少个字节，就随便摆个位。另请注意，UCS-2不能表示U + FFFF以上的字符。由于您未指定语言，因此下面是一些示例C代码（省略了错误检查）：wchar_t utf8_char_to_ucs2(const unsigned char *utf8){  if(!(utf8[0] & 0x80))      // 0xxxxxxx    return (wchar_t)utf8[0];  else if((utf8[0] & 0xE0) == 0xC0)  // 110xxxxx    return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));  else if((utf8[0] & 0xF0) == 0xE0)  // 1110xxxx    return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));  else    return ERROR;  // uh-oh, UCS-2 can't handle code points this high}

0 0

守候你守候我

PHP代码（假定有效的utf-8，不检查无效的utf-8）：function ord_utf8($c) {    $b0 = ord($c[0]);    if ( $b0 < 0x10 ) {        return $b0;        }    $b1 = ord($c[1]);    if ( $b0 < 0xE0 ) {        return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);        }    return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);    }

0 0