如何获取utf-8字符串中给定字符的代码点编号?

我想获取给定UTF-8字符串的UCS-2代码点。例如,单词“ hello”应变为类似“ 0068 0065 006C 006C 006F”的名称。请注意,字符可以来自任何语言,包括诸如东亚语言之类的复杂文字。


因此,问题归结为“将给定字符转换为其UCS-2代码点”


但是如何?拜托,由于我非常着急,任何帮助都将不胜感激。


提问者的答覆转录为答案


感谢您的答复,但这需要在PHP v 4或5中完成,而不是6。


该字符串将是来自表单字段的用户输入。


我想实现utf8to16或utf8decode的PHP版本,例如


function get_ucs2_codepoint($char)

{

    // calculation of ucs2 codepoint value and assign it to $hex_codepoint

    return $hex_codepoint;

}

您可以为我提供PHP的帮助,还是可以通过上述版本的PHP来帮助我?


临摹微笑
浏览 444回答 3
3回答

慕雪6442864

Scott Reynen编写了一个将UTF-8转换为Unicode的函数。我在PHP文档中发现了它。function utf8_to_unicode( $str ) {&nbsp; &nbsp; $unicode = array();&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; $values = array();&nbsp; &nbsp; $lookingFor = 1;&nbsp; &nbsp; for ($i = 0; $i < strlen( $str ); $i++ ) {&nbsp; &nbsp; &nbsp; &nbsp; $thisValue = ord( $str[ $i ] );&nbsp; &nbsp; if ( $thisValue < ord('A') ) {&nbsp; &nbsp; &nbsp; &nbsp; // exclude 0-9&nbsp; &nbsp; &nbsp; &nbsp; if ($thisValue >= ord('0') && $thisValue <= ord('9')) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// number&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;$unicode[] = chr($thisValue);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;$unicode[] = '%'.dechex($thisValue);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ( $thisValue < 128)&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; $unicode[] = $str[ $i ];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $values[] = $thisValue;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ( count( $values ) == $lookingFor ) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $number = ( $lookingFor == 3 ) ?&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $number = dechex($number);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $values = array();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $lookingFor = 1;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } // if&nbsp; &nbsp; &nbsp; &nbsp; } // if&nbsp; &nbsp; }&nbsp; &nbsp; } // for&nbsp; &nbsp; return implode("",$unicode);} // utf8_to_unicode

蝴蝶刀刀

使用现有的实用程序(例如iconv)或您使用的语言随附的任何库。如果您坚持使用自己的解决方案,请阅读UTF-8格式。基本上,每个代码点都存储为1-4个字节,具体取决于代码点的值。范围如下:U + 0000 — U + 007F:1个字节:0xxxxxxxU + 0080 — U + 07FF:2个字节:110xxxxx 10xxxxxxU + 0800 — U + FFFF:3个字节:1110xxxx 10xxxxxx 10xxxxxxU + 10000 — U + 10FFFF:4个字节:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx其中每个x是一个数据位。因此,您可以通过查看第一个字节来判断每个代码点由多少字节组成:如果它以0开头,则为1字节字符。如果以110开头,则为2字节字符。如果以1110开头,则为3字节字符。如果以11110开头,则为4字节字符。如果以10开头,则为多字节字符的非初始字节。如果以11111开头,则为无效字符。一旦确定了字符中有多少个字节,就随便摆个位。另请注意,UCS-2不能表示U + FFFF以上的字符。由于您未指定语言,因此下面是一些示例C代码(省略了错误检查):wchar_t utf8_char_to_ucs2(const unsigned char *utf8){&nbsp; if(!(utf8[0] & 0x80))&nbsp; &nbsp; &nbsp; // 0xxxxxxx&nbsp; &nbsp; return (wchar_t)utf8[0];&nbsp; else if((utf8[0] & 0xE0) == 0xC0)&nbsp; // 110xxxxx&nbsp; &nbsp; return (wchar_t)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F));&nbsp; else if((utf8[0] & 0xF0) == 0xE0)&nbsp; // 1110xxxx&nbsp; &nbsp; return (wchar_t)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F));&nbsp; else&nbsp; &nbsp; return ERROR;&nbsp; // uh-oh, UCS-2 can't handle code points this high}

守候你守候我

PHP代码(假定有效的utf-8,不检查无效的utf-8):function ord_utf8($c) {&nbsp; &nbsp; $b0 = ord($c[0]);&nbsp; &nbsp; if ( $b0 < 0x10 ) {&nbsp; &nbsp; &nbsp; &nbsp; return $b0;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; $b1 = ord($c[1]);&nbsp; &nbsp; if ( $b0 < 0xE0 ) {&nbsp; &nbsp; &nbsp; &nbsp; return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);&nbsp; &nbsp; }
打开App,查看更多内容
随时随地看视频慕课网APP