猿问

截断包含HTML的文本,忽略标签

我想截断一些文本(从数据库或文本文件加载),但其中包含HTML,因此包含了标签,并且将返回较少的文本。然后,这可能导致标签未关闭或部分关闭(因此Tidy可能无法正常工作,并且内容仍然较少)。我如何基于文本截断(并且可能在到达表时停止,因为这可能会导致更复杂的问题)。


substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

将导致:


Hello, my <strong>name</st...

我想要的是:


Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

我怎样才能做到这一点?


虽然我的问题是关于如何在PHP中进行操作,但最好知道如何在C#中进行操作...要么应该可以,因为我认为我可以将方法移植过来(除非它是内置的)方法)。


还要注意,我包括了一个HTML实体&acute;-必须将其视为单个字符(而不是本示例中的7个字符)。


strip_tags 是一个备用,但我会丢失格式和链接,并且HTML实体仍然会出现问题。


白猪掌柜的
浏览 758回答 3
3回答

潇湘沐

我已经按照您的建议编写了一个将HTML截断的函数,但是没有打印出来,而是将其保存在字符串变量中。也处理HTML实体。&nbsp;/**&nbsp; &nbsp; &nbsp;*&nbsp; function to truncate and then clean up end of the HTML,&nbsp; &nbsp; &nbsp;*&nbsp; truncates by counting characters outside of HTML tags&nbsp; &nbsp; &nbsp;*&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;*&nbsp; @author alex lockwood, alex dot lockwood at websightdesign&nbsp; &nbsp; &nbsp;*&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;*&nbsp; @param string $str the string to truncate&nbsp; &nbsp; &nbsp;*&nbsp; @param int $len the number of characters&nbsp; &nbsp; &nbsp;*&nbsp; @param string $end the end string for truncation&nbsp; &nbsp; &nbsp;*&nbsp; @return string $truncated_html&nbsp; &nbsp; &nbsp;*&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;*&nbsp; **/&nbsp; &nbsp; &nbsp; &nbsp; public static function truncateHTML($str, $len, $end = '&hellip;'){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //find all tags&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i';&nbsp; //match html tags and entities&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //WSDDebug::dump($matches); exit;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $i =0;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //loop through each found tag that is within the $len, add those characters to the len,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //also track open and closed tags&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // $matches[$i][0] = the whole tag string&nbsp; --the only applicable field for html enitities&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // IF its not matching an &htmlentity; the following apply&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // $matches[$i][1] = the start of the tag either '<' or '</'&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // $matches[$i][2] = the tag name&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // $matches[$i][3] = the end of the tag&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //$matces[$i][$j][0] = the string&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //$matces[$i][$j][1] = the str offest&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while($matches[$i][0][1] < $len && !empty($matches[$i])){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $len = $len + strlen($matches[$i][0][0]);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if(substr($matches[$i][0][0],0,1) == '&' )&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $len = $len-1;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //ignore empty/singleton tags for tag counting&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //double check&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $openTags[] = $matches[$i][2][0];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elseif(end($openTags) == $matches[$i][2][0]){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; array_pop($openTags);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }else{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $warnings[] = "html has some tags mismatched in it:&nbsp; $str";&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $i++;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $closeTags = '';&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (!empty($openTags)){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $openTags = array_reverse($openTags);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; foreach ($openTags as $t){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $closeTagString .="</".$t . ">";&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if(strlen($str)>$len){&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Finds the last space from the string new length&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $lastWord = strpos($str, ' ', $len);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ($lastWord) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //truncate with new len last word&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $str = substr($str, 0, $lastWord);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //finds last character&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $last_character = (substr($str, -1, 1));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //add the end text&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //restore any open tags&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $truncated_html .= $closeTagString;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }else&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $truncated_html = $str;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return $truncated_html;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; }

30秒到达战场

100%准确但非常困难的方法:使用DOM迭代字符使用DOM方法删除剩余元素序列化DOM简单的暴力破解方法:使用preg_split('/(<tag>)/')PREG_DELIM_CAPTURE将字符串拆分为标签(不是元素)和文本片段。测量所需的文本长度(它将是拆分后的第二个元素,您可能会html_entity_decode()用来帮助精确测量)剪切字符串(&[^\s;]+$在末尾修剪以除去可能切碎的实体)使用HTML Tidy修复它
随时随地看视频慕课网APP
我要回答