从R中的字符串中删除html标签

首页课程实战体系课手记专栏慕课教程

我正在尝试将网页源代码读入R并将其作为字符串处理。我正在尝试删除段落并从段落文本中删除html标签。我遇到了以下问题：

我尝试实现一个功能来删除html标签：

cleanFun=function(fullStr)

{

#find location of tags and citations

tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

#create storage for tag strings

tagStrings=list()

#extract and store tag strings

for(i in 1:dim(tagLoc)[1])

{

tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);

}

#remove tag strings from paragraph

newStr=fullStr

for(i in 1:length(tagStrings))

{

newStr=str_replace_all(newStr,tagStrings[[i]][1],"")

}

return(newStr)

};

这适用于某些标签，但不适用于所有标签，此示例失败的示例是以下字符串：

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

目标是获得：

cleanFun(test)="junk junk junk junk"

但是，这似乎不起作用。我认为这可能与字符串长度或转义字符有关，但是我找不到涉及这些的解决方案。

慕码人2483693

浏览 889回答 3

HUWWW

这可以通过正则表达式和grep系列简单地实现：cleanFun <- function(htmlString) {  return(gsub("<.*?>", "", htmlString))}这也将与同一字符串中的多个html标签一起使用！

0 0

随时随地看视频慕课网APP