WebClient.DownloadString()返回带有特殊字符的字符串

我要从网络上下载我正在构建的屏幕抓取工具的某些内容时遇到问题。


在下面的代码中,从Web客户端下载字符串方法返回的字符串为一些(不是全部)网站的源下载返回一些奇数字符。


我最近添加了http标头,如下所示。以前,在没有标题的情况下调用相同的代码具有相同的效果。我没有尝试过'Accept-Charset'标头的变体,除了基本知识之外,我对文本编码了解不多。


我指的字符或字符序列是:


“  ”



“ Â ”


当您在Web浏览器中使用“查看源代码”时,看不到这些字符。是什么原因造成的?我该如何解决该问题?


string urlData = String.Empty;

WebClient wc = new WebClient();


// Add headers to impersonate a web browser. Some web sites 

// will not respond correctly without these headers

wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");

wc.Headers.Add("Accept", "*/*");

wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");

wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");


urlData = wc.DownloadString(uri);


MMMHUHU
浏览 1239回答 3
3回答

德玛西亚99

是八位位组的Windows-1252表示形式EF BB BF。这是UTF-8字节顺序标记,这意味着您的远程网页是以UTF-8编码的,但是您正在阅读它的方式就像是Windows-1252。 根据该文档,WebClient.DownloadString使用Webclient.Encoding它的编码时,它的远程资源转换成字符串。设置为System.Text.Encoding.UTF8,理论上一切都会正常进行。

慕丝7291255

WebClient.DownloadString实现的方式很笨。它应该从Content-Type响应的标题中获取字符编码,但是相反,它希望开发人员事先告知期望的编码。我不知道此类的开发人员在想什么。我创建了一个辅助类,该辅助类从Content-Type响应的头中检索编码名称:public static class WebUtils{&nbsp; &nbsp; public static Encoding GetEncodingFrom(&nbsp; &nbsp; &nbsp; &nbsp; NameValueCollection responseHeaders,&nbsp; &nbsp; &nbsp; &nbsp; Encoding defaultEncoding = null)&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; if(responseHeaders == null)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new ArgumentNullException("responseHeaders");&nbsp; &nbsp; &nbsp; &nbsp; //Note that key lookup is case-insensitive&nbsp; &nbsp; &nbsp; &nbsp; var contentType = responseHeaders["Content-Type"];&nbsp; &nbsp; &nbsp; &nbsp; if(contentType == null)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return defaultEncoding;&nbsp; &nbsp; &nbsp; &nbsp; var contentTypeParts = contentType.Split(';');&nbsp; &nbsp; &nbsp; &nbsp; if(contentTypeParts.Length <= 1)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return defaultEncoding;&nbsp; &nbsp; &nbsp; &nbsp; var charsetPart =&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; contentTypeParts.Skip(1).FirstOrDefault(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));&nbsp; &nbsp; &nbsp; &nbsp; if(charsetPart == null)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return defaultEncoding;&nbsp; &nbsp; &nbsp; &nbsp; var charsetPartParts = charsetPart.Split('=');&nbsp; &nbsp; &nbsp; &nbsp; if(charsetPartParts.Length != 2)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return defaultEncoding;&nbsp; &nbsp; &nbsp; &nbsp; var charsetName = charsetPartParts[1].Trim();&nbsp; &nbsp; &nbsp; &nbsp; if(charsetName == "")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return defaultEncoding;&nbsp; &nbsp; &nbsp; &nbsp; try&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return Encoding.GetEncoding(charsetName);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; catch(ArgumentException ex)&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new UnknownEncodingException(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; charsetName,&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "The server returned data in an unknown encoding: " + charsetName,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ex);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}(这UnknownEncodingException是一个自定义的异常类,InvalidOperationException如果需要,可以随意替换或其他)然后,WebClient该类的以下扩展方法可以解决问题:public static class WebClientExtensions{&nbsp; &nbsp; public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; var rawData = webClient.DownloadData(uri);&nbsp; &nbsp; &nbsp; &nbsp; var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);&nbsp; &nbsp; &nbsp; &nbsp; return encoding.GetString(rawData);&nbsp; &nbsp; }}因此,在您的示例中,您将执行以下操作:urlData = wc.DownloadStringAwareOfEncoding(uri);...就是这样。
打开App,查看更多内容
随时随地看视频慕课网APP