基于虎嗅网的文本挖掘（Python爬虫&LDA主题模型）-原创手记-慕课网

一、分析背景

1、1 网站选取

关于虎嗅，虽然是小众的互联网媒体，没有像擅长于用户推荐的今日头条、专注于时政的澎湃新闻那样广为人知。但其有内涵、有质量，能看到最新的消息与某些观点的深入分析，包括微信公众号，大部分会订阅虎嗅的公众号。

再者，文本重点在于展现和学习文本挖掘的思路和整体框架。至于其载体是虎嗅还是澎湃，显得或许没有那么重要了。

1.2 分析目的

1、熟悉熟悉分析流程，尝试和学习一些有意思的东西；
2、展现数据之美，体现数据的奥妙和给人带来的强烈的视觉冲击；
3、基于文本挖掘，分析虎嗅网这家网站的运营方向，专注领域，不同文章的书写手法，受欢迎文章所具备的特点等。

1.3 使用到的数据分析工具

Python 3.5.0
PyCharm 2018.3
Pyspider 0.3.10
MongoDB 4.0.4
Studio-3T
jieba
WordCloud
R 3.5.1
RStudio 3.5.1
Jupyter 4.4.0

二、前期准备

2.1 Pyspider

2.1.1Pyspider简介

Pyspider是一个非常高效、简单的框架，而且提供了一个WebUI界面。你可以在WebUI界面里编写你的爬虫代码，管理爬虫状态，查看当前调用的任务。

Pyspider内置了PyQuery解析，可以使用任何你喜欢的html解析包；
数据库支持MongoDB、MySQl、Redis、SQLite等；
支持抓取经过JavaScript渲染的页面
多进程处理

2.1.2 Pyspider安装及配置

节省篇幅，话不多说，链接附上：[Pyspider安装及配置]https://blog.csdn.net/qq_42336565/article/details/80697482

2.2 MongoDB

2.2.1 MongoDB简介

MongoDB 是一个基于分布式文件存储的数据库。由 C++ 语言编写。旨在为 WEB 应用提供可扩展的高性能数据存储解决方案。

MongoDB 是一个介于关系数据库和非关系数据库之间的产品，是非关系数据库当中功能最丰富，最像关系数据库的。

2.2.2 MongoDB安装及配置

按照下面的教程来安装：
https://jingyan.baidu.com/article/a3f121e493e592fc9052bbfe.html 需要强调的一点是：第10步：

webp

MongoDB安装.png

这一步你要是很任性地像安装其他软件一样，选择了自定义的安装路径，或者在这一步:

webp

MongoDB安装.png

左下角，你勾上了Install MongoDB Compass，那你有可能就玩完了。。。接下来安装进程：

webp

MongoDB安装.png

进度条可能会卡在70%左右，一直不变，这个问题折磨了好久，老泪纵横，最后还是乖乖地默认安装路径。MongoDB还真是傲娇得很呐！

三、数据获取及预处理

3.1数据爬取

鉴于虎嗅网主页是主编精挑细选出来的，很据代表性，能反映虎嗅网的整体状况，本文使用 Pyspider 抓取了来自[虎嗅网] https://www.huxiu.com/的主页文章。

3.1.1 使用PyCharm

照常，我们用PyCharm来做，检查虎嗅原网页：

webp

HuXiu.png

设置服务器代理

def get_one_page(my_headers,url):
    randdom_header = random.choice(my_headers)
    req = urllib.request.Request(url)
    req.add_header("User-Agent", randdom_header)
    req.add_header("GET", url)
    response = urllib.request.urlopen(req)    return  response#代理服务器my_headers = [        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36",        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0",        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14",        "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Win64; x64; Trident/6.0)"
    ]

获取原网页

def get_href(html):
    pattern = re.compile('<div class="mod-b mod-art clearfix "'
                         '.*?"transition"  href="(.*?)"'
                         '.*?</div>', re.S)
    items =re.findall(pattern, html)    return items

使用正则表达式，解析原网页

def parse_one_page(href):
    pattern = re.compile('<div class="article-wrap">'
                         '.*?class="t-h1">(.*?)</h1>'
                         '.*?article-time pull-left">(.*?)</span>'
                         '.*?article-share pull-left">(.*?)</span>'
                         '.*?article-pl pull-left">(.*?)</span>'
                       #  '.*?text-remarks.*?</p><p><br/></p><p>(.*?)<!--.*?认证-->'
                         '.*?author-name.*?<a href=".*?" target="_blank">(.*?)</a>'
                         '.*?author-one">(.*?)</div>'
                         '.*?author-article-pl.*?target="_blank">(.*?)</a></li>'
                         '.*?</div>', re.S)

将获得的参数值转化成键值对

items =re.findall(pattern, href)  for item in items:      yield {          'title': item[0].strip(),          'time': item[1],          'share': item[2][2:],          'recoment': item[3][2:],       #   'content': re.compile(r'<[^>]+>',re.S).sub('',item[4]).strip(),
          'anthor': item[4].strip(),          'intro': item[5],          'passNum': item[6]
      }

循环遍历，抓取第一页所有文章

for i in range(len(url_html)):
        url_ord = "https://www.huxiu.com" + url_html[i]
        ord_text = get_one_page(my_headers, url_ord).read().decode('utf-8')        for item in parse_one_page(ord_text):            print(item)
            write_to_file(item)

保存到文件text.txt中

def write_to_file(content):
    with open('text.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False)+'\n')
        f.close()

爬取结果：

{"title": "裁员凶猛", "time": "2018-12-10 20:35", "share": "3", "recoment": "2", "anthor": "华夏时报©", "intro": "", "passNum": "54篇文章"}
{"title": "新东方烹饪学校都要上市了，你竟然还看不起职校", "time": "2018-12-10 20:31", "share": "3", "recoment": "0", "anthor": "敲敲格", "intro": "空山无人", "passNum": "150篇文章"}
{"title": "互联网太可怕，我还是做回煤老板吧", "time": "2018-12-10 20:22", "share": "6", "recoment": "2", "anthor": "故事FM©", "intro": "", "passNum": "16篇文章"}
{"title": "《海王》，大型“海底捞”现场", "time": "2018-12-10 20:01", "share": "3", "recoment": "2", "anthor": "mrpuppybunny", "intro": "", "passNum": "316篇文章"}
{"title": "从成龙说起：这届中国男明星不行", "time": "2018-12-10 19:51", "share": "10", "recoment": "2", "anthor": "腾讯《大家》©", "intro": "精选大家文章，畅享阅读时光。", "passNum": "119篇文章"}
{"title": "互联网寒冬，不是深渊，而是阶梯", "time": "2018-12-10 19:36", "share": "14", "recoment": "2", "anthor": "瞎说职场©", "intro": "", "passNum": "8篇文章"}
{"title": "【虎嗅晚报】和山寨Supreme合作？三星：这是意大利Supreme", "time": "2018-12-10 19:29", "share": "1", "recoment": "0", "anthor": "敲敲格", "intro": "空山无人", "passNum": "150篇文章"}
{"title": "老干妈：不上市的底气与逻辑", "time": "2018-12-10 18:28", "share": "9", "recoment": "6", "anthor": "中国经济信息杂志©", "intro": "信息改变生存质量", "passNum": "5篇文章"}
{"title": "2018，一文看尽AI发展真相", "time": "2018-12-10 18:16", "share": "18", "recoment": "0", "anthor": "新智元", "intro": "人工智能全产业平台", "passNum": "52篇文章"}
{"title": "《任天堂明星大乱斗：特别版》：我们都爱“大杂烩”", "time": "2018-12-10 17:54", "share": "1", "recoment": "1", "anthor": "我不叫塞尔达", "intro": "", "passNum": "81篇文章"}
{"title": "罗玉凤不认命", "time": "2018-12-10 17:46", "share": "25", "recoment": "12", "anthor": "盖饭人物ThePeople©", "intro": "冷眼看人间，心如火焰。", "passNum": "3篇文章"}
{"title": "星巴克被骂上热搜，新会员体系把老用户都气哭了", "time": "2018-12-10 16:40", "share": "10", "recoment": "9", "anthor": "运营研究社", "intro": "", "passNum": "13篇文章"}
{"title": "雀巢紧急召回一批问题奶粉，可致婴儿恶心呕吐", "time": "2018-12-10 16:36", "share": "5", "recoment": "0", "anthor": "每日经济新闻©", "intro": "", "passNum": "78篇文章"}
{"title": "8012年了，我们为什么还沉迷于“捏脸”游戏？", "time": "2018-12-10 16:27", "share": "9", "recoment": "3", "anthor": "看理想©", "intro": "“看理想”诞生于知名出版品牌“...", "passNum": "57篇文章"}
{"title": "广州三重奏：认识中国“南方”的一个视角", "time": "2018-12-10 16:25", "share": "21", "recoment": "0", "anthor": "东方历史评论©", "intro": "《东方历史评论》杂志官方微信账...", "passNum": "3篇文章"}
{"title": "得了癌症，怎么告诉孩子", "time": "2018-12-10 16:00", "share": "19", "recoment": "1", "anthor": "谢熊猫君", "intro": "", "passNum": "7篇文章"}
{"title": "从校园到职场：什么是职场经验", "time": "2018-12-10 15:43", "share": "38", "recoment": "5", "anthor": "caoz的梦呓©", "intro": "", "passNum": "70篇文章"}
{"title": "我们不禁要问：这些怪物在科学的地图周围在做什么呢？", "time": "2018-12-10 15:10", "share": "15", "recoment": "1", "anthor": "一席©", "intro": "", "passNum": "37篇文章"}
{"title": "一边卖命，一边求生，300万中国底层现状", "time": "2018-12-10 14:36", "share": "31", "recoment": "6", "anthor": "一条©", "intro": "每天一条原创短视频，每天讲述一...", "passNum": "1篇文章"}
{"title": "像拍《海王》一样拍《西游记》，会是什么样？", "time": "2018-12-10 14:30", "share": "13", "recoment": "10", "anthor": "壹条电影©", "intro": "", "passNum": "4篇文章"}
{"title": "DC翻身作《海王》，其实是一部环保教育宣传片", "time": "2018-12-10 14:00", "share": "6", "recoment": "8", "anthor": "PingWest品玩©", "intro": "有品好玩的科技，一切与你有关", "passNum": "71篇文章"}
{"title": "平台&大媒体都在输血本地新闻，  这样的合作模式真的可持续吗？", "time": "2018-12-10 14:00", "share": "8", "recoment": "2", "anthor": "全媒派©", "intro": "", "passNum": "78篇文章"}

到目前为止，第一页的文章已经处理掉了，本以为一切自然一帆风顺的，万事皆大欢喜，不就多爬几页嘛，一个循环不就得了。事实证明我想得简单了。
当点击第一页下面这个“加载更多”时，发现其经过了JavaScript渲染。

webp

HuXiu.png

分析该请求的方式和地址，包括参数，如下图所示：

webp

HuXiu.png

得到以下信息：

页面请求地址为：https://www.huxiu.com/v2_action/article_list
请求方式：POST
请求参数比较重要的是一个叫做page的参数

3.1.2 使用PySpider爬取动态加载页面

on_start 函数内部编写循环事件，我们本次爬取2000页；

@every(minutes=24 * 60)    def on_start(self):        for page in range(1,2000):
            print("正在爬取第 {} 页".format(page))            self.crawl('https://www.huxiu.com/v2_action/article_list', method="POST",data={"page":page},callback=self.parse_page,validate_cert=False)

页面生成完毕之后，开始调用parse_page 函数，用来解析 crawl() 方法爬取 URL 成功后返回的 Response 响应。

def parse_page(self, response):
        content = response.json["data"]
        doc = pq(content)
        lis = doc('.mod-art').items()
        data = [{           'title': item('.msubstr-row2').text(),           'url':'https://www.huxiu.com'+ str(item('.msubstr-row2').attr('href')),           'name': item('.author-name').text(),           'write_time':item('.time').text(),           'comment':item('.icon-cmt+ em').text(),           'favorites':item('.icon-fvr+ em').text(),           'abstract':item('.mob-sub').text()
           } for item in lis ] 
        return data

最后，定义一个 on_result() 方法，该方法专门用来获取 return 的结果数据。这里用来接收上面 parse_page() 返回的 data 数据，在该方法可以将数据保存到 MongoDB 中。

def on_result(self, result):        if result:
            self.save_to_mongo(result)    
    def save_to_mongo(self, result):
        df = pd.DataFrame(result)
        content = json.loads(df.T.to_json()).values()        if mongo_collection.insert_many(content):
            print('存储到mongodb成功')
            sleep = np.random.randint(1,5)
            time.sleep(sleep)

pyspider 以 URL的 MD5 值作为唯一 ID 编号，ID 编号相同，就视为同一个任务，不会再重复爬取。
GET 请求的分页URL 一般不同，所以 ID 编号会不同，能够爬取多页。
POST 请求的URL是相同的，爬取第一页之后，后面的页数便不会再爬取。
为了爬取第2页及之后，重新写下 ID 编号的生成方式，在 on_start() 方法前面添加下面代码：

def get_taskid(self,task):        return md5string(task['url']+json.dumps(task['fetch'].get('data','')))

数据保存到了MongoDB 中：

webp

MongoDB.png

共计2000页， 28222篇文章。抓取了 7 个字段信息：文章标题、作者、发文时间、评论数、收藏数、摘要和文章链接。

3.2 数据清洗

首先，我们需要从 MongoDB 中读取数据，并转换为 DataFrame。

client = pymongo.MongoClient(host='localhost', port=27017)
db = client['Huxiu']
collection = db['News']# 将数据库数据转为dataFramedata = pd.DataFrame(list(collection.find()))

下面我们看一下数据的行数和列数，整体情况及数据的前五行。

#查看行数和列数print(data.shape)#查看总体情况print(data.info())

结果:

(28222, 8)
RangeIndex: 28222 entries, 0 to 28221Data columns (total 8 columns):
_id           28222 non-null objectabstract      28222 non-null object
comment       28222 non-null object
favorites     28222 non-null object
name          28222 non-null object
title         28222 non-null object
url           28222 non-null object
write_time    28222 non-null object

可以看到数据的维度是 28222行 × 8 列。发现多了一列无用的 _id 需删除，同时 name 列有一些特殊符号，比如© 需删除。另外，数据格式全部为 Object 字符串格式，需要将 comment 和 favorites 两列更改为数值格式、 write_time 列更改为日期格式。

# 删除无用的_id列data.drop(['_id'], axis=1, inplace=True)# 删除特殊符号@data['name'].replace('@','',inplace=True,regex=True)
data_duplicated = data.duplicated().value_counts()# 将数据列改为数值列data = data.apply(pd.to_numeric, errors='ignore')# 修改时间，并转换为datetime格式data['write_time'] = pd.to_datetime(data['write_time'])
data = data.reset_index(drop=True)

下面，我们看一下数据是否有重复，如果有，那么需要删除。

# 删除重复值data = data.drop_duplicates(keep='first')

我们再增加两列数据，一列是文章标题长度列，一列是年份列，便于后面进行分析

# 增加标题长度列data['title_length'] = data['title'].apply(len)# 年份列data['year'] = data['write_time'].dt.year

以上，就完成了基本的数据清洗处理过程，针对这 9 列数据开始进行分析。

四、数据统计分析

4.1 整体情况

先来看一下总体情况：

print(data.describe())

结果：

            comment     favorites  title_length          yearcount  27236.000000  27236.000000  27236.000000  27236.000000mean       9.030988     40.480761     23.010501   2016.382288std       14.912655     52.381115      8.376050      1.516007min        0.000000      0.000000      3.000000   2012.00000025%        3.000000     12.000000     17.000000   2016.00000050%        6.000000     24.000000     23.000000   2017.00000075%       11.000000     48.000000     28.000000   2017.000000max      914.000000    787.000000    124.000000   2018.000000

使用了 data.describe() 方法对数值型变量进行统计分析。从上面可以简要得出以下几个结论：

读者的评论和收藏热情都不算太高。大部分文章（75 %）的评论数量为十几条，收藏数量不过几十个。这和一些微信大 V 公众号动辄百万级阅读、数万级评论和收藏量相比，虎嗅网的确相对小众一些。不过也正是因为小众，也才深得部分人的喜欢。
评论数最多的文章有914 条，收藏数最多的文章有 787 个收藏量，说明还是有一些潜在的比较火或者质量比较好的文章。
最长的文章标题长达 124 个字，大部分文章标题长度在 20 来个字左右，所以标题最好不要太长或过短。

print(data['name'].describe())print(data['write_time'].describe())

结果：

count     27236unique     3334top          虎嗅freq       2289count                   27236unique                   1390top       2017-04-25 00:00:00freq                       44first     2012-06-27 00:00:00last      2018-10-20 00:00:00

unique 表示唯一值数量，top 表示出现次数最多的变量，freq 表示该变量出现的次数，所以可以简单得出以下几个结论：

在文章来源方面，3334 个作者贡献了这 27236篇文章，其中自家官网「虎嗅」写的数量最多，有2289篇，这也很自然。
在文章发表时间方面，最早的一篇文章来自于 2012年 6 月 27日。 6 年多时间，发文数最多的 1 天是 2017 年4 月 25 日，一共发了 44 篇文章。

4.2 虎嗅网文章发布数量变化

def analysis1(data):

    data.set_index(data['write_time'], inplace=True)
    data = data.resample('Q').count()['name'] # 以季度汇总
    data = data.to_period('Q')    # 创建x,y轴标签
    x = np.arange(0, len(data), 1)
    axl.plot(x, data.values,
        color = color_line,
        marker = 'o', markersize = 4
        )
    axl.set_xticks(x) # 设置x轴标签为自然数序列
    axl.set_xticklabels(data.index) # 更改x轴标签值为年份
    plt.xticks(rotation=90) # 旋转90度，不至于太拥挤

    for x,y in zip(x,data.values):
        plt.text(x,y + 10, '%.0f' %y,ha = 'center', color = colors, fontsize=fontsize_text)    # 设置标题及横纵坐标轴标题
    plt.title('虎嗅网文章数量发布变化(2012-2018)', color = colors, fontsize=fontsize_title)
    plt.xlabel('时期')
    plt.ylabel('文章（篇）')
    plt.tight_layout() # 自动控制空白边缘
    plt.savefig('虎嗅网文章数量发布变化.png', dip=200)
    plt.show()

结果：

webp

虎嗅网文章发布数量变化.png

可以看到，以季度为时间尺度的 6 年间，12年-15年发文数量比较稳定，大概在400篇左右。但2016 年之后文章开始增加到 2000 篇以上，可能跟虎嗅网于2015年2月上市有关。首尾两个季度日期不全，所以数量比较少。

4.3 文章收藏量TOP10

几万篇文章里，到底哪些文章写得比较好或者比较火？

top = data.sort_values('favorites', ascending=False)
    top.index=(range(1,len(top.index)+1))    print(top[:10][['title','favorites','comment']])

结果：

                             title  favorites  comment
1                        货币如水，覆水难收        787       39
2                            自杀经济学        781      119
3   2016年已经起飞的5只黑天鹅，都在罗振宇这份跨年演讲全文里        774       39
4               真正强大的商业分析能力是怎样炼成的？        747       18
5                        藏在县城的万亿生意        718       35
6                           腾讯没有梦想        707       32
7               段永平连答53问，核心是“不为清单”        706       27
8                          王健林的滑铁卢        703       92
9                           7-11不死        691       17
10            游戏策划人士：为什么我的儿子不沉迷游戏？        644       33

发现两个有意思的地方：第一，文章标题都比较短小精炼。第二，文章收藏量虽然比较高，但评论数都不多，猜测这是因为——大家都喜欢做伸手党？

4.4 历年TOP3文章收藏比较

在了解文章的总体排名之后，我们来看看历年的文章排名是怎样的。这里，每年选取了收藏量最多的 3 篇文章。

def analysis2(data):
    def topn(data):
        top = data.sort_values('favorites', ascending=False)        return top[:3]

    data = data.groupby(by=['year']).apply(topn)
    print(data[['title', 'favorites']])    # 增加每年top123列，列依次值为1、2、3
    data['add'] = 1 # 辅助
    data['top'] = data.groupby(by='year')['add'].cumsum()

    data_reshape = data.pivot_table(index='year', columns='top', values='favorites').reset_index()
    print(data_reshape)
    data_reshape.plot(
        y = [1,2,3],
        kind = 'bar',
        width = 0.3,
        color = ['#1362A3', '#3297EA', '#8EC6F5']
        )    # 添加x轴标签
    years = data['year'].unique()
    plt.xticks(list(range(7)), years)
    plt.xlabel('Year')
    plt.ylabel('文章收藏数量')
    plt.title('历年TOP3文章收藏比较', color = colors, fontsize = fontsize_title)
    plt.tight_layout()
    plt.savefig('历年TOP3文章收藏比较.png', dpi=200)
    plt.show()

结果：

webp

作者：伪文艺boy
链接：https://www.jianshu.com/p/fee0a0b15f91