Python爬虫 | 基于request-bs4-re的股票数据定向爬虫-原创手记-慕课网

Python版本：Python3.5 ；
技术路线：requests-bs4-re 。

1 功能描述

股票数据是进行量化数据的基础型数据，爬虫也能为量化交易提供获得基础数据的方法。
所有功能需求如下：

目标：获得上交所和深交所所有股票的名称和交易信息
输出：保存到文件中

2 候选数据网站选择

数据源网站：为了获取股票数据，我们要找一些能够寻获得股票数据的网站，这里边选取两个候选网站：
- 新浪股票：http://finance.sina.com.cn/stock/
- 百度股票：https://gupiao.baidu.com/stock/
选取原则：有了候选网站，我们要判断哪个网站更适合爬虫的爬取，选取原则如下：
- 股票信息静态存在于HTML页面中，非js代码生成
- 没有Robots协议限制
选取方法：浏览器F12，源代码查看等
选取心态：不要纠结于某个网站，多找信息源试试

1.网页查看源代码
打开浏览器，分别查看了新浪股票和百度股票的源代码，发现新浪股票里每只个股信息为js代码动态生成，百度股票为静态HTML页面。所以，百度股票更适合作为股票数据来源。

百度股票

2.关于Robots协议
除了上述之外，我们还有要爬虫的合法性。打开链接https://gupiao.baidu.com/robots.txt，可以看到显示了404页面，说明百度股票没有对爬虫做相关限制。所以，我们个爬虫也完全合法滴。

3.辅助网站
在确定数据来源后，还有确定当前股票市场中所以股票的列表。这里我们使用东方财富网：http://quote.eastmoney.com/stocklist.html，它列出了上交所和深交所所有股票信息。所以，东方财富网页为获取所有股票列表信息来源。

东方财富网

3 程序的结构设计

我们在看一下百度股票每个个股信息，观察浏览器地址栏；
可以发现每个个股都有一个个股的编号，它作为网页地址的一部分；
而这个编号恰巧就是这支股票对应的深圳交易所或上海交易所的股票代码。
链接特点
由此可以构建一个程序结构，大概分为以下三个步骤：

步骤1：从东方财富网获取股票列表（形成一个所有股票信息的列表信息）
步骤2：根据股票列表，逐个到百度股票获取个股信息（逐一获取股票代码，把它逐一增加到百度股票的链接中，对链接逐个进行访问，获取这支股票对应的个股信息）
步骤3：将结构存储到文件（最终把所有股票列表上的信息，通过百度股票网站获取后，存储到相关文件中）

4 代码编写

1.首先先编写主函数main()数和定义其他函数及其接口。由于我们要用到requests库、BeautifulSoup库和正则表达式(re)库，所以先import这几个库。

import requests
from bs4 import BeautifulSoup
import re

2.整个程序，我定义四个函数。

获得URL对应页面 getHTMLText()
获得股票信息列表 getStockList()
获得每一支个股的股票信息，并把它存在一个数据结构 getStockList()
最后定义主函数 main()

import requests
from bs4 import BeautifulSoup
import re

def getHTMLText(url):
    return ""

def getStockList(lst, stockURL):
# 获得股票列表
    return ""

def getStockList(lst, stockURL, fpath):
# 根据股票列表，到相关网站上获取股票信息，并把它保存到相关文件中
    return ""

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    # 获得股票列表的链接
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    # 获取股票信息链接的主体部分
    output_file = 'G:/BaiduStockInfo.txt'
    # 输出文件的保存位置
    slist=[]
    # 股票列表信息保存的变量
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

3.到此程序主体框架已编写完成，下面逐个看每一个函数的功能。

getHTMLText() 使用通用代码框架

def getHTMLText(url): 
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        # 如果爬取失败，产生异常信息
        r.encoding = r.apparent_encoding
        # 使用apparent_encoding修改编码
        return r.text
        # 将网页的信息内容返回
    except requests.HTTPError:
        return ""
        # 若出现错误，返回空字符串

getStockList()因为要从东方财富网获得股票列表，所以，这里我们需要观察页面源代码进行编写

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    # 用BeautifulSoup解析这个页面
    a = soup.find_all('a')
    # 使用find_all()方法找到所有的<a>标签
    for i in a:
        try:
            href = i.attrs['href']
            # 找到每个<a>标签的href属性
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
            # href属性链接里面的数字提取出来，放到lst里
            # 使用正则表达式匹配
        except:
            continue

getStockInfo()这里需要百度股票中的个股信息的源代码，启动浏览器，观察个股信息的源代码

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            # 存储从一个页面中记录的所有个股信息
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
            # 搜索标签，找到股票信息所在的大标签信息，存在stockInfo中
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):、
            # 将键值对列表进行复制，将其还原为键值对，并存储在字典infoDict中
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
            # 将相关的信息保存在文件中
                f.write( str(infoDict) + '\n' )
        except:
            continue

4.至此，所有代码已经编写完成啦，感觉自己好棒棒，赶快运行一下看看~

import requests
from bs4 import BeautifulSoup
import re


def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            continue


def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

5 代码优化

其实上面我们已经完成的所有功能的实现，是不是已经完了呢？答案当然是：No。

因为程序最终是给人来实现的，所以提高用户体验是一件灰常重要的事情啦。
对于爬虫来讲呢，提高速度肯定是提高用户体验的重要方法。但是很遗憾，只要我们采用requests-bs4这样的技术路线，速度都不会提高地很快。
如果对速度有很高的要求，可以使用Scrapy库，后面的文章会写到。
那我们是不是什么都做不了了呢？当然也不是啦，我们也是可以在一些小地方做优化滴~
- 关于 encoding 网页编码识别的优化：在getHTMLText()函数中，有一行是r.encoding = r.apparent_encoding，意思是将获得HTML页面的文本由程序分析，来判断页面可能使用的编码方式。
- 所以 r.apparent_encoding 要几乎分析文本的所有内容，显然是需要一定时间的。像这样的定向爬虫，可以采用手工先获得编码类型，直接赋给 encoding 。

def getHTMLText(url, code="utf-8"):
# 新增code参数，默认utf-8编码
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    # 百度股票为GB2312编码，将其赋值给code参数
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

增加动态进度显示，当时是另外一个提高用户体验的方法啦。可以增加进度条，打印当前爬取进度的百分比。

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)), end="")
                # 增加动态进度百分比
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)), end="")
            # 增加动态进度百分比
            continue

6 最终代码

import requests
from bs4 import BeautifulSoup
import re
 

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""
 

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)), end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)), end="")
            continue
 

def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'G:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

本帖仅作为学习笔记；
有错误之处欢迎指出；
也欢迎大家一起学习交流~(๑•̀㉨•́ฅ✧)