手记

QQ音乐爬虫——下载榜单歌曲

今天我们来实现一下QQ音乐的爬虫,实现对榜单里面的歌曲的下载

主页

榜单内容

    可以简单分析一下页面,网页也是基于动态处理的,所以有必要对所需的数据包进行抓取,QQ音乐会不定时进行更新,所以每一期的规则会不一样,这里是基于目前的规则进行编写的代码,给大家偷个懒,有关歌曲数据的数据包基本上都包含fcg关键字,可以直接筛选,大家也可以自行查看preview进行判断

    这里是榜单歌曲信息包:,这里就作为我们爬虫的切入点,从这里可以获取到歌曲的基本信息,包括歌曲id和名字,后面会用到这些信息,我们先记住,慢慢来进行分析

    我们打开播放页面,对歌曲媒体文件进行抓取,直接获取media数据即可

    仔细观察会发现不同歌曲下载链接之间的饿异同点,去抓取不同的歌曲数据包会发现包括guid,format等参数都是固定数值,这里变化的只有C400后面的参数(仔细观察发现这里就是songmid值)和vkey值。

    我们再对vkey相关的数据包进行抓取,从名字就能简单看出这个数据包适合vkey相关的

    这里是vkey数据包 ,我们将数据整理一下(放在json在线解析页面整理)查看,对比一下不难发现vkey值的保存地址,这里的purl地址就是C400后面那一串加上vkey后面,也是省去不少麻烦

    对这里vkey连接里header里面真实url的连接进行分析,发现后面的数据参数基本上就是后面data里面的参数,只是除了data里面的songmid不同外,所以这里只需要将songmid进行构造一下然后进行页面获取即可

def getVkey(songmid):
    vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
    res = requests.get(url=vkey_url)
    time.sleep(0.5)
    res02 = json.loads(res.text)
    vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
    return vkey

    我们随便拿一首歌的songmid和vkey进行验证,发现是可以下载的,至此完整流程我们已经完成,基本上就是:

  1. 获取歌曲songmid
  2. 通过songmid获取vkey
  3. 通过vkey组合的下载链接进行歌曲获取

代码实现

  #!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
@author: maya
@contact: 1278077260@qq.com
@software: Pycharm
@file: music.py
@time: 2019/1/8 12:48
@desc:
'''
import json
import requests
import time
import os
import urllib

headers = {
        "cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
        "user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'

    }

def getHtml(start_url):

    try:
        r = requests.get(start_url, headers=headers)
        r.encoding = r.apparent_encoding


        text = json.loads(r.text)
        return text
    except:
        return ""

def getSongMid(html):

    songmid = []
    for tid in html['songlist']:
        songmid.append([tid['data']['songmid'], tid['data']['songname']])
    return songmid

def getSong(html):
    start_index = 0
    while (True):
        start_num = start_index * 30
        num = 30
        start_index += 1
        update_key = html['update_time']  # 有些update_key为2018-5,而实际请求需要传递2018-05,因此需要转换下
        temp_key = update_key.split("_")
        if (len(temp_key) == 3):
            if len(temp_key[1]) == 1:
                update_key = temp_key[0] + '_0' + temp_key[1] + temp_key[2]
            elif len(temp_key[2]) == 1:
                update_key = temp_key[0] + temp_key[1] + '_0' + temp_key[2]
        page_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0".format(
            update_key, start_num)
        json_text = getHtml(page_url)
        songinfo = getSongMid(json_text)
        if len(songinfo) == 0:
            break
        for sid in songinfo:
            vkey = getVkey(sid[0])#获取每首音乐的vkey
            saveMusic(sid[0],vkey,sid[1])#保存此音乐
            time.sleep(1)#休眠1秒,防止被服务器过滤掉

def getVkey(songmid):
    vkey_url = "https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(songmid)
    res = requests.get(url=vkey_url)
    time.sleep(0.5)
    res02 = json.loads(res.text)
    vkey = res02["req_0"]["data"]["midurlinfo"][0]["purl"]
    return vkey



def saveMusic(songmid, vkey, name):

    headers['Host'] = 'dl.stream.qqmusic.qq.com'
    url = "http://dl.stream.qqmusic.qq.com/" + vkey
    res = requests.get(url, headers=headers, stream=True)
    filename = 'song/{0}.m4a'.format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", ""))

    print("*****    正在下载    *****")
    print(url)
    print("*****歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))

    with open(filename, 'wb') as f:
        f.write(res.raw.read())
    if(urllib.request.urlopen(url).getheader('Content-Length') > 0):
        print("成功下载歌曲:{}".format(name.replace("?", "").replace("/", "_").replace("\\", "_").replace("\"", "")))
        # size = urllib.request.urlopen(url).getheader('Content-Length')
        # print(size)
    else:
        print("下载失败")
        os.remove(filename)

if __name__ == '__main__':
    start_url = "https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date=2019-01-08&topid=4&type=top&song_begin=0&song_num=30&g_tk=1154346586&loginUin=1278077260&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0"
    text = getHtml(start_url)
    getSong(text)


多线程版本:

import requests
import json
import time
from datetime import datetime
import threading



date_time=datetime.now().date()
def func(num):
    starturl="https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid=4&type=top&song_begin={1}&song_num=30&g_tk=1285181755&loginUin=2521763805&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0".format(date_time,num*30)
    print(starturl)
    headers = {
    "cookie": 'RK=51FHFw4aE8; pgv_pvi=8430643200; ptcz=83cfc479ce75c5a1416df7d87136166109888f38587d9944738abca7ab77d17c; tvfe_boss_uuid=e4ba183f02ae980f; pgv_pvid=3169027098; pgv_pvid_new=2426636288_14882e87533; mobileUV=1_15f666e2b04_e8a50; pac_uid=1_1278077260; eas_sid=l1C5q306s9W2d845F9u7f1K1U6; ptui_loginuin=40370953; o_cookie=1278077260; luin=o1278077260; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22%24device_id%22%3A%221669eddcdc5156-0905303c6ff588-7d113749-1049088-1669eddcdc83f8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; lskey=00010000a5727043706a88a2aebf6044daf687035fcc0804760fd13cac0729275356f7aa88d5157b46210ea6; LW_sid=y1s5J425D4j7u9N1Q8Q0j2k383; LW_uid=p1q5u4d584A7f971l820z2k3M9; ts_uid=4705118039; yq_index=0; uin=o1278077260; skey=@mXN9mj3as; p_uin=o1278077260; pt4_token=cVwioR9KifEllUyD2CPEXz692iNhDH8JE-YwH*5TlRY_; p_skey=BE7HSxnTeFIPwrO6sJ*YXyA1xKGxT072f5YAo919LSY_; yqq_stat=0; pgv_si=s3828307968; pgv_info=ssid=s3773836208; ts_last=y.qq.com/n/yqq/toplist/4.html; ts_refer=link.zhihu.com/%3Ftarget%3Dhttps%253A//y.qq.com/n/yqq/toplist/4.html%2523stat%253Dy_new.toplist.menu.4',
    "user-agent": 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36'
    }
    res=requests.get(url=starturl,headers=headers)
    res=res.text
    res=json.loads(res)
    songname=[]
    songmid=[]
    for i in res["songlist"]:
        songname.append(i["data"]["songname"])
        songmid.append(i["data"]["songmid"])
    mid_name=dict(zip(songmid,songname))

    for j in mid_name:
        vkey_url ="https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey05137740976859173&g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq.json&needNewCode=0&data=%7B%22req%22%3A%7B%22module%22%3A%22CDN.SrfCdnDispatchServer%22%2C%22method%22%3A%22GetCdnDispatch%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22calltype%22%3A0%2C%22userip%22%3A%22%22%7D%7D%2C%22req_0%22%3A%7B%22module%22%3A%22vkey.GetVkeyServer%22%2C%22method%22%3A%22CgiGetVkey%22%2C%22param%22%3A%7B%22guid%22%3A%22953482270%22%2C%22songmid%22%3A%5B%22{0}%22%5D%2C%22songtype%22%3A%5B0%5D%2C%22uin%22%3A%220%22%2C%22loginflag%22%3A1%2C%22platform%22%3A%2220%22%7D%7D%2C%22comm%22%3A%7B%22uin%22%3A0%2C%22format%22%3A%22json%22%2C%22ct%22%3A24%2C%22cv%22%3A0%7D%7D".format(j)
        res02=requests.get(url=vkey_url)
        time.sleep(0.5)
        res02 = res02.text
        res02 = json.loads(res02)
        vkey=res02["req_0"]["data"]["midurlinfo"][0]["purl"]
        url="http://dl.stream.qqmusic.qq.com/"+vkey
        try:
            filename="music/"+mid_name[j]+".m4a"
            print(filename)
            res03=requests.get(url=url,headers=headers)
            with open(filename,"wb") as f:
                f.write(res03.content)
        except:
            continue

# threading_list=[]
# for the in range(4):
#     threadParse = threading.Thread(target=func(the))
#     threading_list.append(threadParse)
#
# for th in threading_list:
#     th.setDaemon(True)
#     th.start()
for lon in range(4):
    func(lon)

  • 这里通过urllib对歌曲数据进行判断,去除无法下载的歌曲(由于权限等问题)
  • 代码中没有对文件夹进行建立,大家可以自行修改一下,也可以直接建立相应文件夹
  • 更多爬虫代码详情查看Github
2人推荐
随时随地看视频
慕课网APP