一步步编写自己的PHP爬取代理IP项目（三）-原创手记-慕课网

上一章节我们讲完了自动加载，现在我们正式进入爬虫核心代码的编写中，首先我们需要先看看整个目录

config.php 这个是我们的配置文件加载文件
ProxyPool.php 这个是爬虫的核心处理文件
Queue.php 这个是队列操作的处理文件
Requests.php 这个是发起请求的处理文件

然后我们在回忆一下入口文件的代码

<?php
require_once __DIR__ . '/autoloader.php';
require_once __DIR__ . '/vendor/autoload.php';

use ProxyPool\core\ProxyPool;

$proxy = new ProxyPool();
$proxy->run();

通过这里可以看到我们使用了core里面ProxyPool的run方法，先来看看ProxyPool的内容吧

<?php

use ProxyPool\core\Requests;  //HTTP请求文件
use ProxyPool\core\Queue;   //队列操作文件

class ProxyPool
{
    private $redis;
    private $httpClient;    
    private $queueObj;
        
    function __construct()
    {
        $redis = new \Redis();
        $redis->connect(config("database.redis_host"), config("database.redis_port"));
        $this->redis = $redis;$this->httpClient = new Requests(['timeout' => 10]);
        $this->queueObj = new Queue();
    }
        
    public function run()
    { 
                echo "start to spider ip...." . PHP_EOL;
                $ip_arr = $this->get_ip(); //获取IP的具体方法
                echo "select IP num: " . count($ip_arr) . PHP_EOL;
            
                echo "start to check ip...." . PHP_EOL;
                $this->check_ip($ip_arr);  //验证IP可用性的方法
                $ip_pool = $this->redis->smembers('ip_pool');  //读取redis中的ip
                echo "end check ip...." . PHP_EOL;  
               
                print_r($ip_pool);  //输出ip数组
                die;
    }
}

其中get_ip方法会爬取两个网站的IP

//获取各大网站代理IP
private function get_ip()
{
    $ip_arr = [];
    $ip_arr = $this->get_xici_ip($ip_arr);      //西刺代理
    $ip_arr = $this->get_kuaidaili_ip($ip_arr); //快代理
    return $ip_arr;
}

我们先来来看看西刺代理的爬取

private function get_xici_ip($ip_arr)
{
    for ($i = 1; $i <= config('spider.page_num'); $i++) 
    {
        list($infoRes, $msg) = $this->httpClient ->request('GET','http://www.xicidaili.com/nn/'.$i,[]);
        if (!$infoRes) 
        {
            print_r($msg);  //输出错误信息
            exit();
        }
        $infoContent = $infoRes->getBody();
        $this->convert_encoding($infoContent);
        preg_match_all('/<tr.*>[\s\S]*?<td class="country">[\s\S]*?<\/td>[\s\S]*?<td>(.*?)<\/td>[\s\S]*?<td>(.*?)<\/td>/', $infoContent, $match);
        
        $host_arr = $match[1];
        $port_arr = $match[2];
        foreach ($host_arr as $key => $value) 
        {
            $ip_arr[] = $host_arr[$key].":".$port_arr[$key];
        }
    }
    return $ip_arr;
}

这个方法里面，我们首先使用 config('spider.page_num') 这个方法读取了配置文件里面定义的爬取页数，我这里定义的是3页，然后我们打开西刺代理的网站，会发现域名是

http://www.xicidaili.com/nn/XX 这个XX是第几页，第一页就是1，第二页就是2，以此类推

所以我们在代码里面循环访问了三次网站，获取到网页的返回值，然后用正则匹配html去获取里面的地址和端口号（具体html元素可以在网站右键点击审查元素查看）

preg_match_all('/<tr.*>[\s\S]*?<td class="country">[\s\S]*?<\/td>[\s\S]*?<td>(.*?)<\/td>[\s\S]*?<td>(.*?)<\/td>/', $infoContent, $match);

然后经过一些处理，将获取到的IP返回。这就是get_xici_ip这个方法做的事情，它就是负责爬取IP。

然后我们来看看

//检测IP可用性
private function check_ip($ip_arr)
{
    $this->queueObj = $this->queueObj->arr2queue($ip_arr);
    $queue = $this->queueObj->getQueue();
    foreach ($queue as $key => $value) 
    {
        //用百度网和腾讯网测试IP地址的可用性
        for ($i=0; $i < config('spider.examine_round'); $i++) 
        {
            $response = $this->httpClient->test_request('GET','https://www.baidu.com', ['proxy' => 'https://'.$value]);
            if (!$response) 
            {
                $response = $this->httpClient->test_request('GET','http://www.qq.com', ['proxy' => 'http://'.$value]);
                if ($response && $response->getStatusCode() == 200) 
                {
                    break;
                }
            }
            else if($response->getStatusCode() == 200)
            {
                break;
            }
        }
        //将结果存入redis   
        if ($response && $response->getStatusCode() == 200) 
        {
            $this->set_ip2redis($value);   
        }    
        else{
            echo $value . " error...  ". PHP_EOL;    
        }
    }
}

这里我们使用了https的百度和http的qq来检测，如果成功访问就把这个IP插入redis中。

这样我们就能做到爬取IP并且校验可用性了。