抓取页面
地址:http://www.meipai.com/medias/hot
1234 | public function getContentByFilegetcontents( $url ) { $content = file_get_contents ( $url ); return $content ; } |
然后我们会获取到整个页面的代码,接下来就是从代码中提取出视频的地址 标题 图片等关键信息
2.提取
我们发现视频的主要代码集中在以下代码中
< li class = "pr no-select loading J_media_list_item" itemscope itemtype = "http://schema.org/VideoObject" > < img src = "http://mvimg1.meitudata.com/5733fe57ce7aa8996.jpg!thumb320" width = "300" height = "300" class = "db pa pai" alt = "手撕包菜。包菜撕片装洗净备用。热锅入油五花肉下锅煸炒出油,多余的油盛出。放酱油肉上色,盛出。之前的油倒锅内,放蒜辣椒炒香,下包菜继续翻炒,倒适量酱油老抽五香粉。再下之前炒好的五花肉翻炒,放适量盐,出锅前放鸡精淋入适量香醋即可非常香啊,超级下饭。喜欢的点赞奥#美食##家常菜#" itemprop = "thumbnail" > < div id = "w517161790" class = "content-l-video content-l-media-wrap pr cp" data-id = "517161790" data-video = "http://mvvideo1.meitudata.com/5734040ae2dec950.mp4" > < div class = "layer-black pa" ></ div > < a hidefocus href = "/media/517161790" target = "_blank" class = "content-l-p pa" title = "手撕包菜。包菜撕片装洗净备用。热锅入油五花肉下锅煸炒出油,多余的油盛出。放酱油肉上色,盛出。之前的油倒锅内,放蒜辣椒炒香,下包菜继续翻炒,倒适量酱油老抽五香粉。再下之前炒好的五花肉翻炒,放适量盐,出锅前放鸡精淋入适量香醋即可非常香啊,超级下饭。喜欢的点赞奥#美食##家常菜#" > < meta itemprop = "url" content = "/media/517161790" > < i class = "icon icon-item-play" ></ i > < strong class = "js-convert-emoji" itemprop = "description" >哈喇嘎子流成河</ strong > </ a > </ div > < div class = "pr" itemscope itemtype = "http://schema.org/AggregateRating" > < a hidefocus href = "/user/62299474" class = "dbl h48" > < img src = "http://mvavatar2.meitudata.com/5731f6b7bbee979.jpg!thumb60" width = "28" height = "28" class = "avatar m10" title = "小优Lucky" alt = "小优Lucky" > </ a > < p class = "content-name pa" > < a hidefocus href = "/user/62299474" class = "content-name-a js-convert-emoji" title = "小优Lucky" itemprop = "author" >小优Lucky</ a > </ p > < div class = "content-like pa" data-id = "517161790" > < i class = "icon icon-like" ></ i > < span itemprop = "ratingCount" >3060</ span > </ div > < a hidefocus href = "/media/517161790" data-sc = "1" target = "_blank" class = "conten-command pa" data-id = "517161790" > < i class = "icon icon-command" ></ i > < span itemprop = "reviewCount" >100</ span > </ a > </ div > </ li > |
通过正则匹配
public function extracturl( $page ) { $matches = array (); $voide = array (); $mainurl = "" ; $list = array (); $j =0; $pat = "/<li class=\"pr no-select loading J_media_list_item\".*?>.*?<\/li>/ism" ; preg_match_all( $pat , $page , $matches , PREG_PATTERN_ORDER); for ( $i =0; $i < count ( $matches [0]) ; $i ++) { $pat1 = "/data-video=\"(.*?)\"/ism" ; preg_match_all( $pat1 , $matches [0][ $i ], $voide , PREG_PATTERN_ORDER); $myvoide = $voide [1][0]; $pat2 = "/src=\"(.*?)\"/ism" ; preg_match_all( $pat2 , $matches [0][ $i ], $img , PREG_PATTERN_ORDER); $myimg = $img [1][0]; $pat3 = "/<strong class=\"js-convert-emoji\".*?>(.*?)<\/strong>/ism" ; preg_match_all( $pat3 , $matches [0][ $i ], $title , PREG_PATTERN_ORDER); $mytitle = $title [1][0]; $list [ $j ++]= array ( 'voide' => $myvoide , 'title' => $mytitle , 'img' => $myimg ); } return $list ; } } |
全部代码
<?php class Cutecrawler { public function getContentByFilegetcontents( $url ) { $content = file_get_contents ( $url ); return $content ; } public function extracturl( $page ) { $matches = array (); $voide = array (); $mainurl = "" ; $list = array (); $j =0; $pat = "/<li class=\"pr no-select loading J_media_list_item\".*?>.*?<\/li>/ism" ; preg_match_all( $pat , $page , $matches , PREG_PATTERN_ORDER); for ( $i =0; $i < count ( $matches [0]) ; $i ++) { $pat1 = "/data-video=\"(.*?)\"/ism" ; preg_match_all( $pat1 , $matches [0][ $i ], $voide , PREG_PATTERN_ORDER); $myvoide = $voide [1][0]; $pat2 = "/src=\"(.*?)\"/ism" ; preg_match_all( $pat2 , $matches [0][ $i ], $img , PREG_PATTERN_ORDER); $myimg = $img [1][0]; $pat3 = "/<strong class=\"js-convert-emoji\".*?>(.*?)<\/strong>/ism" ; preg_match_all( $pat3 , $matches [0][ $i ], $title , PREG_PATTERN_ORDER); $mytitle = $title [1][0]; $list [ $j ++]= array ( 'voide' => $myvoide , 'title' => $mytitle , 'img' => $myimg ); } return $list ; } } $url = "http://www.meipai.com/medias/hot" ; $crawler = new Cutecrawler(); $content = $crawler ->getContentByFilegetcontents( $url ); $c = $crawler ->extracturl( $content ); var_dump( $c ); ?> |
最后结果:
array(24) { [0]=> array(3) { ["voide"]=> string(51) "http://mvvideo2.meitudata.com/5737fd5caeb838981.mp4" ["title"]=> string(27) "老师那些年常说的话" ["img"]=> string(58) "http://mvimg4.meitudata.com/57384337bd8b59410.jpg!thumb320" } [1]=> array(3) { ["voide"]=> string(50) "http://mvvideo2.meitudata.com/5737fceabf873602.mp4" ["title"]=> string(21) "女友突然冷落你" ["img"]=> string(58) "http://mvimg2.meitudata.com/5736d25d0aa5d8991.jpg!thumb320" } [2]=> array(3) { ["voide"]=> string(51) "http://mvvideo2.meitudata.com/5737f300131e18596.mp4" ["title"]=> string(27) "女明星之间的内心戏" ["img"]=> string(58) "http://mvimg1.meitudata.com/5737059ad66e16795.jpg!thumb320" } [3]=> array(3) { ["voide"]=> string(51) "http://mvvideo2.meitudata.com/5737eb9d0bfc92046.mp4" ["title"]=> string(24) "真替老师感到悲剧" ["img"]=> string(57) "http://mvimg3.meitudata.com/5737ebf503431343.jpg!thumb320" } |
接下来。。。你可以存入数据库