網(wǎng)頁趴數(shù)據(jù)php,網(wǎng)站爬數(shù)據(jù)

php的curl怎么爬取網(wǎng)頁內(nèi)容

創(chuàng)建一個新cURL資源

成都創(chuàng)新互聯(lián)公司是一家專注于網(wǎng)站制作、成都網(wǎng)站設(shè)計與策劃設(shè)計,商丘網(wǎng)站建設(shè)哪家好?成都創(chuàng)新互聯(lián)公司做網(wǎng)站,專注于網(wǎng)站建設(shè)十載,網(wǎng)設(shè)計領(lǐng)域的專業(yè)建站公司;建站業(yè)務(wù)涵蓋:商丘等地區(qū)。商丘做網(wǎng)站價格咨詢:13518219792

設(shè)置URL和相應(yīng)的選項

抓取URL并把它傳遞給瀏覽器

關(guān)閉cURL資源，并且釋放系統(tǒng)資源

代碼案例：

PHP抓取網(wǎng)頁指定內(nèi)容

?php

* 如下：方法有點笨

* 抓取網(wǎng)頁內(nèi)容用 PHP 的正則

* 用JS每隔5分鐘刷新當(dāng)前頁面---即重新獲取網(wǎng)頁內(nèi)容

* 注： $mode中--title/title-更改為所需內(nèi)容（如 $mode = "#a(.*)/a#";獲取所有鏈接）

* window.location.href="";中的

* 更改為自己的URL----作用：即刷新當(dāng)前頁面

* setInterval("ref()",300000);是每隔300000毫秒（即 5 * 60 *1000 毫秒即5分鐘）執(zhí)行一次函數(shù) ref()

* print_r($arr);輸出獲得的所有內(nèi)容 $arr是一個數(shù)組可根據(jù)所需輸出一部分（如 echo $arr[1][0];）

* 若要獲得所有內(nèi)容可去掉

* $mode = "#title(.*)/title#";

if(preg_match_all($mode,$content,$arr)){

print_r($arr);

echo "br/";

echo $arr[1][0];

}

再加上 echo $content；

$url = ""; //目標(biāo)站

$fp = @fopen($url, "r") or die("超時");

$content=file_get_contents($url);

$mode = "#title(.*)/title#";

if(preg_match_all($mode,$content,$arr)){

//print_r($arr);

echo "br/";

echo $arr[1][0];

}

script language="JavaScript" type="text/javascript"

function ref(){

window.location.href="";

}

setInterval("ref()",300000);

//--

/script

如何用php 編寫網(wǎng)絡(luò)爬蟲?

pcntl_fork或者swoole_process實現(xiàn)多進程并發(fā)。按照每個網(wǎng)頁抓取耗時500ms，開200個進程，可以實現(xiàn)每秒400個頁面的抓取。

curl實現(xiàn)頁面抓取，設(shè)置cookie可以實現(xiàn)模擬登錄

simple_html_dom 實現(xiàn)頁面的解析和DOM處理

如果想要模擬瀏覽器，可以使用casperJS。用swoole擴展封裝一個服務(wù)接口給PHP層調(diào)用

在這里有一套爬蟲系統(tǒng)就是基于上述技術(shù)方案實現(xiàn)的，每天會抓取幾千萬個頁面。

php獲取網(wǎng)頁源碼內(nèi)容有哪些辦法

可以參考以下幾種方法：

方法一： file_get_contents獲取

span style="white-space:pre"?/span$url="";

span style="white-space:pre"?/span$fh= file_get_contents

('');span style="white-space:pre"?/spanecho $fh;

方法二：使用fopen獲取網(wǎng)頁源代碼

span style="white-space:pre"?/span$url="";

span style="white-space:pre"?/span$handle = fopen ($url, "rb");

span style="white-space:pre"?/span$contents = "";

span style="white-space:pre"?/spanwhile (!feof($handle)) {

span style="white-space:pre"??/span$contents .= fread($handle, 8192);

span style="white-space:pre"?/span}

span style="white-space:pre"?/spanfclose($handle);

span style="white-space:pre"?/spanecho $contents; //輸出獲取到得內(nèi)容。

方法三：使用CURL獲取網(wǎng)頁源代碼

$url="";

$UserAgent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; .NET CLR 3.5.21022; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$curl = curl_init();?//創(chuàng)建一個新的CURL資源

curl_setopt($curl, CURLOPT_URL, $url);?//設(shè)置URL和相應(yīng)的選項

curl_setopt($curl, CURLOPT_HEADER, 0);? //0表示不輸出Header，1表示輸出

curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);?//設(shè)定是否顯示頭信息,1顯示，0不顯示。//如果成功只將結(jié)果返回，不自動輸出任何內(nèi)容。如果失敗返回FALSE

curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);

curl_setopt($curl, CURLOPT_ENCODING, '');?//設(shè)置編碼格式，為空表示支持所有格式的編碼

//header中“Accept-Encoding: ”部分的內(nèi)容，支持的編碼格式為："identity"，"deflate"，"gzip"。

curl_setopt($curl, CURLOPT_USERAGENT, $UserAgent);

curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);

//設(shè)置這個選項為一個非零值(象 “Location: “)的頭，服務(wù)器會把它當(dāng)做HTTP頭的一部分發(fā)送(注意這是遞歸的，PHP將發(fā)送形如 “Location: “的頭)。

$data = curl_exec($curl);

echo $data;

//echo curl_errno($curl); //返回0時表示程序執(zhí)行成功

curl_close($curl);?//關(guān)閉cURL資源，并釋放系統(tǒng)資源

拓展資料

PHP（外文名:PHP: Hypertext Preprocessor，中文名：“超文本預(yù)處理器”）是一種通用開源腳本語言。語法吸收了C語言、Java和Perl的特點，利于學(xué)習(xí)，使用廣泛，主要適用于Web開發(fā)領(lǐng)域。PHP 獨特的語法混合了C、Java、Perl以及PHP自創(chuàng)的語法。它可以比CGI或者Perl更快速地執(zhí)行動態(tài)網(wǎng)頁。

用PHP做出的動態(tài)頁面與其他的編程語言相比，PHP是將程序嵌入到HTML（標(biāo)準(zhǔn)通用標(biāo)記語言下的一個應(yīng)用）文檔中去執(zhí)行，執(zhí)行效率比完全生成HTML標(biāo)記的CGI要高許多；PHP還可以執(zhí)行編譯后代碼，編譯可以達(dá)到加密和優(yōu)化代碼運行，使代碼運行更快。

參考資料：PHP（超文本預(yù)處理器)-百度百科

如何用PHP做網(wǎng)絡(luò)爬蟲

其實用PHP來爬會非常方便，主要是PHP的正則表達(dá)式功能在搜集頁面連接方面很方便，另外PHP的fopen、file_get_contents以及l(fā)ibcur的函數(shù)非常方便的下載網(wǎng)頁內(nèi)容。

具體處理方式就是建立就一個任務(wù)隊列，往隊列里面插入一些種子任務(wù)和可以開始爬行，爬行的過程就是循環(huán)的從隊列里面提取一個URL，打開后獲取連接插入隊列中，進行相關(guān)的保存。隊列可以使用數(shù)組實現(xiàn)。

當(dāng)然PHP作為但線程的東西，慢慢爬還是可以，怕的就是有的URL打不開，會死在那里。

抓取網(wǎng)頁數(shù)據(jù)怎么保存到數(shù)據(jù)庫 php

給一個例子你看看吧.

if($pro_list_contents=@file_get_contents(''))

{

preg_match_all("/td width=\"50%\" valign=\"top\"(.*)td width=\"10\"img src=\"images\/spacer.gif\"/isU", $pro_list_contents, $pro_list_contents_ary);

for($i=0; $icount($pro_list_contents_ary[1]); $i++)

{

preg_match_all("/a href=\"(.*)\"img src=\"(.*)\".*span(.*)\/span/isU", $pro_list_contents_ary[1][$i], $url_img_price);

$url=addslashes($url_img_price[1][0]);

$img=str_replace(' ', '20%', trim(''.$url_img_price[2][0]));

$price=(float)str_replace('$', '', $url_img_price[3][0]);

preg_match_all("/a class=\"ml1\" href=\".*\"(.*)\/a/isU", $pro_list_contents_ary[1][$i], $proname_ary);

$proname=addslashes($proname_ary[1][0]);

include("inc/db_connections.php");

$rs=mysql_query("select * from pro where Url='$url' and CateId='{$cate_row['CateId']}'"); //是否已經(jīng)采集了

if(mysql_num_rows($rs))

{

echo "跳過：{$url}br";

continue;

}

$basedir='/u_file/pro/img/'.date('H/');

$save_dir=Build_dir($basedir); //創(chuàng)建目錄函數(shù)

$ext_name = GetFileExtName( $img ); //取得圖片后輟名

$SaveName = date( 'mdHis' ) . rand( 10000, 99999 ) . '.' . $ext_name;

if( $get_file=@file_get_contents( $img ) )

{

$fp = @fopen( $save_dir . $SaveName, 'w' );

@fwrite( $fp, $get_file );

@fclose( $fp );

@chmod( $save_dir . $SaveName, 0777 );

@copy( $save_dir . $SaveName, $save_dir . 'small_'.$SaveName );

$imgpath=$basedir.'small_'.$SaveName;

}

else

{

$imgpath='';

}

if($pro_intro_contents=@file_get_contents($url))

{

preg_match_all("/\/h1(.*)\/td\/tr/isU", $pro_intro_contents, $pro_intro_contents_ary);

$p_contents=addslashes(str_replace('src="', 'src="', $pro_intro_contents_ary[1][0]));

$p_contents=SaveRemoteImg($p_contents, '/u_file/pro/intro/'.date('H/')); //把遠(yuǎn)程html代碼里的圖片保存到本地

}

$t=time();

mysql_query("insert into pro(CateId, ProName, PicPath_0, S_PicPath_0, Price_0, Contents, AddTime, Url) values('{$cate_row['CateId']}', '$proname', '$imgpath', '$img', '$price', '$p_contents', '$t', '$url')");

echo $url.$img.$cate."br\r\n";

}

新聞標(biāo)題：網(wǎng)頁趴數(shù)據(jù)php,網(wǎng)站爬數(shù)據(jù)
鏈接URL：http://chinadenli.net/article40/dsgcgho.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站排名、品牌網(wǎng)站制作、全網(wǎng)營銷推廣、品牌網(wǎng)站建設(shè)、手機網(wǎng)站建設(shè)、網(wǎng)站制作

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容