本帖最后由 cnzmz 于 2019-3-16 20:11 编辑
补充一下:用python也能获取到源码,不出现验证码,python源码在最下面,就是php的curl死活不行
用php curl函数抓取谷歌学术搜索结果的时候,一直会出现google的人机验证,我原以为是谷歌限制了我的服务器ip,于是我用这个服务器搭建了一个http代理,自己的浏览器用这个代理访问google学术,无论怎么刷新怎么搜索,都不会出现验证码。那么问题应该不是出现在ip上面,然后我又在curl函数里加入了header信息,加入了UA,但还是出现验证码,不知道是不是我没加正确。
以下为我写的curl函数代码,我自己浏览器里面设置这个http代理访问没有任何问题,用下面的函数访问就给我蹦出了个人机验证
跪求php大神能帮忙解决,可付费
- <?php
- function get_html($url){
- $header=array();
- $header[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
- $header[] = "Accept-Language:zh-CN,zh;q=0.9,en;q=0.8";
- $header[] = "Cache-Control:no-cache";
- $header[] = "Connection:keep-alive"; // browsers keep this blank.
- $header[] = "DNT:1";
- $header[] = "Pragma:no-cache";
- $header[] = "Upgrade-Insecure-Requests:1";
- $header[] = "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36";
- $proxy="ip:port";
- $proxyauth = 'user:pass';
- $ch = curl_init();
- curl_setopt ($ch, CURLOPT_PROXY, $proxy);
- curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth);
- curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
- curl_setopt($ch, CURLOPT_URL,$url);
- curl_setopt($ch, CURLOPT_HEADER, false);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- curl_setopt($ch, CURLOPT_TIMEOUT,5);
- $data = curl_exec($ch);
- curl_close($ch);
- return $data;
- }
- $a= get_html('https://scholar.google.com/scholar?hl=zh-CN&as_sdt=0%2C5&q=cell&btnG=');
- echo '<pre>';
- print_r($a);
- echo '</pre>';
复制代码
补充一下,用python也能获取到源码,不会出现机器人验证
- def requests_html(url):
- headers = {
- 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
- 'Accept-Encoding':'gzip, deflate, sdch',
- 'Accept-Language':'zh-CN,zh;q=0.8',
- 'Cache-Control':'no-cache',
- 'Connection':'keep-alive',
- 'DNT':'1',
- 'Pragma':'no-cache',
- 'Upgrade-Insecure-Requests':'1',
- 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
- }
- # proxies = {'http':'127.0.0.1:1080','https':'127.0.0.1:1080'}
- proxies = {"http": "http://user:pass@ip:port"}
- r = requests.get(url,headers=headers,timeout=5,proxies=proxies)
- status_code = r.status_code
- # print status_code
- if status_code != 200:
- pass
- else:
- coding = r.encoding.strip().lower()
- # print coding
- if coding == 'utf-8':
- html = r.content
- # print html
- return html
- else:
- html = r.text#.encode(coding).decode('utf8').encode('utf8')
- # print html
- return html
复制代码 |