求php大神帮忙解决个curl问题，可付费

cnzmz · 发表于 2019-3-16 19:02:01

本帖最后由 cnzmz 于 2019-3-16 20:11 编辑

补充一下：用python也能获取到源码，不出现验证码，python源码在最下面，就是php的curl死活不行

用php curl函数抓取谷歌学术搜索结果的时候，一直会出现google的人机验证，我原以为是谷歌限制了我的服务器ip，于是我用这个服务器搭建了一个http代理，自己的浏览器用这个代理访问google学术，无论怎么刷新怎么搜索，都不会出现验证码。那么问题应该不是出现在ip上面，然后我又在curl函数里加入了header信息，加入了UA，但还是出现验证码，不知道是不是我没加正确。

以下为我写的curl函数代码，我自己浏览器里面设置这个http代理访问没有任何问题，用下面的函数访问就给我蹦出了个人机验证

跪求php大神能帮忙解决，可付费

<?php
function get_html($url){
$header=array();
$header[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8";
$header[] = "Accept-Language:zh-CN,zh;q=0.9,en;q=0.8";
$header[] = "Cache-Control:no-cache";
$header[] = "Connection:keep-alive"; // browsers keep this blank.
$header[] = "DNT:1";
$header[] = "Pragma:no-cache";
$header[] = "Upgrade-Insecure-Requests:1";
$header[] = "User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36";
$proxy="ip:port";
$proxyauth = 'user:pass';
$ch = curl_init();
curl_setopt ($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxyauth);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT,5);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$a= get_html('https://scholar.google.com/scholar?hl=zh-CN&as_sdt=0%2C5&q=cell&btnG=');
echo '<pre>';
print_r($a);
echo '</pre>';

复制代码

补充一下，用python也能获取到源码，不会出现机器人验证

def requests_html(url):
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'DNT':'1',
'Pragma':'no-cache',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
}
# proxies = {'http':'127.0.0.1:1080','https':'127.0.0.1:1080'}
proxies = {"http": "http://user:pass@ip:port"}
r = requests.get(url,headers=headers,timeout=5,proxies=proxies)
status_code = r.status_code
# print status_code
if status_code != 200:
pass
else:
coding = r.encoding.strip().lower()
# print coding
if coding == 'utf-8':
html = r.content
# print html
return html
else:
html = r.text#.encode(coding).decode('utf8').encode('utf8')
# print html
return html

复制代码

cnzmz · 发表于 2019-3-16 21:05:50

lzhd24 发表于 2019-3-16 20:24

用这个办法，还是会出现机器人验证，大佬可否加个QQ请教一下？

cnzmz · 发表于 2019-3-16 20:29:36

lzhd24 发表于 2019-3-16 20:24

感动的我都哭了，我去研究下，有结果了来反馈；请接下我的膝盖

lzhd24 · 发表于 2019-3-16 20:24:01

提示: 作者被禁止或删除内容自动屏蔽

cnzmz · 发表于 2019-3-16 20:12:28

补充一下，用python的request也能获取到源码，不出现验证码

cnzmz · 发表于 2019-3-16 19:55:15

rlonnet 发表于 2019-3-16 19:41
来路加上试试，不过google就算了代理也不行

来路加上也不行，

rlonnet · 发表于 2019-3-16 19:41:48

来路加上试试，不过google就算了代理也不行

cnzmz · 发表于 2019-3-16 19:29:20

自己来终结0回复的尴尬

		自动登录	找回密码
密码			注册

lzhd24 lzhd24 当前离线积分 10694	6^# 发表于 2019-3-16 20:24:01 \| 只看该作者提示: 作者被禁止或删除内容自动屏蔽
lzhd24 lzhd24 当前离线积分 10694
	回复支持反对举报