本博客日IP超过2000,PV 3000 左右,急需赞助商。
极客时间所有课程通过我的二维码购买后返现24元微信红包,请加博主新的微信号:xttblog2,之前的微信号好友位已满,备注:返现
受密码保护的文章请关注“业余草”公众号,回复关键字“0”获得密码
所有面试题(java、前端、数据库、springboot等)一网打尽,请关注文末小程序
腾讯云】1核2G5M轻量应用服务器50元首年,高性价比,助您轻松上云
昨天发生了一件另我非常沮丧的事情。我的个人站点业余草,数据库发生了故障,导致了将近100篇文章的丢失。
本站点主要是一个月备份一次数据库,上个月,也就是9月份的文章目前已全部丢失。
通过我个人对搜索引擎的理解,发现谷歌网页快照中有部分保留,于是我就用https抓取了部分快照,以便能恢复部分文章。下面介绍本文的重点如何使用HttpsClient抓取https网页内容?
一般的jsoup等爬虫框架对https的支持都不够友好。因此我这里借助了HttpsClient工具类来实现。
注意:如果你使用我的案例,在抓取https开头的网页时报错:unable to find valid certification path to requested target或者是peer not authenticated异常,原因是可能是使用jdk1.6,可以1.7试试,如果还是报错那就重新包装抓取用到HttpClient类。
下面我们进入代码实战阶段。
import java.security.cert.CertificateException; import java.security.cert.X509Certificate; import javax.net.ssl.SSLContext; import javax.net.ssl.TrustManager; import javax.net.ssl.X509TrustManager; import org.apache.http.client.HttpClient; import org.apache.http.conn.scheme.Scheme; import org.apache.http.conn.scheme.SchemeRegistry; import org.apache.http.conn.ssl.SSLSocketFactory; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager; //业余草:www.xttblog.com public class HttpsClient { public static DefaultHttpClient getNewHttpsClient(HttpClient httpClient){ try { SSLContext ctx = SSLContext.getInstance("TLS"); X509TrustManager tm = new X509TrustManager() { public X509Certificate[] getAcceptedIssuers() { return null; } public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { } public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { } }; ctx.init(null, new TrustManager[] { tm }, null); SSLSocketFactory ssf = new SSLSocketFactory(ctx,SSLSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER); SchemeRegistry registry = new SchemeRegistry(); registry.register(new Scheme("https", 443, ssf)); ThreadSafeClientConnManager mgr = new ThreadSafeClientConnManager(registry); return new DefaultHttpClient(mgr, httpClient.getParams()); } catch (Exception ex) { ex.printStackTrace(); return null; } } }
在抓取之前重新获取httpClient类(httpClient = HttpsClient.getNewHttpsClient(httpClient);)
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import org.apache.commons.httpclient.HttpStatus; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.util.EntityUtils; public class Test { //业余草:www.xttblog.com public static void main(String[] args) { String url ="https://baidu.com"; String html = getPageHtml(url); System.out.println(html); } /** * 获取网页html */ public static String getPageHtml(String currentUrl) { HttpClient httpClient=new DefaultHttpClient(); httpClient = HttpsClient.getNewHttpsClient(httpClient); String html = ""; HttpGet request = new HttpGet(currentUrl); HttpResponse response = null; try { response = httpClient.execute(request); if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK){ HttpEntity mEntity = response.getEntity(); html = EntityUtils.toString(mEntity); } }catch(IOException e){ e.printStackTrace(); } return html.toString(); } }
使用的jar:
- commons-httpclient-3.1.jar
- commons-logging.jar
- httpclient-4.2.5.jar
- httpcore-4.2.4.jar
以上代码使用jdk1.7测试通过。
源码和jar到这里进行下载,导入eclipse中就能运行。
最后,欢迎关注我的个人微信公众号:业余草(yyucao)!可加作者微信号:xttblog2。备注:“1”,添加博主微信拉你进微信群。备注错误不会同意好友申请。再次感谢您的关注!后续有精彩内容会第一时间发给您!原创文章投稿请发送至532009913@qq.com邮箱。商务合作也可添加作者微信进行联系!
本文原文出处:业余草: » 使用HttpsClient抓取https网页内容