最近群里有人在做爬虫,使用httpclient进行爬取,结果感觉爬取的比较慢,每一次请求差不多要消耗几百毫秒,问能否优化以下,因此这里我大概介绍下使用httpclient支撑高并发的总体思路。
这里要想加快httpclient的请并发量,就要减少httpclient的请求时间,所以关于请求时间无外乎做的思路就是:
使用httpclient连接池 使用keepavlie 链接重用
所以基于上诉的情况,我们来编写一个htppclient实现高并发的案例。这里我们主要还是采用httpclient4,使用的版本是:4.5.6,具体介绍如下:
1)创建一个连接池
这里创建连接池比较简单,主要是设置总连接数和各个路由并发数,示例代码如下:
private static PoolingHttpClientConnectionManager connectionManager = null; static { connectionManager = new PoolingHttpClientConnectionManager(); connectionManager.setMaxTotal(500); connectionManager.setDefaultMaxPerRoute(50);// 例如默认每路由最高50并发,具体依据业务来定 }
2)创建一个keepalive的策略
这里keepalive的时长需要定义下,一般我们设置为60秒即可,示例代码如下:
private static ConnectionKeepAliveStrategy keepAliveStrategy = new ConnectionKeepAliveStrategy() { @Override public long getKeepAliveDuration(HttpResponse response, HttpContext context) { return 60 * 1000;//定义keepalive时长为60s } };
具体的实际时长可以在代码里面写死,也可以从header头里面动态获取,一般都是直接写死的
3)初始化httpclient
上面我们设置完毕了,那么接下来就要初始化httpclient了,这里的httpclient初始化的时候一般包含一些属性信息,例如:
超时时间 重拾 压缩 等等
这里我们依托上面的connectionmanager来创建这个httpclient,示例代码如下:
public static CloseableHttpClient getHttpClient() { CloseableHttpClient httpClient = HttpClients.custom() //设置连接池 .setConnectionManager(connectionManager) //设置keepAlive的时间 .setKeepAliveStrategy(keepAliveStrategy) //设置重拾3次 .setRetryHandler(new DefaultHttpRequestRetryHandler(3,true)) //设置request confg .setDefaultRequestConfig( RequestConfig.custom() .setStaleConnectionCheckEnabled(true) .setContentCompressionEnabled(true) .setSocketTimeout(60) .setConnectionRequestTimeout(60) .setConnectTimeout(60) .build()).build(); return httpClient; }
4)设置关闭httpclient
日常我们在使用httpclient的时候,用完一般都会直接close,这里的话由于我们是把httpclient放在连接池的,因此这里我们使用一个单独的线程去检测超过30秒没有使用的httpclient给他关闭掉即可,示例代码如下:
package com.example.demo.httpclient; import java.util.concurrent.TimeUnit; import org.apache.http.conn.HttpClientConnectionManager; public class ExpireConnectionCloseThread extends Thread{ private final HttpClientConnectionManager connMgr; private volatile boolean shutdown; public ExpireConnectionCloseThread(HttpClientConnectionManager connMgr) { super(); this.connMgr = connMgr; } @Override public void run() { try { while (!shutdown) { synchronized (this) { wait(5000); // Close expired connections connMgr.closeExpiredConnections(); // Optionally, close connections // that have been idle longer than 30 sec connMgr.closeIdleConnections(30, TimeUnit.SECONDS); } } } catch (InterruptedException ex) { ex.printStackTrace(); } } public void shutdown() { shutdown = true; synchronized (this) { notifyAll(); } } }
然后我们在使用的时候单独启动一个线程来检测空余链接,示例代码如下:
new ExpireConnectionCloseThread(connectionManager).start();
5)解决重复关流的问题
在httpclient4.3之后,官方建议使用ResponseHandler来获取responsebody,这样可以减少一次关流的操作,这里我们也使用这个ResponseHandler来处理,示例代码如下:
/** * 这里使用handler处理结果,避免重复关流,使用的方式是:String responseBody = * httpclient.execute(httpget,responseHandler); */ public static ResponseHandler<String> responseHandler = new ResponseHandler<String>() { @Override public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException { int status = response.getStatusLine().getStatusCode(); if (status >= 200 && status < 300) { HttpEntity entity = response.getEntity(); return entity != null ? EntityUtils.toString(entity, "UTF-8") : null; } else { throw new ClientProtocolException("Unexpected response status: " + status); } } };
最后我们完整的HttpClientUtis的代码如下:
package com.example.demo.httpclient; import java.io.IOException; import org.apache.http.HeaderElement; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.ResponseHandler; import org.apache.http.client.config.RequestConfig; import org.apache.http.conn.ConnectionKeepAliveStrategy; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.DefaultHttpRequestRetryHandler; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.protocol.HTTP; import org.apache.http.protocol.HttpContext; import org.apache.http.util.EntityUtils; public class HttpClientUtils { private static PoolingHttpClientConnectionManager connectionManager = null; static { connectionManager = new PoolingHttpClientConnectionManager(); connectionManager.setMaxTotal(500); connectionManager.setDefaultMaxPerRoute(50);// 例如默认每路由最高50并发,具体依据业务来定 new ExpireConnectionCloseThread(connectionManager).start(); } private static ConnectionKeepAliveStrategy keepAliveStrategy = new ConnectionKeepAliveStrategy() { @Override public long getKeepAliveDuration(HttpResponse response, HttpContext context) { return 60 * 1000;//定义keepalive时长为60s } }; public static CloseableHttpClient getHttpClient() { CloseableHttpClient httpClient = HttpClients.custom() //设置连接池 .setConnectionManager(connectionManager) //设置keepAlive的时间 .setKeepAliveStrategy(keepAliveStrategy) //设置重拾3次 .setRetryHandler(new DefaultHttpRequestRetryHandler(3,true)) //设置request confg .setDefaultRequestConfig( RequestConfig.custom() .setStaleConnectionCheckEnabled(true) .setContentCompressionEnabled(true) .setSocketTimeout(60) .setConnectionRequestTimeout(60) .setConnectTimeout(60) .build()).build(); return httpClient; } /** * 这里使用handler处理结果,避免重复关流,使用的方式是:String responseBody = * httpclient.execute(httpget,responseHandler); */ public static ResponseHandler<String> responseHandler = new ResponseHandler<String>() { @Override public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException { int status = response.getStatusLine().getStatusCode(); if (status >= 200 && status < 300) { HttpEntity entity = response.getEntity(); return entity != null ? EntityUtils.toString(entity, "UTF-8") : null; } else { throw new ClientProtocolException("Unexpected response status: " + status); } } }; }
当我们使用的时候就比较简单了,这里我们使用百度来测试以下,示例代码如下:
private void test() { try { HttpGet httpGet = new HttpGet("https://www.baidu.com"); CloseableHttpClient httpclient = HttpClientUtils.getHttpClient(); String responseBody = httpclient.execute(httpGet,HttpClientUtils.responseHandler); System.out.println(Thread.currentThread().getName()+":"+responseBody); } catch (Exception e) { e.printStackTrace(); } }
在运行的时候我们能看到很快的输出了结果,并且能实现自动的关闭httpclient,是不是很方便:
目前这个工具类我在线下进行并发测试和疲劳测试都没有遇到问题,所以大家可以放心的使用。
最后按照惯例,附上本案例的源码,登陆后即可下载。
还没有评论,来说两句吧...