最近群里有人在做爬虫,使用httpclient进行爬取,结果感觉爬取的比较慢,每一次请求差不多要消耗几百毫秒,问能否优化以下,因此这里我大概介绍下使用httpclient支撑高并发的总体思路。
这里要想加快httpclient的请并发量,就要减少httpclient的请求时间,所以关于请求时间无外乎做的思路就是:
使用httpclient连接池 使用keepavlie 链接重用
所以基于上诉的情况,我们来编写一个htppclient实现高并发的案例。这里我们主要还是采用httpclient4,使用的版本是:4.5.6,具体介绍如下:
1)创建一个连接池
这里创建连接池比较简单,主要是设置总连接数和各个路由并发数,示例代码如下:
private static PoolingHttpClientConnectionManager connectionManager = null;
static {
connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(500);
connectionManager.setDefaultMaxPerRoute(50);// 例如默认每路由最高50并发,具体依据业务来定
}2)创建一个keepalive的策略
这里keepalive的时长需要定义下,一般我们设置为60秒即可,示例代码如下:
private static ConnectionKeepAliveStrategy keepAliveStrategy = new ConnectionKeepAliveStrategy() {
@Override
public long getKeepAliveDuration(HttpResponse response, HttpContext context) {
return 60 * 1000;//定义keepalive时长为60s
}
};具体的实际时长可以在代码里面写死,也可以从header头里面动态获取,一般都是直接写死的
3)初始化httpclient
上面我们设置完毕了,那么接下来就要初始化httpclient了,这里的httpclient初始化的时候一般包含一些属性信息,例如:
超时时间 重拾 压缩 等等
这里我们依托上面的connectionmanager来创建这个httpclient,示例代码如下:
public static CloseableHttpClient getHttpClient() {
CloseableHttpClient httpClient =
HttpClients.custom()
//设置连接池
.setConnectionManager(connectionManager)
//设置keepAlive的时间
.setKeepAliveStrategy(keepAliveStrategy)
//设置重拾3次
.setRetryHandler(new DefaultHttpRequestRetryHandler(3,true))
//设置request confg
.setDefaultRequestConfig(
RequestConfig.custom()
.setStaleConnectionCheckEnabled(true)
.setContentCompressionEnabled(true)
.setSocketTimeout(60)
.setConnectionRequestTimeout(60)
.setConnectTimeout(60)
.build()).build();
return httpClient;
}4)设置关闭httpclient
日常我们在使用httpclient的时候,用完一般都会直接close,这里的话由于我们是把httpclient放在连接池的,因此这里我们使用一个单独的线程去检测超过30秒没有使用的httpclient给他关闭掉即可,示例代码如下:
package com.example.demo.httpclient; import java.util.concurrent.TimeUnit; import org.apache.http.conn.HttpClientConnectionManager; public class ExpireConnectionCloseThread extends Thread{ private final HttpClientConnectionManager connMgr; private volatile boolean shutdown; public ExpireConnectionCloseThread(HttpClientConnectionManager connMgr) { super(); this.connMgr = connMgr; } @Override public void run() { try { while (!shutdown) { synchronized (this) { wait(5000); // Close expired connections connMgr.closeExpiredConnections(); // Optionally, close connections // that have been idle longer than 30 sec connMgr.closeIdleConnections(30, TimeUnit.SECONDS); } } } catch (InterruptedException ex) { ex.printStackTrace(); } } public void shutdown() { shutdown = true; synchronized (this) { notifyAll(); } } }
然后我们在使用的时候单独启动一个线程来检测空余链接,示例代码如下:
new ExpireConnectionCloseThread(connectionManager).start();
5)解决重复关流的问题
在httpclient4.3之后,官方建议使用ResponseHandler来获取responsebody,这样可以减少一次关流的操作,这里我们也使用这个ResponseHandler来处理,示例代码如下:
/**
* 这里使用handler处理结果,避免重复关流,使用的方式是:String responseBody =
* httpclient.execute(httpget,responseHandler);
*/
public static ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
@Override
public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
return entity != null ? EntityUtils.toString(entity, "UTF-8") : null;
} else {
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};最后我们完整的HttpClientUtis的代码如下:
package com.example.demo.httpclient;
import java.io.IOException;
import org.apache.http.HeaderElement;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.ResponseHandler;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.conn.ConnectionKeepAliveStrategy;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.DefaultHttpRequestRetryHandler;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.protocol.HTTP;
import org.apache.http.protocol.HttpContext;
import org.apache.http.util.EntityUtils;
public class HttpClientUtils {
private static PoolingHttpClientConnectionManager connectionManager = null;
static {
connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(500);
connectionManager.setDefaultMaxPerRoute(50);// 例如默认每路由最高50并发,具体依据业务来定
new ExpireConnectionCloseThread(connectionManager).start();
}
private static ConnectionKeepAliveStrategy keepAliveStrategy = new ConnectionKeepAliveStrategy() {
@Override
public long getKeepAliveDuration(HttpResponse response, HttpContext context) {
return 60 * 1000;//定义keepalive时长为60s
}
};
public static CloseableHttpClient getHttpClient() {
CloseableHttpClient httpClient =
HttpClients.custom()
//设置连接池
.setConnectionManager(connectionManager)
//设置keepAlive的时间
.setKeepAliveStrategy(keepAliveStrategy)
//设置重拾3次
.setRetryHandler(new DefaultHttpRequestRetryHandler(3,true))
//设置request confg
.setDefaultRequestConfig(
RequestConfig.custom()
.setStaleConnectionCheckEnabled(true)
.setContentCompressionEnabled(true)
.setSocketTimeout(60)
.setConnectionRequestTimeout(60)
.setConnectTimeout(60)
.build()).build();
return httpClient;
}
/**
* 这里使用handler处理结果,避免重复关流,使用的方式是:String responseBody =
* httpclient.execute(httpget,responseHandler);
*/
public static ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
@Override
public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
return entity != null ? EntityUtils.toString(entity, "UTF-8") : null;
} else {
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
}当我们使用的时候就比较简单了,这里我们使用百度来测试以下,示例代码如下:
private void test() {
try {
HttpGet httpGet = new HttpGet("https://www.baidu.com");
CloseableHttpClient httpclient = HttpClientUtils.getHttpClient();
String responseBody = httpclient.execute(httpGet,HttpClientUtils.responseHandler);
System.out.println(Thread.currentThread().getName()+":"+responseBody);
} catch (Exception e) {
e.printStackTrace();
}
}在运行的时候我们能看到很快的输出了结果,并且能实现自动的关闭httpclient,是不是很方便:
目前这个工具类我在线下进行并发测试和疲劳测试都没有遇到问题,所以大家可以放心的使用。
最后按照惯例,附上本案例的源码,登陆后即可下载。


还没有评论,来说两句吧...