b******d 发帖数: 794 | 1 前几天在诸位大牛指导下做了个网络爬虫。
开了12个线程,在本地台机上跑还可以,一次查询只需2分多钟(台机I7 2600/12g mem/
ssd); 后来上线到vps(cpu不详,内存只有1g, 否则太贵养不起了),速度就很慢了
请大虾指点如果优化程序 |
g*****g 发帖数: 34805 | 2 打印一些timestamp出来看看哪里慢了。
mem/
【在 b******d 的大作中提到】 : 前几天在诸位大牛指导下做了个网络爬虫。 : 开了12个线程,在本地台机上跑还可以,一次查询只需2分多钟(台机I7 2600/12g mem/ : ssd); 后来上线到vps(cpu不详,内存只有1g, 否则太贵养不起了),速度就很慢了 : 请大虾指点如果优化程序
|
b******d 发帖数: 794 | 3 multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception
主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m
的十几倍。
【在 g*****g 的大作中提到】 : 打印一些timestamp出来看看哪里慢了。 : : mem/
|
r*****l 发帖数: 2859 | 4 1. Check if CPU is fully loaded.
2. Check if memory is used up.
From your symptom, looks like memory issue. Use jmap to dump heap and use
jhat to analyze memory usage.
exception
512m
【在 b******d 的大作中提到】 : multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception : 主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m : 的十几倍。
|
b******d 发帖数: 794 | 5 thanks.
应该是内存问题,512内存环境下,内存一下就吃掉了;4g内存下,一般内存占用是1.4
g(包括操作系统,用free -m看的)
但是决大部分内存操作都是htmlunit完成的,我就是start了12个进程。如果htmlunit本
身就很耗内存(相当于开12个不share cookie的浏览器),那这方面是不是可以控制的很
少了。
【在 r*****l 的大作中提到】 : 1. Check if CPU is fully loaded. : 2. Check if memory is used up. : From your symptom, looks like memory issue. Use jmap to dump heap and use : jhat to analyze memory usage. : : exception : 512m
|
g*****g 发帖数: 34805 | 6 See if you can switch to google app engine.
exception
512m
【在 b******d 的大作中提到】 : multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception : 主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m : 的十几倍。
|
g*****g 发帖数: 34805 | 7 开12个进程?JVM一下子就吃掉了。你当然应该开一个JVM,上面开线程。
.4
htmlunit本
【在 b******d 的大作中提到】 : thanks. : 应该是内存问题,512内存环境下,内存一下就吃掉了;4g内存下,一般内存占用是1.4 : g(包括操作系统,用free -m看的) : 但是决大部分内存操作都是htmlunit完成的,我就是start了12个进程。如果htmlunit本 : 身就很耗内存(相当于开12个不share cookie的浏览器),那这方面是不是可以控制的很 : 少了。
|
b******d 发帖数: 794 | 8 is it like a auto-sized vpn? then I still have to pay for more calculation p
ower and memory if my program is not efficient.
【在 g*****g 的大作中提到】 : See if you can switch to google app engine. : : exception : 512m
|
b******d 发帖数: 794 | 9 sorry, i mean thread, but its still very heavy, and my design target is to
have
10 of those programs running together without negative effect on efficiency.
however, its necessary to parallel processing the request so the waiting
time could be acceptable.
【在 g*****g 的大作中提到】 : 开12个进程?JVM一下子就吃掉了。你当然应该开一个JVM,上面开线程。 : : .4 : htmlunit本
|
g*****g 发帖数: 34805 | 10 I think you may want to consider AWS instead. And use it only when you need
it. You can use big enough instance that suits your need, and shut it down
once you finish it.
efficiency.
【在 b******d 的大作中提到】 : sorry, i mean thread, but its still very heavy, and my design target is to : have : 10 of those programs running together without negative effect on efficiency. : however, its necessary to parallel processing the request so the waiting : time could be acceptable.
|
|
|
b******d 发帖数: 794 | 11 thx, but unfortunately, i can't shut it down because it is supposed to run e
very a few minutes to check the latest informatin from 8am to 9pm. I guess t
hat could cost a lot more than running 4g vps a whole day, :(
need
【在 g*****g 的大作中提到】 : I think you may want to consider AWS instead. And use it only when you need : it. You can use big enough instance that suits your need, and shut it down : once you finish it. : : efficiency.
|
b******d 发帖数: 794 | 12 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启
动,或者停止;有什么好的framework?
need
【在 g*****g 的大作中提到】 : I think you may want to consider AWS instead. And use it only when you need : it. You can use big enough instance that suits your need, and shut it down : once you finish it. : : efficiency.
|
e*****t 发帖数: 1005 | 13 I would recommend to use spring. You can use spring's impl or lay it on top
of quartz.
页启
【在 b******d 的大作中提到】 : 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启 : 动,或者停止;有什么好的framework? : : need
|
g*****g 发帖数: 34805 | 14 For simpler one, TimerTask, more complicated one, quartz, both can be
wired through spring. You can expose the bean as a webservice where you can
set the parameters you need. Add a boolean that's checked every time it's
triggered and you have your stop/start.
页启
【在 b******d 的大作中提到】 : 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启 : 动,或者停止;有什么好的framework? : : need
|
b******d 发帖数: 794 | 15 前几天在诸位大牛指导下做了个网络爬虫。
开了12个线程,在本地台机上跑还可以,一次查询只需2分多钟(台机I7 2600/12g mem/
ssd); 后来上线到vps(cpu不详,内存只有1g, 否则太贵养不起了),速度就很慢了
请大虾指点如果优化程序 |
g*****g 发帖数: 34805 | 16 打印一些timestamp出来看看哪里慢了。
mem/
【在 b******d 的大作中提到】 : 前几天在诸位大牛指导下做了个网络爬虫。 : 开了12个线程,在本地台机上跑还可以,一次查询只需2分多钟(台机I7 2600/12g mem/ : ssd); 后来上线到vps(cpu不详,内存只有1g, 否则太贵养不起了),速度就很慢了 : 请大虾指点如果优化程序
|
b******d 发帖数: 794 | 17 multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception
主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m
的十几倍。
【在 g*****g 的大作中提到】 : 打印一些timestamp出来看看哪里慢了。 : : mem/
|
r*****l 发帖数: 2859 | 18 1. Check if CPU is fully loaded.
2. Check if memory is used up.
From your symptom, looks like memory issue. Use jmap to dump heap and use
jhat to analyze memory usage.
exception
512m
【在 b******d 的大作中提到】 : multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception : 主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m : 的十几倍。
|
b******d 发帖数: 794 | 19 thanks.
应该是内存问题,512内存环境下,内存一下就吃掉了;4g内存下,一般内存占用是1.4
g(包括操作系统,用free -m看的)
但是决大部分内存操作都是htmlunit完成的,我就是start了12个进程。如果htmlunit本
身就很耗内存(相当于开12个不share cookie的浏览器),那这方面是不是可以控制的很
少了。
【在 r*****l 的大作中提到】 : 1. Check if CPU is fully loaded. : 2. Check if memory is used up. : From your symptom, looks like memory issue. Use jmap to dump heap and use : jhat to analyze memory usage. : : exception : 512m
|
g*****g 发帖数: 34805 | 20 See if you can switch to google app engine.
exception
512m
【在 b******d 的大作中提到】 : multi-thread的,打stamp也看不出什么东西吧。我倒是程序里都有输出,看exception : 主要是heap exception, out of memory, vps内存扩到4g后就很快了,可是月费是512m : 的十几倍。
|
|
|
g*****g 发帖数: 34805 | 21 开12个进程?JVM一下子就吃掉了。你当然应该开一个JVM,上面开线程。
.4
htmlunit本
【在 b******d 的大作中提到】 : thanks. : 应该是内存问题,512内存环境下,内存一下就吃掉了;4g内存下,一般内存占用是1.4 : g(包括操作系统,用free -m看的) : 但是决大部分内存操作都是htmlunit完成的,我就是start了12个进程。如果htmlunit本 : 身就很耗内存(相当于开12个不share cookie的浏览器),那这方面是不是可以控制的很 : 少了。
|
b******d 发帖数: 794 | 22 is it like a auto-sized vpn? then I still have to pay for more calculation p
ower and memory if my program is not efficient.
【在 g*****g 的大作中提到】 : See if you can switch to google app engine. : : exception : 512m
|
b******d 发帖数: 794 | 23 sorry, i mean thread, but its still very heavy, and my design target is to
have
10 of those programs running together without negative effect on efficiency.
however, its necessary to parallel processing the request so the waiting
time could be acceptable.
【在 g*****g 的大作中提到】 : 开12个进程?JVM一下子就吃掉了。你当然应该开一个JVM,上面开线程。 : : .4 : htmlunit本
|
g*****g 发帖数: 34805 | 24 I think you may want to consider AWS instead. And use it only when you need
it. You can use big enough instance that suits your need, and shut it down
once you finish it.
efficiency.
【在 b******d 的大作中提到】 : sorry, i mean thread, but its still very heavy, and my design target is to : have : 10 of those programs running together without negative effect on efficiency. : however, its necessary to parallel processing the request so the waiting : time could be acceptable.
|
b******d 发帖数: 794 | 25 thx, but unfortunately, i can't shut it down because it is supposed to run e
very a few minutes to check the latest informatin from 8am to 9pm. I guess t
hat could cost a lot more than running 4g vps a whole day, :(
need
【在 g*****g 的大作中提到】 : I think you may want to consider AWS instead. And use it only when you need : it. You can use big enough instance that suits your need, and shut it down : once you finish it. : : efficiency.
|
b******d 发帖数: 794 | 26 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启
动,或者停止;有什么好的framework?
need
【在 g*****g 的大作中提到】 : I think you may want to consider AWS instead. And use it only when you need : it. You can use big enough instance that suits your need, and shut it down : once you finish it. : : efficiency.
|
e*****t 发帖数: 1005 | 27 I would recommend to use spring. You can use spring's impl or lay it on top
of quartz.
页启
【在 b******d 的大作中提到】 : 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启 : 动,或者停止;有什么好的framework? : : need
|
g*****g 发帖数: 34805 | 28 For simpler one, TimerTask, more complicated one, quartz, both can be
wired through spring. You can expose the bean as a webservice where you can
set the parameters you need. Add a boolean that's checked every time it's
triggered and you have your stop/start.
页启
【在 b******d 的大作中提到】 : 虫兄,另外想用web app实现一个定时执行的任务,要求可以设定间隔,可以通过网页启 : 动,或者停止;有什么好的framework? : : need
|
l*******s 发帖数: 1258 | 29 我就在用htmlunit做东西,内存似乎没有你说的那么大,当然了,我做的是scraper,
可能比你的爬虫workload小很多。
不妨升级以下版本,htmlunit有不少bug,他们整天改。 |