I*****y 发帖数: 6402 | 1 I've followed this tutorial http://blog.foofactory.fi/2007/02/online-indexin
g-integrating-nutch-with.html to install Nutch on top of Solr.
The installation of Solr is successful, and the installation of nutch was do
ne exactly as described in the tutorial. Finally, when I run './bin/crawl.sh
' in the nutch installation directory, the message is: Usage is "crawl.sh b
asedir"
The question is: how do I know if Nutch is crawling or not? I don't see any
index data in the Solr installation directory |
|
b******y 发帖数: 9224 | 2 Just curious if anyone used Nutch before? I've used it in the past and
analyzed the code a lot, but that was v0.72. Now much has changed I guess,
but basic remains the same.
However, the Nutch tutorial is not up to date, that's the drawback of a fast
moving project I guess. Anyway, if anyone used Nutch, let me know and want
to ask some questions...orz |
|
g********g 发帖数: 2172 | 3 Lucene is a index engine only. Nutch is a web crawler. The crawled results
were indexed with Lucene. So they are different products. Indeed used the
Lucene as the index engine but built their own crawler. Nutch is an general
purpose search engine crawler. It is too much work to modify it as a
vertical search engine crawler. |
|
|
b******y 发帖数: 9224 | 5 Thanks, but all I need is a reliable crawler. So I looked around and didn't
find a good one other than Nutch.
There is one called larbin, but it is in c++ and a one-man show. There is
another one called Heritrix, but it is more for archive purpose.
Anyway, Nutch seems ok for now. |
|
b******y 发帖数: 9224 | 6 good write-up.
Nutch is not good at all for production environment. It is good for playing
with.
To do a truly scalable crawler for a vertical market, you got to do it
yourself. |
|
n********s 发帖数: 144 | 7 报agent not configured。
agent name是不是可以随便写,其他地方呢?是在site-xml,还是在其他文件设置?
nutch-0。9 |
|
i***c 发帖数: 301 | 8 lucene.net生成的index,搜索没有问题
可是用nutch爬来的index好像结构不同,如何用lucene.net来搜?
还是lucene.net的index版本低? |
|
w******f 发帖数: 620 | 9 Yeap, maybe mobile user has different user experience which required the
custermized server side implementation.
BTW, I like your search site, do you integrate the nutch and solr together,
let nutch do the crawl, fetch and index, and solr do the search? |
|
w*****e 发帖数: 748 | 10 Heritrix 和nutch 比较好,可以抓大量的东西. 设置和使用比较简单. 很多小公司都用
这两个.
有个web-harvest 支持比较复杂的query, 比如抓论坛blog等等,比较方便. 但是设置本
身跟一个小语言差不多, 有点编程基础的,还不如自己用Jspider 或者nutch啥的改改. |
|
p*****2 发帖数: 21240 | 11 大数据可能是现在科技界与VC界最关注的热词了。似乎和大数据沾不上边的互联网公司
、甚至是传统产业公司,都没前途。
是热词,则必有泡沫。而在泡沫之下,一些真的在创造与挖掘大数据价值的公司、特别
是创业公司,倒未见得为人所知。
这里就有一些可能还处于“隐身模式”中的公司,它们目前正忙于完成最后的收尾工作
,将把足以改变“游戏规则”的技术呈现在世人面前。
这些大数据创业公司的团队,很多是来自于谷歌、Facebook这些巨头。其中有些创业公
司以分析为重点,有些以内存数据库为重点。此外,还有其他一些创业公司则正在大力
开发NoSQL数据库(非关系型数据库)技术。
美国科技博客网站Business Insider为我们列出了14家正在崛起的大数据创业公司。它
们的业务、模式,或许值得国内关注大数据创业的同学借鉴。可以看到,这些公司后面
,都是实力雄厚的VC。
“大数据之所以有趣,是因为它将是未来许多年时间里的一个重大投资领域。大数据浪
潮将持续很久,而不会是18个月或24个月以后就宣告终结。”风险投资公司Accel
Partners普通合伙人李平曾说。
这些已获得融资的大数据创业公司的增长动量正... 阅读全帖 |
|
d********w 发帖数: 363 | 12 硅谷最火的高科技创业公司都有哪些?
在硅谷大家非常热情的谈创业谈机会,我也通过自己的一些观察和积累,看到了不少最
近几年涌现的热门创业公司。我给大家一个列表,这个是华尔街网站的全世界创业公司
融资规模评选(http://graphics.wsj.com/billion-dollar-club/)。它本来的标题是billion startup club,我在去年国内讲座也分享过,不到一年的时间,截至到2015年1月17日,现在的排名和规模已经发生了很大的变化。首先,估值在10Billlon的达到了7家,而一年前一家都没有。其次,第一名是中国人家喻户晓的小米,第三,前20名中,绝大多数(8成在美国,在加州,在硅谷,在旧金山!)比如Uber, Airbnb, Dropbox, Pinterest. 第四 里面也有不少相似模式成功的,比如Flipkart就是印度市场的淘宝,Uber与Airbnb都是共享经济的范畴。所以大家还是可以在移动(Uber),大数据(Palantir),消费级互联网,通讯(Snapchat),支付(Square),O2O App里面寻找下大机会。这里面很多公司我都亲自面... 阅读全帖 |
|
发帖数: 1 | 13 New opening
Lead Data Scientist
Job Description
We are developing a large data platform for mobile advertising. This is a
great opportunity for an outstanding candidate to build the core
intellectual property on our latest product from the ground up. The position
will be focused on building predictive modeling with Hadoop. If you want to
work on bleeding-edge technology, handling hundreds of millions of
transactions a day, this may be the opportunity for you!
What You Need For This Position
PhD ... 阅读全帖 |
|
发帖数: 1 | 14 Another position opening!!!
Title :Lead Data Scientist, Ad Team
SAN MATEO, CA ENGINEERING FULL-TIME
We are developing a large data platform for mobile advertising. This is a
great opportunity for an outstanding candidate to build the core
intellectual property on our latest product from the ground up. The position
will be focused on building predictive modeling with Hadoop. If you want to
work on bleeding-edge technology, handling hundreds of millions of
transactions a day, this may be the oppor... 阅读全帖 |
|
发帖数: 1 | 15 Another position opening!!!
Title :Lead Data Scientist, Ad Team
SAN MATEO, CA ENGINEERING FULL-TIME
We are developing a large data platform for mobile advertising. This is a
great opportunity for an outstanding candidate to build the core
intellectual property on our latest product from the ground up. The position
will be focused on building predictive modeling with Hadoop. If you want to
work on bleeding-edge technology, handling hundreds of millions of
transactions a day, this may be the oppor... 阅读全帖 |
|
n**a 发帖数: 12 | 16 Hello,
Amazon.com is looking for experienced engineers with MapReduce/Hadoop/Lucene
,Distributed and scalable systems background. Please send your resumes to
n******[email protected]
Many positions open, location- Seattle, WA
Job description: SDE
Software Dev Engineer, Product Ads
Product Ads is a high-profile, strategic business unit, with support and
interest from all parts of Amazon and top management. We are a highly
motivated, collaborative and fun-loving team building a high growth business
. ... 阅读全帖 |
|
I*****y 发帖数: 6402 | 17 如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站,哪个搜索引擎好
一些? 好像indeed.com公布出来的是用lucene. |
|
b******y 发帖数: 9224 | 18 ya, indeed.com uses lucene |
|
|
|
b******y 发帖数: 9224 | 21 good, thanks for the info |
|
w****n 发帖数: 48 | 22 Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies. |
|
b******y 发帖数: 9224 | 23 unfortunately, for doing a search engine, the crawler is the hardest part.
Search is relatively easy.
You get all sorts of crappy html pages and also all sort of crappy websites
to handle... |
|
g********g 发帖数: 2172 | 24 还有一种方法就是用YAHOO, ALEXA 的DATA. 否则不是狭小领域的话 crawler 的带宽费
都付不起. |
|
I*****y 发帖数: 6402 | 25 I would like to try building a search engine for a particular topic using
Lucene and Nutch.
I've installed Java, Tomcat 6 and Ant on my testing server http://208.64.71.46:8080/
however, I have no idea how to install Lucene. Anyone knows? please teach me
a little bit. thanks
ps: I use CentOS 5.2 by the way. |
|
K*Q 发帖数: 1001 | 26 oh, nutch itself includes lucene
you do not need to install lucene again
me |
|
I*****y 发帖数: 6402 | 27 I think you also use Lucene/Nutch as your own search engine? |
|
b******y 发帖数: 9224 | 28 I think he is using some version of nutch maybe? |
|
I*****y 发帖数: 6402 | 29 在一个public email list找高手把nutch和solr integrate帮我做了,我愿意付钱。 很
快就收到以个曾在G-Talk里聊过的国内一同胞的回复,说能帮我做,要价$50/hours,而
且说要1 day work. 不知这个是否合理,他还是以为在美国的人都很rich?? |
|
h******s 发帖数: 100 | 30 thanks.
solr is the backend
nutch is not used. I wrote a dedicated crawler
, |
|
I*****y 发帖数: 6402 | 31 打算做一个专业领域内的搜索引擎,就像有在这里大侠的myvisajobs, hanajobs, 等等
打算用Solr开源代码做收录索引的引擎,用nutch, heritrix做spider去crawl.
请问做一个这样的搜索引擎,主机需要啥配置和硬盘的空间?还是需要multiple 主机连
在一起?
搜索相关生物领域内的protocols等等,要收录的网站应该有不少 |
|
|
i***0 发帖数: 8469 | 33 I wanted to connect on a 2 opportunities I'm working on. One is a role in
Menlo Park with Adsymptotic. I'm working with the founders from Google/
Admob/Yahoo and backed by Sequoia/KP; well funded, and growing. I've helped
them staff a number of roles and looking for a hadoop expert to join the
team. Take a look, they're doing quite well.
I'm also working with the VPE of Mashlogic in Palo Alto and they're looking
for an Java/BigData engineer. They're growing, backed by NEA and Bessemer
Vent... 阅读全帖 |
|
c***o 发帖数: 61 | 34 我只想用它来检索documents (.doc/.pdf/etc.)而非htm/html,但是如果我在
crawl-urlfilter.txt里面将htm/html skip掉是不行的,因为crawler根本就得不到
足够的link信息。是不是先crawl/fetch,之后在index的时候再将htm/html去掉呢?
应该怎样处理?谢谢! |
|
|
i***c 发帖数: 301 | 36 can you give me some info about the web crawler?
how do you integrete with asp.net
I using nutch and lucene seems not easy with asp.net
the
't |
|
c***o 发帖数: 61 | 37 【 以下文字转载自 BuildingWeb 讨论区 】
【 原文由 csfoo 所发表 】
我只想用它来检索documents (.doc/.pdf/etc.)而非htm/html,但是如果我在
crawl-urlfilter.txt里面将htm/html skip掉是不行的,因为crawler根本就得不到
足够的link信息。是不是先crawl/fetch,之后在index的时候再将htm/html去掉呢?
应该怎样处理?谢谢! |
|
|
n********s 发帖数: 144 | 39 我老随便弄了一下就跑起来了。
不劳大牛们再出手了。 |
|
c***o 发帖数: 61 | 40 【 以下文字转载自 BuildingWeb 讨论区 】
【 原文由 csfoo 所发表 】
我只想用它来检索documents (.doc/.pdf/etc.)而非htm/html,但是如果我在
crawl-urlfilter.txt里面将htm/html skip掉是不行的,因为crawler根本就得不到
足够的link信息。是不是先crawl/fetch,之后在index的时候再将htm/html去掉呢?
应该怎样处理?谢谢! |
|
k***r 发帖数: 4260 | 41 Somehow I find the hadoop FS hard to use ...
you can probably just use Lucene. |
|
k***r 发帖数: 4260 | 42 heritrix is good. Or if you don't want to crawl
the whole web, you can roll your own crawler.
Otherwise, I'd say use heritrix.
t |
|
b******y 发帖数: 9224 | 43 Thanks for the info.
I wrote my own crawler before, but since it is not my main focus, so, I am
looking into open source crawler these days.
Definitely not wanting to crawl the whole web, thank god I don't need to do
that ;-) |
|
k***r 发帖数: 4260 | 44 If you only need some domain data, say, shopping sites,
I'd rather write my own crawler. This way the parsing code
can be very close to crawling code, which makes your
crawling smart and more efficient.
do |
|
r***y 发帖数: 4379 | 45 --比如我现在做的是apache 上最火的projects,开发者其实都是 yahoo, fb, 还有相
关商业公司的全职员工。
hadoop?
mahout?
solr?
nutch?
open |
|
|
w***g 发帖数: 5958 | 47 我们用nutch,很烂。主要是一旦crawl的范围放大到整个internet,大部分时间就都花
在了处理各种垃圾页面上。一个好的crawler最关键的是各种ad hoc的heuristic rules
避免抓取无用页面。据我所知没有一个open source的软件有比较好的这种rules。虽然
不少软件允许用户自己plugin,但是对于没有什么经验的人来说找到这些rules比imple
ment一个crawler还要难。 |
|
|
c****e 发帖数: 1453 | 49 Vertical market一直有人做。很多用的的确就是hadoop+lucene.很多电商的产品搜索
也就是用的这些。这个组合达到了搜索引擎的最基本需求,但是和Google,Bing之类的
没法比。最重要的就是有没有relevance的infrastructure. Index的build可以通过
hadoop解决scale的问题,但是lucene的query serve非常慢。relevane才是硬骨头。至
于细节就更多了,speller, query understanding, user intent都需要大量的用户数
据和click,这就是为什么很多网站的站内搜素超级烂,还不如从google直接搜。
Ebay这么大的公司,自己的product search都很难做,挖了一些人做Cassini,效果并不
好。
http://www.slideshare.net/fullscreen/cloudera/hadoop-world-2011
你要找crawler可以看nutch,parse pdf这样的文档可以用tika.至于动态页面的parsing
,可以自己wrap webki... 阅读全帖 |
|
|