第9页 - 关于scraping的讨论汇总 - 话题女王

全部话题 - 话题: scraping

s*******w
发帖数: 2257

来自主题: TrustInJesus版 - 《约伯记》

下面开始第二章的学习：
================================
:1 [hgb] 又有一天，神的众子来侍立在耶和华面前，撒
但也来在其中。
[kjv] Again there was a day when the sons of God came to present
themselves before the LORD, and Satan came also among them to present
himself before the LORD.
[bbe] And there was a day when the sons of the gods came together
before the Lord, and the Satan came with them.
2:2 [hgb] 耶和华问撒但说，你从哪里来？撒但回答说，
我从地上走来走去，往返而来。
... 阅读全帖

s******0
发帖数: 13782

来自主题: NJU版 - 看看你的工作是哪一个？（ZT)

A Software Engineer, a Hardware Engineer and a Departmental Manager were on
their way to a meeting. They were driving down a steep mountain road when
suddenly the brakes on their car failed. The car careened almost out of
control down the road, bouncing off the crash barriers, until it
miraculously ground to a halt scraping along the mountainside. The car's
occupants, shaken but unhurt, now had a problem: they were stuck halfway
down a mountain in a car with no brakes. What were they to do?
"I k... 阅读全帖

c*********t
发帖数: 30088

来自主题: PKU版 - Petoskey, Michigan (转载)

【以下文字转载自 Shaanxi 讨论区】
发信人: luowei (燕草秦桑), 信区: Shaanxi
标题: Petoskey, Michigan
发信站: BBS 未名空间站 (Wed Jul 21 21:41:22 2010, 美东)
A family that lives on the outskirts of Petoskey , Michigan decided to
build a sturdy, colorful playground for their 3- and 4-year-old sons. They
lined the bottom with smooth-stone gravel all around to avoid knee scrapes
and other injuries. They finished building it one Friday evening and were
very pleased with the end product.
The following morning, the mom was about

f*******h
发帖数: 10

来自主题: SJTU版 - 2003年度上海交通大学全球最瞩目事件

欧盟引用全文
http://europa.eu.int/comm/research/headlines/news/article_03_12_31_en.html
Chinese study ranks world’s top 500 universities
European universities scrape by with a pass mark, according to a new academi
c ranking of the world’s best schools of higher learning released by resear
chers from Shanghai’s institute of higher education.
European universities ranked fifth and ninth on the ladder in terms of resea
rch and academic performance。
Oxford and Cambridge Universities are conspicuous as the

y**********o
发帖数: 7947

来自主题: Beijing版 - 缺蛋白质

蛋白质缺乏的明显症状。体重每镑x0.37=每天最低蛋白质需要
难怪，难怪
Ridges or deep lines in finger and toe nails
Slowness in healing wounds, cuts, scrapes, and bruises
Difficulty sleeping
Crankiness, moodiness
Severe depression
Anxiety
Lack of energy, no desire to do things

l****i
发帖数: 4609

来自主题: Shaanxi版 - Petoskey, Michigan

A family that lives on the outskirts of Petoskey , Michigan decided to
build a sturdy, colorful playground for their 3- and 4-year-old sons. They
lined the bottom with smooth-stone gravel all around to avoid knee scrapes
and other injuries. They finished building it one Friday evening and were
very pleased with the end product.
The following morning, the mom was about to wake the boys up and have them
go out to play in their new play center. This is what she saw from the
upstairs window

P******e
发帖数: 2223

来自主题: Shanghai版 - 大胆预测，爷叔２０１２年结束单身生活

yeah, i am scraping the bottom of the barrels, eating left-overs and licking
every last drop.
CENA.

z*****9
发帖数: 256

来自主题: Zhejiang版 - 诸暨行：慕名买香榧

You do not peel it. You can scrape it with the hard outer shell.

N**t
发帖数: 1738

来自主题: Apple版 - mit百度不倦今天又进不去了

话不能这么说，这个app应该是个scraper，取http页面肯定没问题，取完了之后不还得
大卸八块哪块
是十大哪块是果版都需要知道http页面的格式，老邢三天两头变格式，scrape起来可不
是有难度吗。

i****a
发帖数: 36252

来自主题: Apple版 - Chrome > all (转载)

【以下文字转载自 WaterWorld 讨论区】
发信人: iMaJia (iMac,iPod,iPad,i馬甲), 信区: WaterWorld
标题: Chrome > all
发信站: BBS 未名空间站 (Thu Mar 10 15:27:02 2011, 美东)
http://www.engadget.com/2011/03/10/safari-and-ie8-get-shamed-at
chrome-still-safe-for-n/
http://www.blogcdn.com/www.engadget.com/media/2011/03/chrome-ha
10-600.jpg
Ahead of the most recent Pwn2Own, Google made a rather proud challenge:
it'd pay $20,000 to any team or individual who could successfully hack
Chrome. Two takers signed up for that challenge -- a... 阅读全帖

x***q
发帖数: 4953

来自主题: Apple版 - hulu plugin for xbmc 现在确实不work了

俺试了一下，以前能放的视频都放不了。
下面是作者的解释：
http://forum.xbmc.org/showthread.php?t=97144
I checked the Hulu, MTV Networks, South Park Studios, TBS, TNT and they all
started using a new handshake for rtmp. librtmp doesn't currently support it
. The plug-ins are playing the correct link but can't negotiate the
connection to the server. I will start marking them broken tomorrow till
librtmp supports the new handshake.
Spike TV scraping is broken and haven't got to it.
the PBS is working fine on my end. NOVA for t... 阅读全帖

w********1
发帖数: 3492

来自主题: Apple版 - Archiveteam Saves 272 Terabytes of MobileMe Websites From Deletion [Mac Blog]

Tue, 26 Jun 2012 11:18:14 PDT
A volunteer team has been downloading the entire publicly accessible
contents of MobileMe webpages, iDisk folders, and photo galleries, ahead of
the shutdown of the MobileMe service on June 30.
The team just finished the project, some four days ahead of their deadline.
The project began late last yearn and ramped up as the team moved closer to
the MobileMe shutdown date.
Archive Team has finished downloading MobileMe and .Mac before Apple deletes
it on June 30. 272 ... 阅读全帖

c********g
发帖数: 1173

来自主题: Apple版 - 买不到iPhone 6/6+的同学看过来

我写了个app来抓附近Apple Store的inventory。一旦发现，可以给你发realtime
notification和email。我自己用这个tool抓了两个iPhone 6/6+。
我刚把这个app open source了，你可以到这里下载，然后自己build， install：
http://github.com/ychw/iPhone6Radar
因为这个app scrape了Apple的网页，所以没办法放到app store上。你要是自己不会
build，就只能找朋友帮忙了。或者你也可以用istock.us，但不能根据距离来filter结
果。

s****y
发帖数: 983

来自主题: BuildingWeb版 - 请问能否用JQUERY+AJAX获取远程文件的内容

你是想做data scraping 是不是

g****o
发帖数: 1284

来自主题: BuildingWeb版 - 请问能否用JQUERY+AJAX获取远程文件的内容

多谢楼上二位回复。
我想其实我也用不着去动态抓取文件了，我真正感兴趣的是包含某对著名牌手的所有记
录文件，可以直接从档案库里一个个下载就行了，并不是很麻烦。
我真正要实现的是，当这些记录文件都已经存在我的服务器上以后，当用户从客户端输
入某个特定的叫牌序列（某种固定格式的字符串）后，我要能够去遍历这些记录文件，
返回这对著名牌手使用这个特定叫牌序列的牌例，并在客户端予以展示。这可能就是
sunrey说的data scraping吧？
记录文件的格式都是这样的：
vg|41st WBTC-BBO1,BB-F1,I,1,16,MONACO,0,ITALY,6|
rs|2SN+2,3SN=,2SN=,1NEx-1,4SS=,4SS=,3NN-2,3NN=,4HN=,5DEx-3,6CE=,6CW=,3HE=,
4HE-1,2HS-1,3CW-1,6CN=,6DS+1,5HSx-1,4SE+1,6DN-1,4SS+2,4SW=,4SW=,PASS,2CN+3,
4CW+1,5CWx=,1NW-3,3CW=,4HS-1,3NS=|
pn|NUNES,VERSACE,FANTONI,LAURIA,... 阅读全帖

r**********d
发帖数: 510

来自主题: BuildingWeb版 - Chipostle.com web scrape

大牛指点一下。
我想。到https://order.chipotle.com。
染后选一店，点一个 sides and drinks, 然后add to bag.
我可以从network traffic 看到。order id但是，我从那个python程序里面，额拿不到
order id。我用的是selenium.
另外一个办法是，嗯，通过他的api call，但是，它的cookies 每天是换新的。无法
自动话。
有什么好办法吗？

r**********d
发帖数: 510

来自主题: BuildingWeb版 - Chipostle.com web scrape

c********1
发帖数: 5269

来自主题: BuildingWeb版 - Chipostle.com web scrape

试式用xpath?

K****n
发帖数: 5970

来自主题: CS版 - 急啊！应聘一个职位，对方发了一堆题

我实在是没有经验哈，只在这种网站做过小卒几个月，看过波士的老code。你批判地看：
问题：
1.有哪些访问供应商(Amazon,Ebay, etc.)数据源的方法
这个很多，最一般的方法就是把供应商的query html破解一下，找若干代理服务器，
send
query， download html，parse html。如果你和供应商有contract，可能访问一些底
层的
layer，如果得到xml，自然是更方便。
2.怎么保证某个数据源中的@所有@物品都被下载了（比方Amazon中所有的书）
wokao，怎么可能保证呢。你必须了解对方的数据结构，否则的话就要仔细分析对方网
页的各种可能
的情况，比如如何在query中specify每页列出的物品数量啊，然后怎么翻页啊，之类的。
3.怎么保证在不同数据源中同样货品的辨识问题，你准备用怎样的辨识过程
真tmd难，关键字matching吧，然后可以把同义关键字放在一起。。。至于哪些是同义
关键字，倒
是可以去amazon, google shopping, bing shopping, yahoo shopping去scrape好多
商

c**t
发帖数: 2744

来自主题: DotNet版 - 如何提取网页某个表格的元素？

google web scraping, tons of examples, all languages...

c**t
发帖数: 2744

来自主题: DotNet版 - Web scraping的利器: HttpClient in WFC REST Starter Kit

You are right. But it's much better than WebClient or HttpWebRequest etc.

s*****f
发帖数: 75

来自主题: DotNet版 - Web scraping的利器: HttpClient in WFC REST Starter Kit

强力推荐 IRobotSoft visual web scraper. 网页抓取，高级数据库集成，自动点击浏
览，我就不一一列举了。用它制作小 web robot 可以实现非常强大的功能。还有，它
傻瓜式制动学习功能可以免去很多编程的烦恼。

c**t
发帖数: 2744

来自主题: DotNet版 - Web scraping的利器: HttpClient in WFC REST Starter Kit

it's not free

s*****f
发帖数: 75

来自主题: DotNet版 - Web scraping的利器: HttpClient in WFC REST Starter Kit

it is free

S*******r
发帖数: 44

来自主题: Hardware版 - 请问这个公司里换下来的T60如何?

单位里换了一批笔记本，原来的就整整卖给员工。请大家帮忙看看下面这个T60 ($235)
怎么样，有没有必要买个1年的warranty ($65)? 这里先多谢了!
Black IBM ThinkPad T60 with huge 15" LCD
- Windows XP Pro Installed
- Integrated Wireless - connects to your at home wireless router or any wi-
fi hot spot. No cords necessary!
-DVD Player
- CD Player and Burner
- Software included: iTunes, FireFox, Open Office (a Microsoft Office
compatible software with word processing, spreadsheet, presentation etc.)
Machine Specs:
Model IBM ThinkPad T60
Processor Type Int... 阅读全帖

a**********g
发帖数: 20

来自主题: Hardware版 - 买个电脑还是用cloud

我现在写了个程序来收集网上的数据，这个程序每个小时跑一次，到指定的网站scrape
data, 然后存入数据库。我现在想让这个程序不间断的跑半年，不知道在什么样的环
境下最好？
1）我可以买个台式放家里，然后在上面跑这个程序。缺点就是，万一家里停电，短网
，就会影响数据收集
2）能不能在一个server 上？这样就能克服上面的困难。amazon web service 这样的
能满足我的需求吗？我对cloud/web service一窍不通，望多指点。
多谢！

s*******a
发帖数: 8827

来自主题: Hardware版 - 买个电脑还是用cloud

i use the free amazon ec2 to run openvpn server 24/7, and its ok-ish...

scrape

b******y
发帖数: 9224

来自主题: Java版 - good java xpath parser

Given an html page, I want to transform it into xml (proper html), then use
a good java xpath parser to retrive/scrape relevant content.
Any suggestions for a good java xpath parser?
Thanks in advance,

b******y
发帖数: 9224

来自主题: Java版 - 问个flex网站scrape的问题

没有好的办法，flash是binary的，除非你编程实现machine recognition of image
text, 呵呵

s******e
发帖数: 493

来自主题: Java版 - 问个flex网站scrape的问题

as code runs inside as runtime just like java applet runs inside jvm.
there is no html for you to save and filter.
but if you are certain the data is from server, maybe you can intercept the
packets using jpcap and filter it. But that is still quite hard considering
a flex app can use different ways to communicate server (xml, serialized
binay object...).
i am curious why you want to do it?

n**a
发帖数: 12

来自主题: Java版 - Amazon.com is Hiring- SDE with Machine Learning/Data mining/Hadoop background

Hello,
Amazon.com is looking for experienced engineers with Machine learning/Data
mining/Hadoop background. Please send your resumes to n******[email protected]
Many positions open, location- Seattle, WA
Job description: SDE
The product catalog is a key business asset and differentiator for Amazon.
The Catalog Quality organization is chartered with the goal of improving the
quality of this data by building systems and employing automated techniques
to identify and fix discrepancies and enrich the dat... 阅读全帖

c******n
发帖数: 4965

来自主题: Java版 - find all tables used in a hibernate/jdbc project?

I would need to replicate all the tables used in my project to another DB,
which has less background load,
the existing project uses a combination of JDBC and hibernate, with
hibernate being the majority.
I guess I COULD scrape hibernate query logs and find the tables accessed,
but that presents problems since in the test or production, the test data
may omit some access patterns.
is there some code analysis tool to generate all the tables accessed by a
hibernate project?

c**t
发帖数: 2744

来自主题: Linux版 - 知道微软的新浏览器不？pivot

It gives users power to manage their own collection: label, filter contents.
Greate concept! For me it's a visualized scrape book.

c**t
发帖数: 2744

来自主题: Programming版 - 在带有ajax的页面做screen scrape

use fiddler to sniffer what's sent; make another requests..

r****t
发帖数: 10904

来自主题: Programming版 - 在带有ajax的页面做screen scrape

又来了，用selenium

l******t
发帖数: 660

来自主题: Programming版 - 在带有ajax的页面做screen scrape

又来了？以前有人问过？

... ?this depends on the accuracy of the html page. Very likely it will be broken
on webpages you would like to scrape.

a
",

r****t
发帖数: 10904

来自主题: Programming版 - 如何下载网络页面，不包含 ,

k***r
发帖数: 4260

来自主题: Programming版 - 想写个适用于移动设备显示的书名查询页面，把书名检索送到

如果没有API，就只能用screen scraping了吧。

c****e
发帖数: 1453

来自主题: Programming版 - 如果没有api，有什么办法写网站客户端呢？

Search "web scraping". Essentially you just get the webpage and play with it
.

d*********4
发帖数: 409

来自主题: Programming版 - 如果没有api，有什么办法写网站客户端呢？

多谢各位回复，已经给那边的程序员发邮件问了。可以想象web scraping,但是对于那
种登陆后返回的信息，不知道这种方法可行否，还有如果有一些带pagination的内容之
类的。我会多google的，多谢！！！

d*l
发帖数: 400

来自主题: Programming版 - 请问哪里有python的code example

https://developers.google.com/edu/python/
For screen scraping of HTML site, use beautifulsoup. very easy.
Even ppmm's in my class can do it in homework. As a wsn, you should be able
to learn it in one day. Otherwise, you are not a true wsn.

d*l
发帖数: 400

来自主题: Programming版 - 请问哪里有python的code example

d****n
发帖数: 12461

来自主题: Programming版 - 请教Regular Expression,

web scraping可以，但是text mining难。regex基于的是模式匹配，只有你知道模式的
时候才有用。
regex对于大文件查找有时候效率很低。一个上G的文件可能就查死你了。所以仅限于文
件系统大量小文件之类这样基础的活。但是很有用。

w****k
发帖数: 6244

来自主题: Programming版 - 谁给说说Selenium？

做什么取决于你自己。
反正就是个自动控制浏览器的工具。
我用它自动抢过便宜机票。
做个screen scraping
就是没有拿来测试过网页。 haha
需要coding

S**********e
发帖数: 503

来自主题: Programming版 - web scraping有啥方便的API或者框架不

就是从一些网站抓link分析然后下载点东西。我目前只知道用java和apache的
httpclient抓回网页然后分析文本,今天google到一个叫selenium的东西，好像能简化
开发过程。不知道还有什么简单易用的？

c********l
发帖数: 8138

来自主题: Programming版 - web scraping有啥方便的API或者框架不

http://blogread.cn/it/article/874?f=hot1
http://blogread.cn/it/article/3958?f=sa
http://blogread.cn/it/article/4086?f=sa
http://www.searchtb.com/2011/01/an-introduction-to-crawler.html

g*****g
发帖数: 34805

来自主题: Programming版 - web scraping有啥方便的API或者框架不

htmlunit.

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天