第6页 - 关于parsing的讨论汇总 - 话题女王

全部话题 - 话题: parsing

X****r
发帖数: 3557

JS文件里只要有一个parse错误剩下就不parse了，
所以后面出什么错都正常。

s***r
发帖数: 1121

来自主题: Shandong版 - 《毕业生》(ZT) Simon and Garfunkel - Are you going to Scarborough Fair

《毕业生》
问尔所之，是否如适 Are you going to Scarborough Fair
蕙兰芫荽，郁郁香芷 Parsely sage rosemary and thyme
彼方淑女，凭君寄辞 Remember me to one who lives there
伊人曾在，与我相知 She once was a true love of mine
嘱彼佳人，备我衣缁 Tell her to make me a cambric shirt
蕙兰芫荽，郁郁香芷 Parsely sage rosemary and thyme
勿用针剪，无隙无疵 Without no seams nor needle work
伊人何在，慰我相思 Then she will be a true love of mine
彼山之阴，深林荒址 On the side of hill in the deep forest green
冬寻毡毯，老雀燕子 Tracing of sparrow on snow crested brown
雪覆四野，高山迟滞 Blankets and bed clothers

b***y
发帖数: 405

来自主题: Apple版 - ChineseWeb 偷偷用Proxy?

我前面已经讲了，MITBBS很少做重大改动，而且这个也不是什么技术难题, AppStore已
经有solution了。所以，不要把话题带偏。我最关心还是为什么没必要偷用户的浏览数
据的情况下还要偷。
其次，你确定ChineseWeb是在server端parse吗？不确定就让作者自己来澄清一下。否
则后面的讨论都没意义。
第三，HTML/XML如果格式变了它的schema会有相应改动，一般App都是根据schema来
parse的。如果像你说的那个design会把developer累死。
第四，各个软件公司都搜肠刮肚的挖掘用户浏览pattern以便植入广告，卖用户信息，
赚更多的钱。所以，这么明显的事儿说成“公益的精神”，太离谱了。
第五，没必要说话夹枪带棒的，什么“半桶水”，“你写过正式的软件没有”。当然，
你要是作者之一，这个从感情上到可以理解。

xml
之间

j***y
发帖数: 2074

来自主题: BuildingWeb版 - SSI support of Apache Server under NT4.0?

how to configure Apache to support SSI under NT4.0?
i found in the documentation that i only add the following directives in the
file httpd.conf:
AddHandler server-parsed .shtml
AddType text/html shtml
but it seems that the SSI (Server Side Include) still can't be supported. the
server even can't parse my phrase like in my index.shtml.
any help?
thanks,

h***u
发帖数: 214

来自主题: BuildingWeb版 - quesiton about PHP! waiting online

Thank you!
but it still not work, the message is:
Parse error: parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting
T_STRING or T_VARIABLE or T_NUM_STRING in c:\apache\htdocs\testphp1.php on
line 8
my platform is win2000 + apache + php. when I first install php, I try install
php as a module of apache, but it failed. so I install as cgi

line 8
line 8

s****y
发帖数: 983

来自主题: BuildingWeb版 - JavaScript 问题 (转载)

哦,google 'jquery parse xml'一堆教程
比如这个
http://www.switchonthecode.com/tutorials/xml-parsing-with-jquery

h******u
发帖数: 155

来自主题: CS版 - 有没有做编译的大牛

现有的parse generator通常用来检查 context-free-language的recognition 问题。
这些language一般都有matched open parentheses的属性，例如你有左括号，需要右括
号。检查 CFL-recognition不是trivial的问题，所以需要这样的parser生成器。或者
可以做些 attribute grammar的属性检查（某个production被evaluate的时候有些什
么action需要产生）。一般的formated text parsing 不要往这个上面去想。perl就
是用来干这种事情的。

d*****u
发帖数: 17243

来自主题: CS版 - 计算语言学硕士Brandeis VS UW@Seattle，是去是留，给点意

计算语言学其实包括computational lingustics和natural language processing两大块
现在一般都不区分，或者界限模糊，所以要仔细看program的要求
严格说CL是用计算的方法研究语言学问题，具体又分成两类
一是研究human language processing的
现在很多人做（human) sentence parsing，用一些统计方法和其他法则模拟句子解析
但是也有做acquisition的
还有mental lexicon等等
感觉这一部分对语言学基础要求较高，同时要懂人工智能
我觉得这是很有前途的方向，但是做的人很少，
主要是因为很少有人既懂语言学又懂人工智能和机器学习
也没那么多老师
二是做语料库的
实际就是做数据库的一种，但是要对语言学稍有了解
这个现在其实需求还是挺多，因为语料库有各种要求，针对各种研究
但是技术含量就一般了，相对比较枯燥
而NLP是研究自然语言的机器处理，不一定直接借用语言学理论
比如tagging, parsing, machine translation, dialogue system等等都可以算

n*p
发帖数: 298

来自主题: CS版 - 一个程序的小问题

我是在用fread读一个文本文件到buffer里，
然后用strtok(file_buffer,DELIMITERS)来parse，
这个文本文件只有10个字符，两个单词
可parse之后总是会多几个怪字符怎么办？
好像strtok没看到buffer（也就是file）的end

b****e
发帖数: 119

来自主题: CS版 - 请问NATURAL LANGUAGE PROCESSING，或者说TEXT MINING里面哪个方向找工作最方便

去工业界工作，最重要的不是教科书上的算法熟，而是
第一，工具系统熟，NLP从toolkit到各个环节(shallow parsing, deep parsing,
classification (e.g., sentiment analysis), entity extraction, relation
extraction, Q/A), 工具有那些，有什么优缺点，各种算法什么工具里有，这个要熟，
面试的时候要张口就来
第二，做两三个具体的项目，基于现有的工具系统，做部分的改进。有没有实战经验，
面试的时候也是一下就能问出来。
一句话，就是要实战，带着问题去看书，而不是先把教科书读一遍。

w***y
发帖数: 6251

来自主题: Database版 - 有没有人用postgres？有个drop table的问题

我在psql 下试一些SQL命令，为什么 DROP TABLE test CASCADE RESTRICT; 会有错呢？
ERROR: parser: parse error at or near "CASCADE"
我是看了manual说用CASCADE RESTRICT这种关键字的，为啥会有parse error？
多谢！

w***y
发帖数: 6251

来自主题: Database版 - 有没有人用postgres？有个drop table的问题

我只用一个也是有错：（
DROP TABLE test CASCADE;
ERROR: parser: parse error at or near "CASCADE"
DROP TABLE test RESTRICT;
ERROR: parser: parse error at or near "RESTRICT"
DROP TABLE test;
DROP

s******n
发帖数: 34

来自主题: Database版 - 适用于sql server 和oracle 的 sql batch?

我就是要先创建若干个sp,然后在执行sp，
我的这些sp是很长的，　不想直接hard code, 而且如果sp有问题，　也不想需要重新
编译程序，　所以我把他们放在一个文件里，　又不想parse文件，　想直接读到
string里，　然后用ADO 执行。
所以我需要知道如果用ado执行batch, 这个语句间是用什么分隔的？
不知道这样讲清楚了没有？
看来只能parse文件了？　或者用ini文件？

h**o
发帖数: 548

来自主题: Database版 - 问 log 分析的问题

大概几十台servers, 每台server每天分析几十GB公司自己格式的web log.
目前是用c 语言分析,结果存成每日的xml.
然后有一台management server, 每天从这几十台servers 收集 xml结果，
再用 perl parse 这些 xml 并和已有的历史文件（也是xml）合并生成一个新的历史
文件。
记录包括daily，weekly, monthly的各种信息。
现在问题是这个xml文件太大不好parse。想问能否用sql从新设计？
statistics 包括:
userID_$attr1_$attr2_$attr3_$attr4,
url_$attr1_$attr2_$attr3
sessionID_$attr3_$attr4
...
其中
$attrX is variable with a value. e.g. $attr3 is 手机类型 whose value can be

userID, url, sessionID are long lists of str... 阅读全帖

n****f
发帖数: 905

来自主题: Database版 - 问 log 分析的问题

别激动朋友。请问这种 LOG 文件，要如何做 PARTITION？
parse 之前要不要 DROP indexes?
parse 之后要不要重建 indexes?
呵呵，慢，仅仅是一个现象，原因有很多。。。。
俺说说硬件不合法啊？

c**t
发帖数: 2744

来自主题: DotNet版 - 心得：use XPath (+namespace)

因为工作的需要，要parse非常复杂的xml.用donet来parse XML非常方便，尤其用XPath
直接选取node(s)。但是当xml比较复杂的时候，比如从crystal reports直接导出的xml
，通常的办法：
XmlDocument xml = new XmlDocument();
xml.Load( PathToXmlFile );
XmlNodeList selection = xml.SelectNodes(strXPathExpression);
就行不通：明明xml.innerXml不空，selection.Count总是0。去
掉namespace就可以了。经过一番Google，终于找到答案：在SelectNodes前加上
XmlNamespaceManager nsmgr = new XmlNamespanceManager(xml.NameTable);
nsmgr.AddNamespace("a", "http://....");
nsmgr.AddNamespace("b", "urn:....");
XmlNodeList selection =

o**********a
发帖数: 330

来自主题: DotNet版 - 新手请教问题

刚接触 xml，为什么第一段code 可以work。而第2段code不能正确地创建xml
多谢
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Serialization;
using System.IO;
namespace xmlapp
{

public class Movie
{
public string Title
{ get; set; }
public int Rating
{ get; set; }
public DateTime ReleaseDate
{ get; set; }
}
class Program
{
static void Main(string[] args)
{

Movi... 阅读全帖

o**********a
发帖数: 330

来自主题: DotNet版 - 如何获取一个比较大的xml file从一个webserive

如何获取一个比较大的xml file从一个webserive，然后把它parse and save到
database里面
我初步想法是，把xml 下载到本地，然后用xmlreader parse 然后再保存到database里面
大家一般都怎么做的，多谢

v******n
发帖数: 421

来自主题: DotNet版 - how to get the days difference between "3/1/2013" and "6/1/2013"

class Program
{
static void Main() {
var d1 = System.DateTime.Parse("1/31/2013");
var d2 = System.DateTime.Parse("2/1/2013");
var d3 = d2 - d1;
System.Console.WriteLine(d3.Days);
}
}

w*r
发帖数: 2421

来自主题: Java版 - [转载] a question on XML parser

okey, your xml file is not well formated for parsing. My suggestion is
that you can write a class to get rid of the all document head at the first
place and put all record well-formated [Cin one file(or stream).
Then what you need to do is just write a xslt to transform the
xml to whatever the format you want and parse it into your application.

x***n
发帖数: 39

来自主题: Java版 - [转载] a question on XML parser

1. chop ur monolithic(?) file (collection of xmls) into collection of
xml files, parse one by one
2. find a fast way to feel an xml document (part of the file) to a parser,
then the second parsing for the second xml DOCUMENT (unfortunately
it's the second part of ur physical file), and so on.
1 or 2.

t*******t
发帖数: 105

来自主题: Java版 - type conversions

Dataset ds = (Dataset) new GctParser().parse();
Gctparser.parse()返回的是一个List，而Dataset是一个interface，这种转换的细节是
在什么地方写出来的啊？
多谢！google了type conversion，结果是讲 int -》double。

A**o
发帖数: 1550

来自主题: Java版 - type conversions

read your source code,
it's not what you said. it's actually returned you an object.
Dataset ds = (Dataset) new GctParser().parse().get(0);
the parse() does give you a List
but the get gives you an Object before being casted...
No surprises. Over and out.

节是

c*****t
发帖数: 1879

来自主题: Java版 - ETL process in JAVA. --有建议的请回这个贴。

最简单的办法我已经说了，你只要弄几个 class，对应 runlength scanning,
delimit scanning 等（这两个你该会吧，不会就没辙了）。然后就用 Spring
的 xml 做 record format 弄一 container，里面就是这几个 class object
和具体的 setting （比如 runlength 的长度等）。这样当用户给你个 xml，
你通过 Spring 读进该 xml，得到的是一个 list 的 scan action 。然后你
loop 这个 list 里的 action 不停的 parse record 就行了。
其它通过 Xml schema 设计的道理一样。自己 parse xml 说实在也很容易。

g*****g
发帖数: 34805

来自主题: Java版 - memcached

Flexibility and portablity, below was a real app I worked in ex-job.
Let's say you want to implement an web server
that can display emails, all emails are in MIME format so
you have to parse emails to get attachments in the first
place. And of course you want to cache them to avoid parsing
them again.
Now use a pure java solution, you have the control of the eviction
policy. You may give VIP members more cache space or longer
expiry time for example. Your caching is portable, config the
director

k****u
发帖数: 133

来自主题: Java版 - 请教大牛们一个问题

After a second thought, there are some special cases that are hard to deal
with, for example:
String trickyOne = "/* this is not a block comment */";
in this case, you have to make sure /* and */ are not enclosed in double
quotes. This alone is making your logic convoluted.
You best bet is to get to the parse tree of the source code, then just
traverse the tree and look for block comments. This solution is easy, clean
and maintainable.
For getting the parse tree, JavaSE 6 introduced some

e*****t
发帖数: 1005

来自主题: Java版 - 库存查询应该怎么做？

想简单、直接，就直接simulate http request，parse http response.
如果不是很明白底层的，parse dom也可以。不懂你的onload或者listner指的什么，你
是说得javascript么？java不是那么处理的。

nam
是等

s***o
发帖数: 2191

来自主题: Java版 - JQuery

This is a cross domain request, and it returns csv instead of json. (I guess
) jsonp won't work with csv format out of the box.
You can find the corresponding API that returns data in jsonp format, or if
that's not available, you can get data on server side, parse it and put the
parsed result on your web page with whatever format you prefer.

print

z****e
发帖数: 54598

来自主题: Java版 - 我自己编了个Java面试题

Regular Expression这题写起来非常麻烦
而且Regular Expression并不比xslt简单多少
还不如直接上xslt，不仅规范，而且可以复用代码
但是xslt也要求一定的基础，大多数人未必觉得适应
比较好的方法是上freemarker＋xml parsing lib
读和写分成两块，用dom/sax读入
parse完之后，用freemarker写好的template输出

d****g
发帖数: 7460

来自主题: Java版 - 我自己编了个Java面试题

靠，不需要帮助。
我最后决定自己parse string. 从头到尾parse一遍找<和>之间的tag.
有prefix的strip掉。代码写好看了考的是基本功。细节处是看对
string的空间分配的理解。好多细节我都是头次弄清楚。比如
string a= b.substring(0,256); a 新占多少字节？
stringbuffer sb;
sb.tostring() 和 new string (sb) 有没有区别？
一般我会认为这种问题无聊，但这个case都oom了，就寸土必争了。

c*********e
发帖数: 16335

来自主题: Java版 - java web services怎么把xml,json格式的数据解析？用哪些插件

en,json其实parse起来比xml容易。xml parse,自己手写的话，要搞什么childnode,烦
死人了。

c*********e
发帖数: 16335

来自主题: Java版 - java web services怎么把xml,json格式的数据解析？用哪些插件

parse xml 用 dom4j
parse json 用 jackson
咋样？

n*w
发帖数: 3393

来自主题: Linux版 - 怎样得到一个目录里所有上个月创建文件的大小总和？

都是对象在被pipe：
(gci -r director | ? {$_.CreationTime -gt (get-date).addmonths(-1)} |
Measure-Object -Sum Length).Sum
比bash等parse来parse去确实要先进点。

s**h
发帖数: 1889

来自主题: Linux版 - linux能不能限制一天最多错误登录次数？

to
举.
我粗读过，而且我前面说过我对iptables ip blocking理解有限/误. 你楼上:
http://mitbbs.com/article1/Linux/31244201_3_0.html
没试过。我可能没理解对这句话:
...because fail2ban parses log files to detect brute force attacks at a
certain interval ...
我认为parse日志需要其积累一定数据量，而积累一定数据量达到fail2ban的阀值可能
需要时间。这只是我的猜测。可能不对。
public
我原来的单位只允许有限几个端口。如果不巧我设的端口被禁，允许的端口有固定
的程序使用，基本改端口号不可行。我也说过我不太了结公匙那个方法以及根据我
有限理解认为公匙方法可能会造成的潜在的不便。
http://mitbbs.com/article1/Linux/31244175_3_0.html
我从来没有否认任何一种方法的有效性。

r*******y
发帖数: 1081

来自主题: Linux版 - ./test input and ./test < input

what is the difference between parsing the parameter here?
in ./test input, I can use $1 in the test script to denote the input
How to parse input in ./test < input ?
thanks

S*A
发帖数: 7142

来自主题: Linux版 - internals of gcc

front end is just C parsing and turn into internal IR.
The parsing itself is pretty complicate but the rules are actually
not very hard to understand.
The back end is where the interesting stuff happening.
If you just want to study how real compiler works. LLVM
is a much better start. The project is much cleaner compare
to gcc.

i***r
发帖数: 1035

来自主题: Linux版 - python code performance --- normal or too slow? (转载)

【以下文字转载自 Programming 讨论区】
发信人: iiiir (哎呀我最牛), 信区: Programming
标题: python code performance --- normal or too slow?
发信站: BBS 未名空间站 (Tue Jan 7 11:21:52 2014, 美东)
file is 2.5GB with 18,217,166 lines
my python script took about 20-30 minutes to finish
seems slow?
Thanks!!
input file data structure (showing first two lines, wrapped):
chromo pos ref alt dc1 dc2 dc3 dtm bas din
crw itb ptw spw isw irw inw ru1 ru2
ru3 im1 ... 阅读全帖

c*****m
发帖数: 1160

来自主题: Linux版 - Apache : 好像有一个命令能显示当前的config变量？

找到这个回答：
As noted by arco444, you can use apachectl -S to display an overview of the
VirtualHosts currently running from the configs, and apachectl -M to display
all currently loaded modules - I'm not aware of a tool to display the
verbose output of all configs parsed (and which order they were parsed in)
at launch of httpd, but I would recommend that you familiarise yourself with
the general structure of the httpd config files.
要不我自己做一个，也不太难。

m***t
发帖数: 254

来自主题: Programming版 - A very dump c++ question

ok. testfunc actually parses a document. Parsing a document is expensive,
while checking for certain field is cheap. If parameter j is there, i need
check for j after I get the document object. If j is not there, i donot do
the checking on j field.

h**o
发帖数: 548

来自主题: Programming版 - how to count the times a function is used

But I do not know what function appear in this files.
I only know there are some format which I can use to parse the names of
these function (see the attachment in the previous response)
I think I can first use grep to parse these function name, then count them
using for, sort,etc.
But I do not know how to grep them.

r*********r
发帖数: 3195

来自主题: Programming版 - c++ 中如何把str转换为float?

well, you have to put the usage in the right context.
if you are parsing millions of strings into floats (maybe
in high throughput networking environment?), then yeah,
atof or sscanf run much faster.
if you are only parsing the user input in an interactive
environment, why do you even care the complexity of the function
call?

b***y
发帖数: 2799

来自主题: Programming版 - [合集] how to know the encoding of a file

☆─────────────────────────────────────☆
davidwang (dd) 于 (Wed Feb 20 16:14:03 2008) 提到:
It's like this --
I have a C++ application which will read and parse an input file, then write
the result parsed to database. I need to know the encoding of the input
file to set the NLS_LANG variable. For example, if the file is encoded in
UTF8, NSL_LANG will be set '.UTF8'.
Question is, how I can detect this file is encoded in UTF8 etc. in C++.
Anybody ran into the same problem before?
Many thanks!
☆──

b***y
发帖数: 2799

来自主题: Programming版 - [合集] 被perl雷到了，sed, awk, cygwin, native以及其他

☆─────────────────────────────────────☆
nkw (just+it) 于 (Sat Apr 5 20:21:14 2008) 提到:
不知道雷在这里用的对不对，虽然我很不喜欢这个用法。
要parse很多大text csv文件，很简单，每行50列，其中用一列是没有双引号的text，
但有时也会有逗号，目的就是如果一行的逗号多了一个的话（肯定是text列有逗号），
就用双引号把一列括起来。
有同事用perl写了一个parser。我一直很抗拒perl，混乱，很丑。而且这个问题用sed
就可以做了。
在cygwin用sed做了，用了40多分钟。
觉得太慢。想想awk会不会快点？结果还慢一倍。
perl一直被认为是做这个最好的，那个同事还说过perl就是“designed for this”。
发现同事run过他的perl，去看了他saved的parsed文件时间，显示用了6，7+小时。他
用native Active perl，非cygwin。
再试了试gnu sed for windows 32，半个小时。http://gnuwin32.so

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天