第3页 - 关于udf的讨论汇总 - 话题女王

r*****d
发帖数: 346

来自主题: DataSciences版 - Pig UDF written in Python

问题终于孑孓了。。
问题出在原来的jython是老版本，2.5.1+
我下载了最新的版本，2.5.3
然后把原来的register jython.jar跟PIG_CLASSPATH都换成新的就好了
老版本可以用来单独运行python script,
但是不能放到register 'my_udf.py' using jython as myfuncs;
因为跟用的Pig版本不兼容
就酱紫

r*****d
发帖数: 346

来自主题: DataSciences版 - Pig UDF written in Python

又有一个问题，需要import a python module怎么办？
比如import json
执行pig script的时候得到：
ImportError: no module named json
json这个module在那里，
假设这个python script的名字是test.py
执行python test.py没问题
但是执行jython test.py就也得到ImportError: no module named json
非常感谢！

l*******m
发帖数: 1096

来自主题: DataSciences版 - Pig UDF written in Python

need to test in jython.
BTW, using java is much much better with hadoop. Json parsing in java is 10-
time faster compared to python.

r*****d
发帖数: 346

来自主题: DataSciences版 - Pig UDF written in Python

More details please? Thank you very much.

l*******m
发帖数: 1096

来自主题: DataSciences版 - Pig UDF written in Python

如果你就是在pig读json. 用elephant bird的json loader就好，不用java coding
download如下jar(google 他们的jar下载，不要编译整个repositery, 不容易）
register /path/to/json-simple.jar;
register /path/to/elephant-bird-core.jar;
register /path/to/elephant-bird-pig.jar;
一个完整的example
https://github.com/kevinweil/elephant-bird/blob/master/examples/src/main/pig
/nested_json_get_distinct_items_from_nested_array.pig

j*******g
发帖数: 331

来自主题: DataSciences版 - Pig UDF written in Python

another thing I can think of is you probably should make sure all data nodes
has this package installed, did you test on every data node you gonna use?

r*****d
发帖数: 346

来自主题: DataSciences版 - Pig UDF written in Python

hopefully it is not this complicated.. i treat data nodes as encapsulation..

nodes

j*******g
发帖数: 331

来自主题: DataSciences版 - Pig UDF written in Python

data node might be inaccurate, but look at this
http://stackoverflow.com/questions/7831649/how-do-i-make-hadoop

..

t******g
发帖数: 2253

来自主题: DataSciences版 - Pig UDF written in Python

lz是HH？

r*****d
发帖数: 346

来自主题: DataSciences版 - Pig UDF written in Python

谢谢大家！迟些时候发包子。

t*********u
发帖数: 26311

来自主题: DataSciences版 - 请问大家有没有直接用java全程写mapreduce的程序的？

还是用pig+UDF?

t*********u
发帖数: 26311

来自主题: DataSciences版 - 请问大家有没有直接用java全程写mapreduce的程序的？

我只是想看看有没有稍微复杂点的用java全程写的例子
便于理解和学习
PS
UDF的大众选择是java还是python？

w******k
发帖数: 299

来自主题: DataSciences版 - 请问大家有没有直接用java全程写mapreduce的程序的？

Hadoop for data storage. Hive as query interface with mapper/reducer written
in perl/python. Hive UDF was written in Java. Final classifier was written
in Java.

t*********u
发帖数: 26311

来自主题: DataSciences版 - 请问大家有没有直接用java全程写mapreduce的程序的？

那在原始的数据的层次
能不能在FOREACH GENERATE的时候用上Eval或者Filter类型的UDF
这样就是相当于直接对原始数据进行map操作
我的理解这个就是local运行的,对么？

s*******d
发帖数: 132

来自主题: DataSciences版 - 有谁在上UW的data science课吗？

感觉这课讲的东西不多，作业要求倒是不少。
twitter python 那个作业，就是hash table的应用呀
sql的作业也不难，什么conditional query的，也没具体讲 udf连个例子都没有
写作业就靠google
大家是选亚麻棕，还是kaggle？

c***u
发帖数: 4107

来自主题: DataSciences版 - pig能做iterative的问题吗?

最近在自学, 请问, pig能做iterative的问题吗, 比如一些matrix update的问题的.
比如nonnegative matrix factorization, 有一个nonnegative matrix N, 要分解成2
个matrix A和B, 使得|N-A*B|尽可能的小
标准算法是: 先随机生成2个矩阵A和B. 然后先固定A, 按照一个规则用A和N去更新B;
再固定B, 用B和N去更新A; 一直如此循环更新, 直到|N-A*B|足够小
不知道用pig或者hive, 能解决如此问题吗?
(不是research问题, 5/6年就有人用mapreduce+java发了N片文章)
当然, 我是说不另外写UDFs的情形下

D**u
发帖数: 288

来自主题: DataSciences版 - pig能做iterative的问题吗?

Pig itself has no support for iteration, but if you really want to use Pig,
you can embed the pig script into a python (jython) program to do it
iteratively.
Check this for example:
http://thedatachef.blogspot.com/2013/11/linear-regression-with-
After all, this is not best practice since for every iteration a M/R job is
spanned, and that is 2 sec wasted, and usually your algorithm runs with
hundreds of iteration. So, just use Spark. Spark now support both Scala and
Python pretty much equally wel... 阅读全帖

a*****d
发帖数: 18

来自主题: DataSciences版 - data scientist position

我现在的公司想扩张data science team。是一个做大数据的pre-ipo公司(比较有前途
的一个)。这个team主
要是做consulting service，项目比较有意思。老板是白人，人非常好。
Key responsibilities include:
Help customers understand and evaluate data science use-cases appropriate
for their business
Collaborate with customer teams to formulate the problem, recommend a
solution approach and design a data architecture
Create a prototype in R, Python, Java or similar stack to demonstrate the
results of various algorithmic approaches and evaluate their performance
Work ... 阅读全帖

a*****d
发帖数: 18

来自主题: DataSciences版 - data scientist position

t*********u
发帖数: 26311

来自主题: DataSciences版 - hive 里面的UDF会被几个node同时运行么？

j*****n
发帖数: 1545

来自主题: DataSciences版 - hive 里面的UDF会被几个node同时运行么？

跟 node 有啥关系? 该 mapper 跑就 mapper 跑，该 reducer 跑就 reducer 跑

j********p
发帖数: 9680

来自主题: DataSciences版 - hive 里面的UDF会被几个node同时运行么？

感觉这个Hive就是为分布计算设计的,多个node同时运行是应该的,
但作为使用者就不用操心哪个和哪个在跑了.

l******n
发帖数: 9344

来自主题: DataSciences版 - san bruno ds position

Minimum Qualifications?
- PhD in Computer Science, Statistics or related field; OR a Master’s
degree or equivalent in Computer Science, Statistics or related field and 2
years of related experience.?
- Knowledge of machine learning, information retrieval, data mining,
statistics, NLP or related field.?
- Programming skills in one of the following languages: Java, Scala, C/C++.?
- Knowledge of one of the scripting languages such as Python or Perl.?
- Experience analyzing and interpreting the resu... 阅读全帖

G***n
发帖数: 877

来自主题: DataSciences版 - HIVE load CSV 问题请教

Hive本身也不成熟，很多东西需要UDF

jar

w*******y
发帖数: 60932

来自主题: _DealGroup版 - 【$】Free copy of Daemon Tools Pro 4/22 only

Chip.de is having an easter promotion and giving away copies of Daemon Tools
Pro.
The installer comes with the serial number pre-loaded. To get a copy, go to
Link:
http://www.chip.de/downloads/DAEMON-Tools-Pro-Advanced-Vollvers
Click "Zum Download", then click "Download-Server CHIP Online" and wait a
few seconds, the download should start automatically.
Features include:Create .iso, .mds/.mdf and .mdx images of CD, DVD, Blu-ray
discs
Protect images with password
Make or edit images with... 阅读全帖

q*z
发帖数: 13362

来自主题: _Xiyu版 - 再问个问题。

当然不是,这个问题很复杂,
不同的bluray player也不一样
最基本的方法是用udf 2.5,
直接burn 文件,但是并不保证
所有的bluray player都能放,因为
有的player有folder或者文件名的限制
为了确保兼容性.最好是用个支持avchd master的软件
比如说nero,来burn

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天