r*****d 发帖数: 346 | 1 问题终于孑孓了。。
问题出在原来的jython是老版本,2.5.1+
我下载了最新的版本,2.5.3
然后把原来的register jython.jar跟PIG_CLASSPATH都换成新的就好了
老版本可以用来单独运行python script,
但是不能放到register 'my_udf.py' using jython as myfuncs;
因为跟用的Pig版本不兼容
就酱紫 |
|
r*****d 发帖数: 346 | 2 又有一个问题,需要import a python module怎么办?
比如import json
执行pig script的时候得到:
ImportError: no module named json
json这个module在那里,
假设这个python script的名字是test.py
执行python test.py没问题
但是执行jython test.py就也得到ImportError: no module named json
非常感谢! |
|
l*******m 发帖数: 1096 | 3 need to test in jython.
BTW, using java is much much better with hadoop. Json parsing in java is 10-
time faster compared to python. |
|
r*****d 发帖数: 346 | 4 More details please? Thank you very much. |
|
l*******m 发帖数: 1096 | 5 如果你就是在pig读json. 用elephant bird的json loader就好,不用java coding
download如下jar(google 他们的jar下载,不要编译整个repositery, 不容易)
register /path/to/json-simple.jar;
register /path/to/elephant-bird-core.jar;
register /path/to/elephant-bird-pig.jar;
一个完整的example
https://github.com/kevinweil/elephant-bird/blob/master/examples/src/main/pig
/nested_json_get_distinct_items_from_nested_array.pig |
|
j*******g 发帖数: 331 | 6 another thing I can think of is you probably should make sure all data nodes
has this package installed, did you test on every data node you gonna use? |
|
r*****d 发帖数: 346 | 7 hopefully it is not this complicated.. i treat data nodes as encapsulation..
nodes |
|
|
|
|
|
t*********u 发帖数: 26311 | 12 我只是想看看有没有稍微复杂点的用java全程写的例子
便于理解 和学习
PS
UDF的大众选择是java还是python? |
|
w******k 发帖数: 299 | 13 Hadoop for data storage. Hive as query interface with mapper/reducer written
in perl/python. Hive UDF was written in Java. Final classifier was written
in Java. |
|
t*********u 发帖数: 26311 | 14 那在原始的数据的层次
能不能在FOREACH GENERATE的时候用上Eval或者Filter类型的UDF
这样就是相当于直接对原始数据进行map操作
我的理解这个就是local运行的,对么? |
|
s*******d 发帖数: 132 | 15 感觉这课讲的东西不多,作业要求倒是不少。
twitter python 那个作业,就是hash table的应用呀
sql的作业也不难,什么conditional query的,也没具体讲 udf连个例子都没有
写作业就靠google
大家是选亚麻棕,还是kaggle? |
|
c***u 发帖数: 4107 | 16 最近在自学, 请问, pig能做iterative的问题吗, 比如一些matrix update的问题的.
比如nonnegative matrix factorization, 有一个nonnegative matrix N, 要分解成2
个matrix A和B, 使得|N-A*B|尽可能的小
标准算法是: 先随机生成2个矩阵A和B. 然后先固定A, 按照一个规则用A和N去更新B;
再固定B, 用B和N去更新A; 一直如此循环更新, 直到|N-A*B|足够小
不知道用pig或者hive, 能解决如此问题吗?
(不是research问题, 5/6年就有人用mapreduce+java发了N片文章)
当然, 我是说不另外写UDFs的情形下 |
|
D**u 发帖数: 288 | 17 Pig itself has no support for iteration, but if you really want to use Pig,
you can embed the pig script into a python (jython) program to do it
iteratively.
Check this for example:
http://thedatachef.blogspot.com/2013/11/linear-regression-with-
After all, this is not best practice since for every iteration a M/R job is
spanned, and that is 2 sec wasted, and usually your algorithm runs with
hundreds of iteration. So, just use Spark. Spark now support both Scala and
Python pretty much equally wel... 阅读全帖 |
|
a*****d 发帖数: 18 | 18 我现在的公司想扩张data science team。是一个做大数据的pre-ipo公司(比较有前途
的一个)。这个team主
要是做consulting service,项目比较有意思。老板是白人,人非常好。
Key responsibilities include:
Help customers understand and evaluate data science use-cases appropriate
for their business
Collaborate with customer teams to formulate the problem, recommend a
solution approach and design a data architecture
Create a prototype in R, Python, Java or similar stack to demonstrate the
results of various algorithmic approaches and evaluate their performance
Work ... 阅读全帖 |
|
a*****d 发帖数: 18 | 19 我现在的公司想扩张data science team。是一个做大数据的pre-ipo公司(比较有前途
的一个)。这个team主
要是做consulting service,项目比较有意思。老板是白人,人非常好。
Key responsibilities include:
Help customers understand and evaluate data science use-cases appropriate
for their business
Collaborate with customer teams to formulate the problem, recommend a
solution approach and design a data architecture
Create a prototype in R, Python, Java or similar stack to demonstrate the
results of various algorithmic approaches and evaluate their performance
Work ... 阅读全帖 |
|
|
j*****n 发帖数: 1545 | 21 跟 node 有啥关系? 该 mapper 跑就 mapper 跑,该 reducer 跑就 reducer 跑 |
|
j********p 发帖数: 9680 | 22 感觉这个Hive就是为分布计算设计的,多个node同时运行是应该的,
但作为使用者就不用操心哪个和哪个在跑了. |
|
l******n 发帖数: 9344 | 23 Minimum Qualifications?
- PhD in Computer Science, Statistics or related field; OR a Master’s
degree or equivalent in Computer Science, Statistics or related field and 2
years of related experience.?
- Knowledge of machine learning, information retrieval, data mining,
statistics, NLP or related field.?
- Programming skills in one of the following languages: Java, Scala, C/C++.?
- Knowledge of one of the scripting languages such as Python or Perl.?
- Experience analyzing and interpreting the resu... 阅读全帖 |
|
G***n 发帖数: 877 | 24 Hive本身也不成熟,很多东西需要UDF
jar |
|
w*******y 发帖数: 60932 | 25 Chip.de is having an easter promotion and giving away copies of Daemon Tools
Pro.
The installer comes with the serial number pre-loaded. To get a copy, go to
Link:
http://www.chip.de/downloads/DAEMON-Tools-Pro-Advanced-Vollvers
Click "Zum Download", then click "Download-Server CHIP Online" and wait a
few seconds, the download should start automatically.
Features include:Create .iso, .mds/.mdf and .mdx images of CD, DVD, Blu-ray
discs
Protect images with password
Make or edit images with... 阅读全帖 |
|
q*z 发帖数: 13362 | 26 当然不是,这个问题很复杂,
不同的bluray player也不一样
最基本的方法是用udf 2.5,
直接burn 文件,但是并不保证
所有的bluray player都能放,因为
有的player有folder或者文件名的限制
为了确保兼容性.最好是用个支持avchd master的软件
比如说nero,来burn |
|