海量SAS data的处理 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 海量SAS data的处理

相关主题
● 请教要统计处理海量数据的话，业界用哪个统计软件比较好？	● 职位推荐：Data Scientist
● Mainframe SAS vs Unix SAS	● [updated]Data Scientist职位refer: Austin Startup
● 搞统计的换方向容易吗？	● 公司是真的在招人
● 这个还是Markov随机过程吗?	● MS:苹果错了，现在是PC+时代
● 想请大家各抒己见，帮我参谋参谋去哪里	● [bssd]周末乱侃
● Project Manager Position in Marketing Analytics at Discover Financial	● Scala学会了没啥用武之地啊
● 请问这里大家有用 mainframe 的吗？	● 对于高频程序和摩根斯坦利MSSM笔试请教各位大牛几个问题

相关话题的讨论汇总
话题: data话题: sas话题: std话题: sampling话题: 抽样

进入Statistics版参与讨论

(共1页)

c****s
发帖数: 395

我有个sas data file,总共有200多G，在SAS里处理起来太慢
我想加快处理，想把它转为csv file,然后用r处理，会不会快点，关键从sas变到csv又
得半天
。请教各位，有没有更好的方法？sql就算了，现在用不了。

s*r
发帖数: 2757

不会快

T*******I
发帖数: 5138

这种思路是最笨的了。
统计是抽样研究的科学。最快捷的办法是对200G的数据库进行抽样研究，例如，系统抽
样，这是误差最小的抽样方法。如果你抽5%的样，结果是只需处理10G的数据，处理速
度将大大提高。这是提高速度的唯一且有效的办法。
一些人似乎又回到了19世纪末和20世纪初的统计思维了，以为只要尽可能大而全，就可
以得到精而准的结果。然而，事实上，我们既不可能得到一个关于总体的绝对真实的结
果，也就不需要那样的结果。200G的数据量看起来很大，可是与总体的无限性相比，它
所包含的信息量依然趋于零。

【在 c****s 的大作中提到】

: 我有个sas data file,总共有200多G，在SAS里处理起来太慢
: 我想加快处理，想把它转为csv file,然后用r处理，会不会快点，关键从sas变到csv又
: 得半天
: 。请教各位，有没有更好的方法？sql就算了，现在用不了。

A*******s
发帖数: 3942

u don't even know what the nature of the data is and u start bullshitting...
don't mislead anyone here please.

【在 T*******I 的大作中提到】

: 这种思路是最笨的了。
: 统计是抽样研究的科学。最快捷的办法是对200G的数据库进行抽样研究，例如，系统抽
: 样，这是误差最小的抽样方法。如果你抽5%的样，结果是只需处理10G的数据，处理速
: 度将大大提高。这是提高速度的唯一且有效的办法。
: 一些人似乎又回到了19世纪末和20世纪初的统计思维了，以为只要尽可能大而全，就可
: 以得到精而准的结果。然而，事实上，我们既不可能得到一个关于总体的绝对真实的结
: 果，也就不需要那样的结果。200G的数据量看起来很大，可是与总体的无限性相比，它
: 所包含的信息量依然趋于零。

s********0
发帖数: 51

R的速度其实很慢的，而且200G的数据光读入就恐怕要至少几天甚至于根本读不进去。
其实真的遇到这个情
况，我觉得还是用一些其他的语言来做吧，像C，python，Python，肯定可以快很多很
多。如果经过预处
理可以把数据变的少一些，比如几百兆，还可以考虑用sas或者R来做，不然的话，就要
看看那个语言有没
有这方面的package了。

【在 c****s 的大作中提到】

T*******I
发帖数: 5138

可笑！一个统计专才竟然反对抽样研究。

..

【在 A*******s 的大作中提到】

: u don't even know what the nature of the data is and u start bullshitting...
: don't mislead anyone here please.

A*******s
发帖数: 3942

funny...i am not 统计专才, just someone who happens to know stat 101.
leave alone interval estimates, which is highly affected by sample size.
just think about two cases for predictive modeling(point estimates only):
1. high dimensional data. decrease in sample size would increase the ratio P
/N drastically.
2. highly unbalanced data. a sub-sample may not have enough positive
observations.

【在 T*******I 的大作中提到】

: 可笑！一个统计专才竟然反对抽样研究。
:
: ..

p********a
发帖数: 5352

问题是，这个很明显是不能抽样调查的。人要处理一个大数据，你说来个抽样，老板非
要把脸气绿。

T*******I
发帖数: 5138

你想要知道一切，不想漏掉任何信息，或者说你要搞确定一切。这是不可能的。

P

【在 A*******s 的大作中提到】

: funny...i am not 统计专才, just someone who happens to know stat 101.
: leave alone interval estimates, which is highly affected by sample size.
: just think about two cases for predictive modeling(point estimates only):
: 1. high dimensional data. decrease in sample size would increase the ratio P
: /N drastically.
: 2. highly unbalanced data. a sub-sample may not have enough positive
: observations.

V**0
发帖数: 889

I agree. Without a careful examination on the data and a rough idea about
the dat generating process, sampling could cause a lot of issues. However,
sampling in the right way (in terms of many aspects) can indeed do a good
job on scaling down the complexity, to make the problem practically solvable
. Sometimes trade off has to be made. Further deeper thinking about this
problem may go into the long existing debate between statisitcian and
applied econometricians.

P

【在 A*******s 的大作中提到】

相关主题
● Project Manager Position in Marketing Analytics at Discover Financial	● [updated]Data Scientist职位refer: Austin Startup
● 请问这里大家有用 mainframe 的吗？	● 公司是真的在招人
● 职位推荐：Data Scientist	● MS:苹果错了，现在是PC+时代
进入Statistics版参与讨论

V**0
发帖数: 889

It depends on whether the boss care more about how you deal with the data or
more about the sense that you make out of the data.
To most cases, data is to be used, not to be believed.

【在 p********a 的大作中提到】

: 问题是，这个很明显是不能抽样调查的。人要处理一个大数据，你说来个抽样，老板非
: 要把脸气绿。

p********a
发帖数: 5352

你这个是从统计分析的角度看的，LZ问的是DATA ANALYSIS的问题，不是统计分析啊。
数据的PATTERN都不清楚，目的不清楚，做抽样too early to say了

A*******s
发帖数: 3942

please discuss problems in math/stat language.
"知道一切" "搞确定一切" does not make any sense here.
no prediction is correct, but the model with 90% correct rate is better than
the one with 87%.
For LZ's problem, i think lz need to offer more details--what kind of data u
have, what is the method u r using? the solution depends on the context--
maybe a new optimized algorithm giving the approximate estimation, or maybe
something like multi-thread programming...

【在 T*******I 的大作中提到】

: 你想要知道一切，不想漏掉任何信息，或者说你要搞确定一切。这是不可能的。
:
: P

A*******s
发帖数: 3942

can u give some specific examples? i am weak in sampling techniques and
would like to know more. thanks

solvable

【在 V**0 的大作中提到】

: I agree. Without a careful examination on the data and a rough idea about
: the dat generating process, sampling could cause a lot of issues. However,
: sampling in the right way (in terms of many aspects) can indeed do a good
: job on scaling down the complexity, to make the problem practically solvable
: . Sometimes trade off has to be made. Further deeper thinking about this
: problem may go into the long existing debate between statisitcian and
: applied econometricians.
:
: P

T*******I
发帖数: 5138

假定LZ的老板要他10天内出结果，而在现有条件下运算200G的数据需要12天，那么，她
/他该怎么办呢？

than
u
maybe

【在 A*******s 的大作中提到】

: please discuss problems in math/stat language.
: "知道一切" "搞确定一切" does not make any sense here.
: no prediction is correct, but the model with 90% correct rate is better than
: the one with 87%.
: For LZ's problem, i think lz need to offer more details--what kind of data u
: have, what is the method u r using? the solution depends on the context--
: maybe a new optimized algorithm giving the approximate estimation, or maybe
: something like multi-thread programming...

T*******I
发帖数: 5138

我认为我的那个回答已经很符合统计思维了。如果你不能接受，非要我用数学术语表达
，那是你的问题。你可以在英汉之间翻译，也应该可以在统计与数学之间转译。

than
u
maybe

【在 A*******s 的大作中提到】

A*******s
发帖数: 3942

i think that case is much less likely than u don't know statistics.

【在 T*******I 的大作中提到】

: 假定LZ的老板要他10天内出结果，而在现有条件下运算200G的数据需要12天，那么，她
: /他该怎么办呢？
:
: than
: u
: maybe

A*******s
发帖数: 3942

sorry but u just can't read

【在 T*******I 的大作中提到】

: 我认为我的那个回答已经很符合统计思维了。如果你不能接受，非要我用数学术语表达
: ，那是你的问题。你可以在英汉之间翻译，也应该可以在统计与数学之间转译。
:
: than
: u
: maybe

c****s
发帖数: 395

I am LZ, 谢谢各位精彩的评论拉，真是几家争鸣。
python很快的化就用了，实在没时间了就用sample
关键从sas到csv又得很久，sas怎么这么慢呢
yes, i am doing data processing, not statistics related.

D******n
发帖数: 2836

fts，这跟sampling啥关系啊。
难道我要找出所有var1==1的obs的obs number，我还可以去sampling？

【在 A*******s 的大作中提到】

: can u give some specific examples? i am weak in sampling techniques and
: would like to know more. thanks
:
: solvable

相关主题
● [bssd]周末乱侃	● 请教要统计处理海量数据的话，业界用哪个统计软件比较好？
● Scala学会了没啥用武之地啊	● Mainframe SAS vs Unix SAS
● 对于高频程序和摩根斯坦利MSSM笔试请教各位大牛几个问题	● 搞统计的换方向容易吗？
进入Statistics版参与讨论

T*******I
发帖数: 5138

在统计学上，抽样是无条件地可执行的。你该回你的母校去向你导师求证这句话。

【在 A*******s 的大作中提到】

: i think that case is much less likely than u don't know statistics.

S******y
发帖数: 1123

Hadoop?
Map-Reducer?
#######################

T*******I
发帖数: 5138

Your problem is definitely related to Statistics.

【在 c****s 的大作中提到】

: I am LZ, 谢谢各位精彩的评论拉，真是几家争鸣。
: python很快的化就用了，实在没时间了就用sample
: 关键从sas到csv又得很久，sas怎么这么慢呢
: yes, i am doing data processing, not statistics related.

A*******s
发帖数: 3942

well, u can use ur IQ to estimate the mean IQ of the whole population. it's
an unbiased estimate anyway.

【在 T*******I 的大作中提到】

: 在统计学上，抽样是无条件地可执行的。你该回你的母校去向你导师求证这句话。

v*****a
发帖数: 1332

我一个学计算机的都看不下去了
你在对DATA没有最基本的了解的情况下，就说5%抽样可以？
举例
一个200M大小的矩阵，每个位置数据大小为1K，也许能抽样
一个200K大小的矩阵，每个位置数据大小为1M，也许能抽样
如果整个DATA就是一个20X10的的矩阵，每个位置数据大小为1G，还能抽5%的样么？

【在 T*******I 的大作中提到】

: 在统计学上，抽样是无条件地可执行的。你该回你的母校去向你导师求证这句话。

A*******s
发帖数: 3942

lz already stated it's data processing, how could u be so stupid to say it's
statistics? what if he's just doing data merging and table joins?

【在 T*******I 的大作中提到】

: Your problem is definitely related to Statistics.

T*******I
发帖数: 5138

为什么不可以？depends on 你是在做什么？你的目的是什么？
如果你的目的是做统计分析，如果你的数据是20X10的矩阵需要合并，而每个位置上的
数据是1G，你可以将每个1G的数据用抽样的方法减小到1m或1k or what ever size you
can obtain，然后再合并。
当然，如果你的目的只是纯粹的数据合并继而建立一个庞大的数据库，那就另当别论。

【在 v*****a 的大作中提到】

: 我一个学计算机的都看不下去了
: 你在对DATA没有最基本的了解的情况下，就说5%抽样可以？
: 举例
: 一个200M大小的矩阵，每个位置数据大小为1K，也许能抽样
: 一个200K大小的矩阵，每个位置数据大小为1M，也许能抽样
: 如果整个DATA就是一个20X10的的矩阵，每个位置数据大小为1G，还能抽5%的样么？

s********p
发帖数: 637

不会有这样的情况的

【在 v*****a 的大作中提到】

s********p
发帖数: 637

如果data clean and subset 之后，数据文件还是很大。标准做法就是随机采样,再进一步分析

【在 c****s 的大作中提到】

s********p
发帖数: 637

IO

【在 c****s 的大作中提到】

相关主题
● 搞统计的换方向容易吗？	● Project Manager Position in Marketing Analytics at Discover Financial
● 这个还是Markov随机过程吗?	● 请问这里大家有用 mainframe 的吗？
● 想请大家各抒己见，帮我参谋参谋去哪里	● 职位推荐：Data Scientist
进入Statistics版参与讨论

T*******I
发帖数: 5138

在这种情况下不应该用随机抽样，而是系统抽样。后者的抽样误差最小。

进一步分析

【在 s********p 的大作中提到】

: 如果data clean and subset 之后，数据文件还是很大。标准做法就是随机采样,再进一步分析

v*****a
发帖数: 1332

。。。。
那我再把问题简化，
一个200长度的相量，每个元素就是一个超级大的数字，超级超级超级大那种，每个元
素的大小就是1G，
现在我要算STD，
这个你能抽样么？
你还能把我1G大小的一个数字抽样成"1m或1k" ？？

you

【在 T*******I 的大作中提到】

: 为什么不可以？depends on 你是在做什么？你的目的是什么？
: 如果你的目的是做统计分析，如果你的数据是20X10的矩阵需要合并，而每个位置上的
: 数据是1G，你可以将每个1G的数据用抽样的方法减小到1m或1k or what ever size you
: can obtain，然后再合并。
: 当然，如果你的目的只是纯粹的数据合并继而建立一个庞大的数据库，那就另当别论。

v*****a
发帖数: 1332

1，我这是举一个极端的例子
2，我现在做的RESEARCH就和LZ这个情况类似，（我不敢说绝对不能用抽样，但是），
绝对不是一拍脑袋，直接抽个样就能解决问题的

【在 s********p 的大作中提到】

: 不会有这样的情况的

T*******I
发帖数: 5138

如果你是在搞“数字游戏”的话，与我无关。
如果你的问题是很现实的自然问题，那么，再大的数字都可以被简化为很小的数字，例
如，太阳到地球的距离是149500000公里，或者149500000000米，或者149500000000000毫米，可以写成1.495（亿公里）等。所以，简化纪录方式可以减小你的那种超级超级超级超级超级超级超级……超级大的数据。

【在 v*****a 的大作中提到】

: 。。。。
: 那我再把问题简化，
: 一个200长度的相量，每个元素就是一个超级大的数字，超级超级超级大那种，每个元
: 素的大小就是1G，
: 现在我要算STD，
: 这个你能抽样么？
: 你还能把我1G大小的一个数字抽样成"1m或1k" ？？
:
: you

A*******s
发帖数: 3942

u can't read.

【在 T*******I 的大作中提到】

: 如果你是在搞“数字游戏”的话，与我无关。
: 如果你的问题是很现实的自然问题，那么，再大的数字都可以被简化为很小的数字，例
: 如，太阳到地球的距离是149500000公里，或者149500000000米，或者149500000000000毫米，可以写成1.495（亿公里）等。所以，简化纪录方式可以减小你的那种超级超级超级超级超级超级超级……超级大的数据。

T*******I
发帖数: 5138

更何况200个独立个体的5%系统抽样结果是得到10个个体。一样可以计算出你要的STD。

【在 v*****a 的大作中提到】

v*****a
发帖数: 1332

跟您老无法交流了。。
您老就报着那个无敌的“H0”和“FAIL　TO　REJECT”一辈子吧
千万别出去搞工程，祸害人。。

149500000000000毫米，可以写成1.495（亿公里）等。所以，简化纪录方式可以减小你
的那种超级超级超级超级超级超级超级……超级大的数据。

【在 T*******I 的大作中提到】

A*******s
发帖数: 3942

u don't even know the patterns in the data before doing systematic sampling?

【在 T*******I 的大作中提到】

: 在这种情况下不应该用随机抽样，而是系统抽样。后者的抽样误差最小。
:
: 进一步分析

T*******I
发帖数: 5138

This is very easy for me as long as I have the whole dataset.

sampling?

【在 A*******s 的大作中提到】

: u don't even know the patterns in the data before doing systematic sampling?

s********p
发帖数: 637

Thanks! But I said after data clean and subset, than random sampling.
Maybe you don't know what subset mean!

【在 T*******I 的大作中提到】

: 在这种情况下不应该用随机抽样，而是系统抽样。后者的抽样误差最小。
:
: 进一步分析

相关主题
● [updated]Data Scientist职位refer: Austin Startup	● [bssd]周末乱侃
● 公司是真的在招人	● Scala学会了没啥用武之地啊
● MS:苹果错了，现在是PC+时代	● 对于高频程序和摩根斯坦利MSSM笔试请教各位大牛几个问题
进入Statistics版参与讨论

v*****a
发帖数: 1332

学数学/统计的都是这么抬杠的么？？？？？？
“更何况200个独立个体的5%系统抽样结果是得到10个个体。一样可以计算出你要的STD
。”
如果我是20个个体呢？？？？？你一个个体能算STD？！！！！！
我靠！！！以后我骂文科女的时候一定不自称“理工男”了，我就是“工科男”，我羞
于与“理科男”为伍

【在 T*******I 的大作中提到】

: 更何况200个独立个体的5%系统抽样结果是得到10个个体。一样可以计算出你要的STD。

A*******s
发帖数: 3942

come on! u took the wrong sample of "li ke nan"! he is a "min ke nan"...

STD

【在 v*****a 的大作中提到】

: 学数学/统计的都是这么抬杠的么？？？？？？
: “更何况200个独立个体的5%系统抽样结果是得到10个个体。一样可以计算出你要的STD
: 。”
: 如果我是20个个体呢？？？？？你一个个体能算STD？！！！！！
: 我靠！！！以后我骂文科女的时候一定不自称“理工男”了，我就是“工科男”，我羞
: 于与“理科男”为伍

s********p
发帖数: 637

我觉得如果是你说的这种情况，是不能抽样的
但就通常统计分析来说，不会有这样的数据，我没有碰到过。如果真有的话，要么是当
初设计文件结构的人出了问题，要么就是问题本身导致出现了这种少见的数据。
好好做做data clean，是能消除不少没用的东西，缩减文件大小的

【在 v*****a 的大作中提到】

: 1，我这是举一个极端的例子
: 2，我现在做的RESEARCH就和LZ这个情况类似，（我不敢说绝对不能用抽样，但是），
: 绝对不是一拍脑袋，直接抽个样就能解决问题的

v*****a
发帖数: 1332

恩，太正确了
我现在就在苦苦的DATA　CLEANING呢

【在 s********p 的大作中提到】

: 我觉得如果是你说的这种情况，是不能抽样的
: 但就通常统计分析来说，不会有这样的数据，我没有碰到过。如果真有的话，要么是当
: 初设计文件结构的人出了问题，要么就是问题本身导致出现了这种少见的数据。
: 好好做做data clean，是能消除不少没用的东西，缩减文件大小的

T*******I
发帖数: 5138

抬杠的是你。
20个已经是很小的样本了，我还没听说20个数据不能做统计分析的。除非他说我的20观
察值都是趋于无穷大的。

STD

【在 v*****a 的大作中提到】

v*****a
发帖数: 1332

理科男--OK！！！！

【在 T*******I 的大作中提到】

: 抬杠的是你。
: 20个已经是很小的样本了，我还没听说20个数据不能做统计分析的。除非他说我的20观
: 察值都是趋于无穷大的。
:
: STD

T*******I
发帖数: 5138

我是医科男。
抬杠的是你，翻翘的是你，气急败坏的也是你。你还想怎么着？

【在 v*****a 的大作中提到】

: 理科男--OK！！！！

A*******s
发帖数: 3942

do u know what is interval estimate?

【在 T*******I 的大作中提到】

: 抬杠的是你。
: 20个已经是很小的样本了，我还没听说20个数据不能做统计分析的。除非他说我的20观
: 察值都是趋于无穷大的。
:
: STD

T*******I
发帖数: 5138

当我在1986年学interval estimate时，你在干吗？

【在 A*******s 的大作中提到】

: do u know what is interval estimate?

A*******s
发帖数: 3942

u learned it in psychiatric hospital?

【在 T*******I 的大作中提到】

: 当我在1986年学interval estimate时，你在干吗？

相关主题
● 请教要统计处理海量数据的话，业界用哪个统计软件比较好？	● 这个还是Markov随机过程吗?
● Mainframe SAS vs Unix SAS	● 想请大家各抒己见，帮我参谋参谋去哪里
● 搞统计的换方向容易吗？	● Project Manager Position in Marketing Analytics at Discover Financial
进入Statistics版参与讨论

T*******I
发帖数: 5138

答非所问，莫不是真有病？

【在 A*******s 的大作中提到】

: u learned it in psychiatric hospital?

A*******s
发帖数: 3942

Lol, feeding a troll really entertains me in the middle of work

【在 T*******I 的大作中提到】

: 答非所问，莫不是真有病？

T*******I
发帖数: 5138

enjoy yourself.

【在 A*******s 的大作中提到】

: Lol, feeding a troll really entertains me in the middle of work

o****o
发帖数: 8077

靠，跑题跑大了
还是回归LZ的原帖
楼主，你要怎么处理？不同的处理有不同的优化方法。你连具体的处理步骤都不说，别
人也没法回不是？

【在 c****s 的大作中提到】

d*******o
发帖数: 493

叫老板买Mainframe吧，现在100k就可以了。

s*********e
发帖数: 1051

handling large data is a very hot topic, especially in west coast.
first of all, you need to give background about your software, hardware, and
business scenario. this is not a trivial subject at all.
from my very limited knowledge and experience, there are 2 approaches in
production. 1) if in sas, you could use in-database processing capability.
Currently, this sas functionality is supported in teradata. 2) if out of sas
, hadoop is another option.

d*******o
发帖数: 493

UNIX-Teradata-SAS system is going to cost the company a leg. A second-handed
Mainframe SAS is so cheap now.

and
sas

【在 s*********e 的大作中提到】

: handling large data is a very hot topic, especially in west coast.
: first of all, you need to give background about your software, hardware, and
: business scenario. this is not a trivial subject at all.
: from my very limited knowledge and experience, there are 2 approaches in
: production. 1) if in sas, you could use in-database processing capability.
: Currently, this sas functionality is supported in teradata. 2) if out of sas
: , hadoop is another option.

a***g
发帖数: 2761

你们认真就输了

g*********d
发帖数: 233

R's major weakness is it can't handle large data set
like SAS. R has trouble dealing even w 1G data

l*********s
发帖数: 5409

ｄａｎ　ｄｉｎｇ　，ｔｅａｃｈｅｒ　ｃｈｅｎ　ｉｓ　ａｎ　ｏｕｔｌｉｅｒ。

STD

【在 v*****a 的大作中提到】

相关主题
● Project Manager Position in Marketing Analytics at Discover Financial	● [updated]Data Scientist职位refer: Austin Startup
● 请问这里大家有用 mainframe 的吗？	● 公司是真的在招人
● 职位推荐：Data Scientist	● MS:苹果错了，现在是PC+时代
进入Statistics版参与讨论

a***g
发帖数: 2761

要是lz手头有r的code，其实改一改，自己写个C++比较靠谱
另外那个数据本身是什么模式?
为啥非得用sas转成csv？
找不到专门直接的转换工具么

【在 g*********d 的大作中提到】

: R's major weakness is it can't handle large data set
: like SAS. R has trouble dealing even w 1G data

c****s
发帖数: 395

Hey, Thank you guys for the replies.
my company obviously can't afford more to buy software unless it is free.
this data step is in the beginning period, and it needs to be merged with
other tables. so sampling obviously is not a good way.
oloolo :i don't know if there are many ways to do it in sas.
by using c++, do you mean design a new function and interfaced with sas?
will it let the data process faster?
right now, I just drop some redundant and large-sized variable

o****o
发帖数: 8077

i was asking what kind of operation you want to do with the data
merge? concatenate? interleaving? or a series of manipulation
without these information, ppl can hardly help

【在 c****s 的大作中提到】

: Hey, Thank you guys for the replies.
: my company obviously can't afford more to buy software unless it is free.
: this data step is in the beginning period, and it needs to be merged with
: other tables. so sampling obviously is not a good way.
: oloolo :i don't know if there are many ways to do it in sas.
: by using c++, do you mean design a new function and interfaced with sas?
: will it let the data process faster?
: right now, I just drop some redundant and large-sized variable

d********r
发帖数: 111

It depends on what operations you are going to apply to the data.
If you need to join the data with other tables and processing with SAS is
too slow, you can try Perl (which is good at processing text file) or C (if
you like to sharpen your programming skills) after you dump the data into a
text file. Then you can use the built-in command "join" in UNIX to join the
data set with other tables also dumped in text file.
I have recently processed a 300 GB text file to match with about 100M
records. It doesn't take too long on a Linux box, about 2 to 3 days to
process the big text file including sorting which takes most of time, once
both tables are sorted, joining them only takes a fraction of the time spent
in sorting. The big file itself comes in text file so using a text
processing tool like Perl is natural. I am not sure about your case, but it'
s worth a shot. Of course, dumping the data from SAS to a text file itself
takes a long time, maybe a couple of days.

【在 c****s 的大作中提到】

(共1页)

进入Statistics版参与讨论

相关主题
● 请教要统计处理海量数据的话，业界用哪个统计软件比较好？	● 职位推荐：Data Scientist
● Mainframe SAS vs Unix SAS	● [updated]Data Scientist职位refer: Austin Startup
● 搞统计的换方向容易吗？	● 公司是真的在招人
● 这个还是Markov随机过程吗?	● MS:苹果错了，现在是PC+时代
● 想请大家各抒己见，帮我参谋参谋去哪里	● [bssd]周末乱侃
● Project Manager Position in Marketing Analytics at Discover Financial	● Scala学会了没啥用武之地啊
● 请问这里大家有用 mainframe 的吗？	● 对于高频程序和摩根斯坦利MSSM笔试请教各位大牛几个问题

相关话题的讨论汇总
话题: data话题: sas话题: std话题: sampling话题: 抽样

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天