a******n 发帖数: 11246 | 1 suppose i have a table with one column called acct_id.
this column is sorted already.
how do I find if there's any duplicate value or not efficiently?
will give double yellow baozi to the most efficient method. thanks |
c*********r 发帖数: 1802 | 2 试试
proc freq
by acct_id; |
x***x 发帖数: 3401 | 3 proc sort data=a nodupkey;
by acct_id;
This will eliminate any duplicated value of acct_id. |
a******n 发帖数: 11246 | 4 this will give all acct_id freq ba?
this table has millions of acct_id.
i just want to find if by any change one acct_id has more than 1 record...
【在 c*********r 的大作中提到】 : 试试 : proc freq : by acct_id;
|
a******n 发帖数: 11246 | 5 then compare the row numbers?
but if the row numbers are not the same, i still don't know which acct_id
has more then one record ya...
【在 x***x 的大作中提到】 : proc sort data=a nodupkey; : by acct_id; : This will eliminate any duplicated value of acct_id.
|
x***x 发帖数: 3401 | 6 so you want to print out the duplicated records?
then try
proc sql;
select acct_id from a group by acct_id having count(*)>1
;
quit;
acct_id
【在 a******n 的大作中提到】 : then compare the row numbers? : but if the row numbers are not the same, i still don't know which acct_id : has more then one record ya...
|
a******n 发帖数: 11246 | 7 thanks. this is what i am using now.
still waiting the result to come out :(
it seems that this method does not take the advantage
that the column is already sorted.
【在 x***x 的大作中提到】 : so you want to print out the duplicated records? : then try : proc sql; : select acct_id from a group by acct_id having count(*)>1 : ; : quit; : : acct_id
|
x***x 发帖数: 3401 | 8 if you want to take advantage that the column is sorted, you may use
'RETAIN' statement in the data step. But i dont think it is any more
efficient.
【在 a******n 的大作中提到】 : thanks. this is what i am using now. : still waiting the result to come out :( : it seems that this method does not take the advantage : that the column is already sorted.
|
L****n 发帖数: 3545 | 9 you can try compress=YES option,
and, sorted column doesn't speed up the process, but index does.
【在 a******n 的大作中提到】 : thanks. this is what i am using now. : still waiting the result to come out :( : it seems that this method does not take the advantage : that the column is already sorted.
|
a******n 发帖数: 11246 | 10 i see...thanks.
will baozi you soon.
【在 x***x 的大作中提到】 : if you want to take advantage that the column is sorted, you may use : 'RETAIN' statement in the data step. But i dont think it is any more : efficient.
|
|
|
a******n 发帖数: 11246 | 11 good. thanks.
baozi soon.
【在 L****n 的大作中提到】 : you can try compress=YES option, : and, sorted column doesn't speed up the process, but index does.
|
x***x 发帖数: 3401 | 12 another way of speeding up is to create a temporary data set that only has
ACCT_ID before go into the sql step. This will help killing some overhead in
the sql step.
data b;
set a(keep=acct_id);
run;
proc sql;
.... |
L****n 发帖数: 3545 | 13 no, I want it now.
Or you consider me as your nurse so we can do business together ?
lol
【在 a******n 的大作中提到】 : good. thanks. : baozi soon.
|
a******n 发帖数: 11246 | 14 我自己都不干了。上班太忙了。
【在 L****n 的大作中提到】 : no, I want it now. : Or you consider me as your nurse so we can do business together ? : lol
|
L****n 发帖数: 3545 | 15 already working le?
zan !!!!!!!!!!
xiao shen yi, tai niu le !
【在 a******n 的大作中提到】 : 我自己都不干了。上班太忙了。
|
a*****3 发帖数: 601 | 16 既然已经排过续, 就用
if Fist.ID <> last.ID
怎么样啊 , 看得过去的话,赏个两个铜板吧
【在 a******n 的大作中提到】 : 我自己都不干了。上班太忙了。
|
a******n 发帖数: 11246 | 17 hao. baozi sent.
i will compare speed of this method and the proc sql one :D
【在 a*****3 的大作中提到】 : 既然已经排过续, 就用 : if Fist.ID <> last.ID : 怎么样啊 , 看得过去的话,赏个两个铜板吧
|
a*****3 发帖数: 601 | 18 那不用说,肯定是我的跑得飞快啦, 多谢天使包子
【在 a******n 的大作中提到】 : hao. baozi sent. : i will compare speed of this method and the proc sql one :D
|
x**********0 发帖数: 2260 | 19 赞。想到这个,忘记code了。呵呵
【在 x***x 的大作中提到】 : proc sort data=a nodupkey; : by acct_id; : This will eliminate any duplicated value of acct_id.
|
g****8 发帖数: 2828 | 20 you don't need to compare yourself.
The log file will tell you how many dup there.
【在 a******n 的大作中提到】 : then compare the row numbers? : but if the row numbers are not the same, i still don't know which acct_id : has more then one record ya...
|
|
|
g****8 发帖数: 2828 | 21 create a table for that sql, it will be faster than just use SELECT. SAS
needs some time to print out large dataset.
【在 a******n 的大作中提到】 : thanks. this is what i am using now. : still waiting the result to come out :( : it seems that this method does not take the advantage : that the column is already sorted.
|
a******n 发帖数: 11246 | 22 有道理!。
id
【在 g****8 的大作中提到】 : you don't need to compare yourself. : The log file will tell you how many dup there.
|
a******n 发帖数: 11246 | 23 thanks...学习了~
【在 g****8 的大作中提到】 : create a table for that sql, it will be faster than just use SELECT. SAS : needs some time to print out large dataset.
|
g****8 发帖数: 2828 | 24 哈哈,谢谢双黄包呀。
【在 a******n 的大作中提到】 : thanks...学习了~
|
p********r 发帖数: 1465 | 25 proc sort... nodupkey out= dupout=;
by accountid;
run;
out= 输出的是不重复的,dupout输出的是所有重复过的。
用compress,之前的童鞋说过了。
million级别的数据也不用太久,很快能出结果。 |
m*****y 发帖数: 229 | 26 don't know whether this work in PROC SQL: count (distinct acct_id ) |