j******o 发帖数: 127 | 1 这个方法是不是特别笨? 欢迎大家测试一下。假设事先知道给定字符的长度n。
%let n=20;
data have;
input str : $&n..;
datalines;
abcsdabcedbcsdaedfjs
;
run;
data have1;
set have;
do i=1 to &n;
do j=1 to &n-i;
sub=substr(str, i, j);
len=lengthn(sub);
output;
end;
output;
end;
run;
proc sort data=have1 nodupkey;
by i sub;
run;
proc sort data=have1;
by len sub;
run;
data have1;
set have1;
sub1=lag(sub);
if sub not eq sub1 then delete;
run;
data have1 (keep=str len);
set have1 end=las... 阅读全帖 |
|
s******y 发帖数: 352 | 2 DATA HAVE;
INFILE CARDS TRUNCOVER;
INPUT A B C D;
ROWID=_N_;
CARDS;
1
7
3 3
2 5
. 5
. 5 3
. . 4 5
. . 5 8
. . 2 3
. . . 1
;
RUN;
PROC TRANSPOSE DATA=HAVE OUT=HAVE1(WHERE=(NOT MISSING(COL1)));
BY ROWID;
VAR A B C D;
RUN;
PROC SORT DATA=HAVE1;
BY _NAME_ ROWID;
RUN;
DATA HAVE2;
SET HAVE1;
BY _NAME_;
ID+1-(FIRST._NAME_)*ID;
RUN;
PROC SORT DATA=HAVE2;
BY ID _NAME_ ;
RUN;
PROC TRANSPOSE DATA=HAVE2 OUT=WANT(DROP=_NAME_ ID);
BY ID ;
VAR COL1;
ID _NAME_;
RUN; |
|
F****3 发帖数: 1504 | 3 我用SQL是这样搞的:
proc sql;
create table WANT as
select distinct a.*, b.VAR1 , b.VAR2
from HAVE1 as a left join HAVE2 (DROP=link_count) as b
on a.PATIENT_ID = b.PATIENT_ID and a.YEAR = b.YEAR;
quit;
但是应为硬盘容量有限,所以只能用DATA STEP了。请问出来的WANTtable是不是一模一
样的啊?
proc sort data=HAVE1;
by PATIENT_ID YEAR;
run;
proc sort data=HAVE2(keep=PATIENT_ID YEAR VAR1 VAR2);
by PATIENT_ID YEAR;
run;
data WANT;
merge HAVE1 (in=a) HAVE2(in=b);
by PATIENT_ID YEAR;
if a;
run; |
|
j******o 发帖数: 127 | 4 Please try and to see if the three groups of id are what you want.
proc sort data=have out=have1 nodupkey; by id typle; run;
data _equ _sup _both;
set have1;
by id typle;
if first.id=last.id and upcase(trim(typle))='EQUIPMENT' then output
_equ(keep=id);
else if first.id=last.id and upcase(trim(typle))='SUPPLY' then output
_sup(keep=id);
else output _both(keep=id);
run; |
|
F****3 发帖数: 1504 | 5 非常简单的程序,run了一个下午都没run出来。
Both have1 and have2 have few variables, but many observations.
proc sql;
create table want as
select *
from have1 as a left have2 as b
on b.serie_beg <= a.serie <= b.serie_end;
quit;
把上面left join换成inner join也一样。。。
serie_beg, serie, and serie_end都是上billion的数值(numeric)。请怎么会这么
久啊? |
|
F****3 发帖数: 1504 | 6 真不好意思啊。你说的有重名,请问可以举一个例子吗?
假设Have1 有下面几个variable (400 GB的那个)
hospital_visit_id, date, time, patient_id, document_id
Have2有下面几个variable (800 MB的那个)
patient_id, document_id, diag_date, diag_type
应该就没有问题吧?
如果另外我把前面条件改动,让HAVE1和HAVE2都有一个共同的variable叫comomon_
variable,这种情况就会出现你提到的改动问题吧?谢谢! |
|
|
F****3 发帖数: 1504 | 8 谢谢大家!
文件已经sort了但是不是按照series来,文件id其实就是series只是说数值必须在某一
段之内。。。
看来只能用series。是不是先要sort by series啊?
我把have1 subset成1000个obs,很快就出来了。 |
|
s******8 发帖数: 102 | 9 你这种情况是该用hash,假设have2是800M.
data merged;
length var1 var2 $8.;*这里要根据实际情况改,但必须定义var1 var2;
if _n_=1 then do;
dcl hash h(dataset:"have2");
h.definekey("PATIENT_ID","YEAR");
h.definedata("VAR1","VAR2");
h.definedone();
call missing(var1,var2);
end;
set have1;
if h.find() ne 0 then call missing(var1,var2);
run; |
|