o****o 发帖数: 8077 | 1 数据如下:
data a1;
input id1 id2 id3 id4;
cards;
1 11 21 31
1 12 22 31
2 11 22 32
2 13 21 32
3 14 23 32
4 11 22 33
;
run;
每个record有四个id,但凡只要任意两个相同,就视为同一个id。例如第一,二个
record,有相同的id1, 一,六又有相同的id2, 所以一,二,六为同一个id。以此推
下去,该数据集实质为一个id。 请教大侠们,怎么用这四个id创建一个能标志他们关
系的新的id呢?
注:慎用array,数据集有几十万条record。 |
o****o 发帖数: 8077 | 2 my general idea is :
this can be solved using hash list while your obs are index by the four IDs
SEPARATELY.
by OP's description, as searching goes on, any subsequent observation that
has 2 IDs in the current combination of id pool will be identified as
belonging to the same group, then any new ID will be added to the pool. The
main hash table contains the record number which is the unqiue ID for each
obs, and the other elements in the hash is an 5 auxilary hash lists, 4 for
id pool of each ID#
【在 o****o 的大作中提到】 : 数据如下: : data a1; : input id1 id2 id3 id4; : cards; : 1 11 21 31 : 1 12 22 31 : 2 11 22 32 : 2 13 21 32 : 3 14 23 32 : 4 11 22 33
|
l*********s 发帖数: 5409 | 3 if A is similar to B , B is similar to C, but A is not similar to C, do you
put A,B,C into one class? |
s*r 发帖数: 2757 | 4 data a1;
set a1;
uid=_n_;
run;
%macro paircomp(id1, id2);
proc sql;
create table t&id1.&id2. as
select min(b.uid,c.uid) as uid_1, max(b.uid,c.uid) as uid_2
from a1 as b, a1 as c
where b.id&id1=c.id&id2 and b.uid ^= c.uid;
quit;
proc sort data=t&id1.&id2. nodup;
by uid_1;
run;
proc append data=t&id1.&id2. base=allsame force; run;
%mend;
option mlogic mprint spool;
%paircomp(1,1);
%paircomp(1,2);
%paircomp(1,3);
%paircomp(1,4);
%paircomp(2,2);
%paircomp(2,3);
%paircomp(2,4);
%paircomp(3,3);
%pairc
【在 o****o 的大作中提到】 : 数据如下: : data a1; : input id1 id2 id3 id4; : cards; : 1 11 21 31 : 1 12 22 31 : 2 11 22 32 : 2 13 21 32 : 3 14 23 32 : 4 11 22 33
|
o****o 发帖数: 8077 | 5 man, the data has 700K records.....you won't want to do it like this
【在 s*r 的大作中提到】 : data a1; : set a1; : uid=_n_; : run; : %macro paircomp(id1, id2); : proc sql; : create table t&id1.&id2. as : select min(b.uid,c.uid) as uid_1, max(b.uid,c.uid) as uid_2 : from a1 as b, a1 as c : where b.id&id1=c.id&id2 and b.uid ^= c.uid;
|
s*r 发帖数: 2757 | 6 700k就不能运行sql inner join拉
【在 o****o 的大作中提到】 : man, the data has 700K records.....you won't want to do it like this
|
o****o 发帖数: 8077 | 7 when new records added, their IDx will be effective for subsequent compare, so that you can't do the pre-set pairwise comparison, the IDx set increases.
【在 s*r 的大作中提到】 : 700k就不能运行sql inner join拉
|
s*r 发帖数: 2757 | 8 4 id columns 的话,总共只要运行10便sql就可以了
【在 o****o 的大作中提到】 : when new records added, their IDx will be effective for subsequent compare, so that you can't do the pre-set pairwise comparison, the IDx set increases.
|
A*******s 发帖数: 3942 | 9 不知道我理解对了没,原帖好像有点歧义。
“每个record有四个id,但凡只要任意两个相同,就视为同一个id。”
我觉得按照贴出来的sample data,应该是
“每个record有四个id,任意两个record, 只要有一个相同的id,就视为同一个id。”
如果是这样的话,你的code里面的where b.id&id1=c.id&id2 就不是这个意思了。
【在 s*r 的大作中提到】 : data a1; : set a1; : uid=_n_; : run; : %macro paircomp(id1, id2); : proc sql; : create table t&id1.&id2. as : select min(b.uid,c.uid) as uid_1, max(b.uid,c.uid) as uid_2 : from a1 as b, a1 as c : where b.id&id1=c.id&id2 and b.uid ^= c.uid;
|
s*r 发帖数: 2757 | 10 那再改改,反正就是这个意思
【在 A*******s 的大作中提到】 : 不知道我理解对了没,原帖好像有点歧义。 : “每个record有四个id,但凡只要任意两个相同,就视为同一个id。” : 我觉得按照贴出来的sample data,应该是 : “每个record有四个id,任意两个record, 只要有一个相同的id,就视为同一个id。” : 如果是这样的话,你的code里面的where b.id&id1=c.id&id2 就不是这个意思了。
|
|
|
z**k 发帖数: 378 | 11 如果不考虑数据库的操作,用C,只需要做4次sort就可以了,忽略I/O复杂度是O(n*log
(n))。
如果是大数据,只考虑I/O的复杂度,SQL可以做Linear Sorting,所以复杂度近似于O(
n)。 |
A*******s 发帖数: 3942 | 12 i happen to remember you had once introduced me deep-first-search algorithm
a few days ago on this board. Can we change this to DFS problem?
Suppose we use sql inner join, as Sir's way
data a1;
set a1;
uid=_n_;
run;
proc sql;
create table link as
select min(b.uid,c.uid) as node1, max(b.uid,c.uid) as node2
from a1 as b, a1 as c
where (b.id1=c.id1 or b.id2=c.id2 or b.id3=c.id3 or b.id4=c.id4)
and b.uid ^= c.uid;
quit;
proc sort data=link nodup;
by node1;
run;
After that we have graph-structur
【在 o****o 的大作中提到】 : my general idea is : : this can be solved using hash list while your obs are index by the four IDs : SEPARATELY. : by OP's description, as searching goes on, any subsequent observation that : has 2 IDs in the current combination of id pool will be identified as : belonging to the same group, then any new ID will be added to the pool. The : main hash table contains the record number which is the unqiue ID for each : obs, and the other elements in the hash is an 5 auxilary hash lists, 4 for : id pool of each ID#
|
s******r 发帖数: 1524 | 13 it is not enough. You need to do recursion to handle multiple level link.
algorithm
【在 A*******s 的大作中提到】 : i happen to remember you had once introduced me deep-first-search algorithm : a few days ago on this board. Can we change this to DFS problem? : Suppose we use sql inner join, as Sir's way : data a1; : set a1; : uid=_n_; : run; : proc sql; : create table link as : select min(b.uid,c.uid) as node1, max(b.uid,c.uid) as node2
|
A*******s 发帖数: 3942 | 14 yes, that's why i said we have graph-structure data READY for DFS.
【在 s******r 的大作中提到】 : it is not enough. You need to do recursion to handle multiple level link. : : algorithm
|
s******r 发帖数: 1524 | 15 of course we can do it by sql & macro. However efficiency could be an issue.
So we need to do some optimization.
Also be careful of loop, like a-b-c-d-a.
【在 A*******s 的大作中提到】 : yes, that's why i said we have graph-structure data READY for DFS.
|
A*******s 发帖数: 3942 | 16 i didnt go through oloolo's paper about DFS since i dont know much about
hash tables, but i think standard DFS routine is able to deal with graph-
structure data, including the loop a-b-c-d-a as you said.
issue.
【在 s******r 的大作中提到】 : of course we can do it by sql & macro. However efficiency could be an issue. : So we need to do some optimization. : Also be careful of loop, like a-b-c-d-a.
|
s******r 发帖数: 1524 | 17 I do believe recursion is still necessary.
In oloolo's, it always begin with lowest level, go through one direction,
from children to parents. So one complete hash table could solve the problem
.
It is different in this case. For instance a-b-c-d-e, like you met id d
first, so cde is one group now. And then you find a, so ab is marked as one
group. One more step is necessary.
Too late, need to sleep now. Have phone interview tomorrow. :)
【在 A*******s 的大作中提到】 : i didnt go through oloolo's paper about DFS since i dont know much about : hash tables, but i think standard DFS routine is able to deal with graph- : structure data, including the loop a-b-c-d-a as you said. : : issue.
|
A*******s 发帖数: 3942 | 18 yep, that's why graphs are more challenging than trees.
good luck for your interview!
problem
one
【在 s******r 的大作中提到】 : I do believe recursion is still necessary. : In oloolo's, it always begin with lowest level, go through one direction, : from children to parents. So one complete hash table could solve the problem : . : It is different in this case. For instance a-b-c-d-e, like you met id d : first, so cde is one group now. And then you find a, so ab is marked as one : group. One more step is necessary. : Too late, need to sleep now. Have phone interview tomorrow. :)
|
S******y 发帖数: 1123 | 19 How about putting your data into a 'term-document' matrix --
ids are like terms
obs numbers are like documents
This can be easiy done in Python.
Then you design an algorithm traversing this matrix. Documents connecting
to each other by non-zero neighbors (row-wise and column-wise directions
only) are in one group.
Just my 2 cents.
==================================================== |
o****o 发帖数: 8077 | 20 听起来不错喃
这个问题仿佛就是寻找最大封闭集合,集合内的的任意一个元素都具备至少两个ID号属
于这个集合内所有元素的对应ID号的并集,而没有超过一个ID号属于集合外元素的ID号
并集。
【在 S******y 的大作中提到】 : How about putting your data into a 'term-document' matrix -- : ids are like terms : obs numbers are like documents : This can be easiy done in Python. : Then you design an algorithm traversing this matrix. Documents connecting : to each other by non-zero neighbors (row-wise and column-wise directions : only) are in one group. : Just my 2 cents. : ====================================================
|
|
|
s*****n 发帖数: 2174 | 21 这个本身就是个坐标联通的问题.
比如可以看作四维空间点阵中的某些点.
两点只要是share任何一维坐标, 就认为是联通的.
写个程序实现并不难, 用R就可以写.
不过恐怕要用递归来处理循环的情况, performance是主要的问题.
【在 o****o 的大作中提到】 : 听起来不错喃 : 这个问题仿佛就是寻找最大封闭集合,集合内的的任意一个元素都具备至少两个ID号属 : 于这个集合内所有元素的对应ID号的并集,而没有超过一个ID号属于集合外元素的ID号 : 并集。
|
S******y 发帖数: 1123 | 22 How about using Generators and Backtracking in Python to traverse 'term-
document' matrix? |
m**o 发帖数: 846 | 23 先按record生成key:
data a1;
set a1;
key=_n_;
run;
再把a1分成4个dataset
data a1_1 a1_2 a1_3 a1_4;
set a1;
output a1_1(keep=key id1);
output a1_2(keep=key id2);
output a1_3(keep=key id3);
output a1_4(keep=key id4);
run;
对a1_1生成最终的ID(unique across id1),然后应用到其他dataset上
proc sort data=a1_1; by id1 key; run;
data a1_1;
set a1_1;
by id1 key;
retain ID;
if first.id1=1 then ID=id1;
run;
把这个新ID merge到id2并使其unique across id2
proc sql;
create table a2 as
select a.*,b.ID
from a1_2 as a inner join a1_1 as b on a.key=b
【在 o****o 的大作中提到】 : 数据如下: : data a1; : input id1 id2 id3 id4; : cards; : 1 11 21 31 : 1 12 22 31 : 2 11 22 32 : 2 13 21 32 : 3 14 23 32 : 4 11 22 33
|
u******e 发帖数: 60 | 24 First check the frequency of id1, id2, id3, id4. The number of levels of the
new id has to be smaller than the one with smallest number of levels. In
the example data, it is id4. Use the following macro for all the other three
variables, from the one with less number of levels to the one with more
number of levels. In the exaple, the order of id1 and id2 does not matter
since they both have 4 levels.
The only issue is sorting may take time.
data a2;
set a1;
idnew =id4;
run;
%macro setID |
b******e 发帖数: 539 | 25 想法和我的差不多,不过不太对。
另外不明白为什么要从id4开始?id3和id4的frequencies不都是3吗?id1和id2的frequencies都是4。可是如果从id3开始,然后
%setID(id4);
%setID(id1);
%setID(id2);
结果就不对。
the
three
【在 u******e 的大作中提到】 : First check the frequency of id1, id2, id3, id4. The number of levels of the : new id has to be smaller than the one with smallest number of levels. In : the example data, it is id4. Use the following macro for all the other three : variables, from the one with less number of levels to the one with more : number of levels. In the exaple, the order of id1 and id2 does not matter : since they both have 4 levels. : The only issue is sorting may take time. : data a2; : set a1; : idnew =id4;
|
f*****a 发帖数: 693 | |
m**o 发帖数: 846 | 27 缺了一次sorting,所以丟了一层关系,需要重新用原来的ID来sort,才能把原来的关
系带上。
%macro setID(idvar);
proc sort data=a2; by &idvar idnew;
稍微修改一下,保留原来的ID和新的ID;
data a2 /*(drop=idT)*/;
set a2;
by &idvar idnew;
retain idT;
if first.&idvar then idT =idnew;
/* else if idT^=idnew then idnew =idT; */
run;
》》》》》》》》》》》》》》然后加一段应该就可以了:
proc sort data=a2; by idnew idT;
data a2(drop=idT idnew rename=(idT2=idnew));
set a2;
by idnew idT;
retain idT2;
if first.idnew then idT2=idT;
run;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
【在 b******e 的大作中提到】 : 想法和我的差不多,不过不太对。 : 另外不明白为什么要从id4开始?id3和id4的frequencies不都是3吗?id1和id2的frequencies都是4。可是如果从id3开始,然后 : %setID(id4); : %setID(id1); : %setID(id2); : 结果就不对。 : : the : three
|
b******e 发帖数: 539 | 28 还是不对,试试这组简单的数据:
data a3;
input id1 id2;
datalines;
3 2
5 2
4 4
5 4
5 5
;
run;
data a2;
set a3;
idnew=id1;
run;
%setID(id2);
【在 m**o 的大作中提到】 : 缺了一次sorting,所以丟了一层关系,需要重新用原来的ID来sort,才能把原来的关 : 系带上。 : %macro setID(idvar); : proc sort data=a2; by &idvar idnew; : 稍微修改一下,保留原来的ID和新的ID; : data a2 /*(drop=idT)*/; : set a2; : by &idvar idnew; : retain idT; : if first.&idvar then idT =idnew;
|
u******e 发帖数: 60 | 29 Please try this one.
data a2;
set a1 ;
idnew =id4;
run;
%macro setID(idvar);
%let complete=no;
%let idnewFrom=;
%let idnewTo=;
%do %while (&complete=no);
proc sort data=a2; by &idvar idnew;
data a2(drop=idT change);
set a2 end=last;
by &idvar idnew;
retain idT change;
Preidnew =idnew;
if _N_=1 then change=0;
if first.&idvar and change=0 then idT =idnew;
else do;
if change=0 and idT ^= idnew then do;
call symput("idnewFrom", left(idnew)); |
m**o 发帖数: 846 | 30 你这个loop会陷入死循环的,不能始终对&idvar排序,需要对idnew and idT交替排序,
i.e.
sort by &idvar idnew -> 生成 idT
sort by idnew idT -> 生成的新ID replace "idnew"
sort by idT idnew -> 生成的新ID replace "idT"
sort by idnew idT -> idnew
sort by idT idnew -> idT
.......
每排一次,可以展开一层后代,i.e. 如果最多有n层后代,那么至多需要排序n次。
这个做法的问题是,因为数据量很大(10GB),不能无限制的排序下去,所以最好加一
个中断条件就是,至多排序m次,i.e. 至多展开到第m代后代,以后的后代就不算成一
家了。无限展开下去也没有多少实际意义。
【在 u******e 的大作中提到】 : Please try this one. : data a2; : set a1 ; : idnew =id4; : run; : %macro setID(idvar); : %let complete=no; : %let idnewFrom=; : %let idnewTo=; : %do %while (&complete=no);
|
|
|
u******e 发帖数: 60 | 31 The idea is as below. The loop is controlled by macro variable COMPLETE. By
the end of loop 4 in the data as below, COMPLETE was set to yes, so it won't
end in dead circle.
I agree with your point that it is not efficient for large data.
loop 1 loop 2 loop 3 loop 4
id3 idnew id3 idnew id3 idnew id3 idnew
1 1 1 1 1 1 1 1
1 2 ->1 1 1 1 1 1 1
1 3 1 3->1 1 1 1 1
2 1 2 1 2 1 2 1
2 |
m**o 发帖数: 846 | 32 You are right, the loop won't be dead, but it would not be able to catch all
descendants either.
To test this, you can just test your code on the simple example "babyface"
provided above.
BTW: in your example, in loop2, id3=3, idnew=3 can not turn to 1
By
't
【在 u******e 的大作中提到】 : The idea is as below. The loop is controlled by macro variable COMPLETE. By : the end of loop 4 in the data as below, COMPLETE was set to yes, so it won't : end in dead circle. : I agree with your point that it is not efficient for large data. : loop 1 loop 2 loop 3 loop 4 : id3 idnew id3 idnew id3 idnew id3 idnew : 1 1 1 1 1 1 1 1 : 1 2 ->1 1 1 1 1 1 1 : 1 3 1 3->1 1 1 1 1 : 2 1 2 1 2 1 2 1
|
u******e 发帖数: 60 | 33 I tried the example provided by "babyface". The idnew were set to 3 for all
five records in loop 3.
The change of idnew is controlled by the two macro variables "idnewFrom" and
"idnewTo". If you add PROC PRINT in the macro, you will see how each loop
works.
The statement 'Preidnew =idnew;" can be deleted. It was used when I tested
the macro. |
m**o 发帖数: 846 | 34 我没有SAS,你能不能贴一下你用 proc print 看到的每一个loop的结果(对于
babyface的例子)? 我看你的code,觉得不能全查出来的样子。
all
and
【在 u******e 的大作中提到】 : I tried the example provided by "babyface". The idnew were set to 3 for all : five records in loop 3. : The change of idnew is controlled by the two macro variables "idnewFrom" and : "idnewTo". If you add PROC PRINT in the macro, you will see how each loop : works. : The statement 'Preidnew =idnew;" can be deleted. It was used when I tested : the macro.
|
u******e 发帖数: 60 | 35 - loop 1 -
id1 id2 idnew
3 2 3
5 2 3
4 4 4
5 4 3
5 5 3
- loop 2 -
id1 id2 idnew
3 2 3
5 2 3
5 4 3
4 4 3
5 5 3
- loop 3 - COMPLETE was change from “no” to “yes” in this loop.
|
m**o 发帖数: 846 | 36 谢谢,你试试我下面这个例子呢
id1 id2
5 1
6 2
7 2
5 3
6 3
【在 u******e 的大作中提到】 : - loop 1 - : : id1 id2 idnew : 3 2 3 : 5 2 3 : 4 4 4 : 5 4 3 : 5 5 3 : - loop 2 - :
|
m**o 发帖数: 846 | 37 另外,你这个例子里面你用了%loop(id2),你如果调用 %loop(id1),也是不行的。
【在 u******e 的大作中提到】 : - loop 1 - : : id1 id2 idnew : 3 2 3 : 5 2 3 : 4 4 4 : 5 4 3 : 5 5 3 : - loop 2 - :
|
u******e 发帖数: 60 | 38 You are right. My code does not work for the example data you provided. The
way of linking two IDs by sorting seems not practical in programming. So I
tried another method.
The rule is
(1) Create a 2-dimention n X m grid with n values from id A, eg, (a1, ...
, an), m values from id B, eg, (b1, ..., bm) which should have the same
IDnewT value, n>=2, m>=2
(2) use a1 as the value of IDnewT
(3)for record falling on the knots of the grid, that is, a in (a1, ...,
an) and b in (b1, ..., bm),se |
u******e 发帖数: 60 | 39 I found a bug and fixed it.
%macro setID(idA, idB);
%let complete=no;
%let idnewVal=;
%let Ahook=;
%let Bhook=;
data a1new;
idnew=.;
run;
** copy idA and idB to A and B;
data a1;
set a1;
A = &idA;
B = &idB;
run;
%let i=0;
%do %while (&complete=no);
%let i=%eval(&i +1);
%let A1Val=;
%let A2Val=;
%let A3Val=;
%let B1Val=;
%let B2Val=;
%let B3Val=;
data a1;
set a1 end=last;
retain A1-A3 Anum B1-B3 Bnum;
if _N_=1 then do;
** if no hook links the current and previous |