i***r 发帖数: 1035 | 1 【 以下文字转载自 Programming 讨论区 】
发信人: iiiir (哎呀我最牛), 信区: Programming
标 题: python code performance --- normal or too slow?
发信站: BBS 未名空间站 (Tue Jan 7 11:21:52 2014, 美东)
file is 2.5GB with 18,217,166 lines
my python script took about 20-30 minutes to finish
seems slow?
Thanks!!
input file data structure (showing first two lines, wrapped):
chromo pos ref alt dc1 dc2 dc3 dtm bas din
crw itb ptw spw isw irw inw ru1 ru2
ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2
ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa
jac
chr01 242806 G T 0/0 0/0 . 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0 0/0 0/0 . 0/0
0/0 0/0 . 0/0 0/0 0/0 0/0 0/0 0/0 0/
0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
0/0
my python code is to:
1. parse the header and produce first file
2. parse the body and translate 0s and 1s to ATGC etc to produce second file
.
import sys
def geno_to_base(ref, alt, genotype):
assert len(genotype) == 3, "genotype not in 0/1 format"
allele1 = alt if genotype[0] else ref
allele2 = alt if genotype[-1] else ref
return '{} {} '.format(allele1, allele2)
def translate_geno(ref, alt, genotype):
'''genotype needs to be either . or 0/0 format'''
return '0 0 ' if genotype == '.' else geno_to_base(ref, alt, genotype)
def line_parse(line):
chrs, pos, ref, alt, *geno = line.split()
all_genotype = [translate_geno(ref, alt, g) for g in geno]
return chrs, pos, ''.join(all_genotype)
if __name__ == "__main__":
fn = sys.argv[1] # required
fin = open(fn)
tfam = open('out.tfam','w')
tped = open('out.tped', 'w')
# write tfam
header = next(fin)
for i,h in enumerate(header.split()[4:]):
tfam.write('{}t{}t0t0t0t0n'.format(i,h))
# write tped
morgan = 0
for i,l in enumerate(fin):
rs_id = 'snp{}'.format(i+1)
chrs, pos, all_geno = line_parse(l)
chrs = int(chrs[3:]) # only need the number
tped.write( '{} {} {} {} {}n'.format(chrs, rs_id, morgan, pos, all_
geno) )
tfam.close()
tped.close() |
|