j*******s 发帖数: 81 | 1 请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。
比如 |
g*****g 发帖数: 34805 | 2 Do something in between, let's say you keep a "file pool",
you can open a maximum of 5000, and you keep the most recent 5000
open. Put it in a queue, pop the head out and append the new one
at the tail when it's over 5000. When you write a file and the file
is already in the queue, remove it and append it to the tail.
To speed up search, you can use a hashmap to track if the files are
open.
【在 j*******s 的大作中提到】 : 请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。 : 比如
|
j*******s 发帖数: 81 | 3 好方法,多谢多谢,堆栈这个方法好极了。
【在 g*****g 的大作中提到】 : Do something in between, let's say you keep a "file pool", : you can open a maximum of 5000, and you keep the most recent 5000 : open. Put it in a queue, pop the head out and append the new one : at the tail when it's over 5000. When you write a file and the file : is already in the queue, remove it and append it to the tail. : To speed up search, you can use a hashmap to track if the files are : open.
|
j*******s 发帖数: 81 | 4 用队列还是堆栈好?第一列的关键字是随机的,FIFO还是LIFO没区别吧?
【在 g*****g 的大作中提到】 : Do something in between, let's say you keep a "file pool", : you can open a maximum of 5000, and you keep the most recent 5000 : open. Put it in a queue, pop the head out and append the new one : at the tail when it's over 5000. When you write a file and the file : is already in the queue, remove it and append it to the tail. : To speed up search, you can use a hashmap to track if the files are : open.
|
g*****g 发帖数: 34805 | 5 随机的话怎么都行,大部分实际问题应该先进先出,叫做least recently used.
【在 j*******s 的大作中提到】 : 用队列还是堆栈好?第一列的关键字是随机的,FIFO还是LIFO没区别吧?
|
s***e 发帖数: 122 | 6 既然是在mac os下操作,把这个大文件先sort一下不就简单多了吗?我甚至于会推荐更
进一步,直接用shell/python/perl任何一种脚本语言来写,肯定更容易一些。
【在 j*******s 的大作中提到】 : 请教一个问题,有一个大文件,是个txt表格,按照第一列的关键字分割成若干文件。 : 比如
|
b******y 发帖数: 9224 | 7
嗯,比较喜欢先sort的方法。好像比较有条不紊.
【在 s***e 的大作中提到】 : 既然是在mac os下操作,把这个大文件先sort一下不就简单多了吗?我甚至于会推荐更 : 进一步,直接用shell/python/perl任何一种脚本语言来写,肯定更容易一些。
|
F****n 发帖数: 3271 | 8 如果MEMORY能承受的话肯定是先SORT好,SORT的速度也就是NLOGN而已,比反复I/O要快
多了。
【在 b******y 的大作中提到】 : : 嗯,比较喜欢先sort的方法。好像比较有条不紊.
|
A**o 发帖数: 1550 | 9 or keep all file names in memory,
and only write to 10k files each iteration reading through the raw file.
【在 g*****g 的大作中提到】 : Do something in between, let's say you keep a "file pool", : you can open a maximum of 5000, and you keep the most recent 5000 : open. Put it in a queue, pop the head out and append the new one : at the tail when it's over 5000. When you write a file and the file : is already in the queue, remove it and append it to the tail. : To speed up search, you can use a hashmap to track if the files are : open.
|