第7页 - 关于batch的讨论汇总 - 话题女王

T***B
发帖数: 137

smectite, you raised an very important point.
If we start with the requirements, what I need is a group of predictors that
can independently make predictions within a JVM. Some characteristics of
the system:
- Initializing a predictor is time consuming.
- A predictor, once initialized, holds non-trivial amount of memory.
- Predict call is CPU intensive. predict() method is not thread safe.
- A client request triggers a batch of predict calls. Batch size can vary.
From design perspective, what I'... 阅读全帖

D****6
发帖数: 278

来自主题: Java版 - 请教framework

spring有mvc, batch, integration, data access....众多的framework.他们之间都有
联系吗还是可以分开学？我想学spring batch.是只学要看它还是需要同时看其他的
framework。多谢！

p*****3
发帖数: 488

来自主题: Java版 - 这周抽空研究了一下Nathan的那篇CAP的blog

Immutable， pre-compute for query, 不知道为什么还有种加上idempotent的冲动.
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
感觉思路不错，很多地方好像太适合。
好像immutable是个大的原则，就像concurrent hashmap一下。以前是一个vector
clock然后merge, 现在打上个时间戳倒是immutable了，但是在business logic level
还是得merge啊，不是啥玩意打个time stamp就解决了，business level上conflict了
就是conflict了，要merge了还是得merge, Partition了以后不同的地方只能得到不同
的部分immutable的entry再merge,就不consistent了。
Available只是写，读操作没提啊。
batch level建立在hadoop上，也就是hadoop有啥CAP上的限制，这个System就有上限制
，这层都没提.
很好的思想是real time ... 阅读全帖

l******0
发帖数: 244

来自主题: Java版 - java后端开发

小网站，觉得可先从 core java/servelet/html/jsp/css 开始。以后如果觉得有需要
，复杂性没法管理，再研究上 framework。这些东西都有 pros and cons，确实需要
才用。
引用一个帖子中的抱怨，当然也有很多支持的。
“
Used to be we wrote simple, efficient, fast applications and web services
using just core Java, Servlets and JSP, html and xml, JDBC API. It was good
enough; JUnit was a good tool to test. We rested easy that our code worked.
Hibernate came along to simplify SQL and enable true mapping of Database
tables with Java objects, allowing hierarchical relationships to b... 阅读全帖

c*******h
发帖数: 527

来自主题: Linux版 - First Qt Solutions available under LGPL

http://groups.google.com/group/qt-china/t/9ead270615c96779
http://www.qtsoftware.com/about/news/first-qt-solutions-available-under-lgpl
First Qt Solutions available under LGPL
First batch of Qt Solutions add-on components available for download under
LGPL.
Oslo, 12 March 2009 — Qt Software today released a first batch of Qt
Solutions — a catalogue of add-on components and tools for Qt — under the
Lesser GPL version 2.1 license. The complete catalogue of Qt Solutions will
be released under the

d**s
发帖数: 920

来自主题: Linux版 - 怎么手动开始Vmware Server On Windows XP

(Sorry to ask a Windows question here, but I guess there are more VmWare
expert here than anywhere else).
I do not want VmWare start automatically when windows boot.
I want to start and stop VmWare server manually (in a batch file) when
needed.
From Internet, I got the following in a batch file:
net start "VMWare Authorization Service"
net start "VMWare DHCP Service"
net start "VMWare NAT Service"
net start "VMWare Registration Service"
net start "VMWare Virtual Mount Manager Extended"
"C:\Progr

x*z
发帖数: 1010

来自主题: Linux版 - linux下有啥dvd->iso的软件么？

嘿嘿给你个idea
root@mee:~# apt-get install k9copy
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
docbook-xsl docbook-xsl-doc-html dvdauthor kdebase-runtime
kdebase-runtime-data kdelibs-bin kdelibs5-data kdelibs5-plugins kdoctools
kubuntu-debug-installer libattica0 libboost-filesystem1.42.0
libboost-regex1.42.0 libboost-system1.42.0 libclucene0ldbl libdbusmenu-qt2
libiodbc2 libkatepartinterfaces4 ... 阅读全帖

B****y
发帖数: 791

来自主题: Programming版 - 请问visual studio里怎么能自动加一个文件到project里？

First write a batch to add the file to project.
In proj properties, there is a setting for Build Event. You can add a
command (batch) after compile or link.

R*******c
发帖数: 249

来自主题: Programming版 - call matlab within R (用system())

想在R里面call MATLAB,
用以下命令
system(' "C:/Program Files/MATLAB/R2009a/bin/matlab.exe" CMD BATCH "E:/
Dropbox/time warping/Codes/Matlab/test.m" "E:/Dropbox/time warping/Codes/
Matlab/test.txt" ')
可以自动打开matlab,但是不会运行我想要的test.m文件，是哪里出错呢？
我试着在matlab里面call R,一切都OK：
system(' "C:/Program Files/R/R-2.9.2/bin/R.exe" CMD BATCH "C:\Users\
Documents\test.R" "C:\Users\Documents\test.txt" ')
多谢~

p*****3
发帖数: 488

来自主题: Programming版 - 说说12306需要多少台机器

为什么用时间戳和cassandra cluster呢，直接上SQS是不是就可以了，多个SQS队列，
batch存取message, 可以大致保证顺序，同一个message可能被多台机器处理，把所有
message处理做成idempotent, 处理poll和process message的server做成stateless的
，batch update DB，瓶颈最后还在DB上。

p*****3
发帖数: 488

来自主题: Programming版 - 排队法是解决不了问题的

我觉得如果处理够快是不是还是可以考虑的，我举个例子
1. 先订票，大并发的时候web service提交大量订票message(请求)存到大量"订票
message queue", 这些都是stateless的，不够可以动态scale。
2. 有一个只读cache的service只提供valid和invalid信息给每个车次，这样更新操作
很少也不需要很强的一致性，这个service用来决策针对用户请求能否定这个车次的位
置，如果可以产生另外一个订票的message到订票的消息队列。这个决策的service也是
stateless, 负责从订票message queue读取用户请求，查询cache, 然后决定选几号列
车。这里所有的也都好scale。
3. 因为前面的service已经知道列车号了，实在不行甚至这一步可以每个列车对应一个
出票message queue, message分到一个列车上面再多又能有多少呢，一台machine
batch获取message, batch处理和更新数据库。应为一张火车票也就是对应一个列车，
就是更新一组[timespan, train_id,... 阅读全帖

q*c
发帖数: 9453

来自主题: Programming版 - 只要有waiting list,黄牛怎么赚钱？

行啊。 100 章票，你去交退票费？
而且 hotmail 当年实现了图像识别，注册人数立刻下降 80%. 老美的计算机 spammer
对付不了这个，就中国人可以？
办法多的很，图像识别都不需要，就是
1. 一次提交你全部的可能车次请求， 10 个100 个都无所谓
2. waiting list 后台顺序match. 控制 batch 保证每个 batch 10 分钟出结果。
3. 人名在票上
4，一次请求最多出一张票，退票交 30% restock fee.
马上搞定。

g*****g
发帖数: 34805

来自主题: Programming版 - objects status snapshot怎么做

Take a page from DB cluster design. You use a replica for long running job.
Updates are logged, batched and asynchronous.
If your objects are in memory, logging does you no good, but you can still
batch and update async. And if memory is a concern, who says replica can't
be in a different machine.

l*****i
发帖数: 13

来自主题: Programming版 - Spark已经out了，能跳船的赶快

Spark的在2012年刚耳闻的时候，惊艳的地方在于内存计算和REPL，当时做machine
learning的同事在公司内部推广这个的时候，我们做engineer的就觉得没什么用，错过
了很多
之后稍微细读spark发现宣传的核心其实是设计核心的一方面的表现，通过Spark的数据
核心RDD的partition/compute/dependency实际可以很容易包装为独立的应用逻辑，比
如现在的graph和dataframe, 然后再去基于新的RDD引入新的优化和应用。并且实际
RDDlazy的特性使得转换不一定对应一个真正的task，所以声明和计算是分离的，扩展
空间很大。Spark的极限远不是楼主说的这个。
另外可能spark确实做不了楼主所说的“真正的streaming”(这个不确定，看12楼的
rxin这些founder和大committer了)，但没有一个系统真正能把高性能，可靠和真正的
streaming做好。Storm在保证可靠和性能的时候也只能以batch来处理一个提交单位，
否则就要出现大量的commit或者不保证transactional.
也许google的Sp... 阅读全帖

g*****g
发帖数: 34805

来自主题: Programming版 - Java job scheduler

It's not clear what you are trying to achieve. Are you trying to finish a
batch of jobs as quickly as possible or they should be triggered on schedule
?
Quartz can do the later if not too complicated but I guess other tools
recommended can do a better job.
http://www.quartz-scheduler.org/documentation/faq#FAQ-chain
For the former, you can check Spring Batch.

z****e
发帖数: 54598

来自主题: Programming版 - 看了flink，不能不说有点小期待

batch processing部分跟spark区别不大
但是streaming部分，跟storm是一样的，比spark要好一点
spark目前还只是micro batch，嗯

w***g
发帖数: 5958

来自主题: Programming版 - spark就是因为吃饱了要做rdd成immutable导致了无法实现真正意义上的stream processing吧？

不是batch的做不出来。要能做出来干嘛要batch。BLAS的性能要发挥出来矩阵得足够大
才行。

z****e
发帖数: 54598

来自主题: Programming版 - spark就是因为吃饱了要做rdd成immutable导致了无法实现真正意义上的stream processing吧？

这说的是dataset吧
streaming不就是为了能够real time process而消耗一部分性能么？
硬件开销和响应时间本来就是一个trade off
micro batch最大问题是不能即时反应
目前比较理想的方案就是kafka+storm
spark因为固定在batch上，所以不太行的样子
这一块好新啊，感觉在摸着石头过河，有谁是做streaming的？
古德霸他们弄视频的是不是用这个比较多？

z****e
发帖数: 54598

来自主题: Programming版 - spark就是因为吃饱了要做rdd成immutable导致了无法实现真正意义上的stream processing吧？

我感觉是rdd这种数据结构限制了他们的发挥
dstream最终还是捆死在rdd上，也就是dstream是rdd的一种
而rdd比较适合dataset，并不十分适合datastream
而spark的基石就是rdd，算法是ml那些，但是数据结构基本上都是rdd
而rdd是为dataset也就是batch处理而设计出来的
为了迁就dataset，硬把datastream的数据结构搞成rdd
这看来不是一个什么很好的选择
当然对于大多数应用来说，micro batch够用
但是总感觉怪怪的，any way，如果flink改掉这个的话
能够结合spark和storm的优点的话，我觉得蛮好
值得一试，比起自己去折腾storm+spark要强
这两个光弄其中一个就已经够呛了

J****R
发帖数: 373

来自主题: Programming版 - 搞不懂为什么大牛说Hbase不如C*?

二爷说的是对的，hdfs的确是一坨。
以前觉得hbase跟c*差不多，是因为忘了把hbase加到hdfs上，所以其实是在一个node上
跑的结果。加上hdfs以后，我靠，慢了20倍都不止。。。。。
6 nodes
hbase +hdfs
use java connector to batch load 1.4M lines of data into hbase, batch size
is 1000, takes about 36 minutes.....
it used to take much shorter time to load same size of data into one node
hbase based on local file system.
sth must be wrong.........

d******e
发帖数: 2265

来自主题: Programming版 - 如何快速处理大量网上xml文件？

简单的python gevent.
复杂点scrapy（这戈软件实际很傻逼，典型的拿着oo硬套的作品）
后来，我自己写了个python batch和 scala batch的framework.
加了一些retry和throttling的东西。

z****e
发帖数: 54598

来自主题: Programming版 - java 8就是一坨屎

compared to for loop
how to use for loop in the streaming?
u dont even know the border of the stream/loop
u need a listener rather than a loop
reactive rather than active
reactive is not for zhuangbi
is used for solving problems
especially for those streaming industry like Netflix
if u can use for loop then plez use for loop
do not use forEach which should only be used in streaming api
and i personally believe for loop is enough for batch api
u dont need streaming for batch like db/file system c... 阅读全帖

g****u
发帖数: 252

来自主题: Programming版 - 每个请求是用单独的 tcp connection 吗？

是后者，如果1gb ethernet要达到制定的吞吐量，怕是必须batch才行。
如果是不batch，用两万刀的硬件应该是能做出来的。
但老魏自己也说过，不可能每个请求单起tcp connection。
如果是做实际的项目的话也应该是按后者的套路做。
10倍的硬件开销还是很大的。

a*f
发帖数: 1790

来自主题: Programming版 - 再次请教个linux组合问题

取决于你的服务器环境和你能不能access服务器。如果没有服务器access，就必须在
web应用管理所有jobs。你的选择就很有限了。
我们用的是Spring Batch，集成在web应用，在web界面管理jobs。可以call rest，多
线程，发邮件查日志等等：
http://docs.spring.io/spring-batch/reference/html/scalability.h

l*******s
发帖数: 1258

来自主题: Programming版 - 总结一下kaggle比赛

如果这里不用minibatch，而是选batch，那么同步成本就会大于batch，
而如果用stochastic代替minibatch，那么速度会更快，同步成本相对就更高，所以跑
几个iteration后再同步是不是更好。
连续local算几次，是不是本来就为了避免局部最优？SGD里面就有这种方法。

w***g
发帖数: 5958

来自主题: Programming版 - 几十层的神经网络用小机器能不能训练出来？

我遇到的dataset都是很小的, 多的也就几百个图片,
都是batch = 1训练FCN. 如果图片尺寸不一样,
只能是batch = 1. 训练数据少层数太深参数太多也不行.

s*********y
发帖数: 6151

来自主题: Programming版 - js里怎么一批批地执行异步操作？

d内存溢出的问题另说。我们用 async-q自己写了个batch processing 每次处理一个
batch 大概几千个request 过几秒再处理。方法不止一种

x**********i
发帖数: 658

来自主题: Programming版 - 神经网络的开窍现象

请问下用batch和sgd有区别吗？还是两者同时用效果最好？我现在是如果用batch就不
用sgd，反之亦然。

：据说老司机的话用手动档最后收敛的好。最近有一篇adam
：训练一阵子后自动切SGD的文章。

e*i
发帖数: 10288

来自主题: Software版 - 请教各位高手如何每几分钟自动执行一次ping

write a batch file. Something like this.
ENDLESS.BAT
PING 192.168.1.1
SLEEP 3600
GOTO STARTHERE
RUN THE BATCH FILE.

d**s
发帖数: 920

来自主题: Software版 - 怎么手动开始Vmware Server On Windows XP

I do not want VmWare start automatically when windows boot.
I want to start and stop VmWare server manually (in a batch file) when
needed.
From Internet, I got the following in a batch file:
net start "VMWare Authorization Service"
net start "VMWare DHCP Service"
net start "VMWare NAT Service"
net start "VMWare Registration Service"
net start "VMWare Virtual Mount Manager Extended"
"C:\Program Files\VMware\VMware Server\vmware.exe"
However, only top 3 line works, the following two lines do not w

r****o
发帖数: 105

来自主题: Biology版 - 从纯化到星辰---CSH蛋白质纯化课侧记（二十一）

本文献给YL
（二十一）离子交换层析
　做完硫酸铵沉淀以及等电点沉淀之后，下一步就是做离子交换层析
(ion-exchange chromatography)。离子交换层析原理很简单，利用蛋
白质表面的带电基团和resin上带相反电荷的ion exchanger group相互结
合来纯化蛋白。离子交换柱和affinity chromatography一样，属于
absorption/desorption型的层析方法，与gel filtration完全不一样，其中
一个最大的优点是，样品体积可以很大，最后的洗脱样品的体积可以小很
多，这样实际起到了一个浓缩样品的作用。还有一个优点是，离子交换层
析其实不需要柱子，可以用batch 的方法（在离心管里面混合beads和样
品）。当然如果beads量多情况下，batch方法平衡得不是很好，导致
resolution不是很好。这一点我们做实验的时候深有体会，以后的一篇中我
会提到。我们纯化钙调蛋白用了柱子做纯化，原因是因为所用的beads 相
当于Sephadex G-50再加上阴离子交换基团，所以实际上可以看成是

a******y
发帖数: 44

来自主题: Biology版 - 科普一下制药公司的相关职位和工作职能

我是在制药公司做regulatory affairs的。我是生物的PhD, 做了三年了。做过小公司
也做过大公司，给大家说说RA的工作吧。
RA是属于R&D的，但是是development阶段的工作。early research大量筛选molecule，
研究药理，探讨模型的工作我们不参与。一般一个项目到我们手里就是已经开始打算做
临床试验了。
一般一个新药的上市，要由FDA对药学（也就是生产即CMC),药理（也就是动物试验，包
括毒理）和临床（也就是在人身上面实验）三个部门进行评估。评估的重点是有效性和
安全性。除了对新药上市的监管外，FDA还负责监管药品临床试验的安全性。一个新药
在研发阶段需要在人身上进行试验已证明它的有效性和安全性，药厂要向FDA 申请临床
试验许可，也就是我们平时说的IND申请。IND 也包括CMC, non-clinical 和clinical
三个部分的资料。
RA在公司内部，要对不同方向不同阶段的development工作进行协调，根据FDA的规定作
出指导和及时提出工作中的问题。因为新药是一个科技含量很高的工作，FDA会经常根
据技术的革新提出各种新... 阅读全帖

s******y
发帖数: 28562

来自主题: Biology版 - 求解...简单的Tissue Culture遇到难题

Your new serum is definitely suspicious then!
Call the company and ask them to send you an old batch, or ask around for
other labs to see if they have the old batch.
Or, order serum from a different company
By the way, don't take things too granted.
"真的进入differentiation，细胞在数量上应该不会增值才对吧" is not
theoretically or practically correct
For example, if your cells differentiated from stem cell into fibroblast, or
from stem cell into some kind of progenitor cells,
the differentiated cells surely will rep

b*****l
发帖数: 9499

来自主题: Biology版 - 颜宁：生物基础研究到底多烧钱？

说到生物，还有比 FBS 更让人 ft 的东西么？我的 cell line，就只用 ATCC 的 FBS，
别的都不敢试。另一个组，发现只有某个公司的某个 batch 的 FBS 能使他们的细胞出
现某个关键的 differentiation，于是就把那个 batch 全给买下来了，堆了好几个冰箱
，被我们那里传为笑谈。

，结
种
做样
就要

m**i
发帖数: 47

来自主题: Biology版 - 帮朋友发的招聘信息 Scientist I, Biology

Scientist I, Biology
Location：Beijing
Job Responsibility:
Responsible for cell culture media and process development works in labs
using standard cell culture instruments and technologies. Performs duties
independently with only general direction given on new projects and methods.
Accuracy and dedication are required in performing all functions of this
position. Apply laboratory techniques and skills to complete experiments
designed to address a variety of specific problems. Make detailed
observ... 阅读全帖

m***T
发帖数: 11058

来自主题: Biology版 - 怎样将 Go 的基因转换到 affymetrix ID

去Affy的NetAffx查查看，它有batch查询的功能。将你需要的所有基因ID存到一个Text
文件中（follow它所需要的格式），然后用batch query的功能去试试。

s*********e
发帖数: 399

来自主题: Biology版 - [转载]王俊，你是科学家吗？-- 饶毅PK王俊

换试剂这个事情，有点太不靠谱了吧。
如果我们说的是illumina的试剂，那在合同里面写的很清楚，只能使用SBS kit的。
华大目前的规模，如果用个盗版试剂，被illumina告一下，赔个上千万美金，是正
常的吧。
再说山寨这个试剂。 3730出来10来年了，国内那么多试剂公司也没山寨出来。hiseq，
我可以说，也有不少公司在山寨。山寨出来的没有。酶这种东西，其实不难生产，重
要的是qc. 国内所有公司的问题就是不能保证每个batch都好用。一个batch不能用，
华大就歇了。如果诸位是华大总经理，会冒这样的风险吗？
如果说的是其他的试剂，比如建库等等。市场上kit多了去了。不同的文库，要用不同
的试剂。这个到不奇怪。还是一句话，如果是为了省试剂费用，故意使用劣质杂牌试
剂，实际是不可能节省成本的。尤其是小项目。做砸一次，重新跑测序的试剂费用远
远高于节省的建库试剂费用。
当然，比如TAT等等，小项目的客户没有得到大项目客户那样的attention, 这种情况
肯定有，而且肯定不少。只能尽力提高。

g**********y
发帖数: 423

来自主题: Biology版 - 生统真的很奇怪科学领域

实验是人做的，显然有的人实验做的好，有的做的差。
我做microarry/RNA-seq，非常关注batch effect。
有的实验室出来的质量batch effect非常小，或者说噪音非常小。
如果你看普通质控，你会发现他们都通过了质控。
一些对噪音敏感的算法，如果用了噪音大的数据，确实会得到不可靠的结论。

g**********y
发帖数: 423

来自主题: Biology版 - 生统真的很奇怪科学领域

s*******c
发帖数: 179

来自主题: Chemistry版 - 请教一个做GMP bioanalysis的问题

I think bracket curve is not required by FDA, but some companies prefer to
have a bracket curve to look at the variance in the batch. Another common
practice is to use only one curve, but put each calibrator randomly among
the batch samples, and the first and last one is a calibrator.

S*****n
发帖数: 6055

来自主题: Chemistry版 - 面试发现公司做的东西都很简单。

I have a colleague very good at nano synthesis
his batch to batch variation is almost zero. I wonder if anyone in academia
could do this

M******n
发帖数: 43051

来自主题: Chemistry版 - 国内制药现在很猛。

Basic research has long been the missing foundation of China’s aspiration
to develop a world-class drug industry. The country has become a global
powerhouse in the manufacture of bulk pharmaceutical chemicals. It is a
preferred destination for major drug companies that outsource research work
to reduce costs. To date, however, none of the world’s major drugs has been
either invented or developed in China.
The country wants to change that in a big way. The government is spending $7
billion over t... 阅读全帖

q****e
发帖数: 3660

来自主题: Chemistry版 - job post: Organic/process chemist in Vandalia, Ohio R&D department

Job Purpose
Research, design and develop new technologies that will lead to the
introduction of new products, improve upon existing products and promote the
economic growth of the company
Duties and Responsibilities
 Responsible for development of new products, processes, methods
and formulations
 Perform feasibility evaluations and prepare samples
 Maintain detailed, accurate and organized batch records
 Prepare standards and specifications for proce... 阅读全帖

p*****e
发帖数: 310

来自主题: Computation版 - Matlab 回车键的问题

用batch file封装一下？参考一下ms batch file的帮助

j**u
发帖数: 6059

来自主题: Computation版 - [合集] 如何让一个嵌套循环程序并行处理？

☆─────────────────────────────────────☆
cityhawk (呆鹰) 于 (Mon May 23 20:38:14 2011, 美东) 提到:
Matlab程序是 for 嵌套循环：比如，
a=0.1:0.5 with spacing 0.01; b=0.1:0.6 with spacing 0.01
c=0.1:0.8 with spacing 0.01; d=0.1:0.6 with spacing 0.01
e=0.1:0.9 with spacing 0.01; f=0.1:0.7 with spacing 0.01
g=0.1:0.6 with spacing 0.01; h=0.1:0.5 with spacing 0.01
执行部分
end; end; end; end;end; end; end; end;
这个程序在普通的PC 3.6GHz, 2GB内存上运行要2个星期多，把它放在系里的服务器上
运行，结果比我们lab的这个PC还慢，网管告诉我系里服务器的单个CPU才1.8GHz，尽管
我们有近30个CPU并行和全部 2... 阅读全帖

s*****5
发帖数: 52

来自主题: Computation版 - 求问懂java的童鞋

完全没有学过java的小白一名，轻拍~弱问一个关于java batch processing的基本问题
，跪谢。
我在run一个java的包，
运行的script是这样的：
java -Xmx3500M -jar MSGFPlus.jar -s "file name" -o "output"
但是我现在想实现同时input很多个file name这种功能从而达到auto batch
processing，
这个该如何实现呢？谢谢

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天