老土说生物:Biology must develop its own big-data systems - Biology版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Biology版 - 老土说生物:Biology must develop its own big-data systems

相关主题
● 同样的学校生物和非生物位置工资差距很大	● 在战略上NIBS太过于保守
● 关于DNA methylation 的数据分析	● 这期Science上的另一篇文章
● 怎样将 Go 的基因转换到 affymetrix ID	● 业内人士来说说single molecule吧
● 有人熟悉ensembl数据库吗？	● 哪位能科普一下ZHUANG的单分子荧光
● 请教个endnote的问题, 怎么在reference list加描述	● PhD期间都有哪些funding可以申呢？大的小的都算...
● China Opportunity _Principal Scientist or Sr. Scientists	● Sunney Xie有没有可能获诺贝尔奖？
● 哪位介绍下工业界、学术界找工作的区别？谢谢	● 谢晓亮告诉习近平自己要全职回国 (转载)
● microplate reader求推荐	● Re: Did I get correct cDNA clone?

相关话题的讨论汇总
话题: data话题: systems话题: scientists话题: management话题: system

进入Biology版参与讨论

(共1页)

m*********u
发帖数: 1491

http://www.nature.com/news/biology-must-develop-its-own-big-dat
Biology must develop its own big-data systems
Too many data-management projects fail because they ignore the changing
nature of life-sciences data, argues John Boyle.
The last week of April was designated Big Data Week. But in modern biology,
every week is big-data week: life-sciences research now routinely churns out
more information than scientists can analyse without help. That help
increasingly comes in the form of expensive data-management systems, but
these are hard to design and most are even harder to use. As a result, a
long line of data-management projects in the life sciences — many of which
I have been involved with — have failed.
The size, complexity and heterogeneity of the data generated in labs across
the world can only increase, and the introduction of cloud computing will
encourage the same mistakes. Just a stone's throw from where I work, at
least three computer companies are already touting cloud-based data-
management systems for the life sciences. We need to find ways to manage and
integrate data to make discoveries in fields such as genomics, and we need
to do this quickly.
Related stories
Biology: The big challenges of big data
Genomics: ENCODE leads the way on big data
Big data: teaching must evolve to keep up with advances
At their most basic, data-management systems allow people to organize and
share information. In the case of small amounts of uniform data from a
single experiment, this can be done with a spreadsheet. But with multiple
experiments that produce diverse data — on gene expression, metabolites and
protein abundance, for example — we need something more sophisticated.
An ideal data-management system would store data, provide common and secure
access methods, and allow for linking, annotation and a way to query and
retrieve information. It would be able to cope with data in different
locations — on remote servers, on desktops, in a database or spread across
different machines — and formats, including spreadsheets, badly named files
, blogs or even scanned-in notebooks.
That ideal system does not exist. Most academic organizations have, through
trial and error, developed their own in-house systems that work — or just
about. The systems have limited functionality and cannot be connected, which
makes collaboration difficult. The situation is as unworkable as if every
lab in the country had decided to devise its own (poor) document-editing
software.
Efforts to introduce overarching data-management systems, to which any and
all scientists in a particular field could plug in, have failed for two main
reasons. Either they demand that scientists change the format of their data
, to allow information to be entered into the system, or they demand that
scientists change the way they work, to generate standardized sets of
results. The systems are thrust on scientists who are then expected to
change, rather than taking the work of scientists as a starting point. It
should not be scientists who are required to be flexible; it should be the
system that they are being asked to use.
“It should not be the scientists who are required to be flexible; it
should be the system that they are being asked to use.”
These problems are exemplified by the expensive flop that was the US
National Cancer Institute's caBIG data-integration project, scrapped last
year after almost a decade and tens or even hundreds of millions of dollars.
It had admirable goals and seemed workable in theory, but in the end it was
too complicated to use. Crucially, caBIG relied on standardized data
formats, which called for standardized experiments. Its one-size-fits-all
approach fit nearly nobody.
There have been some successes. A widely used system called SRS allows the
linking of data held in separate well-structured repositories. And the
Biomart project joins up specially designed databases. But these were both
fairly bespoke research applications; computer giants Microsoft and IBM are
among the commercial firms that have introduced systems that aimed at a
wider reach but had little impact.
To be useful to the life-sciences community, a data-management system
probably needs to be devised and developed by the life-sciences community.
The US National Institutes of Health has a 'Big Data' initiative, and agency
head Francis Collins has spoken many times of the need to address the
problem. Now is the time for researchers to plan an open data-management
system that scientists will want to adopt. Many of the software pieces are
already available.
As a starting point, here are three lessons from the successes and failures
of the past.
First, the data are going to change. Biological information will always come
in varied formats, and these formats cannot be defined in advance. Software
engineers hate this. But a useful system must be flexible and updatable.
Second, people are not going to change. Busy scientists will adopt a new
system only if it offers substantial benefit and is painless. Many
commercial systems are unpopular because they make simple steps such as data
retrieval complicated, to stop scientists using several (rival) systems at
once.
Third, the problem is not technical. Although the latest kit is always
alluring to funders, today's cutting-edge devices will be blunt tomorrow.
Data-management systems must be driven by the need to find a workable
solution to the problem, not by a desire to make the problem fit the latest
fashionable technology.
Development of a biology-friendly system is possible, but it will require a
change in mentality. As a useful test, a good data-management system should
cost more to maintain, update and change with the times than it does to
develop. Otherwise the price is too high.

(共1页)

进入Biology版参与讨论

相关主题
● Re: Did I get correct cDNA clone?	● 请教个endnote的问题, 怎么在reference list加描述
● 小本请教Neuroscience PhD相关问题	● China Opportunity _Principal Scientist or Sr. Scientists
● ask a question	● 哪位介绍下工业界、学术界找工作的区别？谢谢
● PhD position in Geneva, Switzerland	● microplate reader求推荐
● 同样的学校生物和非生物位置工资差距很大	● 在战略上NIBS太过于保守
● 关于DNA methylation 的数据分析	● 这期Science上的另一篇文章
● 怎样将 Go 的基因转换到 affymetrix ID	● 业内人士来说说single molecule吧
● 有人熟悉ensembl数据库吗？	● 哪位能科普一下ZHUANG的单分子荧光

相关话题的讨论汇总
话题: data话题: systems话题: scientists话题: management话题: system

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天