如何处理RNA-Seq - Biology版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Biology版 - 如何处理RNA-Seq

相关主题
● 市场调查 Bioinformatics RNA-seq preprocessing tool	● RNA-seq结果分析求助
● 求科普RNA-sequencing	● 关于RNA-seq的问题
● paper help！	● 简单介绍 Bioinformatics Tools for NGS 分析
● 请教RNA-seq 软件的安装调试的问题！	● 大鼠的RNA-seq应该使用那个reference genome？
● RNA-seq map工具	● RNA seq 数据统计分析问题请教
● truth about RNAseq vs Microarray	● RNA-seq 表达量问题
● non strand specific RNA-seq数据分析	● 如何检测 long noncoding RNA
● 请问大家 RNA-Seq assembly 都用啥软件呢？	● RNA seq分析求教

相关话题的讨论汇总
话题: rna话题: seq话题: transcript话题: reads

进入Biology版参与讨论

(共1页)

z********o
发帖数: 428

老板让用RNA-Seq 的data。我是转行到Bioinformatics的，生物小白。有没有人能给推
荐一下该看哪些资料，用哪个算法？如何入手，入门？
谢谢大家

c****y
发帖数: 373

http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_too
good luck!

t*d
发帖数: 1290

The fastest way to learn it is to practice Trapnell's pipeline, and get
familiar with all the results.

【在 z********o 的大作中提到】

: 老板让用RNA-Seq 的data。我是转行到Bioinformatics的，生物小白。有没有人能给推
: 荐一下该看哪些资料，用哪个算法？如何入手，入门？
: 谢谢大家

g******a
发帖数: 51

Standford 的 SFG：
The Simple Fool’s Guide to Population Genomics via RNA-Seq
http://sfg.stanford.edu/

z********o
发帖数: 428

谢谢大家。
祝新年快乐

h****n
发帖数: 2552

cufflink和DESEQ到底哪个更好？

j****x
发帖数: 1704

往往需要根据具体的实验设置（depth，replicate）来选择。一般而言，首选Limma，
其次DESeq。多用几种方法然后比较一下也没有坏处

【在 h****n 的大作中提到】

: cufflink和DESEQ到底哪个更好？

z********o
发帖数: 428

谁能给解释一下：depth 和 library, library size 这几个概念？在网上浏览了一天
，也没有搞清楚
谢谢。

c********e
发帖数: 598

Why limma is the first choice?

【在 j****x 的大作中提到】

: 往往需要根据具体的实验设置（depth，replicate）来选择。一般而言，首选Limma，
: 其次DESeq。多用几种方法然后比较一下也没有坏处

a**r
发帖数: 352

limma 历史悠久，老牌软件了。速度快，表现也很好
deseq的话，根据这篇文章，表现还是比较一般。bayseq看起来要好一些
http://genomebiology.com/2013/14/9/R95

相关主题
● truth about RNAseq vs Microarray	● RNA-seq结果分析求助
● non strand specific RNA-seq数据分析	● 关于RNA-seq的问题
● 请问大家 RNA-Seq assembly 都用啥软件呢？	● 简单介绍 Bioinformatics Tools for NGS 分析
进入Biology版参与讨论

a******k
发帖数: 1190

I have to say, I have very bad experience with Trapnell's pipeline.
Tophat is complained to be very slow. If you have only tens of millions of
reads, you can go with Tophat, but with a huge dataset like me, it runs
forever (I admit that I have a billion reads). STAR is recommend by a lot of
people and I also found it is very good. It finishes the job in one hour on
our super cluster.
Cufflink is also very slow. For my dataset, it has been running for one week
(16 parallel jobs), and likely another several weeks or forever. It has
some bugs or maybe more fair to say very bad taste on default parameter
settings (you can tweak it if you are getting familiar with it). For example
, it uses an extrapolate algorithm that results in very bad estimation of
abundance of short transcripts (this is complained everywhere). For another
example, it only builds transcript models for a tiny fraction of my data in
testing run. The reason is, it make a guess there is a complex gene model in
one region which is longer than one of its default parameters, then it skip
the whole bundle, i.e., reads falls into that region. So I get warning
messages like this:
Warning! Skip large bundles [chr1: 10000-20000000]. That's almost the entire
chromosome. WTF!
You can adjust the parameters, but it takes a lot time to figure out, and
remember, it is f*** slow. Right now, I have to stick with Cufflink, because
it is kind of the only software that declares to reconstruct transcripts
with reference transcriptome.

【在 t*d 的大作中提到】

: The fastest way to learn it is to practice Trapnell's pipeline, and get
: familiar with all the results.

h****n
发帖数: 2552

难道cufflink就差到没人愿意提了？

【在 a**r 的大作中提到】

: limma 历史悠久，老牌软件了。速度快，表现也很好
: deseq的话，根据这篇文章，表现还是比较一般。bayseq看起来要好一些
: http://genomebiology.com/2013/14/9/R95

l**********1
发帖数: 5204

Pls check,
i) Trapnell C et al., (2012).
Differential gene and transcript expression analysis of RNA-seq experiments
with TopHat and Cufflinks.
Nat Protoc 7: 562–578.
ii) Li H et al., (2009).
The sequence alignment/Map format and SAMtools.
Bioinformatics 25: 2078–2079.
plus
Weikard R et al., (2013).
Identification of novel transcripts and noncoding RNAs in bovine skin by
deep next generation sequencing.
BMC Genomics. 14: 789. [Epub ahead of print]
>http://www.ncbi.nlm.nih.gov/pubmed/24225384
cited,
>Reannotation, mapping and bioinformatic data analysis
>Read alignment to the reference genome was performed
using the Bowtie/ TopHat/ Cufflinks/ Cuffmerge pipeline
[44]. A filtering step using SAMtools and Linux commands
[45] was performed to eliminate those reads showing
more than two mismatches to the reference genome
and reads with multiple mapping hits. A guided transcript
assembly using the bovine reference genome assembly
UMD3.1 (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bos_taurus/,
downloaded 28/02/2012) on top of the Ensembl reference
annotation, release 66, (ftp://ftp.ensembl.org/pub/release-
66/gtf/bos_taurus/, downloaded 28/02/2012) was carried
out for each sample file separately. This strategy considered
the reference genome annotation and additionally,
allowed inclusion of sequence reads mapping to chromosome
regions or transcription units not yet annotated in
the underlying reference transcript assembly. The separate
analysis of the individual transcript assembly for each
sample enabled the identification of potential differently
spliced transcripts of pigmented and nonpigmented phenotypes.
Thus, the generated final transcriptome assembly
comprising transcripts from both phenotypes will provide
novel transcripts, genes and isoforms in addition to the
reannotated known reference loci.
Finally, the resulting individual transcript assemblies
were merged to form a single transcript assembly using
the Cuffmerge option. The merged transcript assembly
(final GTF file) was applied for locus and transcript quantification
using Cuffdiff v1.3. The final dataset represents
the joint transcriptome of pigmented and nonpigmented
skin samples including all transcripts (annotated and nonannotated)
that contain at least one exon and reveal expression
either in pigmented or nonpigmented skin
samples. A further filtering step was included to eliminate
transcripts having a very low expression level. All transcripts
which had a lower bound of zero for the 95% confidence
interval on the FPKM (fragments per kb for a
million reads) of the object were excluded from the dataset.
Transcript and locus assemblies were visualised by inspection
of the BAM files of the samples and the final
annotation with the IGV viewer [46].
original post,
http://www.mitbbs.com/article_t/Biology/31863359.html
2nd flor

【在 z********o 的大作中提到】

: 谁能给解释一下：depth 和 library, library size 这几个概念？在网上浏览了一天
: ，也没有搞清楚
: 谢谢。

t**********y
发帖数: 374

limma was originally designed for array.

【在 a**r 的大作中提到】

B*********r
发帖数: 19

what do you mean by "reconstruct transcripts with reference transcriptome"?

【在 a******k 的大作中提到】

: I have to say, I have very bad experience with Trapnell's pipeline.
: Tophat is complained to be very slow. If you have only tens of millions of
: reads, you can go with Tophat, but with a huge dataset like me, it runs
: forever (I admit that I have a billion reads). STAR is recommend by a lot of
: people and I also found it is very good. It finishes the job in one hour on
: our super cluster.
: Cufflink is also very slow. For my dataset, it has been running for one week
: (16 parallel jobs), and likely another several weeks or forever. It has
: some bugs or maybe more fair to say very bad taste on default parameter
: settings (you can tweak it if you are getting familiar with it). For example

x***u
发帖数: 297

Tophat is slow but result is OK. Cufflinks have been reported to have issues
Trinity ＋ PASA 可以做transcripts reconstruct：
http://pasa.sourceforge.net/

it is kind of the only software that declares to reconstruct transcripts
with reference transcriptome.

【在 a******k 的大作中提到】

a******k
发帖数: 1190

My understanding is that it will first try to reconcile aligned reads/
fragments to known transcripts in a reference transcriptome (user input).
Then for those could not find supporting annotations, it tries to construct
novel transcripts.

【在 B*********r 的大作中提到】

: what do you mean by "reconstruct transcripts with reference transcriptome"?

a******k
发帖数: 1190

That's approximately the same conclusion I got.
Could you please be a little more specific about the problem Cufflinks has?
As far as I can tell, most of the problems are because of its bad default
parameter settings. I am not quite sure whether it is fundamentally flawed.
Strange it tries to make decisions for biologists by doing algorithmic
adjustments to the results outputted. But I believe most researchers would
like to see direct results of raw data.
There are many software/tools can do transcripts reconstruct. But I am
tempted by its idea of using knowledge of existing transcriptome as guidance
in the process of transcripts reconstruction.

issues

【在 x***u 的大作中提到】

: Tophat is slow but result is OK. Cufflinks have been reported to have issues
: Trinity ＋ PASA 可以做transcripts reconstruct：
: http://pasa.sourceforge.net/
:
: it is kind of the only software that declares to reconstruct transcripts
: with reference transcriptome.

j****x
发帖数: 1704

这和annotated genome guarded transcript reconstruction有什么本质的区别吗？

construct

【在 a******k 的大作中提到】

: My understanding is that it will first try to reconcile aligned reads/
: fragments to known transcripts in a reference transcriptome (user input).
: Then for those could not find supporting annotations, it tries to construct
: novel transcripts.

l**********1
发帖数: 5204

Now we have GIIRA
pls check,
HTTPS double dot //sourceforge.net/projects/giira/
GIIRA – RNA-Seq Driven Gene Finding Incorporating Ambiguous Reads
Posted on October 16, 2013 By RNA-Seq Blog Administrator
Reply
The reliable identification of genes is a major challenge in genome research
since further analysis depends on the correctness of this initial step.
With high-throughput RNA-Seq data reflecting currently expressed genes, a
particularly meaningful source of information has become commonly available
for gene finding. However, practical application in automated gene
identification is still not the standard case. A particular challenge in
including RNA-Seq data is the difficult handling of ambiguously mapped reads.
Researchers at the Robert Koch-Institute, Germanyhave developed GIIRA (Gene
Identification Incorporating RNA-Seq data and Ambiguous reads), a novel
prokaryotic and eukaryotic gene finder that is exclusively based on a RNA-
Seq mapping and inherently includes ambiguously mapped reads. GIIRA extracts
candidate regions supported by a sufficient number of mappings and
reassigns ambiguous reads to their most likely origin using a maximum-flow
approach. This avoids the exclusion of genes that are predominantly
supported by ambiguous mappings. Evaluation on simulated and real data and
comparison with existing methods incorporating RNA-Seq information highlight
the accuracy of GIIRA in identifying the expressed genes.
AVAILABILITY: GIIRA is implemented in Java and is available from above HTTPS
link..
CONTACT: renardB'@'rki.de.
GIIRA – RNA-Seq Driven Gene Finding Incorporating Ambiguous Reads is a post
from: RNA-Seq Blog
HTTP double dot//www.informaticsblogs.com/author/rna-seq-blog-administrator/
or
Zickmann F et al., (2013).
GIIRA--RNA-Seq driven gene finding incorporating ambiguous reads.
Bioinformatics. Oct 27. [Epub ahead of print]
http://www.ncbi.nlm.nih.gov/pubmed/24123675
or PPTX slide link: (now converted to PDF format)
HTTP double dot//mendel.informatics.indiana.edu/~yye/lab/teaching/get.php?
course=fall2013-I519&name=GIIRA_paper3.pdf
or below attached figure

?
.
guidance

【在 a******k 的大作中提到】

: That's approximately the same conclusion I got.
: Could you please be a little more specific about the problem Cufflinks has?
: As far as I can tell, most of the problems are because of its bad default
: parameter settings. I am not quite sure whether it is fundamentally flawed.
: Strange it tries to make decisions for biologists by doing algorithmic
: adjustments to the results outputted. But I believe most researchers would
: like to see direct results of raw data.
: There are many software/tools can do transcripts reconstruct. But I am
: tempted by its idea of using knowledge of existing transcriptome as guidance
: in the process of transcripts reconstruction.

相关主题
● 大鼠的RNA-seq应该使用那个reference genome？	● 如何检测 long noncoding RNA
● RNA seq 数据统计分析问题请教	● RNA seq分析求教
● RNA-seq 表达量问题	● Seeking advice on ChIP-Seq, RNA-Seq and/or miRNA array
进入Biology版参与讨论

l**********1
发帖数: 5204

sure,
plus Trans-ABySS, Oases or Scripture etc..
pls refer one review,
Martin JA et al., (2011).
Next-generation transcriptome assembly.
Nat Rev Genet. 12: 671-82.
and its Table 2
also 'SOAPdenovo'
noted on another review,
Yandell M et al., (2012).
A beginner's guide to eukaryotic genome annotation.
Nat Rev Genet. 13: 329-42.
http://www.ncbi.nlm.nih.gov/pubmed/22510764
or PDF link,
HTTP double dot//www.yandell-lab.org/publications/pdf/euk_genome_annotation_
review.pdf

issues

【在 x***u 的大作中提到】

t***q
发帖数: 65

用CLC吧！现在也就卖5000刀，这都基本是白菜价了。
忽悠老板买下来，以后就轻松了，呵呵，当然别让老板知道CLC超级强大的功能

a******k
发帖数: 1190

Sounds the same idea. What else software can you recommend besides Cufflink?

【在 j****x 的大作中提到】

: 这和annotated genome guarded transcript reconstruction有什么本质的区别吗？
:
: construct

a******k
发帖数: 1190

have you ever read the posts carefully?

【在 l**********1 的大作中提到】

: sure,
: plus Trans-ABySS, Oases or Scripture etc..
: pls refer one review,
: Martin JA et al., (2011).
: Next-generation transcriptome assembly.
: Nat Rev Genet. 12: 671-82.
: and its Table 2
: also 'SOAPdenovo'
: noted on another review,
: Yandell M et al., (2012).

j****x
发帖数: 1704

http://www.ncbi.nlm.nih.gov/pubmed/24185837

Cufflink?

【在 a******k 的大作中提到】

: Sounds the same idea. What else software can you recommend besides Cufflink?

a******k
发帖数: 1190

I've seen that paper. My impression is that Cufflink remains the only one
that can do reference-based transcript assembly. Please correct me if I am
wrong.
Some tools like Augustus are originally designed to identify transcript(gene
models) without RNA-seq data. They now can use RNA-seq data, but
interesting is that the performance does not increase a lot with more
information. Most of the other tools do de-novo transcript reconstruction.
SLIDE seems another unique tool that do transcript abundance estimation with
reference transcriptome (i.e., no de novo finding). I only tried it briefly
so not quite sure.

【在 j****x 的大作中提到】

: http://www.ncbi.nlm.nih.gov/pubmed/24185837
:
: Cufflink?

l**********1
发帖数: 5204

RE:
>is that Cufflink remains the only one
Yes, if by overlap graph mode, Cufflinks is the only one,
but now we have 'Traph' which by splicing graph mode,
it was used minimum-cost network flows princeple..
details pls check,
Tomescu AI et al., (2013).
A Novel Combinatorial Method for Estimating
Transcript Expression with RNA-Seq:
Bounding the Number of Paths
Abstract. RNA-Seq technology oers new high-throughput ways for transcript
identication and quantication based on short reads, and has recently
attracted great interest. The problem is usually modeled by a weighted
splicing graph whose nodes stand for exons and whose edges
stand for split alignments to the exons.
ignored
In order to obtain a practical tool,
we implement three optimizations and heuristics, which achieve better
performance on real data, and similar or better performance on simulated
data, than state-of-the-art tools Cuinks, IsoLasso and SLIDE. Our tool,
called Traph, is available at http://www.cs.helsinki.fi/gsa/traph/
PDF link,
>HHTP double dot//arxiv.org/pdf/1307.7811
or from
http://www.cs.helsinki.fi/en/gsa/traph/
there is one slide PPTX file which posted on upper floor:
gene
with
briefly

l**********1
发帖数: 5204

Sorry,
now pls check this one,
one historic review about almost soft what used in RNA-seq de novo assembly
task..
its slide PPTX file link, (n.b. already converted to PDF format)
HTTP double dot//www.cs.helsinki.fi/u/tomescu/traph/TKRM-HITSEQ.pdf
or attached Table here,
>
发信人: aablackk (black), 信区: Biology
标题: Re: 如何处理RNA-Seq
发信站: BBS 未名空间站 (Sun Jan 5 02:03:36 2014, 美东)
omitted
Most of the other tools do de-novo transcript reconstruction. SLIDE seems
another unique tool that do transcript abundance estimation with
reference transcriptome (i.e., no de novo finding). I only tried it briefly
so not quite sure.
>>

【在 a******k 的大作中提到】

: have you ever read the posts carefully?

l**********1
发帖数: 5204

2nd slide attched here,
from under floor that PPTX file link..

what

【在 l**********1 的大作中提到】

: RE:
: >is that Cufflink remains the only one
: Yes, if by overlap graph mode, Cufflinks is the only one,
: but now we have 'Traph' which by splicing graph mode,
: it was used minimum-cost network flows princeple..
: details pls check,
: Tomescu AI et al., (2013).
: A Novel Combinatorial Method for Estimating
: Transcript Expression with RNA-Seq:
: Bounding the Number of Paths

l**********1
发帖数: 5204

RNA-seq SGS (Next Generation Sequencing 2.0) might be already older protocol
, (n.b. likes celluar Phone 3G service)
pls refer TGS(Next Generation Sequencing 3.0)and RNA-seq SGS hybrid protocol
new paper (n.b. likes cell phone 3.5 G service )
Au KF at al., (2013).
Characterization of the human ESC transcriptome by hybrid sequencing.
Proc Natl Acad Sci U S A. 110: E4821-30.
http://www.ncbi.nlm.nih.gov/pubmed/24282307

【在 z********o 的大作中提到】

: 老板让用RNA-Seq 的data。我是转行到Bioinformatics的，生物小白。有没有人能给推
: 荐一下该看哪些资料，用哪个算法？如何入手，入门？
: 谢谢大家

相关主题
● 请教RNA-Seq分析问题	● 求科普RNA-sequencing
● 能否用solexa测序找到差别表达基因呀?	● paper help！
● 市场调查 Bioinformatics RNA-seq preprocessing tool	● 请教RNA-seq 软件的安装调试的问题！
进入Biology版参与讨论

j****x
发帖数: 1704

与cufflinks类似的genome-guided assembly自然还有Scripture，后者速度更快但是对
低丰度的转录本似乎效果不佳。最近有朋友推荐过RNA-eXpress，不过我还没有试过。
另外，前面有帖子提到CLC Genomics，如果不差钱，确实是好选择。
此外，DRUT和RABT是值得考虑的辅助工具，对于你的需求可能有帮助。
BTW，SLIDE不了解，但是作者在这行里应该还算pp了，呵呵

gene
with
briefly

【在 a******k 的大作中提到】

: I've seen that paper. My impression is that Cufflink remains the only one
: that can do reference-based transcript assembly. Please correct me if I am
: wrong.
: Some tools like Augustus are originally designed to identify transcript(gene
: models) without RNA-seq data. They now can use RNA-seq data, but
: interesting is that the performance does not increase a lot with more
: information. Most of the other tools do de-novo transcript reconstruction.
: SLIDE seems another unique tool that do transcript abundance estimation with
: reference transcriptome (i.e., no de novo finding). I only tried it briefly
: so not quite sure.

a******k
发帖数: 1190

genome-guided assembly的很多，我说的是transcriptome-guided
呵呵，SLIDE作者好像AP了

【在 j****x 的大作中提到】

: 与cufflinks类似的genome-guided assembly自然还有Scripture，后者速度更快但是对
: 低丰度的转录本似乎效果不佳。最近有朋友推荐过RNA-eXpress，不过我还没有试过。
: 另外，前面有帖子提到CLC Genomics，如果不差钱，确实是好选择。
: 此外，DRUT和RABT是值得考虑的辅助工具，对于你的需求可能有帮助。
: BTW，SLIDE不了解，但是作者在这行里应该还算pp了，呵呵
:
: gene
: with
: briefly

D*a
发帖数: 6830

明天开课
University of Toronto
Bioinformatic Methods I
https://www.coursera.org/course/bioinfomethods1

l**********1
发帖数: 5204

中文论坛问不出答案的话
去Google Group Tophat 英文论坛问下如何?
HTTPS double dot//groups.google.com/forum/#!topic/tuxedo-tools-users/
HQkjCNXx2-Y
HTTPS //groups.google.com/forum/#!forum/tuxedo-tools-users
from
http://tophat.cbcb.umd.edu/igenomes.shtml

【在 a******k 的大作中提到】

: genome-guided assembly的很多，我说的是transcriptome-guided
: 呵呵，SLIDE作者好像AP了

(共1页)

进入Biology版参与讨论

相关主题
● RNA seq分析求教	● RNA-seq map工具
● Seeking advice on ChIP-Seq, RNA-Seq and/or miRNA array	● truth about RNAseq vs Microarray
● 请教RNA-Seq分析问题	● non strand specific RNA-seq数据分析
● 能否用solexa测序找到差别表达基因呀?	● 请问大家 RNA-Seq assembly 都用啥软件呢？
● 市场调查 Bioinformatics RNA-seq preprocessing tool	● RNA-seq结果分析求助
● 求科普RNA-sequencing	● 关于RNA-seq的问题
● paper help！	● 简单介绍 Bioinformatics Tools for NGS 分析
● 请教RNA-seq 软件的安装调试的问题！	● 大鼠的RNA-seq应该使用那个reference genome？

相关话题的讨论汇总
话题: rna话题: seq话题: transcript话题: reads

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天