x*****u 发帖数: 3419 | 1 http://www.linux-mag.com/2004-03/extreme_01.html
Linux Magazine / March 2004 / EXTREME LINUX
Using OpenMP, Part 3
EXTREME LINUX
Using OpenMP, Part 3
by Forrest Hoffman
This is the third and final column in a series on shared memory parallel
ization using OpenMP. Often used to improve performance of scientific models
on symmetric multi-processor (SMP) machines or SMP nodes in a Linux cluster
, OpenMP consists of a portable set of compiler directives, library calls, a
nd environment variables. |
|
l******9 发帖数: 579 | 2 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
l******9 发帖数: 579 | 3 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
l******9 发帖数: 579 | 4 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
l******9 发帖数: 579 | 5 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
y**b 发帖数: 10166 | 6 用openmp并行化一个模拟程序,发现:
计算只要运行足够步长,openmp每次运行给出的结果都不一样,
而串行运算的结果始终一样。这正常吗?
直觉上openmp由于每次运行都采用不同的计算顺序(各线程的
先后顺序是随机的),从而可能改变误差的积累方式,一般怎么
处理这类问题?谢谢。 |
|
l******9 发帖数: 579 | 7 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
l******9 发帖数: 579 | 8 Hi,
I am trying to do parallelization for a computing intensive problem.
I am working on a Linux cluster where each node is a multicore processor.
e.g. 2 or 4 quad-core processor per node.
I want to reduce latency and improve performance as much as possible.
I plan to use multiprocessing and multithreading at the same.
Each process run on a distinct node and each process spawn many threads
on each node. This is a 2 level parallelism.
For multiprocessing, I would like to choose MPI.
For multithre... 阅读全帖 |
|
x*****u 发帖数: 3419 | 9 http://www.linux-mag.com/2004-02/extreme_01.html
Linux Magazine / February 2004 / EXTREME LINUX
OpenMP Multi-Processing, Part 2
EXTREME LINUX
OpenMP Multi-Processing, Part 2
by Forrest Hoffman
This month, we continue our focus on shared-memory parallelism using Ope
nMP. As a quick review, remember that OpenMP consists of a set of compiler d
irectives, a handful of library calls, and a set of environment variables th
at can be used to specify run-time parameters. Available for both FORTRAN an |
|
b*****l 发帖数: 9499 | 10 【 以下文字转载自 Thoughts 讨论区 】
发信人: bigsail (河马·旋木), 信区: Thoughts
标 题: OpenMP 求救。。。
发信站: BBS 未名空间站 (Sat Apr 30 02:18:47 2011, 美东)
在学 OpenMP,第一步就不通:设多线程失败。。。
TestOMP.cpp 的 code 很简单:开 5 个线程,每个介绍一下自己,就完事了.
#include
#include
using namespace std;
main () {
omp_set_num_threads(5);
cout << "Fork! " << endl;
#pragma omp parallel
{
// Obtain and print thread id
cout<< "Hello World from thread = " << omp_get_thread_num()
<< " of " << omp_get_num_threads() << endl;
... 阅读全帖 |
|
O*******d 发帖数: 20343 | 11 You have to activate OpenMP in your compiler. For Visual Studio 2008,
Project->"your project"->C/C++->Language->OpenMP Support |
|
O*******d 发帖数: 20343 | 12 The default number of threads in OpenMP is the number of CPU on your
computer if you do not call omp_set_num_threads(). Of course you have to
activate OpenMP support of your compiler. |
|
O*******d 发帖数: 20343 | 13 比较新的compiler一般都支持OpenMP。 但是可能需要激活,至少Visual Studio 2008
是这样的。 激活就是把compiler
支持OpenMP的功能调用起来。 如果不激活,不管你有几个CPU,就只有一个thread。你
call omp_set_num_threads()
在没有激活的compiler下是无效的,但也不会给错。 这是为了backward
compatibility. |
|
y****n 发帖数: 15 | 14 有一个关于openmp的问题想请教各位大牛。原始程序(如A)需要分配一个临时数组再释
放。用OpenMP改成并行实现后(如B),不同线程不能共享这个数组,每个线程需要独立
分配这段内存。
如果在循环体内分配内存,那一共分配了nk=121次,效率很低。实际上如果存在4个线
程,只要在每个线程中分配一次就行了。不知道应该如何实现,请大牛们指点。
非常感谢。
-------------------------------------
Program A:
-------------------------------------
float* pfSdx = (float *) calloc( N );
for (int k = 0; k < nk; k++)
{
...
}
free( (float *) pfSdx );
-------------------------------------
Program B:
-------------------------------------
#pragma omp parallel for
for (int k = 0; k < ... 阅读全帖 |
|
x*****u 发帖数: 3419 | 15 http://www.linux-mag.com/2004-01/extreme_01.html
Linux Magazine / January 2004 / EXTREME LINUX
Multi-Processing with OpenMP
EXTREME LINUX
Multi-Processing with OpenMP
by Forrest Hoffman
In this column's previous discussions of parallel programming, the focus
has been on distributed memory parallelism, since most Linux clusters are b
est suited to this programming model. Nevertheless, today's clusters often c
ontain two or four (or more) processors per node. While one could simply sta
rt mult |
|
b*****l 发帖数: 9499 | 16 在学 OpenMP,第一步就不通:设多线程失败。。。
TestOMP.cpp 的 code 很简单:开 5 个线程,每个介绍一下自己,就完事了.
#include
#include
using namespace std;
main () {
omp_set_num_threads(5);
cout << "Fork! " << endl;
#pragma omp parallel
{
// Obtain and print thread id
cout<< "Hello World from thread = " << omp_get_thread_num()
<< " of " << omp_get_num_threads() << endl;
// Only master thread does this
if (omp_get_thread_num() == 0)
cout << "Master thread: number of threads = " <<
omp... 阅读全帖 |
|
|
x*z 发帖数: 1010 | 18 Most MPI libraries have shared memory implemented, which actually has
less overhead than OpenMP or threading. |
|
l******9 发帖数: 579 | 19 In MPI libraries with shared memory implemented, we have inter-process
communication or inter-thread communication ?
If it is former, why process has less overhead than thread ?
If it is later, why it has less overhead than openMP and threading?
Does MPI has some built-in advantages over them ?
Any help is really appreciated.
Thanks |
|
Q*T 发帖数: 263 | 20 Enable OpenMP support when linking:
g++ -fopenmp -c -o TestOMP.o TestOMP.cpp |
|
|
|
y****e 发帖数: 23939 | 23 谢谢你的回复。不过我还是有点不明白,我现在是在Linux里面用g++ compile的。
compile没有问
题,不知道你说的activate OpenMP是什么意思?
我的系统是intel dual core的,应该算两个processor吧。
而且我确实call了omp_set_num_threads()了呀。
但只起来了一个thread。
to |
|
p******m 发帖数: 353 | 24 我尝试用intel 9 编译器在vc 6.0的环境里编译openmp, 但其中一个线程老是被重复
执行, 不知道为什么? 有谁遇到过类似的问题吗? |
|
p******m 发帖数: 353 | 25 请问有没有人用过OpenMP?
能编译产生DLL吗? 被调用的DLL还有并行功能吗? |
|
p******m 发帖数: 353 | 26 我尝试用intel 9 编译器在vc 6.0的环境里编译openmp, 但其中一个线程老是被重复
执行, 不知道为什么? 有谁遇到过类似的问题吗? |
|
s*******e 发帖数: 664 | 27 ☆─────────────────────────────────────☆
petersam (google) 于 (Fri Oct 2 16:06:00 2009, 美东) 提到:
我尝试用intel 9 编译器在vc 6.0的环境里编译openmp, 但其中一个线程老是被重复
执行, 不知道为什么? 有谁遇到过类似的问题吗?
☆─────────────────────────────────────☆
petersam (google) 于 (Fri Oct 2 16:36:24 2009, 美东) 提到:
以下是我的测试代码:
#include "stdio.h"
#include "omp.h"
int main(){
int i;
omp_set_num_threads(2);
#pragma omp parallel for
for(i = 0; i < 6; i++ )
printf("i = %d\n", i);
return 0;
}
☆───────────────────────────────────── |
|
O*******d 发帖数: 20343 | 28 我个人比较喜欢OpenMP。 不需要加很多code,最简单的就只需要加一行, compiler就
可以自动把for loop平行。 线程的数目自动和你的CPU核的数目一致,每个核执行for
loop的不同index。 这些全都是自动的,不需要你操心。 你可以做data parallelism
和task parallelism. |
|
m***x 发帖数: 492 | 29 Data parallel use openmp. |
|
y**b 发帖数: 10166 | 30 更新一下,用了GCC quad-precision math libraray,初步结果显示openmp每次运行的
结果完全一致(在原来double输出精度的意义上),而原来double或long double在相同
运算下有明显偏差。
没有白折腾。感叹64位计算还没普及,128位计算已经颇有需求了,很多高精度库恐怕就
是例证。遗憾的是quadmath库目前很慢,我的计算显示大约慢30倍,够慢。
遍, |
|
t****t 发帖数: 6806 | 31 我不懂fortran, 但是第一, 这种小事没必要搞什么openmp这么复杂, 你不就是要一次
开十七八个进程吗? shell就可以搞定了, 看你的程序本来就是shell的包装, 可是这包
装有什么用呢?
第二, 同时跑十七八个进程, 输入可以是同一个文件(但是注意不要exclusive open),
输出如果是同一个文件那就是自找麻烦. 看你的程序, 调用mymodel.exe的时候命令行
完全没有变化, 多半就是麻烦的根源了吧 |
|
O*******d 发帖数: 20343 | 32 为什么输入文件要用OpenMP?。 输入文件的瓶颈不在CPU,而在硬件IO。 |
|
y****n 发帖数: 15 | 33 下面这段程序使用openmp执行一个类似图像线性插值的算法。
输入为Z(图像),X(坐标),Y(坐标),输出为F(图像)
为了避免同时写入数组F的某个元素,使用了#pragma omp atomic
我遇到的问题是,当把线程数设为1和2时,运行程序会得到不同的结果。实在想不出问
题出在什么地方。肯请大牛们帮忙看一看。
#pragma omp parallel for
for (int n = 0; n < MN; n++)
{
double y = Y[n];
double x = X[n];
int fx = (int)floor(x);
int fy = (int)floor(y);
if (fx<1 || x>nw || fy<1 || y>nh) // image index is [1...nw]
{
for (int i = 0; i < ndim; i++)
{
#pragma omp atomic
F[n+i*MN] += Z... 阅读全帖 |
|
t****t 发帖数: 6806 | 34 不懂openmp, 但是浮点数支持atomic吗? I actually don't think so... |
|
p***o 发帖数: 1252 | 35 纠结这个不如上TBB。再说难道openmp会笨到每次都重新建立新线程而不用线程池? |
|
g****n 发帖数: 13 | 36 Hi
I am new to openMP. now I have some question about it.
I wrote a very simple program in C++.
#include
#include
main ()
{
int nthreads, tid;
int i;
omp_set_num_threads(2);
printf("Number of CPUS:%d\n",omp_get_num_procs());
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
if(tid==0)
{
printf("tid=%d thread = %d\n", 0,tid);
printf("there are %d threads\n",omp_get_num_threads |
|
t*******t 发帖数: 1067 | 37 请问这里有人在用openmp吗?我有个弱问题请教,在下面这行程序里,如果我有很多变
量是
private,至少超过一行,请问怎么换行,谢谢
!$OMP PARALLEL DO SHARED(n,a), PRIVATE(i,j,k,su,....) |
|
t******0 发帖数: 629 | 38 我在网上找到如下手册http://openmp.org/mp-documents/omp-hands-on-SC08.pdf
编写出如下Hello World程序,在VC2012下跑。
#include
#include
#include // system("pause")
int main()
{
omp_set_num_threads(4);
# pragma omp parallel
{
int ID=omp_get_thread_num();
printf("Hello(%d)",ID);
printf("World(%d)n",ID);
}
system("pause"); //课件里没有这句
return(0); //课件里没有这句
}
运行结果就是:
Hello(0)World(0)
Press any key to continue...
说好的1,2,3都没看见了。。。请问我是哪里编... 阅读全帖 |
|
|
y**b 发帖数: 10166 | 40 mpi一直可以做shared memory计算,在一台机器的内存里面通讯,性能能不好吗。
用mpi比mpi+openmp性能还好,很多情况是这样的,我做的情况也是如此。但是不能排
除有些情况不是如此。
关键是,mpi从设计到完成比openmp复杂太多。一个项目,时间上很可能不允许做mpi(
没个半年设计、开发、调试、大规模测试很难搞定),但是openmp很简单,几天几周基
本都能搞定。
mpi一旦做好了,就不是openmp能比的了。openmp只能运行在一个节点或一台工作站上
,mpi就没这个限制了,几百几千个节点并行的威力没法比。 |
|
s******u 发帖数: 501 | 41 烂。OpenMP的scaling明显有问题,72核心280线程但是scaling能到50-60x就很不错了
。总而言之,OpenMP对海量线程的优化还是不行,sweet spot停留在8-32线程并行。也
许是kernel thread的模型决定了OpenMP thread的overhead太高,不像GPU那么
lightweight。MPI倒是能做的不错,但是要这么多的进程内存又不够。最大的优点是可
以直接用现有的x86代码(绝大多数已经支持MPI+OpenMP了),不用像GPU需要重新
fork出来写CUDA,然后maintain两套codebase |
|
y**b 发帖数: 10166 | 42 有啥解释吗?
是总体上跟以下因素有关?
mpi靠手工分块(分区)决定计算粒度,这个常常就是一种优化;
而openmp靠机器决定计算粒度,通常太细而overhead太大。
还是跟编译器和底层硬件更有关系?
我做的一种密集颗粒碰撞模拟,也是mpi明显优于openmp,原计划在几千个
节点上采用hybrid mpi/openmp模式,最后发现还是pure mpi模式快得多,
跨五个数量级的模拟都给出同样结论。当然我这个模拟跟那些专门的测试
有所区别,毕竟有其它因素影响:比如有小量代码不适合openmp化,有些
地方加锁,算法还可进一步改进等等。 |
|
l******9 发帖数: 579 | 43 I am also thinking about openMP.
But, how to make sure that openMP take full use of available
cores ?
Suppose that I have 24 CPUs, each of them has 6 cores (each core
supports hyperthreading).
I have 10,000 computing tasks, each of them needs 0.001 second.
Some of the tasks need to exchange data, which is very small.
Which task needs to send/receive data to/from which task is pre-defined. It
is known before the program in run.
But, the exchange frequency may be very high.
I want to schedule task... 阅读全帖 |
|
y**b 发帖数: 10166 | 44 【 以下文字转载自 Linux 讨论区 】
发信人: yanb (大象,多移动一点点), 信区: Linux
标 题: 如何查看一个程序/进程使用了哪些cpu?
发信站: BBS 未名空间站 (Tue Sep 25 01:10:18 2007), 站内
该程序使用了MPI或OpenMP, 在一个有8个Intel Quad-core(也就是32个core)的
linux服务器上运行.请问有什么命令能看出这个程序使用了哪些cpu及占用率?
目的主要是想直接看看该程序是否真正利用上了MPI或OpenMP。比如OpenMP,
设置OMP_NUM_THREADS=4或8或16...皆能运行,但从处理器结构来看应该是4
才有实际意义,8、16、32究竟是怎么回事? 还有MPI,用下面命令运行
mpirun -np 8或16或32...究竟是否分配到不同cpu上面了? |
|
c******n 发帖数: 16666 | 45 说来比较悲催 非cs专业,搞了个小程序跑模拟,数据量小的时候还好,数据量一大先
是内存挂了。后来跑去ec2租了个大内存服务器发现跑得还是很慢,仔细一看,有个
function算得特别慢,因为是n*n的复杂度,数据量上去了计算时间马上跳了等量级上
升。自己又是一知半解的,不知道哪位能帮着改进下算法然后提示下OpenMP该怎么做。
简而言之,是个关于水文的模拟,计算流域面积,所以数据的基本单位/对象就是node
。 有两个linked-list(求别吐槽用这个而不用vector,摊子摊太大了 改起来不容易
,或者如果我现在添加一个vector,复制现有list行不?)里面存的都是node之间的指
针。
第一个linked-list存的所有node的指针,按照node的ID存放,方便遍历所有node
第二个linked-list,其实不止一个,存的是所有在当前node的下游的node的指针,遍
历的话可以从当前node一直走到当前mesh的边界
流域面积的具体计算,就是当前node自己的面积加上其所有上有点面积的总和
比如在下图中,
a b c d e
... 阅读全帖 |
|
W***o 发帖数: 6519 | 46 try:
gcc -fopenmp -lpthread xxx.cpp
openMP 的东西最好还是在LINUX环境下整方便点,而且有的LINUX还没有 openMP, MPI
library
上次用openMP, MPI 整多线程 的 synch/barrier lock发现 ubuntu 就没有这两个
library
cannot |
|
k**********g 发帖数: 989 | 47
Step into the disassembly. Or use a CPU instruction profiler like AMD
CodeAnalyst or Intel VTune.
If this 0.5 second delay only occurs on the first call after application
launch, I think this is an inevitable cost for using OpenMP. If it happens
on every call then there is a need to investigate.
With the debugger attached, check how many OpenMP threads are created. Also
make sure the EXE and DLL are linking against the correct OpenMP library. |
|
w***g 发帖数: 5958 | 48 你有benchmark吗? 你这么说我很涨见识. 我见过的几个, openblas有openmp或者
thread版,
opencv用tbb, fftw用openmp, 还没见过哪个单机跑的轮子用MPI的. 你没有用32MPI我
觉得
就是一个证据, 就是MPI还做不到底. 但是即使是4x8或8x4能把OpenMP干掉我觉得也很
牛. |
|
t*****z 发帖数: 812 | 49 假设稀疏矩阵用CRS方式存储,为什么我的openmp并行不好?
#pragma omp parallel for private(i,j,t)
for(i=0; i
t = 0.0;
for(j=A.ptr[i];j
t += A.value[j] * x[A.index[j]];
y[i] = t;
}
n=400,000. 2,4,8threads 运行的时间差不多,比1thread w/ openmp快,根1thread w
/o openmp差不错
做iterative solver 大家出出点子? |
|
z*******h 发帖数: 346 | 50 也许是我孤陋寡闻了,我怎么没听说过在Hadoop cluster上用openMP or MPI的。MPI根
本就不可能用,openMP也没必要啊。 |
|