e***a 发帖数: 18 | 1 I understand your statement but I want to get to the bottom of it.
I was asked about this during an interview.
They asked me about why mutex perterson implementation will fail in modern
processor.
(check out http://en.wikipedia.org/wiki/Peterson%27s_algorithm).
"Many modern CPUs reorder instruction execution and memory accesses to
improve execution efficiency. Such processors invariably give some way to
force ordering in a stream of memory accesses, typically through a memory
barrier instruction |
|
r****t 发帖数: 10904 | 2 用进程, up to the number of CPUs, 再多就没有好处了。
servers in python uses Twisted to implement. Twisted + Prospective Broker+ 少量的 threading 基本上是 python 里面对这
个问题的回答吧。 |
|
w*****1 发帖数: 473 | 3 For My own computer, the result is 60, while for the computerr in my
office, the result is nan or -12345678998765443322.
why different cpus give so much different result? Thanks! |
|
x******c 发帖数: 13 | 4 what I said above is for modern intel/amd cpus.
No one can say for sure without profiling/benchmarking though.
As a general rule: don't write code like that.
If your code is not fast enough, use a profiler to identify the most
performance critical parts, then try these to see if it helps. |
|
O*******d 发帖数: 20343 | 5 两幅图像在切换时,我用了颜色混合。 Windows的GDI没有比较好的混合颜色的方法。
我写这个屏保时不打算用OpenGL来做。于是自己写了一个混合颜色的class。 主要想法
是用空间换时间,。颜色混合用查表法。 两个颜色,不管红绿蓝,就是一个2维数组的
index,那个位置的颜色就是事先计算好的混合色。
#include
class ColorBlender
{
public:
ColorBlender(double alpha) { mAlpha = alpha;mLookupTable = NULL;
CreateTable(); }
~ColorBlender() {delete [] mLookupTable ;}
unsigned char Blend(unsigned char firstColor, unsigned char
secondColor);
void Blend(const unsigned char * pFirstColor, const unsi... 阅读全帖 |
|
n**f 发帖数: 121 | 6 I am trying to have GCC installed on a AIX machine, which has 8 CPUs and
runs on AIX 6.
I wonder which version of GCC I should install. A couple of years ago I used
gcc 3.4.x and later used 4.4.1. Now it seems that it is going up to 4.6.1.
The main objective is to be able to do standard non-fancy c++ programming.
Compiling speeding is not a major concern as my projects are not big.
Many thanks! |
|
c*****e 发帖数: 737 | 7 bsr and bsf are on all 386 or above CPUs.
moreover, the optimal solution depends on architecture.
use __buildin_popcnt() is a good idea, if the cpu supports, then it takes
only 1 cycle, otherwise a bit slower. |
|
c*****e 发帖数: 737 | 8 bsr and bsf are on all 386 or above CPUs.
moreover, the optimal solution depends on architecture.
use __buildin_popcnt() is a good idea, if the cpu supports, then it takes
only 1 cycle, otherwise a bit slower. |
|
h*****f 发帖数: 248 | 9 Hmm...I think my code didn't reflect my point. One more try:
class A {
std::vector m_x;
public:
A() {
// insert some integers into m_x
pthread_t t;
pthread_create(&t, NULL, doWork, &m_x);
}
private:
int m_x;
static void* doWork(void* p) {
std::vector* x = static_cast*>(p);
// ... some computation that uses m_x;
// at this point, m_x might not be available on ALL CPUs' cache as A
might not be completely... 阅读全帖 |
|
w****k 发帖数: 6244 | 10 python thread is not real multiple thread
use multiprocess if u want use multi cpus |
|
k**********g 发帖数: 989 | 11
(我是软粉。)
This post talks about integer widths from the language / compiler
perspective. This post is strictly unrelated to the design or limitations of
the underlying CPU architecture. In all cases, compiler has the
responsibility of enabling, and if necessary, emulating the integer types of
various widths.
Turbo Pascal supports 32-bit integer types on 16-bit DOS processes and CPUs.
It provides the REAL floating point type, which is 6 bytes (48 bits) and is
strictly emulated in software. |
|
n***e 发帖数: 723 | 12 Huh, you are correct on this. My apologies.
of
of
CPUs.
is |
|
f****4 发帖数: 1359 | 13 首先,一个设计方案肯定是有取舍的。这点你要是不承认,我就没法和你讲了。我只能
说,就算我相信你的方案每个方面都是最优的,但我认为你最优的过程是要增加系统复
杂度的,我还会认为这块地方实施有风险。
他们2个人其实多少都有提怎么处理票。我懒得去推了。
你去看我分析魏老师的方案的时候,新增车次,那个,我明确提到,牺牲时间,花一个
小时去导入。导入,简单吧,导入不成,再来一次。单机操作内存的事情。哪怕你U盘
拷贝失败,再来一次就是了。我还放了个假设在那,不在线兼容现有的票务系统,不然
讨论起来更复杂了。
分段票那块,魏老师的方案上主机。goodbug也认为主机能做的,只是他认为这么低的
成本的主机处理不了这么高的throughput。你看我那提的,90+CPUs,36G的server,5
万美金。我认为就是要实现,也得加点预算。不然讨论不下去了。然后就是单线程,多
线程实现的区别了。单线程,不需要加预算,好处内存不用加锁。多线程,要加预算,
坏处实现复杂一点。但是一个买票的,C++,单机的程序,实现起来你觉得这个风险大
么???
我看到后面goodbug的回帖,我知道他还是没明白他到底差了哪一... 阅读全帖 |
|
g*****g 发帖数: 34805 | 14 The performance gain on C++ over Java comes from startup time, JIT warm up,
JIt binary code compilation, but not memory reclamation. As a matter of fact
, Java memory reclamation would be faster than C++ unless you work hard to
optimize memory reclamation on C++ side. The reason is because:
1. Java runs garbage collection on a separate thread or threads, C++ code
typically runs in the main thread. In a multicore environment as the
commonplace today. Java has the advantage.
2. When CPUs are loade... 阅读全帖 |
|
T********i 发帖数: 2416 | 15 这个是全部server加起来。他们自己blog写的。
效率还是不错啦。有人号称用php一周写出来是扯蛋。
但是也没有什么特别之处。
Standard user facing server:
Dual Westmere Hex-core (24 logical CPUs);
100GB RAM, SSD;
Dual NIC (public user-facing network, private back-end/distribution); |
|
g*********e 发帖数: 14401 | 16 of course if you have multi core cpus.
swap is just contrxt switch, sync/lock could be expensive, depend on the
program. |
|
S*A 发帖数: 7142 | 17 我当时理解错了,说的是 PC, 是 ARM 当然外设就简单很多。
其实 dmesg 里面有是时钟显示,你可以看出时间在什么地方花
调的。ramdisk 不一定快,因为读 zimage/initram 是用 BIOS
int 13 完成的,每次一两个扇区。
我的 PC, 看 dmsg,我之看大头时间:
ACPI 和 PCI 扫描需要一定时间。
引导其他 CPU 需要一定时间。(IPI call smap)
初始化 SATA port reset 需要时间,然后进入 SATA 3G mode
需要再次 port reset, 这些都有固定的 reset 时间的。
USB 同样, port reset 所有设备需要时间。
然后进入 init ram disk 就已经 2 秒钟了。
找 USB 设备需要 从新 port reset。
mount 文件系统等等。
[ 0.156718] ACPI: All ACPI Tables successfully acquired
[ 0.183241] smpboot: CPU0: Intel(R) Xeon(R) CPU ... 阅读全帖 |
|
S*A 发帖数: 7142 | 18 我当时理解错了,说的是 PC, 是 ARM 当然外设就简单很多。
其实 dmesg 里面有是时钟显示,你可以看出时间在什么地方花
调的。ramdisk 不一定快,因为读 zimage/initram 是用 BIOS
int 13 完成的,每次一两个扇区。
我的 PC, 看 dmsg,我之看大头时间:
ACPI 和 PCI 扫描需要一定时间。
引导其他 CPU 需要一定时间。(IPI call smap)
初始化 SATA port reset 需要时间,然后进入 SATA 3G mode
需要再次 port reset, 这些都有固定的 reset 时间的。
USB 同样, port reset 所有设备需要时间。
然后进入 init ram disk 就已经 2 秒钟了。
找 USB 设备需要 从新 port reset。
mount 文件系统等等。
[ 0.156718] ACPI: All ACPI Tables successfully acquired
[ 0.183241] smpboot: CPU0: Intel(R) Xeon(R) CPU ... 阅读全帖 |
|
z****e 发帖数: 54598 | 19 http://www.techempower.com/blog/2014/03/04/one-million-http-rps
关键字undertow
As we and our collaborators prepare Round 9 of our Framework Benchmarks
project, we had an epiphany:
With high-performance software, a single modern server processes over 1
million HTTP requests per second.
Five months ago, Google talked about load-balancing to achieve 1 million
requests per second. We understand their excitement is about the performance
of their load balancer1. Part of what we do is performance consulti... 阅读全帖 |
|
g*****g 发帖数: 34805 | 20 http://en.wikipedia.org/wiki/Scalability
To scale vertically (or scale up) means to add resources to a single node in
a system, typically involving the addition of CPUs or memory to a single
computer
你这傻逼真是天才呀,连什么叫做scaleup都没整明白就出来装了。看到single node和
add cpu/memory没有? |
|
g*****g 发帖数: 34805 | 21 scale up定义里add cpu to single node/computer写得清清楚楚,node就是一个
computer,如果node是cpu的话,如何add resource to cpu呢?
您老精确定义了什么叫做人至贱则无敌。总之宁可被打脸也要死撑到底。
To scale vertically (or scale up) means to add resources to a single node in
a system, typically involving the addition of CPUs or memory to a single
computer |
|
i**i 发帖数: 1500 | 22 To scale vertically (or scale up) means to add resources to a single node in
a system, typically involving the addition of CPUs or memory to a single
computer.
“所以我说加CPU加CORE本质上是SO”是你个人的理解。 加CPU是典型的SU。多CPU可以
被操作系统有效调度,如果不追求特殊效果(task affinity),可以认为是透明的。所
以跟其他硬件加速效果类似。 你说的数据库访问冲突基本上是抬杠。什么时候一次访
问是锁整个数据库的,或者锁很多的record?
然后,“我谈SCALABILITY时特指SO”。同时,你把SU认为是SO。
哥,很乱你造吗? |
|
z****e 发帖数: 54598 | 23 另外,只有node才需要ipc
其它框架都是itc
这就是为啥node很傻逼
像单线程一样编程是目的
但是你真把所有东西都做成单线程的话
那个thread挂了咋办捏?
哪怕是一个不小心,被阿三程序员写段代码blocked了
你不一样很苦逼?
当然这个问题哪怕是弄成itc一样存在
但是至少至少,其它启动的threads会在一定程度上抵消这个问题
我猜测这就是为啥在大多数报告里面,node总是会有丢包的问题
可能就是那个线程出了问题,把所有鸡蛋全部放在一个篮子里
自然会出问题,vert.x说,我按照cpu有多少个core来建threads
一个dual core的cpu,vert.x就启两个threads,如果是两个这种cpus
就起2*2=4个threads,这样万一其中一个thread被blocked,一样不会挂,对不对?
当然blocked在异步编程中是错误的,而且同一个process下处理communication
比起多个processes下处理,那是要强太多了,不同process之间做c,那很苦逼的好不好
得调用第三方工具,各种狗血问题,所以你还是用node的思维来看其他多线程... 阅读全帖 |
|
z****e 发帖数: 54598 | 24 另外,只有node才需要ipc
其它框架都是itc
这就是为啥node很傻逼
像单线程一样编程是目的
但是你真把所有东西都做成单线程的话
那个thread挂了咋办捏?
哪怕是一个不小心,被阿三程序员写段代码blocked了
你不一样很苦逼?
当然这个问题哪怕是弄成itc一样存在
但是至少至少,其它启动的threads会在一定程度上抵消这个问题
我猜测这就是为啥在大多数报告里面,node总是会有丢包的问题
可能就是那个线程出了问题,把所有鸡蛋全部放在一个篮子里
自然会出问题,vert.x说,我按照cpu有多少个core来建threads
一个dual core的cpu,vert.x就启两个threads,如果是两个这种cpus
就起2*2=4个threads,这样万一其中一个thread被blocked,一样不会挂,对不对?
当然blocked在异步编程中是错误的,而且同一个process下处理communication
比起多个processes下处理,那是要强太多了,不同process之间做c,那很苦逼的好不好
得调用第三方工具,各种狗血问题,所以你还是用node的思维来看其他多线程... 阅读全帖 |
|
g*****g 发帖数: 34805 | 25 You don't, it's normal for a web server to have dozens of cpus and hundreds
of threads. |
|
w***g 发帖数: 5958 | 26 这个我前段时间刚研究过。mTCP不行。最好的是seastar,就scylladb用的那个。
你还回过我帖子。
seastar有个自带的HTTP客户端叫seawreck,可以上他们的DPDK stack。
他们的网页在这儿https://github.com/scylladb/seastar/wiki/HTTPD-benchmark
我千辛万苦刚刚也跑起来了,两台机器1G网卡背对背连。
Server: 10.2.0.20:10000
Connections: 256
Requests/connection: dynamic (timer based)
Requests on cpu 0: 7714601
Requests on cpu 1: 7529698
Requests on cpu 3: 7504488
Requests on cpu 2: 7484931
Total cpus: 4
Total requests: 30233718
Total time: 60.002058
Requests/sec: 503878.016799
也就是 504K req/s。 我昨天发的... 阅读全帖 |
|
f*******t 发帖数: 7549 | 27 不是单机跟李世石打吧, 1920 CPUs and 280 GPUs |
|
n******7 发帖数: 12463 | 28 你看nature paper
单机版48 cpu 8 gpu max
distributed 版 1920 cpus 280 gpu, 大概40个机器的样子 |
|
n******7 发帖数: 12463 | 29 你看nature paper
单机版48 cpu 8 gpu max
distributed 版 1920 cpus 280 gpu, 大概40个机器的样子 |
|
t**********y 发帖数: 374 | 30 如果是普通, 任意的perl script/python script, 用mpirun可以提高速度吗?
mpirun有没有容易理解的instruction/example可以推荐?
mpirun 和 request multiple threads/cpus有什么区别?
谢谢了! |
|