Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
COMSOL Multiphysics performance on 4p Opteron system
Posted 14 lug 2015, 10:39 GMT-4 Studies & Solvers, Structural Mechanics Version 5.0 11 Replies
Please login with a confirmed email address before reporting spam
I have some performance issue with COMSOL Multiphysics on a new system.
I apologize in advance but it will take long.
Recently, we have built a new rig in our institute:
Supermicro H8QGI-FN4L motherboard
4x Opteron 6328 (8 integer cores and 4 FPUs per CPU, i.e. 32 int. cores and 16 FPUs in total, 3.2GHz but with TurboCore 3.8GHz if max. 4 integer cores/CPU is used and 3.5GHz if more than 4 integer cores/CPU is used)
4x Noctua NH-U9DO A3 cooler
Kingston ValueRam ECC Registered 1600MHz CL11 DDR3 (4x KVR16R11S4K4/32i kit, one for each CPU, i.e. 4x8GB modules per CPU; 128GB in total)
FirePro W4100
Samsung 850 Pro 128GB SSD, WD Caviar Green 2TB (64MB cache), Asus Xonar DGX sound card, Asus DVD writer (SATA)
EVGA SuperNOVA 1200 P2 PSU
Memory and CPU were tested, they seem good. RAM modules were placed in accordance with the manual of motherboard, i.e. one module per each memory channel, so every channel are used (16 memory channel in total).
numactl --hardware shows that ram is distributed among all (8) numa nodes almost evenly. For the first numa node, there is a size difference about 70MB, but i think it is reserved maybe by kernels or by IPMI (haven't found out yet).
SSD has 26GB root partition and remaining is for /home.
HDD has 128GB for swap (just to be sure), remaining is for data.
OS is Debian 8 (Jessie) 64bit.
Comsol Multiphysics 5.0 has been installed (NSL licence), but seems very SLOW.
Wrench.mph model from Model Library was modified (max. element size 0.0005 and min. element size 0.00005) to set DoF higher (to 4,133,121), in order to have multithreaded calculations take longer.
On the new rig, it takes 4-5 minutes to solve this model (in BIOS, NUMA mode and memory bank interleaving and memory channel interleaving are enabled, memory node interleaving is disabled). If I explicitly set the number of NUMA nodes to 8 with the flag -numasets 8, then it takes about 3.5 minutes to solve. In that case, it seems from htop that Comsol uses just the first two CPUs (all cores of them). Without this flag, Comsol uses all CPUs, but takes longer to solve. Used physical memory is about 12GB, virtual memory is about 20GB (from COMSOL log), no swapping (as it can be seen in htop). TurboCore is working, monitored by cpufreq-aperf.
The old Intel rig we have:
P6X58D-E motherboard
1x Intel Core i7 950 CPU (4 physical core, 3. 07GHz) with stock fan
24GB DDR3 as 6x4GB modules, non-ECC, unregistered, 1066MHz, CL8
NVIDIA Quadro FX1700
no SSD just HDDs
Debian 7 Wheezy 64bit
Without any COMSOL flags, it takes 3.5 minutes to solve the previously mentioned model, too !!!
On the new machine, Comsol sees 16 cores due to the numbers of FPUs. If I force it to use 32 cores (with -np 32 flag), it complaints that only 16 physical CPUs are present, and simulation takes longer a bit than with -np 16.
Besides of these facts, I think that simulation SHOULD BE AT LEAST 4 TIMES FASTER on the new rig, than on the old one (4x more memory channels, 4x more FPUs, higher frequency, newer architecture).
Is that possible that Comsol uses non-optimized code/BLAS for solving models on AMD CPUs?
By default, it uses MKL (as I can see in Comsol 5.0 Release notes), and if I set selected BLAS to acml (instead of mkl) with the -blas flag (i.e. using ACML shipped with Comsol), it is slower a bit.
I think maybe Comsol's acml library do not uses FMA4/FMA3 and other new instruction set on AMD Opteron.
I have downloaded the newest ACML (6.1) from AMDs web site, but don't know how to set up properly for Comsol.
In have played with settings in BIOS, e.g. NUMA enabled/disabled; memory node interleaving enabled/disabled; CPU specific options like HPC, CPB, etc; IOMMU (if that cares at all).
My question is: what do you suggest to boost performace? If anybody has some system like us, how did she/he configure her/his own system?
Can you suggest me some benchmark to test if this system is properly configured?
One last note: our active subscription has ended at the end of 2014, so last COMSOL version we can use is version 5.0.
Thank you for your help in advance.
Please login with a confirmed email address before reporting spam
on the new machine, I have run the Wrench model with the above mentioned modifications on meshing, starting comsol with different flags. Modelings ended with the following running times:
Setting cores with -np flag:
comsol -np 1 -> 579 seconds
comsol -np 2 -> 466 seconds
comsol -np 4 -> 287 seconds
comsol -np 8 -> 292 seconds
comsol -np 16 -> 265 seconds
Setting numa nodes with -numasets flag:
comsol -numasets 1 -> 282 seconds
comsol -numasets 2 -> 200 seconds
comsol -numasets 3 -> 200 seconds
comsol -numasets 4 -> 226 seconds
comsol -numasets 5 -> 209 seconds
comsol -numasets 6 -> 255 seconds
comsol -numasets 7 -> 217 seconds
comsol -numasets 8 -> 205 seconds
In every cases, DoF was 4,133,121.
Any suggestions? Or is the problem too small to be well scaled?
If we had more nodes but less cores/cpus per node (e.g. 2 computer nodes connected with 1Gbit connection or InfiniBand, 2 CPUs/node), would it be faster, utilizing all cores?
Maybe distributed memory computation is better optimized than shared memory computation?
Thank you for your help.
Please login with a confirmed email address before reporting spam
We have a lot in common here. I used to use Debian, but have switched to Ubuntu-14.04 for my work machines, but still run Debian-8 at home. Our cluster at work is running RHEL-5.11 (very old OS I know). We use a mixture of Intel and Amd64, but our cluster is amd64 with the shared fpu as you describe.
I have not studied the bios settings to the extent you have, so I cannot comment on what is best. I did some benchmarks on our cluster with COMSOL, and published at the 2010 COMSOL conference. I have inserted the link below for you. We have improved our cluster since then, but still are very happy with the COMSOL performance in parallel.
www.comsol.com/paper/exploiting-new-features-of-comsol-version-4-on-conjugate-heat-transfer-problems-7970
In general the parallelism is always better with the shared-memory mode, and slows down with distributed parallel. You always want to optimize with the shared-memory first, then move on to distributed using the highest-speed to connect the nodes that you can get. The COMSOL floating network license will provide this without additional cost per node.
We have observed that the Intel chips, which have a 1-to-1 fpu with each core will sometimes outperform an amd64 chip with shared fpu. For instance a 12-core Intel is faster than a 16-core amd64 since, due to the share fpu, it is effectively 8-cores not 16. We also have observed that COMSOL does not support hyperthreading, which they state clearly. Indeed, now COMSOL ignores the hyperthreaded cores, and you do not need to turn it off in the bios like the older versions. So, if the OS shows 24 cores, and 12 are hyperthreaded, and only 12 as physical cores, then you will only get the performance of 12 physical cores. If has been our observation that If a code does support hyperthreading, then it will only get the effect of about 1/3 of a core anyway using the hyperthreading. But, 1/3 is worth having if you can get it, and I wish COMSOL would support it.
I have not tried the numasets switch, that must be a relatively new switch on comsol.
I used to use specific values for the blas switch, but now I just use the auto. You can get the latest and greatest as you did, but I do not think you will get much better performance. COMSOL will eventually include the updated mat libraries form Intel anyway.
Another trick that the comsol folks told me about is the nh switch, or number of hosts. For our amd64 we have 2 physical processors that are connected on the motherboard. There is a slight slowdown when you go from one processor to another rather than using all the cores within a process. So, for our case where we have 8 cores on two processors for a total of 16 cores on the motherboard, you can set the switches on comsol as
-nh 1 -np 16
or
-nh 2 -np 8
and get slightly greater speedup with this 2nd way of doing it.
Hope this helps. Maybe if you get back on the subscription service, you could get these issues resolved with COMSOL's help. They have a great technical staff of people who are firmly interested in getting COMSOL to perform as best it will for their customers.
Please login with a confirmed email address before reporting spam
thank you for your helpful answer.
It seems that using -nh flag (it should be -nn, right?) is not an option in the case of NSL.
Besides your useful suggestions, I have read some posts about task parallelism on COMSOL Blog, and now I have a slightly better undestanding on the topic (scaling of problem, shared vs. distributed memory parallelism).
If I understand it well, in case of shared memory parallelism (in a single computer), every single core accesses the same part of the physical memory. Therefore, if the problem is small enough, e.g. it fits in the memory part belonging to a particular processor, then it will scale well until the number of cores involved reaches the number of cores present in that particular processor (in the case that cores from the same processor are utilized first, I do not know), and after that the scaling gets worse due to the lower interconnection speed of processors. In my case, this critical number is four because four FPUs are present per CPU.
Is that right, that in case of FNL license, COMSOL can use distributed memory parallelism in a single computer too, by partitioning the problem between numa nodes (or processors), so it can solve faster than with shared memory parallelism? Or does it uses distributed memory parallelism only among different computers (i.e., between motherboards)?
From the existence of NUMA, it could make sense (at least for me) to use this feature between numa nodes because accessing memory in the same numa node where the particular CPU core lays is faster than accessing other ones, but I have no such knowledge of programming. If that is the case, then FNL may be superior over NSL from this aspect at least (and for the same price tag if I know well).
One question about FNL: how many concurrent session can be used on/from a singe machine? With NSL, I can run up to 4 jobs parallel utilizing batch sweep. Is that restriction removed from FNL?
Thank you for your help.
Please login with a confirmed email address before reporting spam
Hello James Freels,
thank you for your helpful answer.
It seems that using -nh flag (it should be -nn, right?) is not an option in the case of NSL.
Ah ! It seems that the -nh flag has been replaced with the numasets flag. The nh flag is probably actually there, but does not show up on the -help list anymore. I don't use it anymore, because I found it does not help enough to worry about. Probably be the same for he numasets. I guess it depends on your hardware. Perhaps my hardware is sharing all the memory by default. I will test when I get time.
Besides your useful suggestions, I have read some posts about task parallelism on COMSOL Blog, and now I have a slightly better undestanding on the topic (scaling of problem, shared vs. distributed memory parallelism).
If I understand it well, in case of shared memory parallelism (in a single computer), every single core accesses the same part of the physical memory. Therefore, if the problem is small enough, e.g. it fits in the memory part belonging to a particular processor, then it will scale well until the number of cores involved reaches the number of cores present in that particular processor (in the case that cores from the same processor are utilized first, I do not know), and after that the scaling gets worse due to the lower interconnection speed of processors. In my case, this critical number is four because four FPUs are present per CPU.
That sounds correct to me. When you say "CPU" I take it that you mean a full motherboard or node on a cluster. You can have several processors on a CPU, node, or motherboard. And like you said, once you use up all the shared memory on the node, then you need to go to distributed parallel processing to run your simulation. If you start swapping to the hard drive, forget it. It will be too slow.
Then there is the game of solvers. You can use a direct solver or iterative solver; the iterative is more difficult to set up. However, if a problem gets too large, the direct solver can be slower than the iterative solver in some cases. It is a tradeoff game.
Is that right, that in case of FNL license, COMSOL can use distributed memory parallelism in a single computer too, by partitioning the problem between numa nodes (or processors), so it can solve faster than with shared memory parallelism? Or does it uses distributed memory parallelism only among different computers (i.e., between motherboards)?
Like I said before, I saw a small incremental performance gain using the nh (now numasets), but not signficant. So, now I just let the code take care of it, and I set np to the total cores on the shared-memory motherboard. I will test numasets though.
From the existence of NUMA, it could make sense (at least for me) to use this feature between numa nodes because accessing memory in the same numa node where the particular CPU core lays is faster than accessing other ones, but I have no such knowledge of programming. If that is the case, then FNL may be superior over NSL from this aspect at least (and for the same price tag if I know well).
I am not sure of FNL exclusively uses numasets. Are you saying that NSL does not seem to work with numasets ? I think it should because it is not distributed, but on the motherboard.
One question about FNL: how many concurrent session can be used on/from a singe machine? With NSL, I can run up to 4 jobs parallel utilizing batch sweep. Is that restriction removed from FNL?
I don't know the correct answer to that. I think on Linux machines you may be limited to one COMSOL instance on the node per license. If you have another instance, I think it uses another license. On the other hand, the Windows desktops seem to be able to have many instances of COMSOL running without using more than one license, whereas the linux desktops use up a license for each instance. I don't know why that is, and I have not really tested it, but my students pointed this out to me. We are a Linux shop, but some applications require Windows (solidworks, COMSOL application builder, etc.)
Thank you for your help.
Please login with a confirmed email address before reporting spam
thank you for your thoughts, again.
Interesting thing is that running the aforementioned wrench model with 'comsol -numasets 8' took about a minute shorter than with 'comsol -np 16'. Monitoring processes with htop during FEM calculation, it seemed that in the first case (-numasets 8) mainly cores 1-16 were used (during intense computing steps, all of them was used almost constantly on 100% and CPU usage in total was about 1600%) and cores 17-32 were used hardly; while in the second case (-np 16), every core were used in some extent (quickly changing usage for each, but 1600% CPU usage in total, here too).
As for the CPU scalability, I meant CPU, not compute node (since I can't test it because of our license type). To make it clearer what I meant:
In our new machine, NUMA is enabled in BIOS and memory node interleaving is disabled (i.e., a particular processor uses at first its own memory modules, and if they filled up then it uses another processor's ones, too). Because AMD Opteron 6328 CPUs have in fact two computing nodes (lets calling them 'small CPUs', each with 2 FPUs and 4 integer cores and 2 memory channels, in our case, 2x8GB modules/'small CPU', so one memory module for each memory channel), Debian sees a 'small CPU' (in fact, a half CPU) with its 16GB memory (on 2 channels) as one NUMA node. So for our quad CPU system, Debian sees 8 NUMA nodes in total. This can be checked by the following:
zoltan@IC-Workstation:~$ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 16084 MB
node 0 free: 15349 MB
node 1 cpus: 4 5 6 7
node 1 size: 16159 MB
node 1 free: 15519 MB
node 2 cpus: 8 9 10 11
node 2 size: 16159 MB
node 2 free: 15949 MB
node 3 cpus: 12 13 14 15
node 3 size: 16159 MB
node 3 free: 15424 MB
node 4 cpus: 16 17 18 19
node 4 size: 16159 MB
node 4 free: 16009 MB
node 5 cpus: 20 21 22 23
node 5 size: 16159 MB
node 5 free: 15916 MB
node 6 cpus: 24 25 26 27
node 6 size: 16159 MB
node 6 free: 15957 MB
node 7 cpus: 28 29 30 31
node 7 size: 16142 MB
node 7 free: 15937 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 22 16 22 16
2: 16 22 10 16 16 22 16 22
3: 22 16 16 10 22 16 22 16
4: 16 22 16 22 10 16 16 22
5: 22 16 22 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 22 16 22 16 16 10
Because linux sees every integer core as a cpu, we have 32 of them (yes, only 16 FPUs, I know), so in htop we see cpus numbered by 1-32.
During solution of the wrench model, using '-numasets 8' flag, COMSOL occupied mainly memory parts belonging to numanodes 0-3 (the first two physical CPUs, that means just 8 FPUs were in fact utilized???), but hardly used other memory parts, and all of the processes related to computing ran on the first two CPUs and had relatively quick access to their own memory part. In the second case (-np 16), every cpu cores (so all FPUs) were used and memory allocation was a bit more evenly distributed among the 16 memory modules in the system, but about 2/3 portion was located in the memory modules belonging to the first two CPUs. So in the latter case, remote memory accesses occured more frequently, as I guess. From solution times, it seems that memory access latencies are more important than core (FPU?) count. Or am I totally wrong?
Regarding swapping, it was not present in any cases.
I am not stating that NSL is not using all numanodes just not knowing whether they are used by memory sharing or by memory distributing (if it makes any sense...)
And yes, I can confirm that on linux, only one COMSOL instance can run at a given time and on Windows more than that (three or four, when I used version 4.x on Windows ages ago, do not remember exacty; not checked with version 5.0). But in that particular COMSOL session, I can add a Batch sweep for not more than 4 different values of a particular parameter, and then it computes for that 4 parameter values simultaneously, starting and running four batch jobs. I would like to know that with FNL, what is the maximum amount of such parallel batch jobs (on a single machine or in a cluster).
Please login with a confirmed email address before reporting spam
I have taken a good look at the wrench.mph model that you have chosen for the test.
It looks like the -nh switch has been replaced with the numasets switch. I think both had the same type of effect whereby COMSOL will increase performance only slightly by adjusting numasets. In my case, about 10% gain.
Our cluster is using Quad-Core AMD Opteron(tm) Processor 2350 with a total of 16 cores (4 cores/processor). My optimum setting is then -np 16 -numasets 4. If I leave numasets out, it uses about 10% more time to solve this problem.
The problem you are having is that the problem you have chosen is not spending enough time in a parallel mode to see the speedup. Not all parts of COMSOL run in parallel as they clearly say in the manuals. One way to see the speedup is to pick a problem that spends most of the time running the PARADISO direct solver, or the GMRES iterative solver. The default setting you are running is using the GMRES solver, but it only spends a small fraction of the total solution time doing that. The rest of the time is doing a lot of setup to get there (assembly, prolongation, etc.) Parts of these operations are parallel, but not all.
One suggestion to see the speed up for this problem is to simply switch to the direct solver from the iterative solver. This is a good example where the iterative solver runs faster than the direct solver.
the simply change np and run the cases. You will be able to measure the speedup since it takes so much longer to solve than the iterative solver and it will spend a lot of time in PARDISO. Be prepared to wait a long time for the np=1 case. Also, the direct solver will use a lot more memory.
I do not understand why you cannot run np=32. That makes no sense to me, but if you can only use np=16, than that is it. I have an Intel server here that runs np=24 and has 24 more hyperthreaded that I cannot use on it with COMSOL.
I think you need numasets=8. This has been a worthwhile study for me, because I will be using numasets=4 from here on the cluster, and will check on the other machines I use.
happy comsol-ing to you !
Please login with a confirmed email address before reporting spam
nn np numasets time(s) ram(GB)
1 16 1 340 15
1 16 2 319 15
1 16 4 304 15
1 8 4 325 15
2 16 4 285 12
3 16 4 240 10
Please login with a confirmed email address before reporting spam
thank you again for your continuous and valuable help.
I will try a full scaling test with PARDISO in next week (to be sure our system is working/scaling properly) and share the results.
I think the reason why COMSOL does not utilize 32 threads (but only 16) is that our computer has only 16 floating point units in total (there is only 1 FPU / 2 integer cores in Opteron 6300 series processors). When I explictly set the flag -np 32, COMSOL warns me that only 16 physical cpus are in the system, and with htop, I can see that only 1600% cpu time is utilized instead of 3200%.
Please login with a confirmed email address before reporting spam
I have done some weak and strong scaling tests with the following results:
STRONG scaling:
number of DoFs: 19,465,109 (+434,166 internal)
-np 16: 2,353 s (solver: 2,289 s)
-np 8: 3,097 s (solver: 3,036 s)
-np 4: 4,276 s (solver: 4,213 s)
-np 2: 7,366 s (solver: 7,305 s)
-np 1: 12,096 s (solver: 12,034 s)
WEAK scaling:
-np 16: 2,353 s (solver: 2,289 s); number of DoFs: 19,465,109 (+434,166 internal)
-np 8: 1,021 s (solver: 987 s); number of DoFs: 9,663,597 (+272,214 internal)
-np 4: 462 s (solver: 444 s); number of DoFs: 4,826,809 (+171,366 internal)
-np 2: 264 s (solver: 254 s); number of DoFs: 2,460,375 (+109,350 internal)
-np 1: 152 s (solver: 146 s); number of DoFs: 1,225,043 (+68,694 internal)
The model was created using Heat Transfer in Solids, its geometry was a simple cube (edge length 1 m), one face was set to T=100degC and opposite face was set to T=0degC, remaining 4 faces were thermally insulated. Material properties were: k=1W/mK, rho=100kg/m3, Cp=100J/kgK. It was a stationary model, mesh consisted of hexahedral elements with quadratic shape functions.
Are these results reasonable?
Thank you for your help.
Please login with a confirmed email address before reporting spam
What solver did you use ? For a shared-memory situation, you want to use the PARDISO direct solver in most situations. However, in this case, since all the properties are constant, you have essentially a pure Laplacian problem, which may lend itself to even faster solvers such as conjugate gradient.
I might suggest to impose temperature-dependent properties which will provide nonlinear behavior and allow a Newton iteration to be fully utilized and perhaps demonstrate the speedup for a more realistic problem. The problem I used in the benchmark shown in the paper was completely nonlinear by solving Navier-Stokes and temperature-dependent properties in the fluid.
What is the definition of the strong and weak scaling ? Why does the weak scaling have different DOFs ? How does the weak scaling demonstrate speedup ?
Please login with a confirmed email address before reporting spam
I used iterative method (GMRES) because it was way faster than PARDISO.
Now I tested with temperature dependent k-value and got the following results (strong scaling):
DoF: 8,120,601 (+242,406 inner)
np 16: 3,382 s
np 8: 4,107 s
np 4: 5,348 s
np 2: 8,367 s
np 1: 13,529 s
As I can see, solution speed is about 4 times faster for 16 cores than for 1 core.
Strong scaling: how solution time changes as number of cores increases while problem size stays the same in total.
Weak scaling: how solution time changes as number of cores AND problem size increase so that problem size per core remains (about) the same. I thought problem size meant DoF in FEM.
Note that while COMSOL employees may participate in the discussion forum, COMSOL® software users who are on-subscription should submit their questions via the Support Center for a more comprehensive response from the Technical Support team.