Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
COMSOL on cluster is limited to 2 nodes
Posted 24 feb 2012, 15:24 GMT-5 6 Replies
Please login with a confirmed email address before reporting spam
Hi,
I've been trying to set up a small cluster to run parametric sweeps. Using 4 computers, all running Ubuntu (all 11.10, kernel 3.0.0-16-generic, 3 nodes are constructed with modified xubuntu live-cds where the "main" node is running kubuntu).
I've set-up the ssh/mpd communications and such, and have verified that the mpd ring is operating correctly, or at least I think it is. Running "comsol mpd ringtest" gives a response in the milliseconds range. I've also checked manually that each node is able to ssh into every other node without the need for authentication. The COMSOL install directory is NFS mounted to each node.
If I run to model on only two nodes:
comsol -nn 2 batch -inputfile <file_to_run>
The model runs as expected; two nodes are initialized and the parameters are segregated to each node and evaluated. If I try to run -nn >2, the cluster will initialize, but I do not get any log output and it does not look like anything is being solved. What happens is that when I run the command, I get no output but checking each of the nodes that are active, I have a "comsollauncher" process spawned, and they operate at 100% CPU indefinitely (I allowed it to run for 30 minutes on 4 hardware nodes before quitting), or spawn and use no CPU time.
Further, if I am only using two hardware nodes, and I try -nn 4 (i.e., two software nodes on each hardware node), the same problem exists, whereas running -nn 2 works correctly. This does not matter what hardware nodes I am using, as I have checked each combination.
Instead, if I use just one computer and run with -nn 4, the model runs correctly.
Has anyone encountered a problem like this? Any help would be appreciated.
Other information:
MPD is launched with:
comsol -nn 4 mpd boot -v -r ssh -f clusternodes
where "clusternodes" is the file containing the hostname to each node
Running:
comsol -nn 4 server
reproduces the same problem as described above
Output of "comsol mpd trace":
$ comsol mpd trace
Precision390-Ubuntu
livecluster01
livecluster03
livecluster02
Output of "comsol mpd ringtest":
$ comsol mpd ringtest
time for 1 loops = 0.00281715393066 seconds
COMSOL version is latest: 4.2.1.166
Each node is connect via a 100/1000 ethernet switch, which allows connection to the license manager.
I've been trying to set up a small cluster to run parametric sweeps. Using 4 computers, all running Ubuntu (all 11.10, kernel 3.0.0-16-generic, 3 nodes are constructed with modified xubuntu live-cds where the "main" node is running kubuntu).
I've set-up the ssh/mpd communications and such, and have verified that the mpd ring is operating correctly, or at least I think it is. Running "comsol mpd ringtest" gives a response in the milliseconds range. I've also checked manually that each node is able to ssh into every other node without the need for authentication. The COMSOL install directory is NFS mounted to each node.
If I run to model on only two nodes:
comsol -nn 2 batch -inputfile <file_to_run>
The model runs as expected; two nodes are initialized and the parameters are segregated to each node and evaluated. If I try to run -nn >2, the cluster will initialize, but I do not get any log output and it does not look like anything is being solved. What happens is that when I run the command, I get no output but checking each of the nodes that are active, I have a "comsollauncher" process spawned, and they operate at 100% CPU indefinitely (I allowed it to run for 30 minutes on 4 hardware nodes before quitting), or spawn and use no CPU time.
Further, if I am only using two hardware nodes, and I try -nn 4 (i.e., two software nodes on each hardware node), the same problem exists, whereas running -nn 2 works correctly. This does not matter what hardware nodes I am using, as I have checked each combination.
Instead, if I use just one computer and run with -nn 4, the model runs correctly.
Has anyone encountered a problem like this? Any help would be appreciated.
Other information:
MPD is launched with:
comsol -nn 4 mpd boot -v -r ssh -f clusternodes
where "clusternodes" is the file containing the hostname to each node
Running:
comsol -nn 4 server
reproduces the same problem as described above
Output of "comsol mpd trace":
$ comsol mpd trace
Precision390-Ubuntu
livecluster01
livecluster03
livecluster02
Output of "comsol mpd ringtest":
$ comsol mpd ringtest
time for 1 loops = 0.00281715393066 seconds
COMSOL version is latest: 4.2.1.166
Each node is connect via a 100/1000 ethernet switch, which allows connection to the license manager.
6 Replies Last Post 6 apr 2012, 08:14 GMT-4