problem with cluster parallel computing.

Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.

problem with cluster parallel computing.

Posted 24 mag 2017, 07:42 GMT-4 Version 5.2a 2 Replies

Ying Tang

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

I am using COMSOL on our university Linux cluster. for each node, we have 20 cores. and the job is submited by PBS scheduler.

If I run with one node, the simulation works out with NO problem. But if I run with more than one nodes, it doesn't work.

I tried many examples from the internet but none of them works for out cluster. I used both hydra and MPD launcher method, none of them are working. below is the sample result I got. As far as I know, comsol come with the intelMPI library so I dont have to load any MPI module separatly.

1. If I use hydra mode (cluster simple), the script I have is
#====================================================start of script==
#!/bin/bash
#PBS -N clustersimple_03
#PBS -lnodes=2:ppn=11
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out

# Load the COMSOL 5.2a module
module load comsol/52a

# Obtain the number of cores and number of nodes assigned to us
NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo
echo ================
#echo 'Start COMSOL with "mpdboot" (multi-core) and in "batch" mode'
echo 'use cluster simple mode'
echo "use ${PBS_NUM_NODES} nodes and ${PBS_NUM_PPN} cores per node"
comsol -clustersimple -f $PBS_NODEFILE batch -mpiarg -rmk -mpiarg pbs\
-nn ${PBS_NUM_NODES}\
-np ${PBS_NUM_PPN} \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph \
-batchlog ${PBS_JOBNAME}.log

cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================
#====================================================end of script==

after I submit my job to PBS, it starts to run. but in the end, I will receive this error message.
I can confirm that the time 300 second is more than enough for my test simulation. Hence the program was frozen.

=>> PBS: job killed: walltime 328 exceeded limit 300
[mpiexec@n05-32] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[mpiexec@n05-32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@n05-32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@n05-32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@n05-32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@n05-32] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion

2. if I run with MPD mode, the script is as below
#====================================================start of script==
#!/bin/bash
#PBS -N mpd_parallel_06
#PBS -lnodes=2:ppn=8
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out

# Load the COMSOL 5.2a module
module load comsol/52a

# Obtain the number of cores and number of nodes assigned to us
export NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
export NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo

# Setting up some files
cat $PBS_NODEFILE > nodes
cat $PBS_NODEFILE | uniq > mpd.conf
# Get number of nodes allocated
export NO_OF_NODES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
export NODE_LIST=`cat $PBS_NODEFILE `
# # Just for kicks, see which nodes we got.
echo ======NODE LIST
echo $NODE_LIST

echo ======uniq -c pbs nodefile=========
uniq -c $PBS_NODEFILE
echo ===============
echo $PBS_NODEFILE
echo ============setup
# Some setup stuff - you don't need to change this line
comsol -nn $PBS_NUM_NODES mpd boot -f mpd.conf -mpirsh ssh comsol mpd trace -l
echo ===============parallel run
echo "--- Parallel COMSOL RUN"
comsol -nn ${PBS_NUM_NODES} -np ${NO_OF_NODES} batch \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph -batchlog ${PBS_JOBNAME}.log

echo "--- mpd ALLEXIT"
comsol mpd allexit

echo
echo "--- Job finished at: `date`"
echo "------------------------------------------------------------------------------"

cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================

#====================================================end of script==

then I will get this output message.
#====================================================start of output==
display NCORES and NNODES
40 and 2
=============
Step into the PBS_O_DIR directory

======NODE LIST
n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11
======uniq -c pbs nodefile=========
8 n05-06
8 n05-11
===============
/var/opt/ud/torque-4.2.10/aux//10862.hpc05.hpc
============setup

usage: mpdtrace [-l] [-V | --version]
Lists the (short) hostname of each of the mpds in the ring
The -l (long) option shows full hostnames and listening ports and ifhn

Copyright (C) 2003-2015 Intel Corporation. All rights reserved.

===============parallel run
--- Parallel COMSOL RUN
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 11312 n05-06 {0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29}
[0] MPI startup(): 1 11313 n05-06 {10,11,12,13,14,15,16,17,18,19,30,31,32,33,34,35,36,37,38,39}
Node 0 is running on host: n05-06
Node 0 has address: n05-06.hpc
Node 1 is running on host: n05-06
Node 1 has address: n05-06.hpc
Warning: The total number of allocated threads (32) on host: n05-06 exceeds the number of available physical cores (20)

#====================================================end of output==
the error message is

#====================================================start of error===
utility.c(2245):ERROR:50: Cannot open file '/opt/ud/LOCAL/etc/modulefiles/comsol/53' for 'reading'
=>> PBS: job killed: walltime 337 exceeded limit 300
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
#====================================================end of error===

does anyone know what happen here?

2 Replies Last Post 19 giu 2017, 09:55 GMT-4

Jim Freels mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 8 years ago

Let's start with a few questions:

1) do you have the comsol floating network license (FNL). In order to run distributed parallel, it is required. You are not required to purchase any other additional licenses than the FNL, but if you do not have that available, then it will not work.
2) Is the FNL accessible from all the nodes you need to access ?
3) have you tried to run the parallel job outside the PBS queue ?
4) there is some help from comsol knowledge base on using PBS in linux. Have you checked this ?
5) have you used your comsol tech support to address this problem ?
6) have you tried a much simpler script that perhaps hardwires your parameters to eliminate the script as part of the problem and not comsol ?

Nicholas Labarbera

Send Private Message Flag post as spam

Please login with a confirmed email address before reporting spam

Posted: 7 years ago

Did you ever solve this issue? I'm having a similar problem

Discussion Forum

problem with cluster parallel computing.

Suggested Content