Discussion Closed This discussion was created more than 6 months ago and has been closed. To start a new discussion with a link back to this one, click here.
problem with cluster parallel computing.
Posted 24 mag 2017, 07:42 GMT-4 Version 5.2a 2 Replies
Please login with a confirmed email address before reporting spam
I am using COMSOL on our university Linux cluster. for each node, we have 20 cores. and the job is submited by PBS scheduler.
If I run with one node, the simulation works out with NO problem. But if I run with more than one nodes, it doesn't work.
I tried many examples from the internet but none of them works for out cluster. I used both hydra and MPD launcher method, none of them are working. below is the sample result I got. As far as I know, comsol come with the intelMPI library so I dont have to load any MPI module separatly.
1. If I use hydra mode (cluster simple), the script I have is
#====================================================start of script==
#!/bin/bash
#PBS -N clustersimple_03
#PBS -lnodes=2:ppn=11
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out
# Load the COMSOL 5.2a module
module load comsol/52a
# Obtain the number of cores and number of nodes assigned to us
NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo
echo ================
#echo 'Start COMSOL with "mpdboot" (multi-core) and in "batch" mode'
echo 'use cluster simple mode'
echo "use ${PBS_NUM_NODES} nodes and ${PBS_NUM_PPN} cores per node"
comsol -clustersimple -f $PBS_NODEFILE batch -mpiarg -rmk -mpiarg pbs\
-nn ${PBS_NUM_NODES}\
-np ${PBS_NUM_PPN} \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph \
-batchlog ${PBS_JOBNAME}.log
cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================
#====================================================end of script==
after I submit my job to PBS, it starts to run. but in the end, I will receive this error message.
I can confirm that the time 300 second is more than enough for my test simulation. Hence the program was frozen.
=>> PBS: job killed: walltime 328 exceeded limit 300
[mpiexec@n05-32] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[mpiexec@n05-32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@n05-32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@n05-32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@n05-32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@n05-32] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
2. if I run with MPD mode, the script is as below
#====================================================start of script==
#!/bin/bash
#PBS -N mpd_parallel_06
#PBS -lnodes=2:ppn=8
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out
# Load the COMSOL 5.2a module
module load comsol/52a
# Obtain the number of cores and number of nodes assigned to us
export NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
export NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo
# Setting up some files
cat $PBS_NODEFILE > nodes
cat $PBS_NODEFILE | uniq > mpd.conf
# Get number of nodes allocated
export NO_OF_NODES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
export NODE_LIST=`cat $PBS_NODEFILE `
# # Just for kicks, see which nodes we got.
echo ======NODE LIST
echo $NODE_LIST
echo ======uniq -c pbs nodefile=========
uniq -c $PBS_NODEFILE
echo ===============
echo $PBS_NODEFILE
echo ============setup
# Some setup stuff - you don't need to change this line
comsol -nn $PBS_NUM_NODES mpd boot -f mpd.conf -mpirsh ssh comsol mpd trace -l
echo ===============parallel run
echo "--- Parallel COMSOL RUN"
comsol -nn ${PBS_NUM_NODES} -np ${NO_OF_NODES} batch \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph -batchlog ${PBS_JOBNAME}.log
echo "--- mpd ALLEXIT"
comsol mpd allexit
echo
echo "--- Job finished at: `date`"
echo "------------------------------------------------------------------------------"
cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================
#====================================================end of script==
then I will get this output message.
#====================================================start of output==
display NCORES and NNODES
40 and 2
=============
Step into the PBS_O_DIR directory
======NODE LIST
n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11
======uniq -c pbs nodefile=========
8 n05-06
8 n05-11
===============
/var/opt/ud/torque-4.2.10/aux//10862.hpc05.hpc
============setup
usage: mpdtrace [-l] [-V | --version]
Lists the (short) hostname of each of the mpds in the ring
The -l (long) option shows full hostnames and listening ports and ifhn
Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
===============parallel run
--- Parallel COMSOL RUN
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 11312 n05-06 {0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29}
[0] MPI startup(): 1 11313 n05-06 {10,11,12,13,14,15,16,17,18,19,30,31,32,33,34,35,36,37,38,39}
Node 0 is running on host: n05-06
Node 0 has address: n05-06.hpc
Node 1 is running on host: n05-06
Node 1 has address: n05-06.hpc
Warning: The total number of allocated threads (32) on host: n05-06 exceeds the number of available physical cores (20)
#====================================================end of output==
the error message is
#====================================================start of error===
utility.c(2245):ERROR:50: Cannot open file '/opt/ud/LOCAL/etc/modulefiles/comsol/53' for 'reading'
=>> PBS: job killed: walltime 337 exceeded limit 300
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
#====================================================end of error===
does anyone know what happen here?
If I run with one node, the simulation works out with NO problem. But if I run with more than one nodes, it doesn't work.
I tried many examples from the internet but none of them works for out cluster. I used both hydra and MPD launcher method, none of them are working. below is the sample result I got. As far as I know, comsol come with the intelMPI library so I dont have to load any MPI module separatly.
1. If I use hydra mode (cluster simple), the script I have is
#====================================================start of script==
#!/bin/bash
#PBS -N clustersimple_03
#PBS -lnodes=2:ppn=11
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out
# Load the COMSOL 5.2a module
module load comsol/52a
# Obtain the number of cores and number of nodes assigned to us
NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo
echo ================
#echo 'Start COMSOL with "mpdboot" (multi-core) and in "batch" mode'
echo 'use cluster simple mode'
echo "use ${PBS_NUM_NODES} nodes and ${PBS_NUM_PPN} cores per node"
comsol -clustersimple -f $PBS_NODEFILE batch -mpiarg -rmk -mpiarg pbs\
-nn ${PBS_NUM_NODES}\
-np ${PBS_NUM_PPN} \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph \
-batchlog ${PBS_JOBNAME}.log
cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================
#====================================================end of script==
after I submit my job to PBS, it starts to run. but in the end, I will receive this error message.
I can confirm that the time 300 second is more than enough for my test simulation. Hence the program was frozen.
=>> PBS: job killed: walltime 328 exceeded limit 300
[mpiexec@n05-32] HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)
[mpiexec@n05-32] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@n05-32] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@n05-32] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@n05-32] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:500): error waiting for event
[mpiexec@n05-32] main (../../ui/mpich/mpiexec.c:1130): process manager error waiting for completion
2. if I run with MPD mode, the script is as below
#====================================================start of script==
#!/bin/bash
#PBS -N mpd_parallel_06
#PBS -lnodes=2:ppn=8
#PBS -lwalltime=00:05:00
#PBS -e ${PBS_JOBNAME}.err
#PBS -o ${PBS_JOBNAME}.out
# Load the COMSOL 5.2a module
module load comsol/52a
# Obtain the number of cores and number of nodes assigned to us
export NCORES=`cat /proc/cpuinfo | grep processor | wc -l`
export NNODES=`cat ${PBS_NODEFILE} | sort -u | wc -l`
echo display NCORES and NNODES
echo ${NCORES} and ${NNODES}
echo =============
echo Step into the PBS_O_DIR directory
cd ${PBS_O_WORKDIR}
echo
# Setting up some files
cat $PBS_NODEFILE > nodes
cat $PBS_NODEFILE | uniq > mpd.conf
# Get number of nodes allocated
export NO_OF_NODES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
export NODE_LIST=`cat $PBS_NODEFILE `
# # Just for kicks, see which nodes we got.
echo ======NODE LIST
echo $NODE_LIST
echo ======uniq -c pbs nodefile=========
uniq -c $PBS_NODEFILE
echo ===============
echo $PBS_NODEFILE
echo ============setup
# Some setup stuff - you don't need to change this line
comsol -nn $PBS_NUM_NODES mpd boot -f mpd.conf -mpirsh ssh comsol mpd trace -l
echo ===============parallel run
echo "--- Parallel COMSOL RUN"
comsol -nn ${PBS_NUM_NODES} -np ${NO_OF_NODES} batch \
-inputfile HPC05_test2_DistPara_clean_unimport_initial_2.mph \
-outputfile ${PBS_JOBNAME}.mph -batchlog ${PBS_JOBNAME}.log
echo "--- mpd ALLEXIT"
comsol mpd allexit
echo
echo "--- Job finished at: `date`"
echo "------------------------------------------------------------------------------"
cp batchtest_out.txt ${PBS_JOBNAME}.txt
echo =================
#====================================================end of script==
then I will get this output message.
#====================================================start of output==
display NCORES and NNODES
40 and 2
=============
Step into the PBS_O_DIR directory
======NODE LIST
n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-06 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11 n05-11
======uniq -c pbs nodefile=========
8 n05-06
8 n05-11
===============
/var/opt/ud/torque-4.2.10/aux//10862.hpc05.hpc
============setup
usage: mpdtrace [-l] [-V | --version]
Lists the (short) hostname of each of the mpds in the ring
The -l (long) option shows full hostnames and listening ports and ifhn
Copyright (C) 2003-2015 Intel Corporation. All rights reserved.
===============parallel run
--- Parallel COMSOL RUN
[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 11312 n05-06 {0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29}
[0] MPI startup(): 1 11313 n05-06 {10,11,12,13,14,15,16,17,18,19,30,31,32,33,34,35,36,37,38,39}
Node 0 is running on host: n05-06
Node 0 has address: n05-06.hpc
Node 1 is running on host: n05-06
Node 1 has address: n05-06.hpc
Warning: The total number of allocated threads (32) on host: n05-06 exceeds the number of available physical cores (20)
#====================================================end of output==
the error message is
#====================================================start of error===
utility.c(2245):ERROR:50: Cannot open file '/opt/ud/LOCAL/etc/modulefiles/comsol/53' for 'reading'
=>> PBS: job killed: walltime 337 exceeded limit 300
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
#====================================================end of error===
does anyone know what happen here?
2 Replies Last Post 19 giu 2017, 09:55 GMT-4