[visit-users] need help launching parallel engine on linux cluster.

Brugger, Eric brugger1 at llnl.gov
Wed Jul 9 11:20:02 EDT 2014


Dave,

Thanks for sending your solution on to the list. It should prove useful to others in a similar situation.

Eric

From: Semeraro, B David [mailto:semeraro at illinois.edu]
Sent: Tuesday, July 08, 2014 9:37 AM
To: VisIt software users community
Subject: Re: [visit-users] need help launching parallel engine on linux cluster.


Just thought I would post the conclusion to this thread. I solved the problem by doing a complete rebuild and proper installation of VisIt. I also rebuilt icet in the process. I believe the previous icet build caused the opengl problem I was seeing. Doing a proper install caused the components to work correctly after I got some environment set in my .bashrc.

In summary, here is what I did to get VisIt working on a Rocks based linux cluster.

1)      Modify the environment to use mpich2 rather than openmpi. (see the .bashrc file below)

2)      Build the third party libraries separately (not really necessary but this is how I did it)

3)      Uncompress visit tarball and build visit with the following command.

a.       src/svn_bin/build_visit --console --mesa --hdf5 --netcdf --silo --h5part --boxlib --no-pyside --thirdparty-path /home/semeraro/VISIT/thirdparty --parallel --prefix /home/semeraro/VISIT

b.      Substitute your value for thirdparty-path and prefix.

4)      Make changes in the .bashrc file to point to the new install.

5)      Use and enjoy.

The bashrc file I used looks like this:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# User specific aliases and functions
PATH=$PATH:$HOME/bin:/opt/mpich2/gnu/bin:$HOME/VISIT/bin
VISITVERSION=2.7.2
VISITHOME=$HOME/VISIT
VISITARCHHOME=$HOME/VISIT/2.7.2/linux-x86_64
LD_LIBRARY_PATH=/opt/mpich2/gnu/lib:$VISITARCHHOME/lib
LD_LIBRARY_PATH_64=$LD_LIBRARY_PATH


function module { eval `modulecmd bash $*`; }
typeset -xf module
module unload rocks-openmpi

export PATH
export LD_LIBRARY_PATH
export LD_LIBRARY_PATH_64


That is about it. Hope this helps someone down the road. It will probably be future Dave because present Dave is going to forget all this by tomorrow.

Thanks,
Present Dave....
From: Harrison, Cyrus D. [mailto:harrison37 at llnl.gov]
Sent: Wednesday, July 02, 2014 6:46 PM
To: VisIt software users community
Subject: Re: [visit-users] need help launching parallel engine on linux cluster.

Hi Dave,
Yes - If the system python is being used, that could derail things.

-Cyrus

On Jul 2, 2014, at 2:21 PM, Semeraro, B David <semeraro at illinois.edu<mailto:semeraro at illinois.edu>> wrote:

Hi Cyrus,

I am passing the same machine file that I use in the cli version. Here is what the output window says is happening with the mpi launch:

MESSAGE: Running: mpirun -np 2 -machinefile /home/semeraro/hostfile /home/semeraro/VISIT272/visit2.7.2/src/exe/engine_par -plugindir /home/semeraro/.visit/2.7.2/linux-x86_64/plugins:/home/semeraro/VISIT272/visit2.7.2/src/plugins -visithome /home/semeraro/VISIT272/visit2.7.2/src -visitarchhome /home/semeraro/VISIT272/visit2.7.2/src/ -dir /home/semeraro/VISIT272/visit2.7.2/src -forcestatic -idle-timeout 480 -debug 5 -dv -noloopback -sshtunneling -host ncsa-rocks.ncsa.illinois.edu<http://ncsa-rocks.ncsa.illinois.edu/> -port 12141

I don't see anything wrong with that. It is using the correct machine file. I uncommented some echo statements in the frontendlauncher script. It appears that the system python is being used to run frontendlauncher.py instead of the visit package python. Could that be part of the problem? Aside from the value of -host and -port in the gui launch I don't see any real difference between the mpirun command used in the gui and the cli. I will check again though.

Dave

From: Harrison, Cyrus D. [mailto:harrison37 at llnl.gov]
Sent: Wednesday, July 02, 2014 3:58 PM
To: VisIt software users community
Subject: Re: [visit-users] need help launching parallel engine on linux cluster.

Hi Dave,
If the engine_par logs did not have the MPI task # as part of the file name, it sounds like it isn't being launched properly. If are using a machinefile from the command line, maybe that isn't getting propagated via the GUI?

-Cyrus

On Jul 2, 2014, at 1:06 PM, Semeraro, B David <semeraro at illinois.edu<mailto:semeraro at illinois.edu>> wrote:


Hi there,

I am running VisIt 2.7.2 on a linux cluster. I can start and run a command line client from the head node with the command:

VISIT272/visit2.7.2/src/bin/visit -debug 5 -cli -nowin -np 2 -nn 2 -l mpirun -machinefile /home/semeraro/hostfile

After that I can issue the python commands and run visit just fine. However, when I try to start a parallel session on this cluster from a remote visit gui client the parallel engine fails to start. The relevant error message from the end of A.vcl.5.log appears to be:

Sending 395 bytes
Child 1 needs to be read (desc=12)
Done reading for child 1
CHILD OUTPUT[1]: Running: mpirun -np 2 -machinefile /home/semeraro/hostfile /home/semeraro/VISIT272/visit2.7.2/src/exe/engine_par -plugindir /home/semeraro/.visit/2.7.2/linux-x86_64/plugins:/home/semeraro/VISIT272/visit2.7.2/src/plugins -visithome /home/semeraro/VISIT272/visit2.7.2/src -visitarchhome /home/semeraro/VISIT272/visit2.7.2/src/ -dir /home/semeraro/VISIT272/visit2.7.2/src -forcestatic -idle-timeout 480 -debug 5 -dv -noloopback -sshtunneling -host ncsa-rocks.ncsa.illinois.edu<http://ncsa-rocks.ncsa.illinois.edu/> -port 38630

Sending 486 bytes
Child 1 needs to be read (desc=12)
Done reading for child 1
CHILD OUTPUT[1]: terminate called after throwing an instance of 'LostConnectionException'

Sending 73 bytes
Child 1 needs to be read (desc=12)
Done reading for child 1
CHILD OUTPUT[1]: --------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 27280 on node compute-0-0 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Sending 250 bytes
Child 1 needs to be read (desc=12)
Lost connection to child 1
Exception: (LostConnectionException) /home/semeraro/VISIT272/visit2.7.2/src/common/comm/SocketConnection.C, line 253: <The reason for the exception was not described>
catch(LostConnectionException) /home/semeraro/VISIT272/visit2.7.2/src/launcher/main/LauncherApplication.C:526
VisIt component launcher exited.

I do not understand why the engine operates fine when using the cli to interact and fails when the gui is used. I have turned off the firewalls on the remote gui machine and the cluster to no avail. Another strange thing is the names of the debug files for the engine are different when I run the cli. When run from the cli the engine debug files have the form A.engine_par.###.5.vlog. where the ### corresponds to the mpi rank I am assuming. When run from the gui there is  no ### in the name I get A.engine_par.5.vlog and B.engine_par.5.vlog.

I have run out of ideas for things to try. Any help from the user community would be greatly appreciated. There must be something unique with this cluster as I have no problem compiling and running on our HPC equipment on Blue Waters. This cluster was built using a recent release of Rocks. One other thing. I solved a previous problem by compiling against mpich2 rather than openmpi. The problem I was getting before was a segfault. I couldn't even run the cli in that case. So I am making some progress.

Thanks,
Dave Semeraro
--
VisIt Users Wiki: http://visitusers.org/
Frequently Asked Questions for VisIt: http://visit.llnl.gov/FAQ.html
To Unsubscribe: send a blank email to visit-users-unsubscribe at elist.ornl.gov<mailto:visit-users-unsubscribe at elist.ornl.gov>
More Options: https://elist.ornl.gov/mailman/listinfo/visit-users

--
VisIt Users Wiki: http://visitusers.org/
Frequently Asked Questions for VisIt: http://visit.llnl.gov/FAQ.html
To Unsubscribe: send a blank email to visit-users-unsubscribe at elist.ornl.gov<mailto:visit-users-unsubscribe at elist.ornl.gov>
More Options: https://elist.ornl.gov/mailman/listinfo/visit-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://elist.ornl.gov/pipermail/visit-users/attachments/20140709/79dc7816/attachment.html>


More information about the visit-users mailing list