[visit-users] need help launching parallel engine on linux cluster.

Harrison, Cyrus D. harrison37 at llnl.gov
Wed Jul 2 16:57:47 EDT 2014


Hi Dave,
If the engine_par logs did not have the MPI task # as part of the file name, it sounds like it isn’t being launched properly. If are using a machinefile from the command line, maybe that isn’t getting propagated via the GUI?

-Cyrus

On Jul 2, 2014, at 1:06 PM, Semeraro, B David <semeraro at illinois.edu> wrote:

> Hi there,
>  
> I am running VisIt 2.7.2 on a linux cluster. I can start and run a command line client from the head node with the command:
>  
> VISIT272/visit2.7.2/src/bin/visit -debug 5 -cli -nowin -np 2 -nn 2 -l mpirun -machinefile /home/semeraro/hostfile
>  
> After that I can issue the python commands and run visit just fine. However, when I try to start a parallel session on this cluster from a remote visit gui client the parallel engine fails to start. The relevant error message from the end of A.vcl.5.log appears to be:
>  
> Sending 395 bytes
> Child 1 needs to be read (desc=12)
> Done reading for child 1
> CHILD OUTPUT[1]: Running: mpirun -np 2 -machinefile /home/semeraro/hostfile /home/semeraro/VISIT272/visit2.7.2/src/exe/engine_par -plugindir /home/semeraro/.visit/2.7.2/linux-x86_64/plugins:/home/semeraro/VISIT272/visit2.7.2/src/plugins -visithome /home/semeraro/VISIT272/visit2.7.2/src -visitarchhome /home/semeraro/VISIT272/visit2.7.2/src/ -dir /home/semeraro/VISIT272/visit2.7.2/src -forcestatic -idle-timeout 480 -debug 5 -dv -noloopback -sshtunneling -host ncsa-rocks.ncsa.illinois.edu -port 38630
>  
> Sending 486 bytes
> Child 1 needs to be read (desc=12)
> Done reading for child 1
> CHILD OUTPUT[1]: terminate called after throwing an instance of 'LostConnectionException'
>  
> Sending 73 bytes
> Child 1 needs to be read (desc=12)
> Done reading for child 1
> CHILD OUTPUT[1]: --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 27280 on node compute-0-0 exited on signal 6 (Aborted).
> --------------------------------------------------------------------------
>  
> Sending 250 bytes
> Child 1 needs to be read (desc=12)
> Lost connection to child 1
> Exception: (LostConnectionException) /home/semeraro/VISIT272/visit2.7.2/src/common/comm/SocketConnection.C, line 253: <The reason for the exception was not described>
> catch(LostConnectionException) /home/semeraro/VISIT272/visit2.7.2/src/launcher/main/LauncherApplication.C:526
> VisIt component launcher exited.
>  
> I do not understand why the engine operates fine when using the cli to interact and fails when the gui is used. I have turned off the firewalls on the remote gui machine and the cluster to no avail. Another strange thing is the names of the debug files for the engine are different when I run the cli. When run from the cli the engine debug files have the form A.engine_par.###.5.vlog. where the ### corresponds to the mpi rank I am assuming. When run from the gui there is  no ### in the name I get A.engine_par.5.vlog and B.engine_par.5.vlog.
>  
> I have run out of ideas for things to try. Any help from the user community would be greatly appreciated. There must be something unique with this cluster as I have no problem compiling and running on our HPC equipment on Blue Waters. This cluster was built using a recent release of Rocks. One other thing. I solved a previous problem by compiling against mpich2 rather than openmpi. The problem I was getting before was a segfault. I couldn’t even run the cli in that case. So I am making some progress.
>  
> Thanks,
> Dave Semeraro
> --
> VisIt Users Wiki: http://visitusers.org/
> Frequently Asked Questions for VisIt: http://visit.llnl.gov/FAQ.html
> To Unsubscribe: send a blank email to visit-users-unsubscribe at elist.ornl.gov
> More Options: https://elist.ornl.gov/mailman/listinfo/visit-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://elist.ornl.gov/pipermail/visit-users/attachments/20140702/4a640106/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4048 bytes
Desc: not available
URL: <https://elist.ornl.gov/pipermail/visit-users/attachments/20140702/4a640106/smime.p7s>


More information about the visit-users mailing list