[visit-developers] parallel hangs
ahern at ornl.gov
Thu May 14 15:15:34 EDT 2009
Mark Miller wrote:
> We've got some wierd behavior on the trunk in parallel at the moment.
> When I logged into test machine at LLNL this morning, I found about 20
> engine_par processes but no viewer/cli. I started a script log and
> attached to 5 or 6 of these engine_par processes with gdb and dumped the
> stack. That file is attached. I left a couple of them running. So, if
> someone wants me to go in and get more detailed information, I can. It
> looks to me like the problem is somewhere in a pick/query opration. I am
> guessing there is a mis-matching collective MPI call in there somewhere
> maybe caused by an early return from some collective function?
> Now, we have both timeout logic coded in C++ directly in our engine and
> I have shell logic that will kill jobs that run beyond a certain time
> limit. So, I am just darn surpised the jobs were still there when I
> logged in. Oh well, hopefull with the upcomming overhaul, I can improve
> our ability to catch this.
Note in the stack trace that the alarm handler was actually called. But
then it hung while trying to close down MPI. I think your assessment of
a possible MPI mismatch is a fruitful avenue to explore.
Oak Ridge National Laboratory
More information about the visit-developers