Skip to content

Conversation

@devreal
Copy link
Owner

@devreal devreal commented Dec 10, 2025

No description provided.

Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

b8cba3b: Install ASAN through apt

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

nbellalou and others added 15 commits December 23, 2025 13:52
Signed-off-by: Nathan Bellalou <nbellalou@nvidia.com>
Create two bool variables, opal_single_threaded and
opal_common_ucx_single_threaded, that mimic behavior of variables
opal_uses_threads and opal_common_ucx_single_threaded, in order to
propagate mpi thread level to opal while preserving abstraction.
opal_single_threaded is true if and only if mpi thread level is MPI_THREAD_SINGLE

Signed-off-by: Nathan Bellalou <nbellalou@nvidia.com>
Turns out the requests being returned to the UCX PML's persisten
request list weren't being properly finalized.

But it turns out mpi4py unit testing tests all kinds of edge
cases, like getting the fortran handle for a persistent requests,
and thus triggered a bug in the UCX PML when OMPI is configured
with debug.

Characteristic traceback at finalize prior to this patch is:

python3: ../opal/mca/threads/pthreads/threads_pthreads_mutex.h:86: opal_thread_internal_mutex_lock: Assertion `0 == ret' failed.
[er-head:1179128] *** Process received signal ***
[er-head:1179128] Signal: Aborted (6)
[er-head:1179128] Signal code:  (-6)
[er-head:1179128] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7ffff71edcf0]
[er-head:1179128] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7ffff66daacf]
[er-head:1179128] [ 2] /lib64/libc.so.6(abort+0x127)[0x7ffff66adea5]
[er-head:1179128] [ 3] /lib64/libc.so.6(+0x21d79)[0x7ffff66add79]
[er-head:1179128] [ 4] /lib64/libc.so.6(+0x47426)[0x7ffff66d3426]
[er-head:1179128] [ 5] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(+0x414a2)[0x7ffff1ccb4a2]
[er-head:1179128] [ 6] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(+0x4150d)[0x7ffff1ccb50d]
[er-head:1179128] [ 7] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(opal_pointer_array_set_item+0x7c)[0x7ffff1ccbd40]
[er-head:1179128] [ 8] /home/foobar/ompi/install_it/lib/libmpi.so.0(+0x3a5adb)[0x7ffff21c1adb]
[er-head:1179128] [ 9] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(+0x3a7aa)[0x7ffff1cc47aa]
[er-head:1179128] [10] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(+0x3b34d)[0x7ffff1cc534d]
[er-head:1179128] [11] /home/foobar/ompi/install_it/lib/libmpi.so.0(+0x39e934)[0x7ffff21ba934]
[er-head:1179128] [12] /home/foobar/ompi/install_it/lib/libmpi.so.0(mca_pml_ucx_cleanup+0x314)[0x7ffff21bc96d]
[er-head:1179128] [13] /home/foobar/ompi/install_it/lib/libmpi.so.0(+0x3a79ad)[0x7ffff21c39ad]
[er-head:1179128] [14] /home/foobar/ompi/install_it/lib/libmpi.so.0(+0x39c57e)[0x7ffff21b857e]
[er-head:1179128] [15] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(opal_finalize_cleanup_domain+0x3e)[0x7ffff1cd32fa]
[er-head:1179128] [16] /home/foobar/ompi/install_it/lib/libopen-pal.so.0(opal_finalize+0x56)[0x7ffff1cc1ca0]
[er-head:1179128] [17] /home/foobar/ompi/install_it/lib/libmpi.so.0(ompi_rte_finalize+0x312)[0x7ffff1edaad5]
[er-head:1179128] [18] /home/foobar/ompi/install_it/lib/libmpi.so.0(+0xc4dd8)[0x7ffff1ee0dd8]
[er-head:1179128] [19] /home/foobar/ompi/install_it/lib/libmpi.so.0(ompi_mpi_instance_finalize+0x13a)[0x7ffff1ee1064]
[er-head:1179128] [20] /home/foobar/ompi/install_it/lib/libmpi.so.0(ompi_mpi_finalize+0x5f3)[0x7ffff1ed4c44]
[er-head:1179128] [21] /home/foobar/ompi/install_it/lib/libmpi.so.0(PMPI_Finalize+0x54)[0x7ffff1f29440]

related to open-mpi#13623

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
When we added the MCA_BASE_COMPONENT_INIT() macro to clean up LTO
build issues, we accidently added a _component to the end of the
component name, breaking the build for any platform that uses the
bsdx_ipv4 component.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Don't search for a .git directory; it might not exist.

Also, remove unnecessary Mercurial and Subversion support; we haven't
used these for years.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
Signed-off-by: Jeff Squyres <jeff@squyres.com>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Use Intersphinx
(https://www.sphinx-doc.org/en/master/usage/extensions/intersphinx.html)
for making links out to PMIx and PRTE docs.

If we simply always linked against the https/internet PMIx and PRTE
docs, Intersphinx makes this very easy.  But that's not the Open MPI
way!  Instead, we want to support linking against the internal
(embedded) PMIx and PRTE docs when relevant and possible, mainly to
support fully-offline HTML docs (e.g., for those who operated in
not-connected-to-the-internet scenarios).  As such, there's several
cases that need to be handled properly:

1. When building the internal PMIx / PRTE, link to the local instances
   of those docs (vs. the https/internet instance).  Ensure to use
   relative paths (vs. absolute paths) so that the pre-built HTML docs
   that we include in OMPI distribution tarballs work, regardless of
   the --prefix/etc. used at configure time.

   NOTE: When the Open MPI Sphinx docs are built, we have not yet
   installed the PMIx / PRTE docs.  So create our own (fake)
   objects.inv inventory file for where the PMIx / PRTE docs *will* be
   installed so that Intersphinx can do its deep linking properly.  At
   least for now, we only care about deep links for pmix_info(1) and
   prte_info(1), so we can just hard-code those into those inventory
   files and that's good enough.  If the OMPI docs link more deeply
   into the PMIx / PRTE docs someday (i.e., link to a bunch more
   things than just pmix_info(1) / prte_info(1)), we might need to
   revisit this design decision.

2. When building against an external PMIx / PRTE, make a best guess as
   to where their local HTML doc instance may be (namely:
   $project_prefix/share/doc/PROJECT).  Don't try to handle all the
   possibilities -- it just gets even more complicated than this
   already is.  If we can't find it, just link out to the
   https/internet docs.

Other miscellaneous small changes:

* Added another Python module in docs/requirements.txt (for building
  the Sphinx inventory file).
* Use slightly-more-pythonix dict.get() API calls in docs/conf.py for
  simplicity.
* Updated OMPI PRTE submodule pointer to get a prte_info.1.rst label
  update that works for both upstream PRTE and the OMPI PRTE fork.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
Per the prior commit, update all OMPI docs RST to properly link to
PMIx and PRTE documentation.

Also added a few mpirun(1) links because they were in the vicinity of
the pmix_info(1) and prte_info(1) that were being updates.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
…eq_fix

PML/UCX: properly handle persistent req free list items
The default algorithm selections were out of date and not performing well. After gathering data using the ompi-collectives-tuning package, new default algorithm decisions are selected for bcast.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Fixes the deadcode path issues from coverity in bcast and reduce.

Signed-off-by: Nithya V S <Nithya.VS@amd.com>
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
…ixes

coll/han: Fix null dereference in revoke_local
…odeFix

opal/mca/common/ucx : assert fix -  change thread mode sent to UCX api
docs: update TCP docs + support deep linking into PMIx and PRTE docs
@devreal devreal force-pushed the mpi4py-asan branch 7 times, most recently from c2eeee3 to 49ecbba Compare January 14, 2026 13:48
…y_fix

coll/acoll: Fixes for coverity deadcode issues
bosilca and others added 6 commits January 14, 2026 11:34
coll/tuned: Change the bcast default collective algorithm selection
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Run mpi4py with ASAN, with a separate step that aborts on errors.
The existing steps should run to completion even if an error is detected.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
30 minutes are not enough to run two extra tests so just enable ASAN
for the existing tests. Also test `ompi_info` and `mpicc`.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
This may reduce overhead, although according to
https://github.com/google/sanitizers/wiki/addresssanitizerflags it
should be disabled by default.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.