Change function calls to use vectorcall (#5948)

* Make argument_vector re-usable for other types.

* Attempt to collect args into array for vectorcall

* Revert "Attempt to collect args into array for vectorcall"

This reverts commit 418a034195c5c8b517e7404686555c097efe7a4b.

* Implement vectorcall args collector

* pre-commit fixes

* Checkpoint in moving to METH_FASTCALL

* pre-commit fixes

* Use the names tuple directly, cleaner code and less reference counting

* Fix unit test, the code now holds more references

It cannot re-use the incoming tuple as before, because it is no longer a tuple at all.  So a new tuple must be created, which then holds references for each member.

* Make clangtidy happy

* Oops, _v is C++14

* style: pre-commit fixes

* Minor code cleanup

* Fix signed conversions

* Fix args expansion

This would be easier with `if constexpr`

* style: pre-commit fixes

* Code cleanup

* fix(tests): Install multiple-interpreter test modules into wheel

The `mod_per_interpreter_gil`, `mod_shared_interpreter_gil`, and
`mod_per_interpreter_gil_with_singleton` modules were being built
but not installed into the wheel when using scikit-build-core
(SKBUILD=true). This caused iOS (and potentially Android) CIBW
tests to fail with ModuleNotFoundError.

Root cause analysis:
- The main test targets have install() commands (line 531)
- The PYBIND11_MULTIPLE_INTERPRETERS_TEST_MODULES were missing
  equivalent install() commands
- For regular CMake builds, this wasn't a problem because
  LIBRARY_OUTPUT_DIRECTORY places the modules next to pybind11_tests
- For wheel builds, only targets with explicit install() commands
  are included in the wheel

This issue was latent until commit fee2527d changed the test imports
from `pytest.importorskip()` (graceful skip) to direct `import`
statements (hard failure), which exposed the missing modules.

Failing tests:
- test_multiple_interpreters.py::test_independent_subinterpreters
- test_multiple_interpreters.py::test_dependent_subinterpreters

Error: ModuleNotFoundError: No module named 'mod_per_interpreter_gil'

* tests: Pin numpy 2.4.0 for Python 3.14 CI tests

Add numpy==2.4.0 requirement for Python 3.14 (both default and
free-threaded builds). NumPy 2.4.0 is the first version to provide
official PyPI wheels for Python 3.14:

- numpy-2.4.0-cp314-cp314-manylinux_2_27_x86_64...whl (default)
- numpy-2.4.0-cp314-cp314t-manylinux_2_27_x86_64...whl (free-threaded)

Previously, CI was skipping all numpy-dependent tests for Python 3.14
because PIP_ONLY_BINARY was set and no wheels were available:

  SKIPPED [...] test_numpy_array.py:8: could not import 'numpy':
  No module named 'numpy'

With this change, the full numpy test suite will run on Python 3.14,
providing better test coverage for the newest Python version.

Note: Using exact pin (==2.4.0) rather than compatible release (~=2.4.0)
to ensure reproducible CI results with the first known-working version.

* tests: Add verbose flag to CIBW pytest command

Add `-v` to the pytest command in tests/pyproject.toml to help
diagnose hanging tests in CIBW jobs (particularly iOS).

This will show each test name as it runs, making it easier to
identify which specific test is hanging.

* tests: Skip subinterpreter tests on iOS, add pytest timeout

- Add `IOS` platform constant to `tests/env.py` for consistency with
  existing `ANDROID`, `LINUX`, `MACOS`, `WIN`, `FREEBSD` constants.

- Skip `test_multiple_interpreters.py` module on iOS. Subinterpreters
  are not supported in the iOS simulator environment. These tests were
  previously skipped implicitly because the modules weren't installed
  in the wheel; now that they are (commit 6ed6d5a8), we need an
  explicit skip.

- Change pytest timeout from 0 (disabled) to 120 seconds. This provides
  a safety net to catch hanging tests before the CI job times out after
  hours. Normal test runs complete in 33-55 seconds total (~1100 tests),
  so 120 seconds per test is very generous.

- Add `-v` flag for verbose output to help diagnose any future issues.

* More cleanups in argument vector, per comments.

* Per Cursor, move all versions to Vectorcall since it has been supported since 3.8.

This means getting rid of simple_collector, we can do the same with a constexpr if in the unpacking_collector.

* Switch to a bool vec for the used_kwargs flag...

This makes more sense and saves a sort, and the small_vector implementation means it will actually take less space than a vector of size_t elements.

The most common case is that all kwargs are used.

* Fix signedness for clang

* Another signedness issue

* tests: Disable pytest-timeout for Pyodide (no signal.setitimer)

Pyodide runs in a WebAssembly sandbox without POSIX signals, so
`signal.setitimer` is not available. This causes pytest-timeout to
crash with `AttributeError: module 'signal' has no attribute 'setitimer'`
when timeout > 0.

Override the test-command for Pyodide to keep timeout=0 (disabled).

* Combine temp storage and args into one vector

It's a good bit faster at the cost of this one scary reinterpret_cast.

* Phrasing

* Delete incorrect comment

At 6, the struct is 144 bytes (not 128 bytes as the comment said).

* Fix push_back

* Update push_back in argument_vector.h

Co-authored-by: Aaron Gokaslan <[email protected]>

* style: pre-commit fixes

* Use real types for these instead of object

They can be null if you "steal" a null handle.

* refactor: Replace small_vector<object> with ref_small_vector for explicit ownership

Introduce `ref_small_vector` to manage PyObject* references in `unpacking_collector`,
replacing the previous `small_vector<object>` approach.

Primary goals:

1. **Maintainability**: The previous implementation relied on
   `sizeof(object) == sizeof(PyObject*)` and used a reinterpret_cast to
   pass the object array to vectorcall. This coupling to py::object's
   internal layout could break if someone adds a debug field or other
   member to py::handle/py::object in the future.

2. **Readability**: The new `push_back_steal()` vs `push_back_borrow()`
   API makes reference counting intent explicit at each call site,
   rather than relying on implicit py::object semantics.

3. **Intuitive code**: Storing `PyObject*` directly and passing it to
   `_PyObject_Vectorcall` without casts is straightforward and matches
   what the C API expects. No "scary" reinterpret_cast needed.

Additional benefits:
- `PyObject*` is trivially copyable, simplifying vector operations
- Batch decref in destructor (tight loop vs N individual object destructors)
- Self-documenting ownership semantics

Design consideration: We considered folding the ref-counting functionality
directly into `small_vector` via template specialization for `PyObject*`.
We decided against this because:
- It would give `small_vector<PyObject*, N>` a different interface than the
  generic `small_vector<T, N>` (steal/borrow vs push_back)
- Someone might want a non-ref-counting `small_vector<PyObject*, N>`
- The specialization behavior could surprise users expecting uniform semantics

A separate `ref_small_vector` type makes the ref-counting behavior explicit
and self-documenting, while keeping `small_vector` generic and predictable.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
4 files changed