From 2d7f61af4cc2494a706313079a5fbcc82fea0a8c Mon Sep 17 00:00:00 2001
From: Ricardo Vieira <ricardo.vieira1994@gmail.com>
Date: Thu, 23 Jan 2025 12:42:36 +0100
Subject: [PATCH 1/2] Make BLAS flags check lazy and more actionable

It replaces the old warning that does not actually apply by a more informative and actionable one.

This warning was for Ops that might use the alternative blas_headers, which rely on the Numpy C-API.

However, regular PyTensor user has not used this for a while. The only Op that would use C-code with this alternative headers is the GEMM Op which is not included in current rewrites. Instead Dot22 or Dot22Scalar are introduced, which refuse to generate C-code altogether if the blas flags are missing.
---
 doc/troubleshooting.rst         | 91 +++++++++++++++++++++++----------
 pytensor/link/c/cmodule.py      | 36 +++++++++++--
 pytensor/tensor/blas_headers.py |  9 ++--
 3 files changed, 101 insertions(+), 35 deletions(-)

diff --git a/doc/troubleshooting.rst b/doc/troubleshooting.rst
index 42f5e31e81..6c7ffd3451 100644
--- a/doc/troubleshooting.rst
+++ b/doc/troubleshooting.rst
@@ -145,44 +145,64 @@ How do I configure/test my BLAS library
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 There are many ways to configure BLAS for PyTensor. This is done with the PyTensor
-flags ``blas__ldflags`` (:ref:`libdoc_config`). The default is to use the BLAS
-installation information in NumPy, accessible via
-``numpy.__config__.show()``.  You can tell pytensor to use a different
-version of BLAS, in case you did not compile NumPy with a fast BLAS or if NumPy
-was compiled with a static library of BLAS (the latter is not supported in
-PyTensor).
+flags ``blas__ldflags`` (:ref:`libdoc_config`). If not specified, PyTensor will
+attempt to find a local BLAS library to link against, prioritizing specialized implementations.
+The details can be found in :func:`pytensor.link.c.cmodule.default_blas_ldflags`.
 
-The short way to configure the PyTensor flags ``blas__ldflags`` is by setting the
-environment variable :envvar:`PYTENSOR_FLAGS` to ``blas__ldflags=XXX`` (in bash
-``export PYTENSOR_FLAGS=blas__ldflags=XXX``)
+Users can manually set the PyTensor flags ``blas__ldflags`` to link against a
+specific version. This is useful even if the default version is the desired one,
+as it will avoid the costly work of trying to find the best BLAS library at runtime.
 
-The ``${HOME}/.pytensorrc`` file is the simplest way to set a relatively
-permanent option like this one.  Add a ``[blas]`` section with an ``ldflags``
-entry like this:
+The PyTensor flags can be set in a few ways:
+
+1. In the ``${HOME}/.pytensorrc`` file.
 
 .. code-block:: cfg
 
     # other stuff can go here
     [blas]
-    ldflags = -lf77blas -latlas -lgfortran #put your flags here
+    ldflags = -llapack -lblas -lcblas  # put your flags here
 
     # other stuff can go here
 
-For more information on the formatting of ``~/.pytensorrc`` and the
-configuration options that you can put there, see :ref:`libdoc_config`.
+2. In BASH before running your script:
+
+.. code-block:: bash
+
+    export PYTENSOR_FLAGS="blas__ldflags='-llapack -lblas -lcblas'"
+
+3. In an Ipython/Jupyter notebook before importing PyTensor:
+
+.. code-block:: python
+
+    %set_env PYTENSOR_FLAGS=blas__ldflags='-llapack -lblas -lcblas'
+
+
+4. In `pytensor.config` directly:
+
+.. code-block:: python
+
+    import pytensor
+    pytensor.config.blas__ldflags = '-llapack -lblas -lcblas'
+
+
+(For more information on the formatting of ``~/.pytensorrc`` and the
+configuration options that you can put there, see :ref:`libdoc_config`.)
+
+You can find the default BLAS library that PyTensor is linking against by
+checking ``pytensor.config.blas__ldflags``
+or running :func:`pytensor.link.c.cmodule.default_blas_ldflags`.
 
 Here are some different way to configure BLAS:
 
-0) Do nothing and use the default config, which is to link against the same
-BLAS against which NumPy was built. This does not work in the case NumPy was
-compiled with a static library (e.g. ATLAS is compiled by default only as a
-static library).
+0) Do nothing and use the default config.
+This will usually work great for installation via conda/mamba/pixi (conda-forge channel).
+It will usually fail to link altogether for installation via pip.
 
 1) Disable the usage of BLAS and fall back on NumPy for dot products. To do
-this, set the value of ``blas__ldflags`` as the empty string (ex: ``export
-PYTENSOR_FLAGS=blas__ldflags=``). Depending on the kind of matrix operations your
-PyTensor code performs, this might slow some things down (vs. linking with BLAS
-directly).
+this, set the value of ``blas__ldflags`` as the empty string.
+Depending on the kind of matrix operations your PyTensor code performs,
+this might slow some things down (vs. linking with BLAS directly).
 
 2) You can install the default (reference) version of BLAS if the NumPy version
 (against which PyTensor links) does not work. If you have root or sudo access in
@@ -208,10 +228,29 @@ correctly (for example, for MKL this might be ``-lmkl -lguide -lpthread`` or
 ``-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lguide -liomp5 -lmkl_mc
 -lpthread``).
 
+5) Use another backend such as Numba or JAX that perform their own BLAS optimizations,
+by setting the configuration mode to ``"NUMBA"`` or ``"JAX"`` and making sure those packages are installed.
+This configuration mode can be set in all the ways that the BLAS flags can be set, described above.
+
+Alternatively, you can pass `mode='NUMBA'` when compiling individual PyTensor functions without changing the default.
+or use the ``config.change_flags`` context manager.
+
+.. code-block:: python
+
+    from pytensor import function, config
+    from pytensor.tensor import matrix
+
+    x = matrix('x')
+    y = x @ x.T
+    f = function([x], y, mode='NUMBA')
+
+    with config.change_flags(mode='NUMBA'):
+        # compiling function that benefits from BLAS using NUMBA
+        f = function([x], y)
+
 .. note::
 
-    Make sure your BLAS
-    libraries are available as dynamically-loadable libraries.
+    Make sure your BLAS libraries are available as dynamically-loadable libraries.
     ATLAS is often installed only as a static library.  PyTensor is not able to
     use this static library. Your ATLAS installation might need to be modified
     to provide dynamically loadable libraries.  (On Linux this
@@ -267,7 +306,7 @@ configuration information. Then, it will print the running time of the same
 benchmarks for your installation. Try to find a CPU similar to yours in
 the table, and check that the single-threaded timings are roughly the same.
 
-PyTensor should link to a parallel version of Blas and use all cores
+PyTensor should link to a parallel version of BLAS and use all cores
 when possible. By default it should use all cores. Set the environment
 variable "OMP_NUM_THREADS=N" to specify to use N threads.
 
diff --git a/pytensor/link/c/cmodule.py b/pytensor/link/c/cmodule.py
index f1f098edbf..ee8e997e96 100644
--- a/pytensor/link/c/cmodule.py
+++ b/pytensor/link/c/cmodule.py
@@ -1982,7 +1982,7 @@ def _try_flags(
         )
 
 
-def try_blas_flag(flags):
+def try_blas_flag(flags) -> str:
     test_code = textwrap.dedent(
         """\
         extern "C" double ddot_(int*, double*, int*, double*, int*);
@@ -2743,12 +2743,30 @@ def check_mkl_openmp():
         )
 
 
-def default_blas_ldflags():
-    """Read local NumPy and MKL build settings and construct `ld` flags from them.
+def default_blas_ldflags() -> str:
+    """Look for an available BLAS implementation in the system.
+
+    This function tries to compile a simple C code that uses the BLAS
+    if the required files are found in the system.
+    It sequentially tries to link to the following implementations, until one is found:
+    1. Intel MKL with Intel OpenMP threading
+    2. Intel MKL with GNU OpenMP threading
+    3. Lapack + BLAS
+    4. BLAS alone
+    5. OpenBLAS
 
     Returns
     -------
-    str
+    blas flags: str
+        Blas flags needed to link to the BLAS implementation found in the system.
+        If no BLAS implementation is found, an empty string is returned.
+
+    Notes
+    -----
+    This function is triggered when `pytensor.config.blas__ldflags` is not given a user
+    default, and it is first accessed at runtime. It can be rather slow, so it is advised
+    to cache the results of this function in PYTENSORRC configuration file or
+    PyTensor environment flags.
 
     """
 
@@ -2797,7 +2815,7 @@ def get_cxx_library_dirs():
 
     def check_libs(
         all_libs, required_libs, extra_compile_flags=None, cxx_library_dirs=None
-    ):
+    ) -> str:
         if cxx_library_dirs is None:
             cxx_library_dirs = []
         if extra_compile_flags is None:
@@ -2947,6 +2965,14 @@ def check_libs(
     except Exception as e:
         _logger.debug(e)
     _logger.debug("Failed to identify blas ldflags. Will leave them empty.")
+    warnings.warn(
+        "PyTensor could not link to a BLAS installation. Operations that might benefit from BLAS will be severely degraded.\n"
+        "This usually happens when PyTensor is installed via pip. We recommend it be installed via conda/mamba/pixi instead.\n"
+        "Alternatively, you can use an experimental backend such as Numba or JAX that perform their own BLAS optimizations, "
+        "by setting `pytensor.config.mode == 'NUMBA'` or passing `mode='NUMBA'` when compiling a PyTensor function.\n"
+        "For more options and details see https://pytensor.readthedocs.io/en/latest/troubleshooting.html#how-do-i-configure-test-my-blas-library",
+        UserWarning,
+    )
     return ""
 
 
diff --git a/pytensor/tensor/blas_headers.py b/pytensor/tensor/blas_headers.py
index 2806bfc41d..645f04bfb3 100644
--- a/pytensor/tensor/blas_headers.py
+++ b/pytensor/tensor/blas_headers.py
@@ -742,6 +742,11 @@ def blas_header_text():
 
     blas_code = ""
     if not config.blas__ldflags:
+        # This code can only be reached by compiling a function with a manually specified GEMM Op.
+        # Normal PyTensor usage will end up with Dot22 or Dot22Scalar instead,
+        # which opt out of C-code completely if the blas flags are missing
+        _logger.warning("Using NumPy C-API based implementation for BLAS functions.")
+
         # Include the Numpy version implementation of [sd]gemm_.
         current_filedir = Path(__file__).parent
         blas_common_filepath = current_filedir / "c_code/alt_blas_common.h"
@@ -1003,10 +1008,6 @@ def blas_header_text():
     return header + blas_code
 
 
-if not config.blas__ldflags:
-    _logger.warning("Using NumPy C-API based implementation for BLAS functions.")
-
-
 def mkl_threads_text():
     """C header for MKL threads interface"""
     header = """

From b9cc96109808095fcb0ae7944b5a3f6248097edd Mon Sep 17 00:00:00 2001
From: Ricardo Vieira <ricardo.vieira1994@gmail.com>
Date: Thu, 23 Jan 2025 16:04:17 +0100
Subject: [PATCH 2/2] Update default modes doc

---
 doc/library/compile/mode.rst | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/doc/library/compile/mode.rst b/doc/library/compile/mode.rst
index 4a977b7b8c..21c4240f4f 100644
--- a/doc/library/compile/mode.rst
+++ b/doc/library/compile/mode.rst
@@ -20,6 +20,9 @@ PyTensor defines the following modes by name:
 
 - ``'FAST_COMPILE'``: Apply just a few graph rewrites and only use Python implementations.
 - ``'FAST_RUN'``: Apply all rewrites, and use C implementations where possible.
+- ``NUMBA``: Apply all relevant related rewrites and compile the whole graph using Numba.
+- ``JAX``: Apply all relevant rewrites and compile the whole graph using JAX.
+- ``PYTORCH`` Apply all relevant rewrites and compile the whole graph using PyTorch compile.
 - ``'DebugMode'``: A mode for debugging. See :ref:`DebugMode <debugmode>` for details.
 - ``'NanGuardMode``: :ref:`Nan detector <nanguardmode>`
 - ``'DEBUG_MODE'``: Deprecated. Use the string DebugMode.
@@ -28,6 +31,12 @@ The default mode is typically ``FAST_RUN``, but it can be controlled via the
 configuration variable :attr:`config.mode`, which can be
 overridden by passing the keyword argument to :func:`pytensor.function`.
 
+For Numba, JAX, and PyTorch, we exclude rewrites that introduce C-only Ops,
+as well as BLAS optimizations, as those are done automatically by the respective backends.
+
+For JAX we also exclude fusion and inplace optimizations, as JAX does not support them
+at the user level. They are performed automatically by JAX.
+
 .. TODO::
 
     For a finer level of control over which rewrites are applied, and whether