Python - high performance computing

Below is a list of Python packages to run computations faster than bare Python.

Packages Based on Python Array Types

Numpy: Numpy is the fundamental package for scientific computing with Python. It provides N-dimensional arrays with comprehensive vectorized operations: mathematical functions, random number generators, linear algebra routines, Fourier transforms writen in low-level C code.
Awkward Array: Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms. It can offload computation on GPU.
Dask: Dask is a flexible library for parallel computing in Python. It provides dynamic task scheduling and out-of-memory big data collections.
JAX: JAX uses an improved Autograd implementation in combination with XLA to compile and run your Python/NumPy programs on CPUs, GPUs and TPUs. It enables composable function transformations and can differentiate through loops, branches, recursion, closures, and it can take derivatives of derivatives of derivatives.
PyTorch: PyTorch is a tensor computation (like NumPy) library with strong GPU acceleration that enables building deep neural networks on a tape-based autograd system. It includes data structures for multi-dimensional tensors and defines mathematical operations over these tensors, as well as utilities for efficient serializing of Tensors and arbitrary types, efficient compiling of ML models, and other useful utilities.
TensorFlow: TensorFlow is an end-to-end platform for machine learning based on GPU-accelerated tensor computations.It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications. It supports the following: Multidimensional-array based numeric computation (similar to NumPy.), GPU and distributed processing, automatic differentiation and more.
CuPy: CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. It acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.. It is essentially NumPy & SciPy for GPU. You can also easily make a custom CUDA kernel if you want to make your code run faster, requiring only a small code snippet of C++.

Packages for Fast DataFrame-Type Computations

Polars: Polars is a lightweight, fast multi-threaded, hybrid-streaming DataFrame library written in Rust using the Apache Arrow columnar format. It enables fast out-of-memory operations, lazy/eager execution, query optimization and more.
cuDF: cuDF is a Python GPU DataFrame library built on the Apache Arrow columnar memory format with a pandas-like API.
Vaex: Vaex is a highly-performant library for lazy out-of-core DataFrames, to visualize and explore big tabular datasets. It can apply operations on an N-dimensional grid up to a billion ( $10^9$ ) objects/rows per second and provides a set of sub-packages for various applications (visualisation, jupyter integration, data formats support, machine learning etc.)
PySpark: PySpark is an interface for Apache Spark in Python, with support for most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
Modin: Modin is a drop-in replacement for pandas to instantly speed up your workflows by scaling pandas so it uses all of your cores. It is most likely the slowest barrier to entry for performance improvements on DataFrame operations: changing the import line is enough.

Packages Based on Just-in-time Compilation of Python Code

Numba: Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. It is designed to be used with NumPy arrays and functions and enables automatic threading, SIMD vectorization and GPU acceleration (CUDA only).
Taichi: Taichi is an open-source, imperative, parallel programming language for high-performance numerical computation embedded in Python. It uses just-in-time (JIT) compiler frameworks to offload the compute-intensive Python code to the native GPU or CPU instructions. Taichi can seamlessly interoperate with popular Python frameworks, such as NumPy, PyTorch, matplotlib, and pillow.

Packages Based on Just-in-Time Compilation of Low-Level Code

PyOpenCL: PyOpenCL gives you easy, Pythonic access to the OpenCL parallel computation API. It can build OpenCL kernels and buffers, with support for a numpy-like array type.
PyCUDA: PyCUDA gives you easy, Pythonic access to Nvidia’s CUDA parallel computation API. It can build CUDA kernels and buffers, with support for a numpy-like array type.
cppyy: cppyy is an automatic, run-time, Python-C++ bindings generator, for calling C++ from Python and Python from C++. Run-time generation enables detailed specialization for higher performance, lazy loading for reduced memory use in large scale projects, Python-side cross-inheritance and callbacks for working with C++ frameworks, run-time template instantiation, automatic object downcasting, exception mapping, and interactive exploration of C++ libraries. cppyy delivers this without any language extensions, intermediate languages, or the need for boiler-plate hand-written code.
xobjects: Provide in-memory serialization of strucured type with C-API generation and compiles at run-time C code using cffi, cupy, pyopencl under the same API.

Packages to Compile Python Modules Into Optimized Code

Cython: Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It makes writing C extensions for Python as easy as Python itself.
Pythran: Pythran is an ahead of time compiler for a subset of the Python language, with a focus on scientific computing. It takes a Python module annotated with a few interface descriptions and turns it into a native Python module with the same interface, but (hopefully) faster. It is meant to efficiently compile scientific programs, and takes advantage of multi-cores and SIMD instruction units.
Mypyc: Mypyc compiles Python modules to C extensions. It uses standard Python type hints to generate fast code.
Nuitka: Nuitka is the optimizing Python compiler written in Python that creates executables that run without an need for a separate installer. Data files can both be included or put alongside.
Codon: Codon is a high-performance Python compiler that compiles Python code to native machine code without any runtime overhead.

Packages to Bind Low-Level Code Modules to Python

cffi: C Foreign Function Interface for Python. Interact with almost any C code from Python, based on C-like declarations that you can often copy-paste from header files or documentation.
Pybind11:pybind11 is a lightweight header-only library that exposes C++ types in Python and vice versa, mainly to create Python bindings of existing C++ code.
Nanobind: nanobind is a small binding library that exposes C++ types in Python and vice versa. It is reminiscent of Boost.Python and pybind11 and uses near-identical syntax. In contrast to these existing tools, nanobind is more efficient: bindings compile in a shorter amount of time, producing smaller binaries with better runtime performance.
maturin: maturin is a tool for building and publishing Rust-based Python packages with minimal configuration. It uses PyO3 to create bindings from Python to Rust code.
Nimporter: Nimporter is a package to compile Nim extensions for Python on import automatically. It uses nimpy to compile Nim modules on the fly, allowing to simply import Nim source code files as if they were Python modules, and use them seamlessly with Python code. Nim compiles fast and reaches C-level speeds.