Skip to content
This repository was archived by the owner on Jan 12, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 35 additions & 36 deletions docs/sources/heterogeneous_computing.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
.. _heterogeneous_computing:
.. include:: ./ext_links.txt

Heterogeneous computing
Heterogeneous Computing
=======================

Device Offload
**************

Python is an interpreted language, which implies that most of Python codes will run on CPU,
Python is an interpreted language, which implies that most of the Python script will run on CPU,
and only a few data parallel regions will execute on data parallel devices.
That is why the concept of host and offload devices is useful when it comes to conceptualizing
That is why the concept of the host and offload devices is helpful when it comes to conceptualizing
a heterogeneous programming model in Python.

.. image:: ./_images/hetero-devices.png
Expand All @@ -19,76 +19,77 @@ a heterogeneous programming model in Python.

The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices*
(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python**
offer a programming model where a script executed by Python interpreter on host can *offload* data
parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted
for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*.
offer a programming model where a script executed by Python interpreter on the host can *offload* data-parallel
kernels to a user-specified device. A *kernel* is the *data-parallel region* of a program submitted
for execution on the device. There can be multiple data-parallel regions, and hence multiple *offload kernels*.

Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded
Kernels can be pre-compiled into a library, such as ``dpnp``, or directly coded
in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ .
**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python
using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_.

One or more kernels are submitted for execution into a *queue* targeting an *offload device*.
For each device one or more queues can be created. In most cases you won’t need to work
For each device, you can create one or more queues. In most cases, you do not need to work
with device queues directly. Data Parallel Extensions for Python will do necessary underlying
work with queues for you through the :ref:`Compute-Follows-Data`.

Unified Shared Memory
*********************

Each device has its own memory, not necessarily accessible from another device.
Each device has its memory, not necessarily accessible from another device.

.. image:: ./_images/hetero-devices.png
:width: 600px
:align: center
:alt: SIMD

For example, **Device 1** memory may not be directly accessible from the host, but only accessible
For example, **Device 1** memory may not be directly accessible from the host but accessible
via expensive copying by a driver software. Similarly, depending on the architecture, direct data
exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive
exchange between **Device 2** and **Device 1** may be only impossible possible via expensive
copying through the host memory. These aspects must be taken into consideration when programming
data parallel devices.

In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1**
On the illustration above the **Device 2** logically consists of two sub-devices: **Sub-Device 1**
and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or
by working with each individual sub-devices. For the former case a programmer needs to create
a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device.

`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support
for unified virtual address space, which allows coherency between the host and the device
pointers. All memory is allocated by the host, but it offers three distinct allocation types:
pointers. The host allocates all memory, but offers three distinct allocation types:

* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation
when you need to stream a read-only data from the host to the device once.
when you need to stream read-only data from the host to the device once.

* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one.
Useful in a situation when most of data crunching happens on the device.
* **Device: located on the device, accessible by device only.** The fastest type of memory.
Useful in a situation when most of the data crunching happens on the device.

* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
the host or device.** Shared allocations are useful when data are accessed by both host and devices,
since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
* **Shared: location is both host and device, accessible by the host and device**.
Shared allocations are useful when both host and device access data,
since a user does not need to manage data migration explicitly.
However, it is much slower than the USM Device memory type.

Compute-Follows-Data
********************
Since data copying between devices is typically very expensive, for performance reasons it is essential
to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model,
which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
which states that the compute happens where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
carry information about allocation queues, and hence, about the device on which an array is allocated.
Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place.
Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution happens.

.. image:: ./_images/kernel-queue-device.png
:width: 600px
:align: center
:alt: SIMD

The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
The picture above illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the
*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm
the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will
be created on the **Device Queue** associated with the **Device 1**.

**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue,
otherwise an exception will be thrown. For example, the following usages will result in the exception.
**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue.
Otherwise, an exception is thrown. For example, the following usages will result in an exception.

.. figure:: ./_images/queue-exception1.png
:width: 600px
Expand All @@ -102,7 +103,7 @@ otherwise an exception will be thrown. For example, the following usages will re
:align: center
:alt: SIMD

Input tensors are on the same device but queues are different. Exception is thrown.
Input tensors are on the same device, but queues are different. Exception is thrown.

.. figure:: ./_images/queue-exception3.png
:width: 600px
Expand All @@ -111,19 +112,19 @@ otherwise an exception will be thrown. For example, the following usages will re

Data belongs to the same device, but queues are different and associated with different sub-devices.

Copying data between devices and queues
Copying Data Between Devices and Queues
***************************************

**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in
normal circumstances you do not need to directly manage queues. Having one canonical queue per device
allows you to copy data between devices using to_device() method:
**Data Parallel Extensions for Python** create **one** *canonical queue* per device. Normally,
you do not need to directly manage queues. Having one canonical queue per device
allows you to copy data between devices using the ``to_device()`` method:

.. code-block:: python

a_new = a.to_device(b.device)

Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``.
The same queue will be associated with ``b`` and ``a_new``.
Array ``a`` is copied to the device associated with array ``b`` into the new array ``a_new``.
The same queue is associated with ``b`` and ``a_new``.

Alternatively, you can do this as follows:

Expand All @@ -137,13 +138,11 @@ Alternatively, you can do this as follows:

a_new = dpctl.tensor.asarray(a, device=b.device)

Creating additional queues
Creating Additional Queues
**************************

As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device,
As said before, **Data Parallel Extensions for Python** automatically creates one canonical queue per device,
and you normally work with this queue implicitly. However, you can always create as many additional queues per device
as needed, and work with them explicitly.
as needed and work explicitly with them, for example, for profiling purposes.

A typical situation when you will want to create the queue explicitly is for profiling purposes.
Read `Data Parallel Control`_ documentation for more details about queues.

6 changes: 3 additions & 3 deletions docs/sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,13 @@ Data Parallel Extensions for Python
===================================

Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance
gains on data parallel devices such as GPUs. It consists of three foundational packages:
gains on data parallel devices, such as GPUs. It consists of three foundational packages:

* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of
Numpy that can be executed on any data parallel device. The subset is a drop-in replacement
of core Numpy functions and numerical data types.
* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler
that enables programming data parallel devices the same way you program CPU with Numba.
* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - an extension for Numba compiler
that lets you program data-parallel devices as you program CPU with Numba.
* **dpctl - Data Parallel Control library** that provides utilities for device selection,
allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions.

Expand Down
24 changes: 12 additions & 12 deletions docs/sources/parallelism.rst
Original file line number Diff line number Diff line change
@@ -1,40 +1,40 @@
.. _parallelism:
.. include:: ./ext_links.txt

Parallelism in modern data parallel architectures
Parallelism in Modern Data-Parallel Architectures
=================================================

Python is loved for its productivity and interactivity. But when it comes to dealing with
computationally heavy codes Python performance cannot be compromised. Intel and Python numerical
computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicated attention to
computationally heavy codes, Python performance cannot be compromised. Intel and Python numerical
computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicate attention to
optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs:

* **Multiple computational cores:** Several computational cores allow processing data concurrently.
Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
reduce a computation time *N* times for a fixed amount of data.
* **Multiple computational cores:** Several computational cores allow to process the data concurrently.
Compared to a single-core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
reduce a computation time *N* times for a set amount of data.

.. image:: ./_images/dpep-cores.png
:width: 600px
:align: center
:alt: Multiple CPU Cores

* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions
that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width.
If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
that perform operations on vectors of data elements at the same time. The size of vectors is called the SIMD width.
If a SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.

In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
In the following diagram, the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs
2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster.
two times more data in fixed time, or, respectively, process a fixed amount of data two times faster.

.. image:: ./_images/dpep-simd.png
:width: 150px
:align: center
:alt: SIMD

* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent
instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`.
instructions in parallel. In the following example, see how to compute :math:`a * b + (c - d)`.
Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction
:math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel
:math:`+` depends on availability of :math:`a * b` and :math:`c - d` and cannot be executed in parallel
with :math:`*` and :math:`-`.

.. image:: ./_images/dpep-ilp.png
Expand Down
30 changes: 15 additions & 15 deletions docs/sources/prerequisites_and_installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@

.. |trade| unicode:: U+2122

Prerequisites and installation
Prerequisites and Installation
==============================

1. Device drivers
1. Device Drivers
******************

Since you are about to start programming data parallel devices beyond CPU, you will need an appropriate hardware.
For example, Data Parallel Extensions for Python work fine on Intel laptops with integrated graphics.
In majority of cases your laptop already has all necessary device drivers installed. But if you want the most
To start programming data parallel devices beyond CPU, you will need an appropriate hardware.
For example, Data Parallel Extensions for Python work fine on Intel |copy| laptops with integrated graphics.
In majority of cases, your Windows*-based laptop already has all necessary device drivers installed. But if you want the most
up-to-date driver, you can always
`update it to the latest one <https://www.intel.com/content/www/us/en/download-center/home.html>`_.
Follow device driver installation instructions
Expand All @@ -22,29 +22,29 @@ to complete this step.
All other necessary components for programming data parallel devices will be installed with
Data Parallel Extensions for Python.

2. Python interpreter
2. Python Interpreter
**********************

You will need Python 3.8, 3.9, or 3.10 installed on your system. If you do not have one yet the easiest way to do
that is to install `Intel Distribution for Python*`_.
It will install all essential Python numerical and machine
learning packages optimized for Intel hardware, including Data Parallel Extensions for Python*.
It installs all essential Python numerical and machine
learning packages optimized for the Intel hardware, including Data Parallel Extensions for Python*.
If you have Python installation from another vendor, it is fine too. All you need is to install Data Parallel
Extensions for Python manually.
Extensions for Python manually as shown in the next section.

3. Data Parallel Extensions for Python
***************************************

You can skip this step if you already installed Intel |copy| Distribution for Python or Intel |copy| AI Analytics Toolkit.
Skip this step if you already installed Intel |copy| Distribution for Python.

The easiest way to install Data Parallel Extensions for Python is to install numba-dpex:

Conda: ``conda install numba-dpex``
* Conda: ``conda install numba-dpex``

Pip: ``pip install numba-dpex``
* Pip: ``pip install numba-dpex``

The above commands will install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
and required compiler runtimes and drivers.
These commands install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
and required compiler runtimes.

.. WARNING::
Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions
Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions
Loading