IntelPython · samaid · Apr 17, 2023 · Apr 17, 2023
diff --git a/docs/sources/heterogeneous_computing.rst b/docs/sources/heterogeneous_computing.rst
@@ -1,15 +1,15 @@
 .. _heterogeneous_computing:
 .. include:: ./ext_links.txt
 
-Heterogeneous computing
+Heterogeneous Computing
 =======================
 
 Device Offload
 **************
 
-Python is an interpreted language, which implies that most of Python codes will run on CPU,
+Python is an interpreted language, which implies that most of the Python script will run on CPU,
 and only a few data parallel regions will execute on data parallel devices.
-That is why the concept of host and offload devices is useful when it comes to conceptualizing
+That is why the concept of the host and offload devices is helpful when it comes to conceptualizing
 a heterogeneous programming model in Python.
 
 .. image:: ./_images/hetero-devices.png
@@ -19,76 +19,77 @@ a heterogeneous programming model in Python.
 
 The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices*
 (two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python**
-offer a programming model where a script executed by Python interpreter on host can *offload* data
-parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted
-for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*.
+offer a programming model where a script executed by Python interpreter on the host can *offload* data-parallel
+kernels to a user-specified device. A *kernel* is the *data-parallel region* of a program submitted
+for execution on the device. There can be multiple data-parallel regions, and hence multiple *offload kernels*.
 
-Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded
+Kernels can be pre-compiled into a library, such as ``dpnp``, or directly coded
 in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ .
 **Data Parallel Extensions for Python** offer the way of writing kernels directly in Python
 using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_.
 
 One or more kernels are submitted for execution into a *queue* targeting an *offload device*.
-For each device one or more queues can be created. In most cases you won’t need to work
+For each device, you can create one or more queues. In most cases, you do not need to work
 with device queues directly. Data Parallel Extensions for Python will do necessary underlying
 work with queues for you through the :ref:`Compute-Follows-Data`.
 
 Unified Shared Memory
 *********************
 
-Each device has its own memory, not necessarily accessible from another device.
+Each device has its memory, not necessarily accessible from another device.
 
 .. image:: ./_images/hetero-devices.png
     :width: 600px
     :align: center
     :alt: SIMD
 
-For example, **Device 1** memory may not be directly accessible from the host, but only accessible
+For example, **Device 1** memory may not be directly accessible from the host but accessible
 via expensive copying by a driver software. Similarly, depending on the architecture, direct data
-exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive
+exchange between **Device 2** and **Device 1** may be only impossible possible via expensive
 copying through the host memory. These aspects must be taken into consideration when programming
 data parallel devices.
 
-In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1**
+On the illustration above the **Device 2** logically consists of two sub-devices: **Sub-Device 1**
 and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or
 by working with each individual sub-devices. For the former case a programmer needs to create
 a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device.
 
 `SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support
 for unified virtual address space, which allows coherency between the host and the device
-pointers. All memory is allocated by the host, but it offers three distinct allocation types:
+pointers. The host allocates all memory, but offers three distinct allocation types:
 
 * **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation
-  when you need to stream a read-only data from the host to the device once.
+  when you need to stream read-only data from the host to the device once.
 
-* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one.
-  Useful in a situation when most of data crunching happens on the device.
+* **Device: located on the device, accessible by device only.** The fastest type of memory.
+  Useful in a situation when most of the data crunching happens on the device.
 
-* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
-  the host or device.** Shared allocations are useful when data are accessed by both host and devices,
-  since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
+* **Shared: location is both host and device, accessible by the host and device**.
+  Shared allocations are useful when both host and device access data,
+  since a user does not need to manage data migration explicitly.
+  However, it is much slower than the USM Device memory type.
 
 Compute-Follows-Data
 ********************
 Since data copying between devices is typically very expensive, for performance reasons it is essential
 to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model,
-which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
+which states that the compute happens where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
 carry information about allocation queues, and hence, about the device on which an array is allocated.
-Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place.
+Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution happens.
 
 .. image:: ./_images/kernel-queue-device.png
     :width: 600px
     :align: center
     :alt: SIMD
 
-The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
+The picture above illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
 **Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the
 *device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm
 the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will
 be created on the **Device Queue** associated with the **Device 1**.
 
-**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue,
-otherwise an exception will be thrown. For example, the following usages will result in the exception.
+**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue.
+Otherwise, an exception is thrown. For example, the following usages will result in an exception.
 
 .. figure:: ./_images/queue-exception1.png
     :width: 600px
@@ -102,7 +103,7 @@ otherwise an exception will be thrown. For example, the following usages will re
     :align: center
     :alt: SIMD
 
-    Input tensors are on the same device but queues are different. Exception is thrown.
+    Input tensors are on the same device, but queues are different. Exception is thrown.
 
 .. figure:: ./_images/queue-exception3.png
     :width: 600px
@@ -111,19 +112,19 @@ otherwise an exception will be thrown. For example, the following usages will re
 
     Data belongs to the same device, but queues are different and associated with different sub-devices.
 
-Copying data between devices and queues
+Copying Data Between Devices and Queues
 ***************************************
 
-**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in
-normal circumstances you do not need to directly manage queues. Having one canonical queue per device
-allows you to copy data between devices using to_device() method:
+**Data Parallel Extensions for Python** create **one** *canonical queue* per device. Normally,
+you do not need to directly manage queues. Having one canonical queue per device
+allows you to copy data between devices using the ``to_device()`` method:
 
 .. code-block:: python
 
    a_new = a.to_device(b.device)
 
-Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``.
-The same queue will be associated with ``b`` and ``a_new``.
+Array ``a`` is copied to the device associated with array ``b`` into the new array ``a_new``.
+The same queue is associated with ``b`` and ``a_new``.
 
 Alternatively, you can do this as follows:
 
@@ -137,13 +138,11 @@ Alternatively, you can do this as follows:
 
     a_new = dpctl.tensor.asarray(a, device=b.device)
 
-Creating additional queues
+Creating Additional Queues
 **************************
 
-As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device,
+As said before, **Data Parallel Extensions for Python** automatically creates one canonical queue per device,
 and you normally work with this queue implicitly. However, you can always create as many additional queues per device
-as needed, and work with them explicitly.
+as needed and work explicitly with them, for example, for profiling purposes.
 
-A typical situation when you will want to create the queue explicitly is for profiling purposes.
 Read `Data Parallel Control`_ documentation for more details about queues.
-
diff --git a/docs/sources/index.rst b/docs/sources/index.rst
@@ -10,13 +10,13 @@ Data Parallel Extensions for Python
 ===================================
 
 Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance
-gains on data parallel devices such as GPUs. It consists of three foundational packages:
+gains on data parallel devices, such as GPUs. It consists of three foundational packages:
 
 * **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of
   Numpy that can be executed on any data parallel device. The subset is a drop-in replacement
   of core Numpy functions and numerical data types.
-* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler
-  that enables programming data parallel devices the same way you program CPU with Numba.
+* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - an extension for Numba compiler
+  that lets you program data-parallel devices as you program CPU with Numba.
 * **dpctl - Data Parallel Control library** that provides utilities for device selection,
   allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions.
 

diff --git a/docs/sources/parallelism.rst b/docs/sources/parallelism.rst
@@ -1,40 +1,40 @@
 .. _parallelism:
 .. include:: ./ext_links.txt
 
-Parallelism in modern data parallel architectures
+Parallelism in Modern Data-Parallel Architectures
 =================================================
 
 Python is loved for its productivity and interactivity. But when it comes to dealing with
-computationally heavy codes Python performance cannot be compromised. Intel and Python numerical
-computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicated attention to
+computationally heavy codes, Python performance cannot be compromised. Intel and Python numerical
+computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicate attention to
 optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs:
 
-* **Multiple computational cores:** Several computational cores allow processing data concurrently.
-  Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
-  reduce a computation time *N* times for a fixed amount of data.
+* **Multiple computational cores:** Several computational cores allow to process the data concurrently.
+  Compared to a single-core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
+  reduce a computation time *N* times for a set amount of data.
 
 .. image:: ./_images/dpep-cores.png
     :width: 600px
     :align: center
     :alt: Multiple CPU Cores
 
 * **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions
-  that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width.
-  If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
+  that perform operations on vectors of data elements at the same time. The size of vectors is called the SIMD width.
+  If a SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
 
-  In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
+  In the following diagram, the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
   Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs
-  2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster.
+  two times more data in fixed time, or, respectively, process a fixed amount of data two times faster.
 
 .. image:: ./_images/dpep-simd.png
     :width: 150px
     :align: center
     :alt: SIMD
 
 * **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent
-  instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`.
+  instructions in parallel. In the following example, see how to compute :math:`a * b + (c - d)`.
   Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction
-  :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel
+  :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and cannot be executed in parallel
   with :math:`*` and :math:`-`.
 
 .. image:: ./_images/dpep-ilp.png

diff --git a/docs/sources/prerequisites_and_installation.rst b/docs/sources/prerequisites_and_installation.rst
@@ -5,15 +5,15 @@
 
 .. |trade| unicode:: U+2122
 
-Prerequisites and installation
+Prerequisites and Installation
 ==============================
 
-1. Device drivers
+1. Device Drivers
 ******************
 
-Since you are about to start programming data parallel devices beyond CPU, you will need an appropriate hardware.
-For example, Data Parallel Extensions for Python work fine on Intel laptops with integrated graphics.
-In majority of cases your laptop already has all necessary device drivers installed. But if you want the most
+To start programming data parallel devices beyond CPU, you will need an appropriate hardware.
+For example, Data Parallel Extensions for Python work fine on Intel |copy| laptops with integrated graphics.
+In majority of cases, your Windows*-based laptop already has all necessary device drivers installed. But if you want the most
 up-to-date driver, you can always
 `update it to the latest one <https://www.intel.com/content/www/us/en/download-center/home.html>`_.
 Follow device driver installation instructions
@@ -22,29 +22,29 @@ to complete this step.
 All other necessary components for programming data parallel devices will be installed with
 Data Parallel Extensions for Python.
 
-2. Python interpreter
+2. Python Interpreter
 **********************
 
 You will need Python 3.8, 3.9, or 3.10 installed on your system. If you do not have one yet the easiest way to do
 that is to install `Intel Distribution for Python*`_.
-It will install all essential Python numerical and machine
-learning packages optimized for Intel hardware, including Data Parallel Extensions for Python*.
+It installs all essential Python numerical and machine
+learning packages optimized for the Intel hardware, including Data Parallel Extensions for Python*.
 If you have Python installation from another vendor, it is fine too. All you need is to install Data Parallel
-Extensions for Python manually.
+Extensions for Python manually as shown in the next section.
 
 3. Data Parallel Extensions for Python
 ***************************************
 
-You can skip this step if you already installed Intel |copy| Distribution for Python or Intel |copy| AI Analytics Toolkit.
+Skip this step if you already installed Intel |copy| Distribution for Python.
 
 The easiest way to install Data Parallel Extensions for Python is to install numba-dpex:
 
-Conda: ``conda install numba-dpex``
+* Conda: ``conda install numba-dpex``
 
-Pip: ``pip install numba-dpex``
+* Pip: ``pip install numba-dpex``
 
-The above commands will install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
-and required compiler runtimes and drivers.
+These commands install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
+and required compiler runtimes.
 
 .. WARNING::
-   Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions
+   Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions