diff --git a/docs/sources/01-get_started.ipynb b/docs/sources/01-get_started.ipynb new file mode 100644 index 0000000..7b4ec35 --- /dev/null +++ b/docs/sources/01-get_started.ipynb @@ -0,0 +1,581 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a002ea61", + "metadata": {}, + "source": [ + "# Getting Started" + ] + }, + { + "cell_type": "markdown", + "id": "905c742a", + "metadata": {}, + "source": [ + "# Few Line Changes to Run on GPU" + ] + }, + { + "cell_type": "markdown", + "id": "41bf2099", + "metadata": {}, + "source": [ + "Let's see the example how you can easily in a few lines of code switch computations from CPU to GPU device.\n", + "\n", + "Please look on the original example.\n", + "We allocate 2 matrices on the Host (CPU) device usnig NumPy array function, all future calculations will be performed as well on the allocated Host(CPU) device." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d9e8711d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "res = [[2 2]\n", + " [2 2]]\n" + ] + } + ], + "source": [ + "# Original CPU script\n", + "\n", + "# Call numpy library\n", + "import numpy as np\n", + "\n", + "# Data alocated on the CPU device\n", + "x = np.array([[1, 1], [1, 1]])\n", + "y = np.array([[1, 1], [1, 1]])\n", + "\n", + "# Compute performed on the CPU device, where data is allocated\n", + "res = np.matmul(x, y)\n", + "\n", + "print (\"res = \", res)" + ] + }, + { + "cell_type": "markdown", + "id": "402c3d61", + "metadata": {}, + "source": [ + "Now let's try to modify our code in a way when all calculations occur on the GPU device.\n", + "To do it, you need just to switch to the dpnp library and see on the result." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c10b7d83", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Array x is located on the device: Device(level_zero:gpu:0)\n", + "Array y is located on the device: Device(level_zero:gpu:0)\n", + "res is located on the device: Device(level_zero:gpu:0)\n", + "res = [[2 2]\n", + " [2 2]]\n" + ] + } + ], + "source": [ + "# Modified XPU script\n", + "\n", + "# Drop-in replacement via single line change\n", + "import dpnp as np\n", + "\n", + "# Data alocated on default SYCL device\n", + "x = np.array([[1, 1], [1, 1]])\n", + "y = np.array([[1, 1], [1, 1]])\n", + "\n", + "# Compute performed on the device, where data is allocated\n", + "res = np.matmul(x, y)\n", + "\n", + "\n", + "print (\"Array x is located on the device:\", x.device)\n", + "print (\"Array y is located on the device:\", y.device)\n", + "print (\"res is located on the device:\", res.device)\n", + "print (\"res = \", res)" + ] + }, + { + "cell_type": "markdown", + "id": "d6a58f17", + "metadata": {}, + "source": [ + "As you may see changing only one line of code help us to perform all calculations on the GPU device.\n", + "In this example np.array() creates an array on the default SYCL* device, which is \"gpu\" on systems with integrated or discrete GPU (it is \"host\" on systems that do not have GPU). The queue associated with this array is now carried with x and y, and np.matmul(x, y) will do matrix product of two arrays x and y, and respective pre-compiled kernel implementing np.matmul() will be submitted to that queue. The result res will be allocated on the device array associated with that queue too.\n", + "\n", + "Now let's make a few improvements in our code and see how we can control and specify exact device on which we want to perform our calculations and which USM memory type to use." + ] + }, + { + "cell_type": "markdown", + "id": "be340585", + "metadata": {}, + "source": [ + "# dpnp simple examples with popular functions" + ] + }, + { + "cell_type": "markdown", + "id": "26d8c3c6", + "metadata": {}, + "source": [ + "1. Example to return an array with evenly spaced values within a given interval." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "0435d7f0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "282 µs ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n", + "Result a is located on the device: Device(level_zero:gpu:0)\n", + "a = [ 3 9 15 21 27]\n" + ] + } + ], + "source": [ + "import dpnp as np\n", + "\n", + "# Create an array of values from 3 till 30 with step 6\n", + "a = np.arange(3, 30, step = 6)\n", + "\n", + "print (\"Result a is located on the device:\", a.device)\n", + "print (\"a = \", a)" + ] + }, + { + "cell_type": "markdown", + "id": "a6765095", + "metadata": {}, + "source": [ + "In this example np.arange() creates an array on the default SYCL* device, which is \"gpu\" on systems with integrated or discrete GPU (it is \"host\" on systems that do not have GPU)." + ] + }, + { + "cell_type": "markdown", + "id": "35081461", + "metadata": {}, + "source": [ + "2. Example which calculates on the GPU the sum of the array elements" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e613398f", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Result x is located on the device: Device(level_zero:gpu:0)\n", + "Result y is located on the device: Device(level_zero:gpu:0)\n", + "The sum of the array elements is: 6\n" + ] + } + ], + "source": [ + "import dpnp as np\n", + "\n", + "x = np.empty(3)\n", + "\n", + "try:\n", + " # Using filter selector strings to specify root devices for a new array\n", + " x = np.asarray ([1, 2, 3], device=\"gpu\")\n", + " print (\"Result x is located on the device:\", x.device)\n", + "except:\n", + " print (\"GPU device is not available\")\n", + "\n", + "# Return the sum of the array elements\n", + "y = np.sum (x) # Expect 6\n", + "\n", + "print (\"Result y is located on the device:\", y.device)\n", + "print (\"The sum of the array elements is: \", y )" + ] + }, + { + "cell_type": "markdown", + "id": "88b4915e", + "metadata": {}, + "source": [ + "In this example np.asarray() creates an array on the default GPU device. The queue associated with this array is now carried with x, and np.sum(x) will derive it from x, and respective pre-compiled kernel implementing np.sum() will be submitted to that queue. The result y will be allocated on the device 0-dimensional array associated with that queue too." + ] + }, + { + "cell_type": "markdown", + "id": "b49f4c62", + "metadata": {}, + "source": [ + "3. Example of inversion of an array" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4b53afed", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Array a is located on the device: Device(level_zero:gpu:0)\n", + "Result x is located on the device: Device(level_zero:gpu:0)\n", + "Array x is: [[-2 -2]\n", + " [-3 -2]\n", + " [-2 -1]\n", + " [ 0 -1]]\n" + ] + } + ], + "source": [ + "import dpnp as np\n", + "\n", + "try:\n", + " \n", + " # Using filter selector strings to specify root devices for an array\n", + " a = np.array([[1, 1], [2, 1], [1, 0], [-1, 0]], device = \"gpu\")\n", + " print (\"Array a is located on the device:\", a.device) \n", + "\n", + " # Do inversion of an array \"a\"\n", + " x = np.invert(a)\n", + "\n", + " print (\"Result x is located on the device:\", x.device)\n", + " print (\"Array x is:\", x) \n", + "\n", + "except:\n", + " print (\"GPU device is not available\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "e10f226a", + "metadata": {}, + "source": [ + "In this example np.array() creates an array on the default GPU device. The queue associated with this array is now carried with a, and np.invert(a) will derive it from a, and respective pre-compiled kernel implementing np.invert() will be submitted to that queue. The result x will be allocated on the device array associated with that queue too." + ] + }, + { + "cell_type": "markdown", + "id": "94cb3b9b", + "metadata": {}, + "source": [ + "# dpctl simple examples" + ] + }, + { + "cell_type": "markdown", + "id": "9fb4b1b5", + "metadata": {}, + "source": [ + "Here you may find a list of simple examples which explain how to understand how many devices you have in the systen and how to operate with them\n", + "Let's print the list of all available SYCL devices." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b890ae71", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Intel(R) OpenCL HD Graphics OpenCL 3.0 \n", + "Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3\n", + "Intel(R) OpenCL OpenCL 3.0 WINDOWS\n", + "Intel(R) Level-Zero 1.3\n" + ] + } + ], + "source": [ + "# See the list of available SYCL platforms and extra metadata about each platform.\n", + "import dpctl\n", + "\n", + "dpctl.lsplatform() # Print platform information" + ] + }, + { + "cell_type": "markdown", + "id": "db5e8db4", + "metadata": {}, + "source": [ + "Let's look on the output.\n", + "On my platform is available OpenCL GPU driver, Intel(R) FPGA Emulation Device, OpenCL CPU driver and Level Zero GPU driver.\n", + "If i play with verbocity parameter, i can get more information about the devices i have." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "ffcf0cfb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Platform 0 ::\n", + " Name Intel(R) OpenCL HD Graphics\n", + " Version OpenCL 3.0 \n", + " Vendor Intel(R) Corporation\n", + " Backend opencl\n", + " Num Devices 1\n", + " # 0\n", + " Name Intel(R) Iris(R) Xe Graphics\n", + " Version 31.0.101.3430\n", + " Filter string opencl:gpu:0\n", + "Platform 1 ::\n", + " Name Intel(R) FPGA Emulation Platform for OpenCL(TM)\n", + " Version OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3\n", + " Vendor Intel(R) Corporation\n", + " Backend opencl\n", + " Num Devices 1\n", + " # 0\n", + " Name Intel(R) FPGA Emulation Device\n", + " Version 2022.15.11.0.18_160000\n", + " Filter string opencl:accelerator:0\n", + "Platform 2 ::\n", + " Name Intel(R) OpenCL\n", + " Version OpenCL 3.0 WINDOWS\n", + " Vendor Intel(R) Corporation\n", + " Backend opencl\n", + " Num Devices 1\n", + " # 0\n", + " Name 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz\n", + " Version 2022.15.11.0.18_160000\n", + " Filter string opencl:cpu:0\n", + "Platform 3 ::\n", + " Name Intel(R) Level-Zero\n", + " Version 1.3\n", + " Vendor Intel(R) Corporation\n", + " Backend ext_oneapi_level_zero\n", + " Num Devices 1\n", + " # 0\n", + " Name Intel(R) Iris(R) Xe Graphics\n", + " Version 1.3.23904\n", + " Filter string level_zero:gpu:0\n" + ] + } + ], + "source": [ + "# See the list of available SYCL platforms and extra metadata about each platform.\n", + "import dpctl\n", + "\n", + "dpctl.lsplatform(2) # Print platform information with verbocitz level 2 (highest level)" + ] + }, + { + "cell_type": "markdown", + "id": "f63525d3", + "metadata": {}, + "source": [ + "Having information about available SYCL platforms you can specify which type of devices you want to work with" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "84ff47e9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[, ]\n" + ] + } + ], + "source": [ + "# See the list of available gpu devices and their extra metadata.\n", + "import dpctl\n", + "\n", + "if dpctl.has_gpu_devices():\n", + " print (dpctl.get_devices(device_type='gpu'))\n", + "else:\n", + " print(\"GPU device is not available\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a93e7cf8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[]\n" + ] + } + ], + "source": [ + "# See the list of available gpu devices and their extra metadata.\n", + "import dpctl\n", + "\n", + "if dpctl.has_cpu_devices():\n", + " print (dpctl.get_devices(device_type='cpu'))\n", + "else:\n", + " print(\"CPU device is not available\")" + ] + }, + { + "cell_type": "markdown", + "id": "efcc3f45", + "metadata": {}, + "source": [ + "And you can make selection of the specific device in your system using the default selctor" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "1c068447", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Name Intel(R) Iris(R) Xe Graphics\n", + " Driver version 1.3.23904\n", + " Vendor Intel(R) Corporation\n", + " Filter string level_zero:gpu:0\n", + "\n" + ] + } + ], + "source": [ + "import dpctl\n", + "\n", + "try:\n", + " # Create a SyclDevice of type GPU based on whatever is returned\n", + " # by the SYCL `gpu_selector` device selector class.\n", + " gpu = dpctl.select_gpu_device()\n", + " gpu.print_device_info() # print GPU device information\n", + "\n", + "except:\n", + " print (\"GPU device is not available\")" + ] + }, + { + "cell_type": "markdown", + "id": "c378b79d", + "metadata": {}, + "source": [ + "Or by using the infromation in filter string of the device create abd explicit SyclDevice " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "ad83abb5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Name Intel(R) Iris(R) Xe Graphics\n", + " Driver version 1.3.23904\n", + " Vendor Intel(R) Corporation\n", + " Profile FULL_PROFILE\n", + " Filter string level_zero:gpu:0\n", + "\n" + ] + } + ], + "source": [ + "import dpctl\n", + "\n", + "# Create a SyclDevice with an explicit filter string,\n", + "# in this case the first level_zero gpu device.\n", + "try:\n", + " level_zero_gpu = dpctl.SyclDevice(\"level_zero:gpu:0\")\n", + " level_zero_gpu.print_device_info()\n", + "except:\n", + " print(\"The first level_zero GPU device is not available\") " + ] + }, + { + "cell_type": "markdown", + "id": "eadefe0b", + "metadata": {}, + "source": [ + "Let's check if your gpu device support double precision. To do this we need to selcet gpu device and check the parameter has_aspect_fp64:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "a94756d7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Name Intel(R) Iris(R) Xe Graphics\n", + " Driver version 1.3.23904\n", + " Vendor Intel(R) Corporation\n", + " Filter string level_zero:gpu:0\n", + "\n", + "Double precision support is False\n" + ] + } + ], + "source": [ + "import dpctl\n", + "# Select GPU device and check double precision support\n", + "try:\n", + " gpu = dpctl.select_gpu_device()\n", + " gpu.print_device_info()\n", + " print(\"Double precision support is\", gpu.has_aspect_fp64)\n", + "except:\n", + " print(\"The GPU device is not available\") " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/sources/02-dpnp_numpy_fallback.ipynb b/docs/sources/02-dpnp_numpy_fallback.ipynb new file mode 100644 index 0000000..ca6bcf5 --- /dev/null +++ b/docs/sources/02-dpnp_numpy_fallback.ipynb @@ -0,0 +1,140 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "be340585", + "metadata": {}, + "source": [ + "# Usage of numpy functions in dpnp library" + ] + }, + { + "cell_type": "markdown", + "id": "9008dbe5", + "metadata": {}, + "source": [ + "1. Example of the usage of the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` environment variable and finding specific values in array based on condition using dpnp library" + ] + }, + { + "cell_type": "markdown", + "id": "5d4300fe", + "metadata": {}, + "source": [ + "Not all functions were yet implemented in the dpnp library, some of the functions require enabling of the direct fallback to the NumPy library. \n", + "One of the example can be the function \"dpnp.full ()\" and parameter like in non default state. \n", + "Let's look on the example where we want to create an two dimencial array with singular element and array like option. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c78511cd", + "metadata": {}, + "outputs": [ + { + "ename": "NotImplementedError", + "evalue": "Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': } isn't currently supported and would fall back on NumPy implementation. Define enviroment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without rasing an exception.", + "output_type": "error", + "traceback": [ + "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[1;31mNotImplementedError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[1;32mIn[1], line 4\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mdpnp\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[0;32m 3\u001b[0m \u001b[38;5;66;03m# Create an two dimencial array with singular element and array like option\u001b[39;00m\n\u001b[1;32m----> 4\u001b[0m a \u001b[38;5;241m=\u001b[39m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfull\u001b[49m\u001b[43m(\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m3\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlike\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mzeros\u001b[49m\u001b[43m(\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 5\u001b[0m \u001b[38;5;28mprint\u001b[39m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArray a is located on the device:\u001b[39m\u001b[38;5;124m\"\u001b[39m, a\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[0;32m 6\u001b[0m \u001b[38;5;28mprint\u001b[39m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArray a is\u001b[39m\u001b[38;5;124m\"\u001b[39m, a)\n", + "File \u001b[1;32m~\\Anaconda3\\envs\\my_env\\lib\\site-packages\\dpnp\\dpnp_iface_arraycreation.py:732\u001b[0m, in \u001b[0;36mfull\u001b[1;34m(shape, fill_value, dtype, order, like, device, usm_type, sycl_queue)\u001b[0m\n\u001b[0;32m 723\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 724\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m dpnp_container\u001b[38;5;241m.\u001b[39mfull(shape,\n\u001b[0;32m 725\u001b[0m fill_value,\n\u001b[0;32m 726\u001b[0m dtype\u001b[38;5;241m=\u001b[39mdtype,\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 729\u001b[0m usm_type\u001b[38;5;241m=\u001b[39musm_type,\n\u001b[0;32m 730\u001b[0m sycl_queue\u001b[38;5;241m=\u001b[39msycl_queue)\n\u001b[1;32m--> 732\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mcall_origin\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnumpy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfull\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mshape\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfill_value\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43morder\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlike\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlike\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[1;32mdpnp\\dpnp_utils\\dpnp_algo_utils.pyx:132\u001b[0m, in \u001b[0;36mdpnp.dpnp_utils.dpnp_algo_utils.call_origin\u001b[1;34m()\u001b[0m\n", + "\u001b[1;31mNotImplementedError\u001b[0m: Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': } isn't currently supported and would fall back on NumPy implementation. Define enviroment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without rasing an exception." + ] + } + ], + "source": [ + "import dpnp as np\n", + "\n", + "# Create an two dimencial array with singular element and array like option\n", + "a = np.full((2,2),3, like = np.zeros((2, 0)))\n", + "print (\"Array a is located on the device:\", a.device)\n", + "print (\"Array a is\", a)" + ] + }, + { + "cell_type": "markdown", + "id": "ae026021", + "metadata": {}, + "source": [ + "As you can see the function \"dpnp.full ()\" and parameter like in non default state is not implemented in the dpnp library.\n", + "We got the following error message: \"Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': } isn't currently supported and would fall back on NumPy implementation. Define environment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without raising an exception.\" \n", + "By default the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` environment variable is not null and it allows to use only dpnp library. \n", + "If we want to call NumPy library for functions not supported under dpnp, we need to change this environment variable to `0`." + ] + }, + { + "cell_type": "markdown", + "id": "210440c3", + "metadata": {}, + "source": [ + "Let's overwrite the same example using the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0` environment variable." + ] + }, + { + "cell_type": "markdown", + "id": "bca7490d", + "metadata": {}, + "source": [ + "`Note:` Please pay attention that if you are working in the Jupyter Notebook, before running the example with setting the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0` you need to restart the kernel in the Jupyter Notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "05ad1565", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0\n", + "Array a is located on the device: Device(level_zero:gpu:0)\n", + "Array a is [[3 3]\n", + " [3 3]]\n" + ] + } + ], + "source": [ + "import os\n", + "# call numpy if not null than we will use dpnp, by default not null\n", + "os.environ[\"DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK\"] = \"0\" \n", + "\n", + "import dpnp as np\n", + "\n", + "# Expect result 0\n", + "print (\"DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK =\", np.config.__DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK__) \n", + "\n", + "# Create an two dimencial array with singular element and array like option\n", + "a = np.full((2,2),3, like = np.zeros((2, 0)))\n", + "print (\"Array a is located on the device:\", a.device)\n", + "print (\"Array a is\", a)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/sources/conf.py b/docs/sources/conf.py index c2255e0..ce78bee 100644 --- a/docs/sources/conf.py +++ b/docs/sources/conf.py @@ -1,118 +1,121 @@ -# ***************************************************************************** -# Copyright (c) 2022, Intel Corporation All rights reserved. -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions are met: -# -# Redistributions of source code must retain the above copyright notice, -# this list of conditions and the following disclaimer. -# -# Redistributions in binary form must reproduce the above copyright notice, -# this list of conditions and the following disclaimer in the documentation -# and/or other materials provided with the distribution. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" -# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, -# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR -# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; -# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, -# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR -# OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, -# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# ***************************************************************************** - -# coding: utf-8 -# Configuration file for the Sphinx documentation builder. - -# -- Project information ----------------------------------------------------- - -project = 'Data Parallel Extensions for Python*' -copyright = '2022, Intel Corporation' -author = 'Intel Corporation' - -# The full version, including alpha/beta/rc tags -release = '0.1' - -# -- General configuration ---------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.todo', - 'sphinx.ext.intersphinx', - 'sphinx.ext.extlinks', - 'sphinx.ext.githubpages', - 'sphinx.ext.napoleon', - 'sphinx.ext.autosectionlabel', - 'sphinxcontrib.programoutput', -] - - -# Add any paths that contain templates here, relative to this directory. -#templates_path = ['_templates'] -templates_path = [] - -# List of patterns, relative to source directory, that match files and -# directories to ignore when looking for source files. -# This pattern also affects html_static_path and html_extra_path. -exclude_patterns = [] - - -# -- Options for HTML output ------------------------------------------------- - -# The theme to use for HTML and HTML Help pages. See the documentation for -# a list of builtin themes. -# -html_theme = 'sdc-sphinx-theme' - -html_theme_path = ['.'] - -html_theme_options = { -} - -# Add any paths that contain custom static files (such as style sheets) here, -# relative to this directory. They are copied after the builtin static files, -# so a file named "default.css" will overwrite the builtin "default.css". -html_static_path = [] - -html_sidebars = { - '**': ['globaltoc.html', 'sourcelink.html', 'searchbox.html', 'relations.html'], -} - -html_show_sourcelink = False - -# -- Todo extension configuration ---------------------------------------------- -todo_include_todos = True -todo_link_only = True - -# -- InterSphinx configuration: looks for objects in external projects ----- -# Add here external classes you want to link from Intel SDC documentation -# Each entry of the dictionary has the following format: -# 'class name': ('link to object.inv file for that class', None) -#intersphinx_mapping = { -# 'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None), -# 'python': ('http://docs.python.org/2', None), -# 'numpy': ('http://docs.scipy.org/doc/numpy', None) -#} -intersphinx_mapping = { -} - -# -- Napoleon extension configuration (Numpy and Google docstring options) ------- -#napoleon_google_docstring = True -#napoleon_numpy_docstring = True -#napoleon_include_init_with_doc = True -#napoleon_include_private_with_doc = True -#napoleon_include_special_with_doc = True -#napoleon_use_admonition_for_examples = False -#napoleon_use_admonition_for_notes = False -#napoleon_use_admonition_for_references = False -#napoleon_use_ivar = False -#napoleon_use_param = True -#napoleon_use_rtype = True - -# -- Prepend module name to an object name or not ----------------------------------- -add_module_names = False +# ***************************************************************************** +# Copyright (c) 2022, Intel Corporation All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are met: +# +# Redistributions of source code must retain the above copyright notice, +# this list of conditions and the following disclaimer. +# +# Redistributions in binary form must reproduce the above copyright notice, +# this list of conditions and the following disclaimer in the documentation +# and/or other materials provided with the distribution. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; +# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, +# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR +# OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, +# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +# ***************************************************************************** + +# coding: utf-8 +# Configuration file for the Sphinx documentation builder. + +# -- Project information ----------------------------------------------------- + +project = 'Data Parallel Extensions for Python*' +copyright = '2022, Intel Corporation' +author = 'Intel Corporation' + +# The full version, including alpha/beta/rc tags +release = '0.1' + +# -- General configuration ---------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'sphinx.ext.todo', + 'sphinx.ext.intersphinx', + 'sphinx.ext.extlinks', + 'sphinx.ext.githubpages', + 'sphinx.ext.napoleon', + 'sphinx.ext.autosectionlabel', + 'sphinxcontrib.programoutput', + 'nbsphinx', + 'sphinx_gallery.load_style', +] + + +# Add any paths that contain templates here, relative to this directory. +#templates_path = ['_templates'] +templates_path = [] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = [] + + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sdc-sphinx-theme' + +html_theme_path = ['.'] + +html_theme_options = { +} + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = [] + +html_sidebars = { + '**': ['globaltoc.html', 'sourcelink.html', 'searchbox.html', 'relations.html'], +} + +html_show_sourcelink = False + +# -- Todo extension configuration ---------------------------------------------- +todo_include_todos = True +todo_link_only = True + +# -- InterSphinx configuration: looks for objects in external projects ----- +# Add here external classes you want to link from Intel SDC documentation +# Each entry of the dictionary has the following format: +# 'class name': ('link to object.inv file for that class', None) +#intersphinx_mapping = { +# 'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None), +# 'python': ('http://docs.python.org/2', None), +# 'numpy': ('http://docs.scipy.org/doc/numpy', None) +#} +intersphinx_mapping = { +} + +# -- Napoleon extension configuration (Numpy and Google docstring options) ------- +#napoleon_google_docstring = True +#napoleon_numpy_docstring = True +#napoleon_include_init_with_doc = True +#napoleon_include_private_with_doc = True +#napoleon_include_special_with_doc = True +#napoleon_use_admonition_for_examples = False +#napoleon_use_admonition_for_notes = False +#napoleon_use_admonition_for_references = False +#napoleon_use_ivar = False +#napoleon_use_param = True +#napoleon_use_rtype = True + +# -- Prepend module name to an object name or not ----------------------------------- +add_module_names = False + diff --git a/docs/sources/examples.rst b/docs/sources/examples.rst index ef070c4..cdd9274 100644 --- a/docs/sources/examples.rst +++ b/docs/sources/examples.rst @@ -1,55 +1,47 @@ -.. _examples: -.. include:: ./ext_links.txt - -List of examples -================ - -.. literalinclude:: ../../examples/01-hello_dpnp.py - :language: python - :lines: 27- - :caption: **EXAMPLE 01:** Your first NumPy code running on GPU - :name: examples_01_hello_dpnp - -.. literalinclude:: ../../examples/02-dpnp_device.py - :language: python - :lines: 27- - :caption: **EXAMPLE 02:** Select device type while creating array - :name: examples_02_dpnp_device - -.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py - :language: python - :lines: 27- - :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex - :name: examples_03_dpnp2numba_dpex - -.. literalinclude:: ../../examples/04-dpctl_device_query.py - :language: python - :lines: 27- - :caption: **EXAMPLE 04:** Get information about devices - :name: examples_04_dpctl_device_query - -.. literalinclude:: ../../examples/05-dpctl_dpnp_test.py - :language: python - :lines: 46- - :caption: **EXAMPLE 05:** Test installation of ``dpctl_dpnp`` - :name: ex_05_dpnp_test - -.. literalinclude:: ../../examples/06-numba-dpex_test.py - :language: python - :lines: 45- - :caption: **EXAMPLE 06:** Test installation of ``numba-dpex`` - :name: 06-numba-dpex_test - -Benchmarks -********** - -.. todo:: - Provide instructions for dpbench - -Jupyter* Notebooks -****************** - -Instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python - -.. literalinclude:: ../../examples/01-get_started.ipynb -.. literalinclude:: ../../examples/02-dpnp_numpy_fallback.ipynb \ No newline at end of file +.. _examples: +.. include:: ./ext_links.txt + +List of examples +================ + +.. literalinclude:: ../../examples/01-hello_dpnp.py + :language: python + :lines: 27- + :caption: **EXAMPLE 01:** Your first NumPy code running on GPU + :name: examples_01_hello_dpnp + +.. literalinclude:: ../../examples/02-dpnp_device.py + :language: python + :lines: 27- + :caption: **EXAMPLE 02:** Select device type while creating array + :name: examples_02_dpnp_device + +.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py + :language: python + :lines: 27- + :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex + :name: examples_03_dpnp2numba_dpex + +.. literalinclude:: ../../examples/04-dpctl_device_query.py + :language: python + :lines: 27- + :caption: **EXAMPLE 04:** Get information about devices + :name: examples_04_dpctl_device_query + +.. literalinclude:: ../../examples/05-dpctl_dpnp_test.py + :language: python + :lines: 27- + :caption: **EXAMPLE 05:** Test installation of ``dpctl`` and ``dpnp`` + :name: ex_05_dpnp_test + +.. literalinclude:: ../../examples/06-numba-dpex_test.py + :language: python + :lines: 27- + :caption: **EXAMPLE 06:** Test installation of ``numba-dpex`` + :name: 06-numba-dpex_test + +Benchmarks +********** + +.. todo:: + Provide instructions for dpbench diff --git a/docs/sources/ext_links.txt b/docs/sources/ext_links.txt index ec86a87..eab6252 100644 --- a/docs/sources/ext_links.txt +++ b/docs/sources/ext_links.txt @@ -1,19 +1,19 @@ -.. - ********************************************************** - THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION - ********************************************************** -.. _NumPy*: https://numpy.org/ -.. _Numba*: https://numba.pydata.org/ -.. _Python* Array API Standard: https://data-apis.org/array-api/ -.. _Intel Distribution for Python*: https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html -.. _OpenCl*: https://www.khronos.org/opencl/ -.. _DPC++: https://www.apress.com/gp/book/9781484255735 -.. _Data Parallel Extension for Numba*: https://intelpython.github.io/numba-dpex/latest/index.html -.. _SYCL*: https://www.khronos.org/sycl/ -.. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html -.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/ -.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/ -.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf> -.. _Intel oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html -.. _Intel VTune Profiler: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html -.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html +.. + ********************************************************** + THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION + ********************************************************** +.. _NumPy*: https://numpy.org/ +.. _Numba*: https://numba.pydata.org/ +.. _Python* Array API Standard: https://data-apis.org/array-api/ +.. _Intel Distribution for Python*: https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html +.. _OpenCl*: https://www.khronos.org/opencl/ +.. _DPC++: https://www.apress.com/gp/book/9781484255735 +.. _Data Parallel Extension for Numba*: https://intelpython.github.io/numba-dpex/latest/index.html +.. _SYCL*: https://www.khronos.org/sycl/ +.. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html +.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/ +.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/ +.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf> +.. _Intel oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html +.. _Intel VTune Profiler: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html +.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html diff --git a/docs/sources/heterogeneous_computing.rst b/docs/sources/heterogeneous_computing.rst index 6341f56..e1499c7 100644 --- a/docs/sources/heterogeneous_computing.rst +++ b/docs/sources/heterogeneous_computing.rst @@ -1,149 +1,149 @@ -.. _heterogeneous_computing: -.. include:: ./ext_links.txt - -Heterogeneous computing -======================= - -Device Offload -************** - -Python is an interpreted language, which implies that most of Python codes will run on CPU, -and only a few data parallel regions will execute on data parallel devices. -That is why the concept of host and offload devices is useful when it comes to conceptualizing -a heterogeneous programming model in Python. - -.. image:: ./_images/hetero-devices.png - :width: 600px - :align: center - :alt: SIMD - -The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices* -(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python** -offer a programming model where a script executed by Python interpreter on host can *offload* data -parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted -for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*. - -Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded -in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ . -**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python -using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_. - -One or more kernels are submitted for execution into a *queue* targeting an *offload device*. -For each device one or more queues can be created. In most cases you won’t need to work -with device queues directly. Data Parallel Extensions for Python will do necessary underlying -work with queues for you through the :ref:`Compute-Follows-Data`. - -Unified Shared Memory -********************* - -Each device has its own memory, not necessarily accessible from another device. - -.. image:: ./_images/hetero-devices.png - :width: 600px - :align: center - :alt: SIMD - -For example, **Device 1** memory may not be directly accessible from the host, but only accessible -via expensive copying by a driver software. Similarly, depending on the architecture, direct data -exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive -copying through the host memory. These aspects must be taken into consideration when programming -data parallel devices. - -In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1** -and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or -by working with each individual sub-devices. For the former case a programmer needs to create -a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device. - -`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support -for unified virtual address space, which allows coherency between the host and the device -pointers. All memory is allocated by the host, but it offers three distinct allocation types: - -* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation - when you need to stream a read-only data from the host to the device once. - -* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one. - Useful in a situation when most of data crunching happens on the device. - -* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by - the host or device.** Shared allocations are useful when data are accessed by both host and devices, - since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type. - -Compute-Follows-Data -******************** -Since data copying between devices is typically very expensive, for performance reasons it is essential -to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model, -which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp`` -carry information about allocation queues, and hence, about the device on which an array is allocated. -Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place. - -.. image:: ./_images/kernel-queue-device.png - :width: 600px - :align: center - :alt: SIMD - -The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the -**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the -*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm -the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will -be created on the **Device Queue** associated with the **Device 1**. - -**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue, -otherwise an exception will be thrown. For example, the following usages will result in the exception. - -.. figure:: ./_images/queue-exception1.png - :width: 600px - :align: center - :alt: SIMD - - Input tensors are on different devices and different queues. Exception is thrown. - -.. figure:: ./_images/queue-exception2.png - :width: 600px - :align: center - :alt: SIMD - - Input tensors are on the same device but queues are different. Exception is thrown. - -.. figure:: ./_images/queue-exception3.png - :width: 600px - :align: center - :alt: SIMD - - Data belongs to the same device, but queues are different and associated with different sub-devices. - -Copying data between devices and queues -*************************************** - -**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in -normal circumstances you do not need to directly manage queues. Having one canonical queue per device -allows you to copy data between devices using to_device() method: - -.. code-block:: python - - a_new = a.to_device(b.device) - -Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``. -The same queue will be associated with ``b`` and ``a_new``. - -Alternatively, you can do this as follows: - -.. code-block:: python - :caption: DPNP array - - a_new = dpnp.asarray(a, device=b.device) - -.. code-block:: python - :caption: DPCtl array - - a_new = dpctl.tensor.asarray(a, device=b.device) - -Creating additional queues -************************** - -As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device, -and you normally work with this queue implicitly. However, you can always create as many additional queues per device -as needed, and work with them explicitly. - -A typical situation when you will want to create the queue explicitly is for profiling purposes. -Read `Data Parallel Control`_ documentation for more details about queues. - +.. _heterogeneous_computing: +.. include:: ./ext_links.txt + +Heterogeneous computing +======================= + +Device Offload +************** + +Python is an interpreted language, which implies that most of Python codes will run on CPU, +and only a few data parallel regions will execute on data parallel devices. +That is why the concept of host and offload devices is useful when it comes to conceptualizing +a heterogeneous programming model in Python. + +.. image:: ./_images/hetero-devices.png + :width: 600px + :align: center + :alt: SIMD + +The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices* +(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python** +offer a programming model where a script executed by Python interpreter on host can *offload* data +parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted +for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*. + +Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded +in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ . +**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python +using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_. + +One or more kernels are submitted for execution into a *queue* targeting an *offload device*. +For each device one or more queues can be created. In most cases you won’t need to work +with device queues directly. Data Parallel Extensions for Python will do necessary underlying +work with queues for you through the :ref:`Compute-Follows-Data`. + +Unified Shared Memory +********************* + +Each device has its own memory, not necessarily accessible from another device. + +.. image:: ./_images/hetero-devices.png + :width: 600px + :align: center + :alt: SIMD + +For example, **Device 1** memory may not be directly accessible from the host, but only accessible +via expensive copying by a driver software. Similarly, depending on the architecture, direct data +exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive +copying through the host memory. These aspects must be taken into consideration when programming +data parallel devices. + +In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1** +and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or +by working with each individual sub-devices. For the former case a programmer needs to create +a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device. + +`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support +for unified virtual address space, which allows coherency between the host and the device +pointers. All memory is allocated by the host, but it offers three distinct allocation types: + +* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation + when you need to stream a read-only data from the host to the device once. + +* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one. + Useful in a situation when most of data crunching happens on the device. + +* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by + the host or device.** Shared allocations are useful when data are accessed by both host and devices, + since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type. + +Compute-Follows-Data +******************** +Since data copying between devices is typically very expensive, for performance reasons it is essential +to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model, +which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp`` +carry information about allocation queues, and hence, about the device on which an array is allocated. +Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place. + +.. image:: ./_images/kernel-queue-device.png + :width: 600px + :align: center + :alt: SIMD + +The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the +**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the +*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm +the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will +be created on the **Device Queue** associated with the **Device 1**. + +**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue, +otherwise an exception will be thrown. For example, the following usages will result in the exception. + +.. figure:: ./_images/queue-exception1.png + :width: 600px + :align: center + :alt: SIMD + + Input tensors are on different devices and different queues. Exception is thrown. + +.. figure:: ./_images/queue-exception2.png + :width: 600px + :align: center + :alt: SIMD + + Input tensors are on the same device but queues are different. Exception is thrown. + +.. figure:: ./_images/queue-exception3.png + :width: 600px + :align: center + :alt: SIMD + + Data belongs to the same device, but queues are different and associated with different sub-devices. + +Copying data between devices and queues +*************************************** + +**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in +normal circumstances you do not need to directly manage queues. Having one canonical queue per device +allows you to copy data between devices using to_device() method: + +.. code-block:: python + + a_new = a.to_device(b.device) + +Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``. +The same queue will be associated with ``b`` and ``a_new``. + +Alternatively, you can do this as follows: + +.. code-block:: python + :caption: DPNP array + + a_new = dpnp.asarray(a, device=b.device) + +.. code-block:: python + :caption: DPCtl array + + a_new = dpctl.tensor.asarray(a, device=b.device) + +Creating additional queues +************************** + +As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device, +and you normally work with this queue implicitly. However, you can always create as many additional queues per device +as needed, and work with them explicitly. + +A typical situation when you will want to create the queue explicitly is for profiling purposes. +Read `Data Parallel Control`_ documentation for more details about queues. + diff --git a/docs/sources/index.rst b/docs/sources/index.rst index 0cca6fa..ab28e4a 100644 --- a/docs/sources/index.rst +++ b/docs/sources/index.rst @@ -1,33 +1,36 @@ -.. _index: -.. include:: ./ext_links.txt - -.. image:: ./_images/DPEP-large.png - :width: 400px - :align: center - :alt: Data Parallel Extensions for Python - -Data Parallel Extensions for Python -=================================== - -Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance -gains on data parallel devices such as GPUs. It consists of three foundational packages: - -* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of - Numpy that can be executed on any data parallel device. The subset is a drop-in replacement - of core Numpy functions and numerical data types. -* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler - that enables programming data parallel devices the same way you program CPU with Numba. -* **dpctl - Data Parallel Control library** that provides utilities for device selection, - allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions. - -Table of Contents -***************** -.. toctree:: - :maxdepth: 2 - - prerequisites_and_installation - parallelism - heterogeneous_computing - programming_dpep - examples - useful_links +.. _index: +.. include:: ./ext_links.txt + +.. image:: ./_images/DPEP-large.png + :width: 400px + :align: center + :alt: Data Parallel Extensions for Python + +Data Parallel Extensions for Python +=================================== + +Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance +gains on data parallel devices such as GPUs. It consists of three foundational packages: + +* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of + Numpy that can be executed on any data parallel device. The subset is a drop-in replacement + of core Numpy functions and numerical data types. +* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler + that enables programming data parallel devices the same way you program CPU with Numba. +* **dpctl - Data Parallel Control library** that provides utilities for device selection, + allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions. + +Table of Contents +***************** +.. toctree:: + :maxdepth: 2 + + prerequisites_and_installation + parallelism + heterogeneous_computing + programming_dpep + examples + jupiter_notebook + useful_links + + \ No newline at end of file diff --git a/docs/sources/jupiter_notebook.rst b/docs/sources/jupiter_notebook.rst new file mode 100644 index 0000000..3883747 --- /dev/null +++ b/docs/sources/jupiter_notebook.rst @@ -0,0 +1,16 @@ +.. _jupiter_notebook: +.. include:: ./ext_links.txt + +Jupyter* Notebooks +****************** + +Instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python + +.. toctree:: + Getting Started <01-get_started.ipynb> + DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK <02-dpnp_numpy_fallback.ipynb> + +To download all examples in Jupiter notebook please use the following links: +`Getting Started <01-get_started.ipynb>`_ +`DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK <02-dpnp_numpy_fallback.ipynb>`_ + diff --git a/docs/sources/parallelism.rst b/docs/sources/parallelism.rst index f9f2ad1..d238286 100644 --- a/docs/sources/parallelism.rst +++ b/docs/sources/parallelism.rst @@ -1,44 +1,44 @@ -.. _parallelism: -.. include:: ./ext_links.txt - -Parallelism in modern data parallel architectures -================================================= - -Python is loved for its productivity and interactivity. But when it comes to dealing with -computationally heavy codes Python performance cannot be compromised. Intel and Python numerical -computing communities, such as `NumFOCUS `_, dedicated attention to -optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs: - -* **Multiple computational cores:** Several computational cores allow processing data concurrently. - Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or - reduce a computation time *N* times for a fixed amount of data. - -.. image:: ./_images/dpep-cores.png - :width: 600px - :align: center - :alt: Multiple CPU Cores - -* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions - that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width. - If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel. - - In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously. - Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs - 2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster. - -.. image:: ./_images/dpep-simd.png - :width: 150px - :align: center - :alt: SIMD - -* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent - instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`. - Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction - :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel - with :math:`*` and :math:`-`. - -.. image:: ./_images/dpep-ilp.png - :width: 150px - :align: center - :alt: SIMD - +.. _parallelism: +.. include:: ./ext_links.txt + +Parallelism in modern data parallel architectures +================================================= + +Python is loved for its productivity and interactivity. But when it comes to dealing with +computationally heavy codes Python performance cannot be compromised. Intel and Python numerical +computing communities, such as `NumFOCUS `_, dedicated attention to +optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs: + +* **Multiple computational cores:** Several computational cores allow processing data concurrently. + Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or + reduce a computation time *N* times for a fixed amount of data. + +.. image:: ./_images/dpep-cores.png + :width: 600px + :align: center + :alt: Multiple CPU Cores + +* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions + that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width. + If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel. + + In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously. + Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs + 2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster. + +.. image:: ./_images/dpep-simd.png + :width: 150px + :align: center + :alt: SIMD + +* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent + instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`. + Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction + :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel + with :math:`*` and :math:`-`. + +.. image:: ./_images/dpep-ilp.png + :width: 150px + :align: center + :alt: SIMD + diff --git a/docs/sources/programming_dpep.rst b/docs/sources/programming_dpep.rst index 68b67b8..db2af12 100644 --- a/docs/sources/programming_dpep.rst +++ b/docs/sources/programming_dpep.rst @@ -1,307 +1,307 @@ -.. _programming_dpep: -.. include:: ./ext_links.txt - -Programming with Data Parallel Extensions for Python -==================================================== - -.. image:: ./_images/dpep-all.png - :width: 600px - :align: center - :alt: Data Parallel Extensions for Python - -As we briefly outlined, **Data Parallel Extensions for Python** consist of three foundational packages: - -* the `Numpy*`_-like library, ``dpnp``; -* the compiler extension for `Numba*`_, ``numba-dpex`` -* the library for managing devices, queues, and heterogeneous data, ``dpctl``. - -Their underlying implementation is based on `SYCL*`_ standard, which is a cross-platform abstraction layer -for heterogeneous computing on data parallel devices, such as CPU, GPU, or domain specific accelerators. - -Scalars vs. 0-dimensional arrays -******************************** - -Primitive types such as Python’s and Numpy’s ``float``, ``int``, or ``complex``, used to represent scalars, -have the host storage. In contrast, ``dpctl.tensor.usm_ndarray`` and ``dpnp.ndarray`` have USM storage -and carry associated allocation queue. For the :ref:`Compute-Follows-Data` consistent behavior -all ``dpnp`` operations that produce scalars will instead produce respective 0-dimensional arrays. - -That implies, that some code changes may be needed to replace scalar math operations with respective -``dpnp`` array operations. See `Data Parallel Extension for Numpy*`_ - **API Reference** section for details. - -Data Parallel Extension for Numpy - dpnp -**************************************** - -The ``dpnp`` library is a bare minimum to start programming numerical codes for data parallel devices. -You may already have a Python script written in `Numpy*`_. Being a drop-in replacement of (a subset of) `Numpy*`_, -to execute your `Numpy*`_ script on GPU usually requires changing just a few lines of the code: - -.. literalinclude:: ../../examples/01-hello_dpnp.py - :language: python - :lines: 27- - :caption: **EXAMPLE 01:** Your first NumPy code running on GPU - :name: ex_01_hello_dpnp - - -In this example ``np.asarray()`` creates an array on the default `SYCL*`_ device, which is ``"gpu"`` on systems -with integrated or discrete GPU (it is ``"host"`` on systems that do not have GPU). -The queue associated with this array is now carried with ``x``, and ``np.sum(x)`` will derive it from ``x``, -and respective pre-compiled kernel implementing ``np.sum()`` will be submitted to that queue. -The result ``y`` will be allocated on the device 0-dimensional array associated with that queue too. - -All ``dpnp`` array creation routines as well as random number generators have additional optional keyword arguments -``device``, ``queue``, and ``usm_type``, using which you can explicitly specify on which device or queue you want -the tensor data to be created along with USM memory type to be used (``"host"``, ``"device"``, or ``"shared"``). -In the following example we create the array ``x`` on the GPU device, and perform a reduction sum on it: - -.. literalinclude:: ../../examples/02-dpnp_device.py - :language: python - :lines: 27- - :caption: **EXAMPLE 02:** Select device type while creating array - :name: ex_02_dpnp_device - - -Data Parallel Extension for Numba - numba-dpex -********************************************** - -`Numba*`_ is a powerful Just-In-Time (JIT) compiler that works best on `Numpy*`_ arrays, `Numpy*`_ functions, and loops. -Data parallel loops is where the data parallelism resides. It allows leveraging all available CPU cores, -SIMD instructions, and schedules those in a way that exploits maximum instruction-level parallelism. -The ``numba-dpex`` extension allows to compile and offload data parallel regions to any data parallel device. -It takes just a few lines to modify your CPU `Numba*`_ script to run on GPU. - -.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py - :language: python - :lines: 27- - :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex - :name: ex_03_dpnp2numba_dpex - -In this example we implement a custom function ``sum_it()`` that takes an array input. We compile it with -`Data Parallel Extension for Numba*`_. Being just-in-time compiler, Numba derives the queue from input argument ``x``, -which is associated with the default device (``"gpu"`` on systems with integrated or discrete GPU) and -dynamically compiles the kernel submitted to that queue. The result will reside as a 0-dimensional array on the device -associated with the queue, and on exit from the offload kernel it will be assigned to the tensor y. - -The ``parallel=True`` setting in ``@njit` is essential to enable generation of data parallel kernels. -Please also note that we use ``fastmath=True`` in ``@njit`` decorator. This is an important setting -to instruct the compiler that you’re okay NOT preserving the order of floating-point operations. -This will enable generation of instructions (such as SIMD) for greater performance. - -Data Parallel Control - dpctl -***************************** - -Both ``dpnp`` and ``numba-dpex`` provide enough API versatility for programming data parallel devices but -there are some situations when you will need to use dpctl advanced capabilities: - -1. **Advanced device management.** Both ``dpnp`` and ``numba-dpex`` support Numpy array creation routines - with additional parameters that specify the device on which the data is allocated and the type of memory to be used - (``"device"``, ``"host"``, or ``"shared"``). However, if you need some more advanced device and data management - capabilities you will also need to import ``dpctl`` in addition to ``dpnp`` and/or ``numba-dpex``. - - One of frequent usages of ``dpctl`` is to query the list devices present on the system, available driver backend - (such as ``"opencl"``, ``"level_zero"``, ``"cuda"``, etc.) - - Another frequent usage is the creation additional queues for the purpose of profiling or choosing an out-of-order - execution of offload kernels. - -.. literalinclude:: ../../examples/04-dpctl_device_query.py - :language: python - :lines: 27- - :caption: **EXAMPLE 04:** Get information about devices - :name: ex_04_dpctl_device_query - -2. **Cross-platform development using Python Array API standard.** If you’re a Python developer - programming Numpy-like codes and targeting different hardware vendors and different tensor implementations, - then going with `Python* Array API Standard`_ is a good choice for writing a portable Numpy-like code. - The ``dpctl.tensor`` implements `Python* Array API Standard`_ for `SYCL*`_ devices. Accompanied with - respective SYCL device drivers from different vendors ``dpctl.tensor`` becomes a portable solution - for writing numerical codes for any SYCL device. - - For example, some Python communities, such as - `Scikit-Learn* community `_, are already establishing - a path for having algorithms (re-)implemented using `Python* Array API Standard`_ . - This is a reliable path for extending their capabilities beyond CPU only, or beyond certain GPU vendor only. - -3. **Zero-copy data exchange between tensor implementations.** Certain Python projects may have own tensor - implementations not relying on ``dpctl.tensor`` or ``dpnp.ndarray`` tensors. Can users still exchange data - between these tensors not copying it back and forth through the host? - `Python* Array API Standard`_ specifies the data exchange protocol for zero-copy exchange - between tensors through ``dlpack``. Being the `Python* Array API Standard`_ implementation - ``dpctl`` provides ``dpctl.tensor.from_dlpack()`` function used for zero-copy view of another tensor input. - - -Debugging and profiling Data Parallel Extensions for Python -*********************************************************** -`Intel oneAPI Base Toolkit`_ provides two tools to assist programmers to analyze performance issues in programs -that use **Data Parallel Extensions for Python**. They are `Intel VTune Profiler`_ and -`Intel Advisor`_. - -Intel VTune Profiler examines various performance aspects of a program like, the most time-consuming parts, -efficiency of offloaded code, impact of memory sub-systems, etc. - -Intel Advisor provides insights on the performance of offloaded code relative to the peak performance and -memory bandwidth. - -Next, we will detail the steps involved in using Intel VTune Profiler and Intel Advisor with -heterogenous programs that use **Data Parallel Extensions for Python**. - -Profiling with Intel VTune Profiler ------------------------------------ - -.. |copy| unicode:: U+000A9 - -.. |trade| unicode:: U+2122 - -Intel |copy| VTune |trade| Profiler provides two mechanisms, called *GPU offload* and *GPU hotspots*, to profile heterogeneous programs -targeted to GPUs. - -The *GPU offload* analysis profiles the entire application (both GPU and host code) and helps to identify -if the application is CPU or GPU bound. It provides information on the proportion of the execution time spent -in GPU execution. It also provides information about various hotspots in the program. The key goal of the *GPU offload* -analysis is to identify the parts of the program that can benefit from offloading to GPUs. - -The *GPU hotspots* analysis focuses on providing insights into the performance of GPU-offloaded code. -It provides insights about the parallelism in the GPU kernel, the efficiency of the kernel, SIMD utilization -and memory latency. It also provides performance data regarding synchronization operations like GPU barriers and -atomic operations. - -The following instructions are used execute the two Intel VTune Profiler analyses on programs written -using **Data Parallel Extensions for Python**. - -.. code-block:: console - :caption: **GPU Offload** - - > vtune -collect gpu-offload -r -- python