diff --git a/docs/sources/01-get_started.ipynb b/docs/sources/01-get_started.ipynb
new file mode 100644
index 0000000..7b4ec35
--- /dev/null
+++ b/docs/sources/01-get_started.ipynb
@@ -0,0 +1,581 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a002ea61",
+   "metadata": {},
+   "source": [
+    "# Getting Started"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "905c742a",
+   "metadata": {},
+   "source": [
+    "# Few Line Changes to Run on GPU"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41bf2099",
+   "metadata": {},
+   "source": [
+    "Let's see the example how you can easily in a few lines of code switch computations from CPU to GPU device.\n",
+    "\n",
+    "Please look on the original example.\n",
+    "We allocate 2 matrices on the Host (CPU) device usnig NumPy array function, all future calculations will be performed as well on the allocated Host(CPU) device."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d9e8711d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "res =  [[2 2]\n",
+      " [2 2]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Original CPU script\n",
+    "\n",
+    "# Call numpy library\n",
+    "import numpy as np\n",
+    "\n",
+    "# Data alocated on the CPU device\n",
+    "x = np.array([[1, 1], [1, 1]])\n",
+    "y = np.array([[1, 1], [1, 1]])\n",
+    "\n",
+    "# Compute performed on the CPU device, where data is allocated\n",
+    "res = np.matmul(x, y)\n",
+    "\n",
+    "print (\"res = \", res)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "402c3d61",
+   "metadata": {},
+   "source": [
+    "Now let's try to modify our code in a way when all calculations occur on the GPU device.\n",
+    "To do it, you need just to switch to the dpnp library and see on the result."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c10b7d83",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Array x is located on the device: Device(level_zero:gpu:0)\n",
+      "Array y is located on the device: Device(level_zero:gpu:0)\n",
+      "res is located on the device: Device(level_zero:gpu:0)\n",
+      "res =  [[2 2]\n",
+      " [2 2]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Modified XPU script\n",
+    "\n",
+    "# Drop-in replacement via single line change\n",
+    "import dpnp as np\n",
+    "\n",
+    "# Data alocated on default SYCL device\n",
+    "x = np.array([[1, 1], [1, 1]])\n",
+    "y = np.array([[1, 1], [1, 1]])\n",
+    "\n",
+    "# Compute performed on the device, where data is allocated\n",
+    "res = np.matmul(x, y)\n",
+    "\n",
+    "\n",
+    "print (\"Array x is located on the device:\", x.device)\n",
+    "print (\"Array y is located on the device:\", y.device)\n",
+    "print (\"res is located on the device:\", res.device)\n",
+    "print (\"res = \", res)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6a58f17",
+   "metadata": {},
+   "source": [
+    "As you may see changing only one line of code help us to perform all calculations on the GPU device.\n",
+    "In this example np.array() creates an array on the default SYCL* device, which is \"gpu\" on systems with integrated or discrete GPU (it is \"host\" on systems that do not have GPU). The queue associated with this array is now carried with x and y, and np.matmul(x, y) will do matrix product of two arrays x and y, and respective pre-compiled kernel implementing np.matmul() will be submitted to that queue. The result res will be allocated on the device array associated with that queue too.\n",
+    "\n",
+    "Now let's make a few improvements in our code and see how we can control and specify exact device on which we want to perform our calculations and which USM memory type to use."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be340585",
+   "metadata": {},
+   "source": [
+    "# dpnp simple examples with popular functions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26d8c3c6",
+   "metadata": {},
+   "source": [
+    "1. Example to return an array with evenly spaced values within a given interval."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "0435d7f0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "282 µs ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n",
+      "Result a is located on the device: Device(level_zero:gpu:0)\n",
+      "a =  [ 3  9 15 21 27]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpnp as np\n",
+    "\n",
+    "# Create an array of values from 3 till 30 with step 6\n",
+    "a = np.arange(3, 30, step = 6)\n",
+    "\n",
+    "print (\"Result a is located on the device:\", a.device)\n",
+    "print (\"a = \", a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6765095",
+   "metadata": {},
+   "source": [
+    "In this example np.arange() creates an array on the default SYCL* device, which is \"gpu\" on systems with integrated or discrete GPU (it is \"host\" on systems that do not have GPU)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35081461",
+   "metadata": {},
+   "source": [
+    "2. Example which calculates on the GPU the sum of the array elements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e613398f",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Result x is located on the device: Device(level_zero:gpu:0)\n",
+      "Result y is located on the device: Device(level_zero:gpu:0)\n",
+      "The sum of the array elements is:  6\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpnp as np\n",
+    "\n",
+    "x = np.empty(3)\n",
+    "\n",
+    "try:\n",
+    "    # Using filter selector strings to specify root devices for a new array\n",
+    "    x = np.asarray ([1, 2, 3], device=\"gpu\")\n",
+    "    print (\"Result x is located on the device:\", x.device)\n",
+    "except:\n",
+    "    print (\"GPU device is not available\")\n",
+    "\n",
+    "# Return the sum of the array elements\n",
+    "y = np.sum (x) # Expect 6\n",
+    "\n",
+    "print (\"Result y is located on the device:\", y.device)\n",
+    "print (\"The sum of the array elements is: \", y )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88b4915e",
+   "metadata": {},
+   "source": [
+    "In this example np.asarray() creates an array on the default GPU device. The queue associated with this array is now carried with x, and np.sum(x) will derive it from x, and respective pre-compiled kernel implementing np.sum() will be submitted to that queue. The result y will be allocated on the device 0-dimensional array associated with that queue too."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b49f4c62",
+   "metadata": {},
+   "source": [
+    "3. Example of inversion of an array"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "4b53afed",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Array a is located on the device: Device(level_zero:gpu:0)\n",
+      "Result x is located on the device: Device(level_zero:gpu:0)\n",
+      "Array x is: [[-2 -2]\n",
+      " [-3 -2]\n",
+      " [-2 -1]\n",
+      " [ 0 -1]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpnp as np\n",
+    "\n",
+    "try:\n",
+    "    \n",
+    "    # Using filter selector strings to specify root devices for an array\n",
+    "    a = np.array([[1, 1], [2, 1], [1, 0], [-1, 0]], device = \"gpu\")\n",
+    "    print (\"Array a is located on the device:\", a.device)  \n",
+    "\n",
+    "    # Do inversion of an array \"a\"\n",
+    "    x = np.invert(a)\n",
+    "\n",
+    "    print (\"Result x is located on the device:\", x.device)\n",
+    "    print (\"Array x is:\", x) \n",
+    "\n",
+    "except:\n",
+    "    print (\"GPU device is not available\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e10f226a",
+   "metadata": {},
+   "source": [
+    "In this example np.array() creates an array on the default GPU device. The queue associated with this array is now carried with a, and np.invert(a) will derive it from a, and respective pre-compiled kernel implementing np.invert() will be submitted to that queue. The result x will be allocated on the device array associated with that queue too."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "94cb3b9b",
+   "metadata": {},
+   "source": [
+    "# dpctl simple examples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9fb4b1b5",
+   "metadata": {},
+   "source": [
+    "Here you may find a list of simple examples which explain how to understand how many devices you have in the systen and how to operate with them\n",
+    "Let's print the list of all available SYCL devices."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b890ae71",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Intel(R) OpenCL HD Graphics OpenCL 3.0 \n",
+      "Intel(R) FPGA Emulation Platform for OpenCL(TM) OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3\n",
+      "Intel(R) OpenCL OpenCL 3.0 WINDOWS\n",
+      "Intel(R) Level-Zero 1.3\n"
+     ]
+    }
+   ],
+   "source": [
+    "# See the list of available SYCL platforms and extra metadata about each platform.\n",
+    "import dpctl\n",
+    "\n",
+    "dpctl.lsplatform()  # Print platform information"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db5e8db4",
+   "metadata": {},
+   "source": [
+    "Let's look on the output.\n",
+    "On my platform is available OpenCL GPU driver, Intel(R) FPGA Emulation Device, OpenCL CPU driver and Level Zero GPU driver.\n",
+    "If i play with verbocity parameter, i can get more information about the devices i have."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ffcf0cfb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Platform  0 ::\n",
+      "    Name        Intel(R) OpenCL HD Graphics\n",
+      "    Version     OpenCL 3.0 \n",
+      "    Vendor      Intel(R) Corporation\n",
+      "    Backend     opencl\n",
+      "    Num Devices 1\n",
+      "      # 0\n",
+      "        Name                Intel(R) Iris(R) Xe Graphics\n",
+      "        Version             31.0.101.3430\n",
+      "        Filter string       opencl:gpu:0\n",
+      "Platform  1 ::\n",
+      "    Name        Intel(R) FPGA Emulation Platform for OpenCL(TM)\n",
+      "    Version     OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3\n",
+      "    Vendor      Intel(R) Corporation\n",
+      "    Backend     opencl\n",
+      "    Num Devices 1\n",
+      "      # 0\n",
+      "        Name                Intel(R) FPGA Emulation Device\n",
+      "        Version             2022.15.11.0.18_160000\n",
+      "        Filter string       opencl:accelerator:0\n",
+      "Platform  2 ::\n",
+      "    Name        Intel(R) OpenCL\n",
+      "    Version     OpenCL 3.0 WINDOWS\n",
+      "    Vendor      Intel(R) Corporation\n",
+      "    Backend     opencl\n",
+      "    Num Devices 1\n",
+      "      # 0\n",
+      "        Name                11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz\n",
+      "        Version             2022.15.11.0.18_160000\n",
+      "        Filter string       opencl:cpu:0\n",
+      "Platform  3 ::\n",
+      "    Name        Intel(R) Level-Zero\n",
+      "    Version     1.3\n",
+      "    Vendor      Intel(R) Corporation\n",
+      "    Backend     ext_oneapi_level_zero\n",
+      "    Num Devices 1\n",
+      "      # 0\n",
+      "        Name                Intel(R) Iris(R) Xe Graphics\n",
+      "        Version             1.3.23904\n",
+      "        Filter string       level_zero:gpu:0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# See the list of available SYCL platforms and extra metadata about each platform.\n",
+    "import dpctl\n",
+    "\n",
+    "dpctl.lsplatform(2)  # Print platform information with verbocitz level 2 (highest level)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f63525d3",
+   "metadata": {},
+   "source": [
+    "Having information about available SYCL platforms you can specify which type of devices you want to work with"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "84ff47e9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[<dpctl.SyclDevice [backend_type.opencl, device_type.gpu,  Intel(R) Iris(R) Xe Graphics] at 0x1a1eddd72f0>, <dpctl.SyclDevice [backend_type.level_zero, device_type.gpu,  Intel(R) Iris(R) Xe Graphics] at 0x1a1eddd70f0>]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# See the list of available gpu devices and their extra metadata.\n",
+    "import dpctl\n",
+    "\n",
+    "if dpctl.has_gpu_devices():\n",
+    "    print (dpctl.get_devices(device_type='gpu'))\n",
+    "else:\n",
+    "    print(\"GPU device is not available\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a93e7cf8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[<dpctl.SyclDevice [backend_type.opencl, device_type.cpu,  11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz] at 0x1779083bcf0>]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# See the list of available gpu devices and their extra metadata.\n",
+    "import dpctl\n",
+    "\n",
+    "if dpctl.has_cpu_devices():\n",
+    "    print (dpctl.get_devices(device_type='cpu'))\n",
+    "else:\n",
+    "    print(\"CPU device is not available\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "efcc3f45",
+   "metadata": {},
+   "source": [
+    "And you can make selection of the specific device in your system using the default selctor"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "1c068447",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    Name            Intel(R) Iris(R) Xe Graphics\n",
+      "    Driver version  1.3.23904\n",
+      "    Vendor          Intel(R) Corporation\n",
+      "    Filter string   level_zero:gpu:0\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpctl\n",
+    "\n",
+    "try:\n",
+    "    # Create a SyclDevice of type GPU based on whatever is returned\n",
+    "    # by the SYCL `gpu_selector` device selector class.\n",
+    "    gpu = dpctl.select_gpu_device()\n",
+    "    gpu.print_device_info() # print GPU device information\n",
+    "\n",
+    "except:\n",
+    "    print (\"GPU device is not available\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c378b79d",
+   "metadata": {},
+   "source": [
+    "Or by using the infromation in filter string of the device create abd explicit SyclDevice "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "ad83abb5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    Name            Intel(R) Iris(R) Xe Graphics\n",
+      "    Driver version  1.3.23904\n",
+      "    Vendor          Intel(R) Corporation\n",
+      "    Profile         FULL_PROFILE\n",
+      "    Filter string   level_zero:gpu:0\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpctl\n",
+    "\n",
+    "# Create a SyclDevice with an explicit filter string,\n",
+    "# in this case the first level_zero gpu device.\n",
+    "try:\n",
+    "    level_zero_gpu = dpctl.SyclDevice(\"level_zero:gpu:0\")\n",
+    "    level_zero_gpu.print_device_info()\n",
+    "except:\n",
+    "    print(\"The first level_zero GPU device is not available\")    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eadefe0b",
+   "metadata": {},
+   "source": [
+    "Let's check if your gpu device support double precision. To do this we need to selcet gpu device and check the parameter has_aspect_fp64:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "a94756d7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    Name            Intel(R) Iris(R) Xe Graphics\n",
+      "    Driver version  1.3.23904\n",
+      "    Vendor          Intel(R) Corporation\n",
+      "    Filter string   level_zero:gpu:0\n",
+      "\n",
+      "Double precision support is False\n"
+     ]
+    }
+   ],
+   "source": [
+    "import dpctl\n",
+    "# Select GPU device and check double precision support\n",
+    "try:\n",
+    "    gpu = dpctl.select_gpu_device()\n",
+    "    gpu.print_device_info()\n",
+    "    print(\"Double precision support is\", gpu.has_aspect_fp64)\n",
+    "except:\n",
+    "    print(\"The GPU device is not available\")   "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/sources/02-dpnp_numpy_fallback.ipynb b/docs/sources/02-dpnp_numpy_fallback.ipynb
new file mode 100644
index 0000000..ca6bcf5
--- /dev/null
+++ b/docs/sources/02-dpnp_numpy_fallback.ipynb
@@ -0,0 +1,140 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "be340585",
+   "metadata": {},
+   "source": [
+    "# Usage of numpy functions in dpnp library"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9008dbe5",
+   "metadata": {},
+   "source": [
+    "1. Example of the usage of the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` environment variable and finding specific values in array based on condition using dpnp library"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d4300fe",
+   "metadata": {},
+   "source": [
+    "Not all functions were yet implemented in the dpnp library, some of the functions require enabling of the direct fallback to the NumPy library. \n",
+    "One of the example can be the function \"dpnp.full ()\" and parameter like in non default state. \n",
+    "Let's look on the example where we want to create an two dimencial array with singular element and array like option. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "c78511cd",
+   "metadata": {},
+   "outputs": [
+    {
+     "ename": "NotImplementedError",
+     "evalue": "Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': <dpnp.dpnp_array.dpnp_array object at 0x0000021C7FAAF8B0>} isn't currently supported and would fall back on NumPy implementation. Define enviroment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without rasing an exception.",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[1;31mNotImplementedError\u001b[0m                       Traceback (most recent call last)",
+      "Cell \u001b[1;32mIn[1], line 4\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mdpnp\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[0;32m      3\u001b[0m \u001b[38;5;66;03m# Create an two dimencial array with singular element and array like option\u001b[39;00m\n\u001b[1;32m----> 4\u001b[0m a \u001b[38;5;241m=\u001b[39m \u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfull\u001b[49m\u001b[43m(\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[38;5;241;43m3\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlike\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mzeros\u001b[49m\u001b[43m(\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m      5\u001b[0m \u001b[38;5;28mprint\u001b[39m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArray a is located on the device:\u001b[39m\u001b[38;5;124m\"\u001b[39m, a\u001b[38;5;241m.\u001b[39mdevice)\n\u001b[0;32m      6\u001b[0m \u001b[38;5;28mprint\u001b[39m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mArray a is\u001b[39m\u001b[38;5;124m\"\u001b[39m, a)\n",
+      "File \u001b[1;32m~\\Anaconda3\\envs\\my_env\\lib\\site-packages\\dpnp\\dpnp_iface_arraycreation.py:732\u001b[0m, in \u001b[0;36mfull\u001b[1;34m(shape, fill_value, dtype, order, like, device, usm_type, sycl_queue)\u001b[0m\n\u001b[0;32m    723\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m    724\u001b[0m     \u001b[38;5;28;01mreturn\u001b[39;00m dpnp_container\u001b[38;5;241m.\u001b[39mfull(shape,\n\u001b[0;32m    725\u001b[0m                                fill_value,\n\u001b[0;32m    726\u001b[0m                                dtype\u001b[38;5;241m=\u001b[39mdtype,\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    729\u001b[0m                                usm_type\u001b[38;5;241m=\u001b[39musm_type,\n\u001b[0;32m    730\u001b[0m                                sycl_queue\u001b[38;5;241m=\u001b[39msycl_queue)\n\u001b[1;32m--> 732\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mcall_origin\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnumpy\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfull\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mshape\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mfill_value\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43morder\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlike\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlike\u001b[49m\u001b[43m)\u001b[49m\n",
+      "File \u001b[1;32mdpnp\\dpnp_utils\\dpnp_algo_utils.pyx:132\u001b[0m, in \u001b[0;36mdpnp.dpnp_utils.dpnp_algo_utils.call_origin\u001b[1;34m()\u001b[0m\n",
+      "\u001b[1;31mNotImplementedError\u001b[0m: Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': <dpnp.dpnp_array.dpnp_array object at 0x0000021C7FAAF8B0>} isn't currently supported and would fall back on NumPy implementation. Define enviroment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without rasing an exception."
+     ]
+    }
+   ],
+   "source": [
+    "import dpnp as np\n",
+    "\n",
+    "# Create an two dimencial array with singular element and array like option\n",
+    "a = np.full((2,2),3, like = np.zeros((2, 0)))\n",
+    "print (\"Array a is located on the device:\", a.device)\n",
+    "print (\"Array a is\", a)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae026021",
+   "metadata": {},
+   "source": [
+    "As you can see the function \"dpnp.full ()\" and parameter like in non default state is not implemented in the dpnp library.\n",
+    "We got the following error message: \"Requested funtion=full with args=((2, 2), 3, None, 'C') and kwargs={'like': <dpnp.dpnp_array.dpnp_array object at 0x0000021C7FAAF8B0>} isn't currently supported and would fall back on NumPy implementation. Define environment variable `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` to `0` if the fall back is required to be supported without raising an exception.\" \n",
+    "By default the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK` environment variable is not null and it allows to use only dpnp library. \n",
+    "If we want to call NumPy library for functions not supported under dpnp, we need to change this environment variable to `0`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "210440c3",
+   "metadata": {},
+   "source": [
+    "Let's overwrite the same example using the `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0` environment variable."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bca7490d",
+   "metadata": {},
+   "source": [
+    "`Note:` Please pay attention that if you are working in the Jupyter Notebook, before running the example with setting the  `DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0` you need to restart the kernel in the Jupyter Notebook. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "05ad1565",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK = 0\n",
+      "Array a is located on the device: Device(level_zero:gpu:0)\n",
+      "Array a is [[3 3]\n",
+      " [3 3]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "# call numpy if not null than we will use dpnp, by default not null\n",
+    "os.environ[\"DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK\"] = \"0\" \n",
+    "\n",
+    "import dpnp as np\n",
+    "\n",
+    "# Expect result 0\n",
+    "print (\"DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK =\", np.config.__DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK__) \n",
+    "\n",
+    "# Create an two dimencial array with singular element and array like option\n",
+    "a = np.full((2,2),3, like = np.zeros((2, 0)))\n",
+    "print (\"Array a is located on the device:\", a.device)\n",
+    "print (\"Array a is\", a)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/sources/conf.py b/docs/sources/conf.py
index c2255e0..ce78bee 100644
--- a/docs/sources/conf.py
+++ b/docs/sources/conf.py
@@ -1,118 +1,121 @@
-# *****************************************************************************
-# Copyright (c) 2022, Intel Corporation All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions are met:
-#
-#     Redistributions of source code must retain the above copyright notice,
-#     this list of conditions and the following disclaimer.
-#
-#     Redistributions in binary form must reproduce the above copyright notice,
-#     this list of conditions and the following disclaimer in the documentation
-#     and/or other materials provided with the distribution.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
-# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
-# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
-# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
-# OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
-# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-# *****************************************************************************
-
-# coding: utf-8
-# Configuration file for the Sphinx documentation builder.
-
-# -- Project information -----------------------------------------------------
-
-project = 'Data Parallel Extensions for Python*'
-copyright = '2022, Intel Corporation'
-author = 'Intel Corporation'
-
-# The full version, including alpha/beta/rc tags
-release = '0.1'
-
-# -- General configuration ----------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    'sphinx.ext.todo',
-    'sphinx.ext.intersphinx',
-    'sphinx.ext.extlinks',
-    'sphinx.ext.githubpages',
-    'sphinx.ext.napoleon',
-    'sphinx.ext.autosectionlabel',
-    'sphinxcontrib.programoutput',
-]
-
-
-# Add any paths that contain templates here, relative to this directory.
-#templates_path = ['_templates']
-templates_path = []
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = []
-
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sdc-sphinx-theme'
-
-html_theme_path = ['.']
-
-html_theme_options = {
-}
-
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = []
-
-html_sidebars = {
-    '**': ['globaltoc.html', 'sourcelink.html', 'searchbox.html', 'relations.html'],
-}
-
-html_show_sourcelink = False
-
-# -- Todo extension configuration  ----------------------------------------------
-todo_include_todos = True
-todo_link_only = True
-
-# -- InterSphinx configuration: looks for objects in external projects -----
-# Add here external classes you want to link from Intel SDC documentation
-# Each entry of the dictionary has the following format:
-#      'class name': ('link to object.inv file for that class', None)
-#intersphinx_mapping = {
-#    'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
-#    'python': ('http://docs.python.org/2', None),
-#    'numpy': ('http://docs.scipy.org/doc/numpy', None)
-#}
-intersphinx_mapping = {
-}
-
-# -- Napoleon extension configuration (Numpy and Google docstring options) -------
-#napoleon_google_docstring = True
-#napoleon_numpy_docstring = True
-#napoleon_include_init_with_doc = True
-#napoleon_include_private_with_doc = True
-#napoleon_include_special_with_doc = True
-#napoleon_use_admonition_for_examples = False
-#napoleon_use_admonition_for_notes = False
-#napoleon_use_admonition_for_references = False
-#napoleon_use_ivar = False
-#napoleon_use_param = True
-#napoleon_use_rtype = True
-
-# -- Prepend module name to an object name or not -----------------------------------
-add_module_names = False
+# *****************************************************************************
+# Copyright (c) 2022, Intel Corporation All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+#     Redistributions of source code must retain the above copyright notice,
+#     this list of conditions and the following disclaimer.
+#
+#     Redistributions in binary form must reproduce the above copyright notice,
+#     this list of conditions and the following disclaimer in the documentation
+#     and/or other materials provided with the distribution.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
+# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+# WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+# OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+# EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+# *****************************************************************************
+
+# coding: utf-8
+# Configuration file for the Sphinx documentation builder.
+
+# -- Project information -----------------------------------------------------
+
+project = 'Data Parallel Extensions for Python*'
+copyright = '2022, Intel Corporation'
+author = 'Intel Corporation'
+
+# The full version, including alpha/beta/rc tags
+release = '0.1'
+
+# -- General configuration ----------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.todo',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.extlinks',
+    'sphinx.ext.githubpages',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.autosectionlabel',
+    'sphinxcontrib.programoutput',
+    'nbsphinx',
+    'sphinx_gallery.load_style',
+]
+
+
+# Add any paths that contain templates here, relative to this directory.
+#templates_path = ['_templates']
+templates_path = []
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = []
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sdc-sphinx-theme'
+
+html_theme_path = ['.']
+
+html_theme_options = {
+}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = []
+
+html_sidebars = {
+    '**': ['globaltoc.html', 'sourcelink.html', 'searchbox.html', 'relations.html'],
+}
+
+html_show_sourcelink = False
+
+# -- Todo extension configuration  ----------------------------------------------
+todo_include_todos = True
+todo_link_only = True
+
+# -- InterSphinx configuration: looks for objects in external projects -----
+# Add here external classes you want to link from Intel SDC documentation
+# Each entry of the dictionary has the following format:
+#      'class name': ('link to object.inv file for that class', None)
+#intersphinx_mapping = {
+#    'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
+#    'python': ('http://docs.python.org/2', None),
+#    'numpy': ('http://docs.scipy.org/doc/numpy', None)
+#}
+intersphinx_mapping = {
+}
+
+# -- Napoleon extension configuration (Numpy and Google docstring options) -------
+#napoleon_google_docstring = True
+#napoleon_numpy_docstring = True
+#napoleon_include_init_with_doc = True
+#napoleon_include_private_with_doc = True
+#napoleon_include_special_with_doc = True
+#napoleon_use_admonition_for_examples = False
+#napoleon_use_admonition_for_notes = False
+#napoleon_use_admonition_for_references = False
+#napoleon_use_ivar = False
+#napoleon_use_param = True
+#napoleon_use_rtype = True
+
+# -- Prepend module name to an object name or not -----------------------------------
+add_module_names = False
+
diff --git a/docs/sources/examples.rst b/docs/sources/examples.rst
index ef070c4..cdd9274 100644
--- a/docs/sources/examples.rst
+++ b/docs/sources/examples.rst
@@ -1,55 +1,47 @@
-.. _examples:
-.. include:: ./ext_links.txt
-
-List of examples
-================
-
-.. literalinclude:: ../../examples/01-hello_dpnp.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 01:** Your first NumPy code running on GPU
-   :name: examples_01_hello_dpnp
-
-.. literalinclude:: ../../examples/02-dpnp_device.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 02:** Select device type while creating array
-   :name: examples_02_dpnp_device
-
-.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex
-   :name: examples_03_dpnp2numba_dpex
-
-.. literalinclude:: ../../examples/04-dpctl_device_query.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 04:** Get information about devices
-   :name: examples_04_dpctl_device_query
-   
-.. literalinclude:: ../../examples/05-dpctl_dpnp_test.py
-   :language: python
-   :lines: 46-
-   :caption: **EXAMPLE 05:**  Test installation of ``dpctl_dpnp``
-   :name: ex_05_dpnp_test
-   
-.. literalinclude:: ../../examples/06-numba-dpex_test.py
-   :language: python
-   :lines: 45-
-   :caption: **EXAMPLE 06:** Test installation of ``numba-dpex``
-   :name: 06-numba-dpex_test
-
-Benchmarks
-**********
-
-.. todo::
-   Provide instructions for dpbench
-
-Jupyter* Notebooks
-******************
-
-Instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python
-
-.. literalinclude:: ../../examples/01-get_started.ipynb
-.. literalinclude:: ../../examples/02-dpnp_numpy_fallback.ipynb
\ No newline at end of file
+.. _examples:
+.. include:: ./ext_links.txt
+
+List of examples
+================
+
+.. literalinclude:: ../../examples/01-hello_dpnp.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 01:** Your first NumPy code running on GPU
+   :name: examples_01_hello_dpnp
+
+.. literalinclude:: ../../examples/02-dpnp_device.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 02:** Select device type while creating array
+   :name: examples_02_dpnp_device
+
+.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex
+   :name: examples_03_dpnp2numba_dpex
+
+.. literalinclude:: ../../examples/04-dpctl_device_query.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 04:** Get information about devices
+   :name: examples_04_dpctl_device_query
+   
+.. literalinclude:: ../../examples/05-dpctl_dpnp_test.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 05:**  Test installation of ``dpctl`` and ``dpnp``
+   :name: ex_05_dpnp_test
+   
+.. literalinclude:: ../../examples/06-numba-dpex_test.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 06:** Test installation of ``numba-dpex``
+   :name: 06-numba-dpex_test
+
+Benchmarks
+**********
+
+.. todo::
+   Provide instructions for dpbench
diff --git a/docs/sources/ext_links.txt b/docs/sources/ext_links.txt
index ec86a87..eab6252 100644
--- a/docs/sources/ext_links.txt
+++ b/docs/sources/ext_links.txt
@@ -1,19 +1,19 @@
-..
-    **********************************************************
-    THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION
-    **********************************************************
-.. _NumPy*: https://numpy.org/
-.. _Numba*: https://numba.pydata.org/
-.. _Python* Array API Standard: https://data-apis.org/array-api/
-.. _Intel Distribution for Python*: https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html
-.. _OpenCl*: https://www.khronos.org/opencl/
-.. _DPC++: https://www.apress.com/gp/book/9781484255735
-.. _Data Parallel Extension for Numba*: https://intelpython.github.io/numba-dpex/latest/index.html
-.. _SYCL*: https://www.khronos.org/sycl/
-.. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
-.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
-.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/
-.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf>
-.. _Intel oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html
-.. _Intel VTune Profiler: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
-.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
+..
+    **********************************************************
+    THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION
+    **********************************************************
+.. _NumPy*: https://numpy.org/
+.. _Numba*: https://numba.pydata.org/
+.. _Python* Array API Standard: https://data-apis.org/array-api/
+.. _Intel Distribution for Python*: https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html
+.. _OpenCl*: https://www.khronos.org/opencl/
+.. _DPC++: https://www.apress.com/gp/book/9781484255735
+.. _Data Parallel Extension for Numba*: https://intelpython.github.io/numba-dpex/latest/index.html
+.. _SYCL*: https://www.khronos.org/sycl/
+.. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
+.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
+.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/
+.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf>
+.. _Intel oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html
+.. _Intel VTune Profiler: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
+.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
diff --git a/docs/sources/heterogeneous_computing.rst b/docs/sources/heterogeneous_computing.rst
index 6341f56..e1499c7 100644
--- a/docs/sources/heterogeneous_computing.rst
+++ b/docs/sources/heterogeneous_computing.rst
@@ -1,149 +1,149 @@
-.. _heterogeneous_computing:
-.. include:: ./ext_links.txt
-
-Heterogeneous computing
-=======================
-
-Device Offload
-**************
-
-Python is an interpreted language, which implies that most of Python codes will run on CPU,
-and only a few data parallel regions will execute on data parallel devices.
-That is why the concept of host and offload devices is useful when it comes to conceptualizing
-a heterogeneous programming model in Python.
-
-.. image:: ./_images/hetero-devices.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices*
-(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python**
-offer a programming model where a script executed by Python interpreter on host can *offload* data
-parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted
-for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*.
-
-Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded
-in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ .
-**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python
-using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_.
-
-One or more kernels are submitted for execution into a *queue* targeting an *offload device*.
-For each device one or more queues can be created. In most cases you won’t need to work
-with device queues directly. Data Parallel Extensions for Python will do necessary underlying
-work with queues for you through the :ref:`Compute-Follows-Data`.
-
-Unified Shared Memory
-*********************
-
-Each device has its own memory, not necessarily accessible from another device.
-
-.. image:: ./_images/hetero-devices.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-For example, **Device 1** memory may not be directly accessible from the host, but only accessible
-via expensive copying by a driver software. Similarly, depending on the architecture, direct data
-exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive
-copying through the host memory. These aspects must be taken into consideration when programming
-data parallel devices.
-
-In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1**
-and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or
-by working with each individual sub-devices. For the former case a programmer needs to create
-a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device.
-
-`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support
-for unified virtual address space, which allows coherency between the host and the device
-pointers. All memory is allocated by the host, but it offers three distinct allocation types:
-
-* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation
-  when you need to stream a read-only data from the host to the device once.
-
-* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one.
-  Useful in a situation when most of data crunching happens on the device.
-
-* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
-  the host or device.** Shared allocations are useful when data are accessed by both host and devices,
-  since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
-
-Compute-Follows-Data
-********************
-Since data copying between devices is typically very expensive, for performance reasons it is essential
-to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model,
-which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
-carry information about allocation queues, and hence, about the device on which an array is allocated.
-Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place.
-
-.. image:: ./_images/kernel-queue-device.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
-**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the
-*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm
-the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will
-be created on the **Device Queue** associated with the **Device 1**.
-
-**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue,
-otherwise an exception will be thrown. For example, the following usages will result in the exception.
-
-.. figure:: ./_images/queue-exception1.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-    Input tensors are on different devices and different queues. Exception is thrown.
-
-.. figure:: ./_images/queue-exception2.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-    Input tensors are on the same device but queues are different. Exception is thrown.
-
-.. figure:: ./_images/queue-exception3.png
-    :width: 600px
-    :align: center
-    :alt: SIMD
-
-    Data belongs to the same device, but queues are different and associated with different sub-devices.
-
-Copying data between devices and queues
-***************************************
-
-**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in
-normal circumstances you do not need to directly manage queues. Having one canonical queue per device
-allows you to copy data between devices using to_device() method:
-
-.. code-block:: python
-
-   a_new = a.to_device(b.device)
-
-Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``.
-The same queue will be associated with ``b`` and ``a_new``.
-
-Alternatively, you can do this as follows:
-
-.. code-block:: python
-    :caption: DPNP array
-
-    a_new = dpnp.asarray(a, device=b.device)
-
-.. code-block:: python
-    :caption: DPCtl array
-
-    a_new = dpctl.tensor.asarray(a, device=b.device)
-
-Creating additional queues
-**************************
-
-As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device,
-and you normally work with this queue implicitly. However, you can always create as many additional queues per device
-as needed, and work with them explicitly.
-
-A typical situation when you will want to create the queue explicitly is for profiling purposes.
-Read `Data Parallel Control`_ documentation for more details about queues.
-
+.. _heterogeneous_computing:
+.. include:: ./ext_links.txt
+
+Heterogeneous computing
+=======================
+
+Device Offload
+**************
+
+Python is an interpreted language, which implies that most of Python codes will run on CPU,
+and only a few data parallel regions will execute on data parallel devices.
+That is why the concept of host and offload devices is useful when it comes to conceptualizing
+a heterogeneous programming model in Python.
+
+.. image:: ./_images/hetero-devices.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices*
+(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python**
+offer a programming model where a script executed by Python interpreter on host can *offload* data
+parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted
+for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*.
+
+Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded
+in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ .
+**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python
+using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_.
+
+One or more kernels are submitted for execution into a *queue* targeting an *offload device*.
+For each device one or more queues can be created. In most cases you won’t need to work
+with device queues directly. Data Parallel Extensions for Python will do necessary underlying
+work with queues for you through the :ref:`Compute-Follows-Data`.
+
+Unified Shared Memory
+*********************
+
+Each device has its own memory, not necessarily accessible from another device.
+
+.. image:: ./_images/hetero-devices.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+For example, **Device 1** memory may not be directly accessible from the host, but only accessible
+via expensive copying by a driver software. Similarly, depending on the architecture, direct data
+exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive
+copying through the host memory. These aspects must be taken into consideration when programming
+data parallel devices.
+
+In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1**
+and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or
+by working with each individual sub-devices. For the former case a programmer needs to create
+a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device.
+
+`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support
+for unified virtual address space, which allows coherency between the host and the device
+pointers. All memory is allocated by the host, but it offers three distinct allocation types:
+
+* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation
+  when you need to stream a read-only data from the host to the device once.
+
+* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one.
+  Useful in a situation when most of data crunching happens on the device.
+
+* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
+  the host or device.** Shared allocations are useful when data are accessed by both host and devices,
+  since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
+
+Compute-Follows-Data
+********************
+Since data copying between devices is typically very expensive, for performance reasons it is essential
+to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model,
+which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
+carry information about allocation queues, and hence, about the device on which an array is allocated.
+Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place.
+
+.. image:: ./_images/kernel-queue-device.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
+**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the
+*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm
+the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will
+be created on the **Device Queue** associated with the **Device 1**.
+
+**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue,
+otherwise an exception will be thrown. For example, the following usages will result in the exception.
+
+.. figure:: ./_images/queue-exception1.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+    Input tensors are on different devices and different queues. Exception is thrown.
+
+.. figure:: ./_images/queue-exception2.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+    Input tensors are on the same device but queues are different. Exception is thrown.
+
+.. figure:: ./_images/queue-exception3.png
+    :width: 600px
+    :align: center
+    :alt: SIMD
+
+    Data belongs to the same device, but queues are different and associated with different sub-devices.
+
+Copying data between devices and queues
+***************************************
+
+**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in
+normal circumstances you do not need to directly manage queues. Having one canonical queue per device
+allows you to copy data between devices using to_device() method:
+
+.. code-block:: python
+
+   a_new = a.to_device(b.device)
+
+Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``.
+The same queue will be associated with ``b`` and ``a_new``.
+
+Alternatively, you can do this as follows:
+
+.. code-block:: python
+    :caption: DPNP array
+
+    a_new = dpnp.asarray(a, device=b.device)
+
+.. code-block:: python
+    :caption: DPCtl array
+
+    a_new = dpctl.tensor.asarray(a, device=b.device)
+
+Creating additional queues
+**************************
+
+As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device,
+and you normally work with this queue implicitly. However, you can always create as many additional queues per device
+as needed, and work with them explicitly.
+
+A typical situation when you will want to create the queue explicitly is for profiling purposes.
+Read `Data Parallel Control`_ documentation for more details about queues.
+
diff --git a/docs/sources/index.rst b/docs/sources/index.rst
index 0cca6fa..ab28e4a 100644
--- a/docs/sources/index.rst
+++ b/docs/sources/index.rst
@@ -1,33 +1,36 @@
-.. _index:
-.. include:: ./ext_links.txt
-
-.. image:: ./_images/DPEP-large.png
-    :width: 400px
-    :align: center
-    :alt: Data Parallel Extensions for Python
-
-Data Parallel Extensions for Python
-===================================
-
-Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance
-gains on data parallel devices such as GPUs. It consists of three foundational packages:
-
-* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of
-  Numpy that can be executed on any data parallel device. The subset is a drop-in replacement
-  of core Numpy functions and numerical data types.
-* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler
-  that enables programming data parallel devices the same way you program CPU with Numba.
-* **dpctl - Data Parallel Control library** that provides utilities for device selection,
-  allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions.
-
-Table of Contents
-*****************
-.. toctree::
-    :maxdepth: 2
-
-    prerequisites_and_installation
-    parallelism
-    heterogeneous_computing
-    programming_dpep
-    examples
-    useful_links
+.. _index:
+.. include:: ./ext_links.txt
+
+.. image:: ./_images/DPEP-large.png
+    :width: 400px
+    :align: center
+    :alt: Data Parallel Extensions for Python
+
+Data Parallel Extensions for Python
+===================================
+
+Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance
+gains on data parallel devices such as GPUs. It consists of three foundational packages:
+
+* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of
+  Numpy that can be executed on any data parallel device. The subset is a drop-in replacement
+  of core Numpy functions and numerical data types.
+* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler
+  that enables programming data parallel devices the same way you program CPU with Numba.
+* **dpctl - Data Parallel Control library** that provides utilities for device selection,
+  allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions.
+
+Table of Contents
+*****************
+.. toctree::
+    :maxdepth: 2
+
+    prerequisites_and_installation
+    parallelism
+    heterogeneous_computing
+    programming_dpep
+    examples
+    jupiter_notebook
+    useful_links
+
+	
\ No newline at end of file
diff --git a/docs/sources/jupiter_notebook.rst b/docs/sources/jupiter_notebook.rst
new file mode 100644
index 0000000..3883747
--- /dev/null
+++ b/docs/sources/jupiter_notebook.rst
@@ -0,0 +1,16 @@
+.. _jupiter_notebook:
+.. include:: ./ext_links.txt
+
+Jupyter* Notebooks
+******************
+
+Instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python
+
+.. toctree::
+   Getting Started <01-get_started.ipynb>
+   DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK <02-dpnp_numpy_fallback.ipynb>
+
+To download all examples in Jupiter notebook please use the following links:
+`Getting Started <01-get_started.ipynb>`_
+`DPNP_RAISE_EXCEPION_ON_NUMPY_FALLBACK <02-dpnp_numpy_fallback.ipynb>`_
+
diff --git a/docs/sources/parallelism.rst b/docs/sources/parallelism.rst
index f9f2ad1..d238286 100644
--- a/docs/sources/parallelism.rst
+++ b/docs/sources/parallelism.rst
@@ -1,44 +1,44 @@
-.. _parallelism:
-.. include:: ./ext_links.txt
-
-Parallelism in modern data parallel architectures
-=================================================
-
-Python is loved for its productivity and interactivity. But when it comes to dealing with
-computationally heavy codes Python performance cannot be compromised. Intel and Python numerical
-computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicated attention to
-optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs:
-
-* **Multiple computational cores:** Several computational cores allow processing data concurrently.
-  Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
-  reduce a computation time *N* times for a fixed amount of data.
-
-.. image:: ./_images/dpep-cores.png
-    :width: 600px
-    :align: center
-    :alt: Multiple CPU Cores
-
-* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions
-  that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width.
-  If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
-
-  In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
-  Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs
-  2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster.
-
-.. image:: ./_images/dpep-simd.png
-    :width: 150px
-    :align: center
-    :alt: SIMD
-
-* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent
-  instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`.
-  Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction
-  :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel
-  with :math:`*` and :math:`-`.
-
-.. image:: ./_images/dpep-ilp.png
-    :width: 150px
-    :align: center
-    :alt: SIMD
-
+.. _parallelism:
+.. include:: ./ext_links.txt
+
+Parallelism in modern data parallel architectures
+=================================================
+
+Python is loved for its productivity and interactivity. But when it comes to dealing with
+computationally heavy codes Python performance cannot be compromised. Intel and Python numerical
+computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicated attention to
+optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs:
+
+* **Multiple computational cores:** Several computational cores allow processing data concurrently.
+  Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
+  reduce a computation time *N* times for a fixed amount of data.
+
+.. image:: ./_images/dpep-cores.png
+    :width: 600px
+    :align: center
+    :alt: Multiple CPU Cores
+
+* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions
+  that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width.
+  If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
+
+  In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
+  Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs
+  2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster.
+
+.. image:: ./_images/dpep-simd.png
+    :width: 150px
+    :align: center
+    :alt: SIMD
+
+* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent
+  instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`.
+  Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction
+  :math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel
+  with :math:`*` and :math:`-`.
+
+.. image:: ./_images/dpep-ilp.png
+    :width: 150px
+    :align: center
+    :alt: SIMD
+
diff --git a/docs/sources/programming_dpep.rst b/docs/sources/programming_dpep.rst
index 68b67b8..db2af12 100644
--- a/docs/sources/programming_dpep.rst
+++ b/docs/sources/programming_dpep.rst
@@ -1,307 +1,307 @@
-.. _programming_dpep:
-.. include:: ./ext_links.txt
-
-Programming with Data Parallel Extensions for Python
-====================================================
-
-.. image:: ./_images/dpep-all.png
-   :width: 600px
-   :align: center
-   :alt: Data Parallel Extensions for Python
-
-As we briefly outlined, **Data Parallel Extensions for Python** consist of three foundational packages:
-
-* the `Numpy*`_-like library, ``dpnp``;
-* the compiler extension for `Numba*`_, ``numba-dpex``
-* the library for managing devices, queues, and heterogeneous data, ``dpctl``.
-
-Their underlying implementation is based on `SYCL*`_ standard, which is a cross-platform abstraction layer
-for heterogeneous computing on data parallel devices, such as CPU, GPU, or domain specific accelerators.
-
-Scalars vs. 0-dimensional arrays
-********************************
-
-Primitive types such as Python’s and Numpy’s ``float``, ``int``, or ``complex``, used to represent scalars,
-have the host storage. In contrast, ``dpctl.tensor.usm_ndarray`` and ``dpnp.ndarray`` have USM storage
-and carry associated allocation queue. For the :ref:`Compute-Follows-Data` consistent behavior
-all ``dpnp`` operations that produce scalars will instead produce respective 0-dimensional arrays.
-
-That implies, that some code changes may be needed to replace scalar math operations with respective
-``dpnp`` array operations. See `Data Parallel Extension for Numpy*`_ - **API Reference** section for details.
-
-Data Parallel Extension for Numpy - dpnp
-****************************************
-
-The ``dpnp`` library is a bare minimum to start programming numerical codes for data parallel devices.
-You may already have a Python script written in `Numpy*`_. Being a drop-in replacement of (a subset of) `Numpy*`_,
-to execute your `Numpy*`_ script on GPU usually requires changing just a few lines of the code:
-
-.. literalinclude:: ../../examples/01-hello_dpnp.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 01:** Your first NumPy code running on GPU
-   :name: ex_01_hello_dpnp
-
-
-In this example ``np.asarray()`` creates an array on the default `SYCL*`_ device, which is ``"gpu"`` on systems
-with integrated or discrete GPU (it is ``"host"`` on systems that do not have GPU).
-The queue associated with this array is now carried with ``x``, and ``np.sum(x)`` will derive it from ``x``,
-and respective pre-compiled kernel implementing ``np.sum()`` will be submitted to that queue.
-The result ``y`` will be allocated on the device 0-dimensional array associated with that queue too.
-
-All ``dpnp`` array creation routines as well as random number generators have additional optional keyword arguments
-``device``, ``queue``, and ``usm_type``, using which you can explicitly specify on which device or queue you want
-the tensor data to be created along with USM memory type to be used (``"host"``, ``"device"``, or ``"shared"``).
-In the following example we create the array ``x`` on the GPU device, and perform a reduction sum on it:
-
-.. literalinclude:: ../../examples/02-dpnp_device.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 02:** Select device type while creating array
-   :name: ex_02_dpnp_device
-
-
-Data Parallel Extension for Numba - numba-dpex
-**********************************************
-
-`Numba*`_ is a powerful Just-In-Time (JIT) compiler that works best on `Numpy*`_ arrays, `Numpy*`_ functions, and loops.
-Data parallel loops is where the data parallelism resides. It allows leveraging all available CPU cores,
-SIMD instructions, and schedules those in a way that exploits maximum instruction-level parallelism.
-The ``numba-dpex`` extension allows to compile and offload data parallel regions to any data parallel device.
-It takes just a few lines to modify your CPU `Numba*`_ script to run on GPU.
-
-.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex
-   :name: ex_03_dpnp2numba_dpex
-
-In this example we implement a custom function ``sum_it()`` that takes an array input. We compile it with
-`Data Parallel Extension for Numba*`_. Being just-in-time compiler, Numba derives the queue from input argument ``x``,
-which is associated with the default device (``"gpu"`` on systems with integrated or discrete GPU) and
-dynamically compiles the kernel submitted to that queue. The result will reside as a 0-dimensional array on the device
-associated with the queue, and on exit from the offload kernel it will be assigned to the tensor y.
-
-The ``parallel=True`` setting in ``@njit` is essential to enable generation of data parallel kernels.
-Please also note that we use ``fastmath=True`` in ``@njit`` decorator. This is an important setting
-to instruct the compiler that you’re okay NOT preserving the order of floating-point operations.
-This will enable generation of instructions (such as SIMD) for greater performance.
-
-Data Parallel Control - dpctl
-*****************************
-
-Both ``dpnp`` and ``numba-dpex`` provide enough API versatility for programming data parallel devices but
-there are some situations when you will need to use dpctl advanced capabilities:
-
-1. **Advanced device management.** Both ``dpnp`` and ``numba-dpex`` support Numpy array creation routines
-   with additional parameters that specify the device on which the data is allocated and the type of memory to be used
-   (``"device"``, ``"host"``, or ``"shared"``). However, if you need some more advanced device and data management
-   capabilities you will also need to import ``dpctl`` in addition to ``dpnp`` and/or ``numba-dpex``.
-
-   One of frequent usages of ``dpctl`` is to query the list devices present on the system, available driver backend
-   (such as ``"opencl"``, ``"level_zero"``, ``"cuda"``, etc.)
-
-   Another frequent usage is the creation additional queues for the purpose of profiling or choosing an out-of-order
-   execution of offload kernels.
-
-.. literalinclude:: ../../examples/04-dpctl_device_query.py
-   :language: python
-   :lines: 27-
-   :caption: **EXAMPLE 04:** Get information about devices
-   :name: ex_04_dpctl_device_query
-
-2. **Cross-platform development using Python Array API standard.** If you’re a Python developer
-   programming Numpy-like codes and targeting different hardware vendors and different tensor implementations,
-   then going with `Python* Array API Standard`_ is a good choice for writing a portable Numpy-like code.
-   The ``dpctl.tensor`` implements `Python* Array API Standard`_ for `SYCL*`_ devices. Accompanied with
-   respective SYCL device drivers from different vendors ``dpctl.tensor`` becomes a portable solution
-   for writing numerical codes for any SYCL device.
-
-   For example, some Python communities, such as
-   `Scikit-Learn* community <https://github.com/scikit-learn/scikit-learn/issues/22352>`_, are already establishing
-   a path for having algorithms (re-)implemented using `Python* Array API Standard`_ .
-   This is a reliable path for extending their capabilities beyond CPU only, or beyond certain GPU vendor only.
-
-3. **Zero-copy data exchange between tensor implementations.** Certain Python projects may have own tensor
-   implementations not relying on ``dpctl.tensor`` or ``dpnp.ndarray`` tensors. Can users still exchange data
-   between these tensors not copying it back and forth through the host?
-   `Python* Array API Standard`_ specifies the data exchange protocol for zero-copy exchange
-   between tensors through ``dlpack``. Being the `Python* Array API Standard`_ implementation
-   ``dpctl`` provides ``dpctl.tensor.from_dlpack()`` function used for zero-copy view of another tensor input.
-
-
-Debugging and profiling Data Parallel Extensions for Python
-***********************************************************
-`Intel oneAPI Base Toolkit`_ provides two tools to assist programmers to analyze performance issues in programs
-that use **Data Parallel Extensions for Python**. They are `Intel VTune Profiler`_ and
-`Intel Advisor`_.
-
-Intel VTune Profiler examines various performance aspects of a program like, the most time-consuming parts,
-efficiency of offloaded code, impact of memory sub-systems, etc.
-
-Intel Advisor provides insights on the performance of offloaded code relative to the peak performance and
-memory bandwidth.
-
-Next, we will detail the steps involved in using Intel VTune Profiler and Intel Advisor with
-heterogenous programs that use **Data Parallel Extensions for Python**.
-
-Profiling with Intel VTune Profiler
------------------------------------
-
-.. |copy| unicode:: U+000A9
-
-.. |trade| unicode:: U+2122
-
-Intel |copy| VTune |trade| Profiler provides two mechanisms, called *GPU offload* and *GPU hotspots*, to profile heterogeneous programs
-targeted to GPUs.
-
-The *GPU offload* analysis profiles the entire application (both GPU and host code) and helps to identify
-if the application is CPU or GPU bound. It provides information on the proportion of the execution time spent
-in GPU execution. It also provides information about various hotspots in the program. The key goal of the *GPU offload*
-analysis is to identify the parts of the program that can benefit from offloading to GPUs.
-
-The *GPU hotspots* analysis focuses on providing insights into the performance of GPU-offloaded code.
-It provides insights about the parallelism in the GPU kernel, the efficiency of the kernel, SIMD utilization
-and memory latency. It also provides performance data regarding synchronization operations like GPU barriers and
-atomic operations.
-
-The following instructions are used execute the two Intel VTune Profiler analyses on programs written
-using **Data Parallel Extensions for Python**.
-
-.. code-block:: console
-    :caption: **GPU Offload**
-
-    > vtune -collect gpu-offload -r <output_dir> -- python <script>.py <args>
-
-.. code-block:: console
-    :caption: **GPU Hotspots**
-
-    > vtune -collect gpu-hotspots -r <output_dir> -- python <script>.py <args>
-
-Intel VTune Profiler performs dynamic binary analysis on a given program to obtain insights on various
-performance characteristics. It can run on unmodified binaries with no extra requirements for program compilation.
-After collecting the data using the above commands, the Intel VTune Profiler GUI can be used to view various
-performance characteristics. In addition to the GUI, it provides mechanisms to generate reports through
-the command line and setup a web server for post processing the data.
-
-Further details on viewing Intel VTune Profiler output along with other use-cases can be found in the
-`Intel VTune Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html>`_.
-
-Profiling with Intel Advisor
-----------------------------
-
-The primary goal of Intel |copy| Advisor is to help programmers make targeted optimizations by identifying
-appropriate kernels and characterizing the performance limiting factors. Intel Advisor provides mechanisms
-to analyze the performance of GPU kernels against the hardware roof-line performance. It provides information
-about the maximum achievable performance with the given hardware conditions and helps identify the best
-kernels for optimization. Further, it helps the programmer characterize if a GPU kernel is bound by
-compute capacity or by memory bandwidth.
-
-The following instructions are used to generate GPU roof-line performance graphs using Intel Advisor.
-
-.. code-block:: console
-    :caption: **Collect Roofline**
-
-    > advisor --collect=roofline --profile-gpu --project-dir=<output_dir> --search-dir src:r=<search_dir> -- <executable> <args>
-
-This command collects the GPU roof-line data from executing the application written using
-**Data Parallel Extensions for Python**.
-
-The next command generates the roof-line graph as a html file in the output directory.
-
-.. code-block:: console
-    :caption: **Generate Roofline HTML-File**
-
-    > advisor --report=roofline --gpu --project-dir=<output_dir> --report-output=<output_dir>/roofline_gpu.html
-
-.. image:: ./_images/advisor_roofline_gen9.png
-    :width: 800px
-    :align: center
-    :alt: Advisor roofline analysis example on Gen9 integrated GPU
-
-The above figure shows an example roof-line graph generated using Intel Advisor.
-The X-axis in the graph represents arithmetic intensity and the Y-axis represents performance in GFLOPS.
-The horizontal lines parallel to the X-axis represent the roof-line compute capacity for the given hardware.
-The cross-diagonal lines represent the peak memory bandwidth of different layers of the memory hierarchy.
-The red colored dot corresponds to the executed GPU kernel. The graph shows the performance of the kernel relative
-to the peak compute capacity and memory bandwidth. It also shows whether the GPU kernel is memory or compute
-bound depending on the roof-line that is limiting the GPU kernel.
-
-For further details on Intel Advisor and its extended capabilities refer to the
-`Intel |copy| Advisor User Guide <https://www.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top.html>`_.
-
-
-.. todo::
-   Document debugging section
-
-Writing robust numerical codes for heterogeneous computing
-**********************************************************
-
-Default primitive type (``dtype``) in `Numpy*`_ is double precision (``float64``), which is supported by
-majority of modern CPUs. When it comes to program GPUs and especially specialized accelerators,
-the set of supported primitive data types may be limited. For example, certain GPUs may not support
-double precision or half-precision. **Data Parallel Extensions for Python** select default ``dtype`` depending on
-device’s default type in accordance with Python Array API Standard. It can be either ``float64`` or ``float32``.
-It means that unlike traditional `Numpy*`_ programming on a CPU, the heterogeneous computing requires
-careful management of hardware peculiarities to keep the Python script portable and robust on any device.
-
-There are several hints how to make the numerical code portable and robust.
-
-Sensitivity to floating-point errors
-------------------------------------
-
-Floating-point arithmetic has a finite precision, which implies that only a tiny fraction of real numbers can be
-represented in floating-point arithmetic. It is almost certain that every floating-point operation
-will induce a rounding error because the result cannot be accurately represented as a floating-point number.
-The `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ sets the upper bound for rounding errors in each
-arithmetic operation to 0.5 *ulp*, meaning that each arithmetic operation must be accurate to the last bit of
-floating-point mantissa, which is an order of :math:`10^-16` in double precision and :math:`10^-7`
-in single precision.
-
-In robust numerical codes these errors tend to accumulate slowly so that single precision is enough to
-calculate the result accurate to 3-5 decimal digits.
-
-However, there is a situation known as a *catastrophic cancellation*, when small accumulated errors
-result in a significant (or even a complete) loss of accuracy. The catastrophic cancellation happens
-when two close floating-point numbers with small rounding errors are subtracted. As a result the original
-rounding errors amplify by the number of identical leading digits:
-
-.. image:: ./_images/fp-cancellation.png
-    :scale: 50%
-    :align: center
-    :alt: Floating-Point Cancellation
-
-In the above example, green digits are accurate digits, a few trailing digits in red are inaccurate due to
-induced errors. As a result of subtraction, only one accurate digit remains.
-
-Situations with catastrophic cancellations must be carefully handled. An example where catastrophic
-cancellation happens naturally is the numeric differentiation, where two close numbers are subtracted
-to approximate the derivative:
-
-.. math::
-
-   df/dx \approx \frac{f(x+\delta) - f(x-\delta)}{2\delta}
-
-Smaller you take :math:`\delta` is greater the catastrophic cancellation. At the same time bigger :math:`\delta`
-results in bigger approximation error. Books on numerical computing and floating-point arithmetic discuss
-variety of technics to make catastrophic cancellations controllable. For more details about floating-point
-arithmetic please refer to `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ and the article by
-`David Goldberg, What every computer scientist should know about floating-point arithmetic`_.
-
-
-Switching between single and double precision
----------------------------------------------
-
-1. Implement your code to switch easily between single and double precision in a controlled fashion.
-   For example, implement a utility function or introduce a constant that selects ``dtype`` for
-   the rest of the `Numpy*`_ code.
-
-2. Run your code on a representative set of inputs in single and double precisions.
-   Observe sensitivity of computed results to the switching between single and double precisions.
-   If results remain identical to 3-5 digits for different inputs, it is a good sign that your code
-   is not sensitive to floating-point errors.
-
-3. Write your code with catastrophic cancellations in mind. These blocks of code will require special
-   care such as the use of extended precision or other techniques to control cancellations.
-   It is likely that this part of the code will require a hardware specific implementation.
-
+.. _programming_dpep:
+.. include:: ./ext_links.txt
+
+Programming with Data Parallel Extensions for Python
+====================================================
+
+.. image:: ./_images/dpep-all.png
+   :width: 600px
+   :align: center
+   :alt: Data Parallel Extensions for Python
+
+As we briefly outlined, **Data Parallel Extensions for Python** consist of three foundational packages:
+
+* the `Numpy*`_-like library, ``dpnp``;
+* the compiler extension for `Numba*`_, ``numba-dpex``
+* the library for managing devices, queues, and heterogeneous data, ``dpctl``.
+
+Their underlying implementation is based on `SYCL*`_ standard, which is a cross-platform abstraction layer
+for heterogeneous computing on data parallel devices, such as CPU, GPU, or domain specific accelerators.
+
+Scalars vs. 0-dimensional arrays
+********************************
+
+Primitive types such as Python’s and Numpy’s ``float``, ``int``, or ``complex``, used to represent scalars,
+have the host storage. In contrast, ``dpctl.tensor.usm_ndarray`` and ``dpnp.ndarray`` have USM storage
+and carry associated allocation queue. For the :ref:`Compute-Follows-Data` consistent behavior
+all ``dpnp`` operations that produce scalars will instead produce respective 0-dimensional arrays.
+
+That implies, that some code changes may be needed to replace scalar math operations with respective
+``dpnp`` array operations. See `Data Parallel Extension for Numpy*`_ - **API Reference** section for details.
+
+Data Parallel Extension for Numpy - dpnp
+****************************************
+
+The ``dpnp`` library is a bare minimum to start programming numerical codes for data parallel devices.
+You may already have a Python script written in `Numpy*`_. Being a drop-in replacement of (a subset of) `Numpy*`_,
+to execute your `Numpy*`_ script on GPU usually requires changing just a few lines of the code:
+
+.. literalinclude:: ../../examples/01-hello_dpnp.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 01:** Your first NumPy code running on GPU
+   :name: ex_01_hello_dpnp
+
+
+In this example ``np.asarray()`` creates an array on the default `SYCL*`_ device, which is ``"gpu"`` on systems
+with integrated or discrete GPU (it is ``"host"`` on systems that do not have GPU).
+The queue associated with this array is now carried with ``x``, and ``np.sum(x)`` will derive it from ``x``,
+and respective pre-compiled kernel implementing ``np.sum()`` will be submitted to that queue.
+The result ``y`` will be allocated on the device 0-dimensional array associated with that queue too.
+
+All ``dpnp`` array creation routines as well as random number generators have additional optional keyword arguments
+``device``, ``queue``, and ``usm_type``, using which you can explicitly specify on which device or queue you want
+the tensor data to be created along with USM memory type to be used (``"host"``, ``"device"``, or ``"shared"``).
+In the following example we create the array ``x`` on the GPU device, and perform a reduction sum on it:
+
+.. literalinclude:: ../../examples/02-dpnp_device.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 02:** Select device type while creating array
+   :name: ex_02_dpnp_device
+
+
+Data Parallel Extension for Numba - numba-dpex
+**********************************************
+
+`Numba*`_ is a powerful Just-In-Time (JIT) compiler that works best on `Numpy*`_ arrays, `Numpy*`_ functions, and loops.
+Data parallel loops is where the data parallelism resides. It allows leveraging all available CPU cores,
+SIMD instructions, and schedules those in a way that exploits maximum instruction-level parallelism.
+The ``numba-dpex`` extension allows to compile and offload data parallel regions to any data parallel device.
+It takes just a few lines to modify your CPU `Numba*`_ script to run on GPU.
+
+.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 03:** Compile dpnp code with numba-dpex
+   :name: ex_03_dpnp2numba_dpex
+
+In this example we implement a custom function ``sum_it()`` that takes an array input. We compile it with
+`Data Parallel Extension for Numba*`_. Being just-in-time compiler, Numba derives the queue from input argument ``x``,
+which is associated with the default device (``"gpu"`` on systems with integrated or discrete GPU) and
+dynamically compiles the kernel submitted to that queue. The result will reside as a 0-dimensional array on the device
+associated with the queue, and on exit from the offload kernel it will be assigned to the tensor y.
+
+The ``parallel=True`` setting in ``@njit` is essential to enable generation of data parallel kernels.
+Please also note that we use ``fastmath=True`` in ``@njit`` decorator. This is an important setting
+to instruct the compiler that you’re okay NOT preserving the order of floating-point operations.
+This will enable generation of instructions (such as SIMD) for greater performance.
+
+Data Parallel Control - dpctl
+*****************************
+
+Both ``dpnp`` and ``numba-dpex`` provide enough API versatility for programming data parallel devices but
+there are some situations when you will need to use dpctl advanced capabilities:
+
+1. **Advanced device management.** Both ``dpnp`` and ``numba-dpex`` support Numpy array creation routines
+   with additional parameters that specify the device on which the data is allocated and the type of memory to be used
+   (``"device"``, ``"host"``, or ``"shared"``). However, if you need some more advanced device and data management
+   capabilities you will also need to import ``dpctl`` in addition to ``dpnp`` and/or ``numba-dpex``.
+
+   One of frequent usages of ``dpctl`` is to query the list devices present on the system, available driver backend
+   (such as ``"opencl"``, ``"level_zero"``, ``"cuda"``, etc.)
+
+   Another frequent usage is the creation additional queues for the purpose of profiling or choosing an out-of-order
+   execution of offload kernels.
+
+.. literalinclude:: ../../examples/04-dpctl_device_query.py
+   :language: python
+   :lines: 27-
+   :caption: **EXAMPLE 04:** Get information about devices
+   :name: ex_04_dpctl_device_query
+
+2. **Cross-platform development using Python Array API standard.** If you’re a Python developer
+   programming Numpy-like codes and targeting different hardware vendors and different tensor implementations,
+   then going with `Python* Array API Standard`_ is a good choice for writing a portable Numpy-like code.
+   The ``dpctl.tensor`` implements `Python* Array API Standard`_ for `SYCL*`_ devices. Accompanied with
+   respective SYCL device drivers from different vendors ``dpctl.tensor`` becomes a portable solution
+   for writing numerical codes for any SYCL device.
+
+   For example, some Python communities, such as
+   `Scikit-Learn* community <https://github.com/scikit-learn/scikit-learn/issues/22352>`_, are already establishing
+   a path for having algorithms (re-)implemented using `Python* Array API Standard`_ .
+   This is a reliable path for extending their capabilities beyond CPU only, or beyond certain GPU vendor only.
+
+3. **Zero-copy data exchange between tensor implementations.** Certain Python projects may have own tensor
+   implementations not relying on ``dpctl.tensor`` or ``dpnp.ndarray`` tensors. Can users still exchange data
+   between these tensors not copying it back and forth through the host?
+   `Python* Array API Standard`_ specifies the data exchange protocol for zero-copy exchange
+   between tensors through ``dlpack``. Being the `Python* Array API Standard`_ implementation
+   ``dpctl`` provides ``dpctl.tensor.from_dlpack()`` function used for zero-copy view of another tensor input.
+
+
+Debugging and profiling Data Parallel Extensions for Python
+***********************************************************
+`Intel oneAPI Base Toolkit`_ provides two tools to assist programmers to analyze performance issues in programs
+that use **Data Parallel Extensions for Python**. They are `Intel VTune Profiler`_ and
+`Intel Advisor`_.
+
+Intel VTune Profiler examines various performance aspects of a program like, the most time-consuming parts,
+efficiency of offloaded code, impact of memory sub-systems, etc.
+
+Intel Advisor provides insights on the performance of offloaded code relative to the peak performance and
+memory bandwidth.
+
+Next, we will detail the steps involved in using Intel VTune Profiler and Intel Advisor with
+heterogenous programs that use **Data Parallel Extensions for Python**.
+
+Profiling with Intel VTune Profiler
+-----------------------------------
+
+.. |copy| unicode:: U+000A9
+
+.. |trade| unicode:: U+2122
+
+Intel |copy| VTune |trade| Profiler provides two mechanisms, called *GPU offload* and *GPU hotspots*, to profile heterogeneous programs
+targeted to GPUs.
+
+The *GPU offload* analysis profiles the entire application (both GPU and host code) and helps to identify
+if the application is CPU or GPU bound. It provides information on the proportion of the execution time spent
+in GPU execution. It also provides information about various hotspots in the program. The key goal of the *GPU offload*
+analysis is to identify the parts of the program that can benefit from offloading to GPUs.
+
+The *GPU hotspots* analysis focuses on providing insights into the performance of GPU-offloaded code.
+It provides insights about the parallelism in the GPU kernel, the efficiency of the kernel, SIMD utilization
+and memory latency. It also provides performance data regarding synchronization operations like GPU barriers and
+atomic operations.
+
+The following instructions are used execute the two Intel VTune Profiler analyses on programs written
+using **Data Parallel Extensions for Python**.
+
+.. code-block:: console
+    :caption: **GPU Offload**
+
+    > vtune -collect gpu-offload -r <output_dir> -- python <script>.py <args>
+
+.. code-block:: console
+    :caption: **GPU Hotspots**
+
+    > vtune -collect gpu-hotspots -r <output_dir> -- python <script>.py <args>
+
+Intel VTune Profiler performs dynamic binary analysis on a given program to obtain insights on various
+performance characteristics. It can run on unmodified binaries with no extra requirements for program compilation.
+After collecting the data using the above commands, the Intel VTune Profiler GUI can be used to view various
+performance characteristics. In addition to the GUI, it provides mechanisms to generate reports through
+the command line and setup a web server for post processing the data.
+
+Further details on viewing Intel VTune Profiler output along with other use-cases can be found in the
+`Intel VTune Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html>`_.
+
+Profiling with Intel Advisor
+----------------------------
+
+The primary goal of Intel |copy| Advisor is to help programmers make targeted optimizations by identifying
+appropriate kernels and characterizing the performance limiting factors. Intel Advisor provides mechanisms
+to analyze the performance of GPU kernels against the hardware roof-line performance. It provides information
+about the maximum achievable performance with the given hardware conditions and helps identify the best
+kernels for optimization. Further, it helps the programmer characterize if a GPU kernel is bound by
+compute capacity or by memory bandwidth.
+
+The following instructions are used to generate GPU roof-line performance graphs using Intel Advisor.
+
+.. code-block:: console
+    :caption: **Collect Roofline**
+
+    > advisor --collect=roofline --profile-gpu --project-dir=<output_dir> --search-dir src:r=<search_dir> -- <executable> <args>
+
+This command collects the GPU roof-line data from executing the application written using
+**Data Parallel Extensions for Python**.
+
+The next command generates the roof-line graph as a html file in the output directory.
+
+.. code-block:: console
+    :caption: **Generate Roofline HTML-File**
+
+    > advisor --report=roofline --gpu --project-dir=<output_dir> --report-output=<output_dir>/roofline_gpu.html
+
+.. image:: ./_images/advisor_roofline_gen9.png
+    :width: 800px
+    :align: center
+    :alt: Advisor roofline analysis example on Gen9 integrated GPU
+
+The above figure shows an example roof-line graph generated using Intel Advisor.
+The X-axis in the graph represents arithmetic intensity and the Y-axis represents performance in GFLOPS.
+The horizontal lines parallel to the X-axis represent the roof-line compute capacity for the given hardware.
+The cross-diagonal lines represent the peak memory bandwidth of different layers of the memory hierarchy.
+The red colored dot corresponds to the executed GPU kernel. The graph shows the performance of the kernel relative
+to the peak compute capacity and memory bandwidth. It also shows whether the GPU kernel is memory or compute
+bound depending on the roof-line that is limiting the GPU kernel.
+
+For further details on Intel Advisor and its extended capabilities refer to the
+`Intel |copy| Advisor User Guide <https://www.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top.html>`_.
+
+
+.. todo::
+   Document debugging section
+
+Writing robust numerical codes for heterogeneous computing
+**********************************************************
+
+Default primitive type (``dtype``) in `Numpy*`_ is double precision (``float64``), which is supported by
+majority of modern CPUs. When it comes to program GPUs and especially specialized accelerators,
+the set of supported primitive data types may be limited. For example, certain GPUs may not support
+double precision or half-precision. **Data Parallel Extensions for Python** select default ``dtype`` depending on
+device’s default type in accordance with Python Array API Standard. It can be either ``float64`` or ``float32``.
+It means that unlike traditional `Numpy*`_ programming on a CPU, the heterogeneous computing requires
+careful management of hardware peculiarities to keep the Python script portable and robust on any device.
+
+There are several hints how to make the numerical code portable and robust.
+
+Sensitivity to floating-point errors
+------------------------------------
+
+Floating-point arithmetic has a finite precision, which implies that only a tiny fraction of real numbers can be
+represented in floating-point arithmetic. It is almost certain that every floating-point operation
+will induce a rounding error because the result cannot be accurately represented as a floating-point number.
+The `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ sets the upper bound for rounding errors in each
+arithmetic operation to 0.5 *ulp*, meaning that each arithmetic operation must be accurate to the last bit of
+floating-point mantissa, which is an order of :math:`10^-16` in double precision and :math:`10^-7`
+in single precision.
+
+In robust numerical codes these errors tend to accumulate slowly so that single precision is enough to
+calculate the result accurate to 3-5 decimal digits.
+
+However, there is a situation known as a *catastrophic cancellation*, when small accumulated errors
+result in a significant (or even a complete) loss of accuracy. The catastrophic cancellation happens
+when two close floating-point numbers with small rounding errors are subtracted. As a result the original
+rounding errors amplify by the number of identical leading digits:
+
+.. image:: ./_images/fp-cancellation.png
+    :scale: 50%
+    :align: center
+    :alt: Floating-Point Cancellation
+
+In the above example, green digits are accurate digits, a few trailing digits in red are inaccurate due to
+induced errors. As a result of subtraction, only one accurate digit remains.
+
+Situations with catastrophic cancellations must be carefully handled. An example where catastrophic
+cancellation happens naturally is the numeric differentiation, where two close numbers are subtracted
+to approximate the derivative:
+
+.. math::
+
+   df/dx \approx \frac{f(x+\delta) - f(x-\delta)}{2\delta}
+
+Smaller you take :math:`\delta` is greater the catastrophic cancellation. At the same time bigger :math:`\delta`
+results in bigger approximation error. Books on numerical computing and floating-point arithmetic discuss
+variety of technics to make catastrophic cancellations controllable. For more details about floating-point
+arithmetic please refer to `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ and the article by
+`David Goldberg, What every computer scientist should know about floating-point arithmetic`_.
+
+
+Switching between single and double precision
+---------------------------------------------
+
+1. Implement your code to switch easily between single and double precision in a controlled fashion.
+   For example, implement a utility function or introduce a constant that selects ``dtype`` for
+   the rest of the `Numpy*`_ code.
+
+2. Run your code on a representative set of inputs in single and double precisions.
+   Observe sensitivity of computed results to the switching between single and double precisions.
+   If results remain identical to 3-5 digits for different inputs, it is a good sign that your code
+   is not sensitive to floating-point errors.
+
+3. Write your code with catastrophic cancellations in mind. These blocks of code will require special
+   care such as the use of extended precision or other techniques to control cancellations.
+   It is likely that this part of the code will require a hardware specific implementation.
+
diff --git a/docs/sources/useful_links.rst b/docs/sources/useful_links.rst
index 3c260a4..9bc7244 100644
--- a/docs/sources/useful_links.rst
+++ b/docs/sources/useful_links.rst
@@ -1,48 +1,48 @@
-.. _useful_links:
-.. include:: ./ext_links.txt
-
-Useful links
-============
-
-.. list-table:: **Companion documentation**
-   :widths: 70 200
-   :header-rows: 1
-
-   * - Document
-     - Description
-   * - `Data Parallel Extension for Numpy*`_
-     - Documentation for programming NumPy-like codes on data parallel devices
-   * - `Data Parallel Extension for Numba*`_
-     - Documentation for programming Numba codes on data parallel devices the same way as you program Numba on CPU
-   * - `Data Parallel Control`_
-     - Documentation how to manage data and devices, how to interchange data between different tensor implementations,
-       and how to write data parallel extensions
-   * - `Intel VTune Profiler`_
-     - Performance profiler supporting  analysis of bottlenecks from function leve down to low level instructions.
-       Supports Python and Numba
-   * - `Intel Advisor`_
-     - Analyzes native and Python codes and provides an advice for better composition of heterogeneous algorithms
-   * - `Python* Array API Standard`_
-     - Standard for writing portable Numpy-like codes targeting different hardware vendors and frameworks
-       operating with tensor data
-   * - `SYCL*`_
-     - Standard for writing C++-like codes for heterogeneous computing
-   * - `DPC++`_
-     - Free e-book how to program data parallel devices using Data Parallel C++
-   * - `OpenCl*`_
-     - OpenCl* Standard for heterogeneous programming
-   * - `IEEE 754-2019 Standard for Floating-Point Arithmetic`_
-     - Standard for floating-point arithmetic, essential for writing robust numerical codes
-   * - `David Goldberg, What every computer scientist should know about floating-point arithmetic`_
-     - Scientific paper important for understanding how to write robust numerical code
-   * - `Numpy*`_
-     - Documentation for Numpy - foundational CPU library for array programming. Used in conjunction with
-       `Data Parallel Extension for Numpy*`_.
-   * - `Numba*`_
-     - Documentation for Numba - Just-In-Time compiler for Numpy-like codes. Used in conjunction with
-       `Data Parallel Extension for Numba*`_.
-
-
-To-Do
-=====
-.. todolist::
+.. _useful_links:
+.. include:: ./ext_links.txt
+
+Useful links
+============
+
+.. list-table:: **Companion documentation**
+   :widths: 70 200
+   :header-rows: 1
+
+   * - Document
+     - Description
+   * - `Data Parallel Extension for Numpy*`_
+     - Documentation for programming NumPy-like codes on data parallel devices
+   * - `Data Parallel Extension for Numba*`_
+     - Documentation for programming Numba codes on data parallel devices the same way as you program Numba on CPU
+   * - `Data Parallel Control`_
+     - Documentation how to manage data and devices, how to interchange data between different tensor implementations,
+       and how to write data parallel extensions
+   * - `Intel VTune Profiler`_
+     - Performance profiler supporting  analysis of bottlenecks from function leve down to low level instructions.
+       Supports Python and Numba
+   * - `Intel Advisor`_
+     - Analyzes native and Python codes and provides an advice for better composition of heterogeneous algorithms
+   * - `Python* Array API Standard`_
+     - Standard for writing portable Numpy-like codes targeting different hardware vendors and frameworks
+       operating with tensor data
+   * - `SYCL*`_
+     - Standard for writing C++-like codes for heterogeneous computing
+   * - `DPC++`_
+     - Free e-book how to program data parallel devices using Data Parallel C++
+   * - `OpenCl*`_
+     - OpenCl* Standard for heterogeneous programming
+   * - `IEEE 754-2019 Standard for Floating-Point Arithmetic`_
+     - Standard for floating-point arithmetic, essential for writing robust numerical codes
+   * - `David Goldberg, What every computer scientist should know about floating-point arithmetic`_
+     - Scientific paper important for understanding how to write robust numerical code
+   * - `Numpy*`_
+     - Documentation for Numpy - foundational CPU library for array programming. Used in conjunction with
+       `Data Parallel Extension for Numpy*`_.
+   * - `Numba*`_
+     - Documentation for Numba - Just-In-Time compiler for Numpy-like codes. Used in conjunction with
+       `Data Parallel Extension for Numba*`_.
+
+
+To-Do
+=====
+.. todolist::