NVIDIA · sri-koundinyan · Nov 24, 2025 · Nov 24, 2025 · Dec 11, 2025 · Dec 11, 2025
diff --git a/tutorials/accelerated-python/notebooks/fundamentals/01__numpy_intro__ndarray_basics.ipynb b/tutorials/accelerated-python/notebooks/fundamentals/01__numpy_intro__ndarray_basics.ipynb
diff --git a/tutorials/accelerated-python/notebooks/fundamentals/03__numpy_to_cupy__ndarray_basics.ipynb b/tutorials/accelerated-python/notebooks/fundamentals/03__numpy_to_cupy__ndarray_basics.ipynb
diff --git a/tutorials/accelerated-python/notebooks/fundamentals/05__memory_spaces__power_iteration.ipynb b/tutorials/accelerated-python/notebooks/fundamentals/05__memory_spaces__power_iteration.ipynb
diff --git a/tutorials/accelerated-python/notebooks/fundamentals/06__asynchrony__power_iteration.ipynb b/tutorials/accelerated-python/notebooks/fundamentals/06__asynchrony__power_iteration.ipynb
diff --git a/...accelerated-python/notebooks/fundamentals/07__cuda_core__devices_streams_and_memory.ipynb b/...accelerated-python/notebooks/fundamentals/07__cuda_core__devices_streams_and_memory.ipynb
@@ -427,7 +427,7 @@
    "outputs": [],
    "source": [
     "import cupy as cp\n",
-    "from cuda.core.experimental import launch, LaunchConfig\n",
+    "from cuda.core.experimental import launch, LaunchConfig, ProgramOptions\n",
     "\n",
     "def execute_vector_add():\n",
     "    # Initialize device and create a stream\n",
@@ -548,7 +548,7 @@
     "    print(f\"Multiplying {N}x{N} matrices\")\n",
     "\n",
     "    # Compile the templated matrix multiplication kernel with specific C++ compiler flags\n",
-    "    program_options = ProgramOptions(std=\"c++17\", arch=f\"sm_{arch}\")\n",
+    "    program_options = ProgramOptions(std=\"c++17\")\n",
     "    program = Program(matmul_source, code_type='c++', options=program_options)\n",
     "    compiled_program = program.compile(target_type='cubin', name_expressions=(\"matrix_multiply<float>\",))\n",
     "    kernel = compiled_program.get_kernel(\"matrix_multiply<float>\")\n",
@@ -889,4 +889,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/tutorials/accelerated-python/notebooks/kernels/40__kernel_authoring__copy.ipynb b/tutorials/accelerated-python/notebooks/kernels/40__kernel_authoring__copy.ipynb
@@ -6,11 +6,22 @@
     "id": "-JpGaP7-D_5W"
    },
    "source": [
-    "## Exercise - Kernel Authoring - Copy\n",
+    "## Kernel Authoring - Copy\n",
     "\n",
-    "In this exercise, we'll learn how to analyze and reason about the performance of CUDA kernels using the NVIDIA Nsight Compute profiler.\n",
+    "### Table of Contents\n",
+    "1. [Environment Setup](#1-environment-setup)\n",
+    "2. [The Baseline Kernel: Blocked Copy](#2-the-baseline-kernel-blocked-copy)\n",
+    "3. [Profiling the Baseline](#3-profiling-the-baseline)\n",
+    "4. [Optimization Challenge: Improved Memory Access](#4-optimization-challenge-improved-memory-access)\n",
+    "5. [Verification & Benchmarking](#5-verification--benchmarking)\n",
+    "6. [Profiling the Optimized Kernel](#6-profiling-the-optimized-kernel)\n",
+    "7. [Further Exploration](#7-further-exploration)\n",
     "\n",
-    "We'll look at a few different ways of writing a simple kernel that copies items from one array to another.\n",
+    "---\n",
+    "\n",
+    "## 1. Environment Setup\n",
+    "\n",
+    "In this exercise, we'll learn how to analyze and reason about the performance of CUDA kernels using the NVIDIA Nsight Compute profiler. We'll look at a few different ways of writing a simple kernel that copies items from one array to another.\n",
     "\n",
     "First, we need to make sure the Nsight Compute profiler, Nsightful, Numba CUDA, and CuPy are available in our notebook:"
    ]
@@ -44,6 +55,8 @@
     "id": "A1SfTQk0EwUl"
    },
    "source": [
+    "## 2. The Baseline Kernel: Blocked Copy\n",
+    "\n",
     "Now, we'll write our first kernel. Each thread will copy `items_per_thread` items from the `src` array to the `dst` array. We'll set the number of threads per block to a constant, `threads_per_block`. We'll calculate how many blocks to launch based on `items_per_thread` and `threads_per_block`. We use `cuda.grid(1)` to get the unique global 1D index of each thread.\n",
     "\n",
     "Each thread will copy a contiguous set of items, e.g. the items with indices `[base, base + items_per_thread)`:\n",
@@ -102,11 +115,11 @@
     "id": "TuR4yDV4H6IB"
    },
    "source": [
-    "Next, we'll actually run the code, by invoking the Nsight Compute `ncu` command line tool. The basic syntax for this tool is `ncu <ncu flags> <your program> <your program args>`, which will run `<your program> <your program args>` while gathering a profile on how your kernels are performing. We're passing it some flags that describe what data it should collect and where it should save the results.\n",
+    "## 3. Profiling the Baseline\n",
     "\n",
-    "There is an overhead to running code under the profiler. Your program may execute noticably slower.\n",
+    "Next, we'll actually run the code by invoking the Nsight Compute `ncu` command line tool. The basic syntax for this tool is `ncu <ncu flags> <your program> <your program args>`, which will run `<your program> <your program args>` while gathering a profile on how your kernels are performing. We're passing it some flags that describe what data it should collect and where it should save the results.\n",
     "\n",
-    "When profiling and benchmarking, we need to run with a sufficient workload to get meaningful and representative results. If your runtime is too short, the profiler may not be able to report some metrics or the results may be inaccurate.\n",
+    "There is an overhead to running code under the profiler. Your program may execute noticeably slower.\n",
     "\n",
     "**NOTE: To modify and rerun the above code, you must execute the previous cell to write the file and this one to execute it.**"
    ]
@@ -131,9 +144,9 @@
    "source": [
     "Let's take a look at the profiling report on the kernel. When you run the next cell, a number of tabs will be displayed. The first tab will have a summary of all of the Nsight recommendations and advisories. Subsequent tabs will have more detailed information on a particular area.\n",
     "\n",
-    "**TODO: Spend a few minutes reviewing the report. What stands out to you? Based on the information in the report, how can the kernel be improved?**\n",
+    "**TODO:** Spend a few minutes reviewing the report. What stands out to you? Based on the information in the report, how can the kernel be improved?\n",
     "\n",
-    "**EXTRA CREDIT: Download the [Nsight Compute GUI](https://developer.nvidia.com/nsight-compute) and open the report in it to see even more information.**"
+    "**EXTRA CREDIT:** Download the [Nsight Compute GUI](https://developer.nvidia.com/nsight-compute) and open the report in it to see even more information."
    ]
   },
   {
@@ -155,11 +168,13 @@
     "id": "mL_9xT44qbMA"
    },
    "source": [
-    "**TODO: Now try to write a better version of our copy kernel.**\n",
+    "## 4. Optimization Challenge: Improved Memory Access\n",
+    "\n",
+    "**TODO:** Now try to write a better version of our copy kernel.\n",
     "\n",
     "As a hint, given that this kernel does no compute and just moves data, our memory access patterns are probably important!\n",
     "\n",
-    "Instead of using the `cuda.grid` utility, you may want to use the hierachical coordinates of our thread to calculate the index:\n",
+    "Instead of using the `cuda.grid` utility, you may want to use the hierarchical coordinates of our thread to calculate the index:\n",
     "\n",
     "- `cuda.blockDim.x`: The number of threads per block.\n",
     "- `cuda.blockIdx.x`: The global index of the current thread block.\n",
@@ -213,6 +228,8 @@
     "id": "qco9XOsTkPEJ"
    },
    "source": [
+    "## 5. Verification & Benchmarking\n",
+    "\n",
     "Now, let's make sure our code works:"
    ]
   },
@@ -259,6 +276,8 @@
     "id": "mfrqUdzozGeU"
    },
    "source": [
+    "## 6. Profiling the Optimized Kernel\n",
+    "\n",
     "Hopefully you see quite a speedup! Now let's profile the optimized variant:"
    ]
   },
@@ -300,7 +319,9 @@
     "id": "Xn6IPpuxD_kz"
    },
    "source": [
-    "**EXTRA CREDIT: Experiment with different problem sizes, threads per block, and items per thread. You can pass them as command line arguments to the Python scripts. If you're feeling really ambitious, do a parameter sweep to study the impact these knobs have on performance.**"
+    "## 7. Further Exploration\n",
+    "\n",
+    "**EXTRA CREDIT:** Experiment with different problem sizes, threads per block, and items per thread. You can pass them as command line arguments to the Python scripts. If you're feeling really ambitious, do a parameter sweep to study the impact these knobs have on performance."
    ]
   }
  ],

diff --git a/tutorials/accelerated-python/notebooks/kernels/41__kernel_authoring__book_histogram.ipynb b/tutorials/accelerated-python/notebooks/kernels/41__kernel_authoring__book_histogram.ipynb
@@ -7,11 +7,22 @@
     "id": "f1a8560a-c91b-48db-af1c-18fcd4892448"
    },
    "source": [
-    "## Exercise - Kernel Authoring - Book Histogram\n",
+    "# Kernel Authoring - Book Histogram\n",
+    "\n",
+    "## Table of Contents\n",
+    "\n",
+    "1. [Environment Setup & Data Download](#1.-Environment-Setup-&-Data-Download)\n",
+    "2. [First Attempt: Global Memory Histogram](#2.-First-Attempt:-Global-Memory-Histogram)\n",
+    "3. [Fixing Data Races with Atomics](#3.-Fixing-Data-Races-with-Atomics)\n",
+    "4. [Profiling the Naive Solution](#4.-Profiling-the-Naive-Solution)\n",
+    "5. [Optimization: Shared Memory & Cooperative Groups](#5.-Optimization:-Shared-Memory-&-Cooperative-Groups)\n",
+    "6. [Performance Comparison](#6.-Performance-Comparison)\n",
+    "\n",
+    "## 1. Environment Setup & Data Download\n",
     "\n",
     "Let's learn to use some advanced CUDA features like shared memory, atomics, and [cuda.cooperative](https://nvidia.github.io/cccl/python/cooperative.html) to write an efficient histogram kernel to determine the most frequent characters in a collection of books.\n",
     "\n",
-    "First, let's download our dataset."
+    "First, let's download our dataset and install the necessary tools."
    ]
   },
   {
@@ -63,6 +74,8 @@
     "id": "9109d3c0-e276-44cc-9f36-f8c79eb48b31"
    },
    "source": [
+    "## 2. First Attempt: Global Memory Histogram\n",
+    "\n",
     "A histogram kernel counts the number of times a value occurs in a dataset. To implement this, we create an array that is large enough to store all possible values (in the case of counting 1-byte ASCII characters, 256 elements). Then for the value of each element in the dataset, we increment its location in the array.\n",
     "\n",
     "Let's try a simple way to implement this:"
@@ -168,11 +181,13 @@
     "id": "b14fa522-b41b-4538-8c34-ecc355e55116"
    },
    "source": [
+    "## 3. Fixing Data Races with Atomics\n",
+    "\n",
     "It looks like something is wrong - our counts are very low, and the most common characters don't make a lot of sense. Many of our increments seem to get lost!\n",
     "\n",
     "What's happening here is called a data race. Many different threads are trying to access the bins of the histogram at the same time.\n",
     "\n",
-    "Imagine that two threads are trying to update the same bin.\n",
+    "Imagine that two threads are trying to update the same bin:\n",
     "\n",
     "- Thread 0 reads the count of the bin, which is 0, and stores it in its local variable `old_count`.\n",
     "- Thread 0 adds 1 to its `old_count`, producing a `new_count` of 1.\n",
@@ -193,6 +208,8 @@
     "id": "08f4dded-26a7-4ef8-b981-e00c569ca4d0"
    },
    "source": [
+    "## 4. Profiling the Naive Solution\n",
+    "\n",
     "Now let's profile our code."
    ]
   },
@@ -228,9 +245,11 @@
     "id": "e1f72831-780f-4cf5-8ff1-2092ecb193d9"
    },
    "source": [
+    "## 5. Optimization: Shared Memory & Cooperative Groups\n",
+    "\n",
     "Looking at the profile trace, it seems like our code is quite slow - look at the memory workload tab and see how low the throughput is!\n",
     "\n",
-    "One improvement we should make is to separate loading from values from the histogram update and to perform striped loads. We'll use [cuda.cooperative](https://nvidia.github.io/cccl/python/cooperative.html)'s block load instead of writing this by hand.\n",
+    "One improvement we should make is to separate loading values from the histogram update and to perform striped loads (also known as coalesced access). We'll use [cuda.cooperative](https://nvidia.github.io/cccl/python/cooperative.html)'s block load instead of writing this by hand.\n",
     "\n",
     "**TODO: Rewrite the code below to use `cuda.cooperative` to load from `values` into local memory.**\n",
     "- **Create a `coop.block.load(dtype, threads_per_block, items_per_thread, algorithm)` object outside of the kernel.**\n",
@@ -308,7 +327,7 @@
     "id": "fd090ee6-a5d3-46f6-a58e-d34e077a99c0"
    },
    "source": [
-    "Now let's profile our code."
+    "Now let's run the code and profile it."
    ]
   },
   {
@@ -366,6 +385,16 @@
     "nsightful.display_ncu_csv_in_notebook(histogram_localized_csv)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "23df6c7a",
+   "metadata": {},
+   "source": [
+    "## 6. Performance Comparison\n",
+    "\n",
+    "Let's compare the execution time of our naive global memory implementation against our optimized shared memory implementation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,