<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Academic | 卢子期</title><link>https://assassin-plus.github.io/portfolio/zh/tags/academic/</link><atom:link href="https://assassin-plus.github.io/portfolio/zh/tags/academic/index.xml" rel="self" type="application/rss+xml"/><description>Academic</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>zh-Hans</language><lastBuildDate>Tue, 12 Dec 2023 00:00:00 +0000</lastBuildDate><image><url>https://assassin-plus.github.io/portfolio/media/icon_hu_982c5d63a71b2961.png</url><title>Academic</title><link>https://assassin-plus.github.io/portfolio/zh/tags/academic/</link></image><item><title>Intro 2 CUDA</title><link>https://assassin-plus.github.io/portfolio/zh/post/intro2cuda/</link><pubDate>Tue, 12 Dec 2023 00:00:00 +0000</pubDate><guid>https://assassin-plus.github.io/portfolio/zh/post/intro2cuda/</guid><description>&lt;h1 id="intro-2-cuda"&gt;Intro 2 CUDA&lt;/h1&gt;
&lt;h2 id="streams"&gt;Streams&lt;/h2&gt;
&lt;h3 id="page-locked-host-memory"&gt;Page-Locked Host Memory&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaHostAlloc((void**)&amp;amp;a, N*sizeof(int), cudaHostAllocDefault);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaFreeHost(a);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;page-locked / pinned host memory:
os guarantees that the memory is resident in physical memory and won&amp;rsquo;t be paged out to disk.&lt;/p&gt;
&lt;p&gt;simultaneously pinned memory opt out of the feature of virtual memory.&lt;/p&gt;
&lt;h3 id="multiple-streams"&gt;Multiple Streams&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStream_t stream1, stream2;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamCreate(&amp;amp;stream1);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamCreate(&amp;amp;stream2);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaMemcpyAsync(d_a, a, N*sizeof(int), cudaMemcpyHostToDevice, stream1);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaMemcpyAsync(d_b, b, N*sizeof(int), cudaMemcpyHostToDevice, stream2);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kernel&amp;lt;&amp;lt;&amp;lt;grid1, block1, 0, stream1&amp;gt;&amp;gt;&amp;gt;(d_a, N);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kernel&amp;lt;&amp;lt;&amp;lt;grid2, block2, 0, stream2&amp;gt;&amp;gt;&amp;gt;(d_b, N);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamSynchronize(stream1);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamSynchronize(stream2);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamDestroy(stream1);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaStreamDestroy(stream2);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id="gpu-work-schedule"&gt;GPU Work Schedule&lt;/h3&gt;
&lt;p&gt;Be aware of the GPU work schedule.
There are different execution units to execute different types of instructions, such as copy, compute, and so on.
And &lt;strong&gt;the order of code dependencies is equal to the order written in the code&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="multi-gpu"&gt;Multi-GPU&lt;/h2&gt;
&lt;h3 id="zero-copy-host-memory"&gt;Zero-Copy Host Memory&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaHostAlloc((void**)&amp;amp;a, N*sizeof(int), cudaHostAllocWriteCombined | cudaHostAllocMapped);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;cudaHostAllocWriteCombined: this flag indicates that the runtime should allocate the buffer as write-combined, which will not change functionality in application
but represents a performance enhancement for buffers that will be read only by the GPU.&lt;/p&gt;
&lt;p&gt;Write-combined memory can be extremely inefficient in scenarios where CPU also needs to perform reads from the buffer.&lt;/p&gt;
&lt;p&gt;cudaHostAllocMapped: the buffers can be accessed from the GPU. However, since there is a difference between the virtual address space of the CPU and the GPU,
the call to cudaHostAlloc() will return a CPU pointer, which is then mapped to a GPU pointer using cudaHostGetDevicePointer().&lt;/p&gt;
&lt;h3 id="portable-pinned-memory"&gt;Portable Pinned Memory&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;This is neccesary when you use multiple GPUs.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaHostAlloc((void**)&amp;amp;a, N*sizeof(int), cudaHostAllocPortable);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;When a buffer is allocated as pinned, they will only &lt;strong&gt;appear&lt;/strong&gt; page-locked to the thread that allocated them. If another thread tries to access the buffer, they will see the buffer as standard pageable memory.&lt;/p&gt;
&lt;p&gt;To support portable pinned memory and zero-copy memory in multi-GPU systems, the code need two notable changes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-c" data-lang="c"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;deviceID&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;cudaSetDevice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arg&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;deviceID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nf"&gt;cudaSetDeviceFlags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cudaDeviceMapHost&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;We need a call to cudaSetDevice() to enable every thread controls a different GPU.&lt;/p&gt;
&lt;p&gt;In addition, as we use zero-copy in order to access these buffers directly from the GPU, we use cudaHostGetDevicePointer() to get the valid device pointers for the host memory.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-fallback" data-lang="fallback"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;float *a, *b, *partial_c;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;float *dev_a, *dev_b, *dev_partial_c;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;//allocate memory on the CPU side
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;a = data-&amp;gt;a;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;b = data-&amp;gt;b;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;partial_c = (float *)malloc(blocksPerGrid *　sizeof(float));
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaHostGetDevicePointer(&amp;amp;dev_a, a, 0);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaHostGetDevicePointer(&amp;amp;dev_b, b, 0);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;cudaMalloc((void**)&amp;amp;dev_partial_c, blocksPerGrid * sizeof(float));
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;dev_a += data-&amp;gt;offset;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;dev_b += data-&amp;gt;offset;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kernel&amp;lt;&amp;lt;&amp;lt;blocksPerGrid, threadsPerBlock&amp;gt;&amp;gt;&amp;gt;(dev_a, dev_b, dev_partial_c);
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description></item></channel></rss>