André Rendeiro
2024-03-19T04:51:43+00:00
https://andre-rendeiro.com
André Rendeiro
javascript:window.location.href=atob('bWFpbHRvOmFyZW5kZWlyb0BjZW1tLmF0')
Inspecting SLURM cluster usage
2023-12-13T00:00:00+00:00
https://andre-rendeiro.com/2023/12/13/inspecting-slurm-cluster-usage
<p>As in many academic institutions, we have an HPC cluster with the SLURM scheduler.
While we are fortunate enough to not be limited by the amount of jobs we can submit (and usually) also not by the resources we can use, it is good to be mindful of the resources we use and the efficiency of our jobs in order to save resources for others and to be mindful of the environment.</p>
<p>The <code class="language-plaintext highlighter-rouge">sacct</code> command can be used to inspect the resources used by a set of jobs, but it is not very convenient to use to get a summary of the usage of the cluster over long periods of time.</p>
<p>I wanted to get a summary of the usage of the cluster for the lab and for each user and partition. To do this, I wrote a simple bash script that runs <code class="language-plaintext highlighter-rouge">sshare</code> to get the usage of the cluster for all users and partitions and then wrote a python script to process the output and plot the usage. This can be run as a cron job every week to get a timestamped summary of the usage of the cluster over time.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/usr/bin/env bash</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> ~/cluster_usage/
<span class="nv">USERS</span><span class="o">=</span><span class="si">$(</span><span class="nb">ls</span> <span class="nt">-p</span> /home/ | <span class="nb">tr</span> <span class="s1">'\n'</span> <span class="s1">','</span> | <span class="nb">sed</span> <span class="s1">'s/\///g'</span><span class="si">)</span>
<span class="nv">DATE</span><span class="o">=</span><span class="si">$(</span><span class="nb">date</span> +%F<span class="si">)</span>
sshare <span class="nt">-u</span> <span class="nv">$USERS</span> <span class="nt">-m</span> <span class="nt">-p</span> <span class="o">></span> ~/cluster_usage/<span class="k">${</span><span class="nv">DATE</span><span class="k">}</span>.txt
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python
</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="n">data_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"~/cluster_usage"</span><span class="p">).</span><span class="n">expanduser</span><span class="p">()</span>
<span class="n">lab_name</span> <span class="o">=</span> <span class="s">"rendeiro_lab"</span>
<span class="n">_df</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">data_dir</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">"*.txt"</span><span class="p">)):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"|"</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">isnull</span><span class="p">().</span><span class="nb">all</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="o">~</span><span class="n">n</span><span class="p">].</span><span class="n">dropna</span><span class="p">()</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="n">df</span><span class="p">.</span><span class="n">dtypes</span> <span class="o">==</span> <span class="s">"object"</span><span class="p">]:</span>
<span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">col</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">assign</span><span class="p">(</span><span class="n">date</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">f</span><span class="p">.</span><span class="n">stem</span><span class="p">))</span>
<span class="n">_df</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">_df</span><span class="p">)</span>
<span class="n">dfs</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"summary.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"summary.csv"</span><span class="p">)</span>
<span class="c1"># Plotting (only last date)
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"RawUsage"</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span>
<span class="n">labs</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Account"</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">sel_labs</span> <span class="o">=</span> <span class="n">labs</span><span class="p">[</span><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"Account"</span><span class="p">)[</span><span class="s">"EffectvUsage"</span><span class="p">].</span><span class="n">mean</span><span class="p">().</span><span class="n">sort_values</span><span class="p">()</span> <span class="o">></span> <span class="mf">0.01</span><span class="p">]</span>
<span class="n">colors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">sel_labs</span><span class="p">,</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"tab20"</span><span class="p">)</span> <span class="o">+</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Paired"</span><span class="p">)))</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"Account"</span><span class="p">].</span><span class="n">isin</span><span class="p">(</span><span class="n">sel_labs</span><span class="p">)]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="k">for</span> <span class="n">lab</span> <span class="ow">in</span> <span class="n">sel_labs</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"Account"</span><span class="p">]</span> <span class="o">==</span> <span class="n">lab</span><span class="p">]</span>
<span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="s">"RawUsage"</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="s">"EffectvUsage"</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">lab</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">lab</span><span class="p">])</span>
<span class="n">ax</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s">"RawUsage"</span><span class="p">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"EffectvUsage"</span><span class="p">,</span> <span class="n">xscale</span><span class="o">=</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"summary.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
<span class="n">queues</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"Partition"</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">queues</span><span class="p">),</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">queues</span><span class="p">),</span> <span class="mi">5</span><span class="p">),</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span><span class="p">,</span> <span class="n">queue</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">fig</span><span class="p">.</span><span class="n">axes</span><span class="p">,</span> <span class="n">queues</span><span class="p">):</span>
<span class="k">for</span> <span class="n">lab</span> <span class="ow">in</span> <span class="n">sel_labs</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="sa">f</span><span class="s">"Account == '</span><span class="si">{</span><span class="n">lab</span><span class="si">}</span><span class="s">' & Partition == '</span><span class="si">{</span><span class="n">queue</span><span class="si">}</span><span class="s">'"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="s">"RawUsage"</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="s">"EffectvUsage"</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">lab</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">lab</span><span class="p">])</span>
<span class="n">ax</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s">"RawUsage"</span><span class="p">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"EffectvUsage"</span><span class="p">,</span> <span class="n">xscale</span><span class="o">=</span><span class="s">"log"</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="n">queue</span><span class="p">)</span>
<span class="k">if</span> <span class="n">queue</span> <span class="o">==</span> <span class="n">queues</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mf">1.05</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s">"upper left"</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="s">"summary.per_queue.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"Account"</span><span class="p">].</span><span class="n">isin</span><span class="p">([</span><span class="n">lab_name</span><span class="p">])]</span>
<span class="n">df</span><span class="p">[</span><span class="s">"id"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">"User"</span><span class="p">]</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">df</span><span class="p">[</span><span class="s">"Partition"</span><span class="p">]</span>
<span class="n">colors</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span>
<span class="nb">zip</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"id"</span><span class="p">].</span><span class="n">unique</span><span class="p">(),</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"tab20"</span><span class="p">)</span> <span class="o">+</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Paired"</span><span class="p">))</span>
<span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s">"id"</span><span class="p">].</span><span class="n">unique</span><span class="p">():</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"id"</span><span class="p">]</span> <span class="o">==</span> <span class="n">q</span><span class="p">]</span>
<span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="s">"RawUsage"</span><span class="p">],</span> <span class="n">p</span><span class="p">[</span><span class="s">"EffectvUsage"</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">q</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">q</span><span class="p">])</span>
<span class="n">ax</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">xlabel</span><span class="o">=</span><span class="s">"RawUsage"</span><span class="p">,</span> <span class="n">ylabel</span><span class="o">=</span><span class="s">"EffectvUsage"</span><span class="p">,</span> <span class="n">xscale</span><span class="o">=</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">data_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"summary.</span><span class="si">{</span><span class="n">lab_name</span><span class="si">}</span><span class="s">.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
Testing full use of CPUs
2023-10-05T00:00:00+00:00
https://andre-rendeiro.com/2023/10/05/test-cpus
<p><a href="/2023/10/04/test-gpus">Since we’ve been testing GPUs</a>, for the sake of completeness, here’s how one can make all the CPU power in a machine go to work.</p>
<p>Brrrr…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python
</span>
<span class="s">"""
Test using all CPUs with parmap.
"""</span>
<span class="kn">import</span> <span class="nn">parmap</span>
<span class="k">def</span> <span class="nf">do</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="n">n</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">parmap</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">do</span><span class="p">,</span> <span class="p">[</span><span class="mi">1000000000</span><span class="p">]</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">,</span> <span class="n">pm_processes</span><span class="o">=</span><span class="mi">28</span><span class="p">)</span>
</code></pre></div></div>
Testing use of multiple GPUs
2023-10-04T00:00:00+00:00
https://andre-rendeiro.com/2023/10/04/test-gpus
<p>We’ve got a pretty powerfull Lambda workstation with GPUs that has served us very well.</p>
<p>Let’s see how we can use them together with the Accelerator API from huggingface:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python
</span>
<span class="s">"""
A simple example of how to use the Accelerator API
to train a ResNet-50 model on a dummy dataset.
Accelerator enables training on a single GPU, multiple GPUs.
Run once `accelerate config` to set up your configuration file.
Run with `accelerate launch test_gpus_accelerate.py` to run on all GPUs.
"""</span>
<span class="kn">import</span> <span class="nn">fire</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torchvision</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">DataLoader</span>
<span class="kn">from</span> <span class="nn">accelerate</span> <span class="kn">import</span> <span class="n">Accelerator</span>
<span class="k">class</span> <span class="nc">DummyDataset</span><span class="p">(</span><span class="n">Dataset</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">1_000_000</span>
<span class="k">def</span> <span class="nf">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">idx</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">tuple</span><span class="p">[</span><span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">]:</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">)</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span><span class="p">,))[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">image</span><span class="p">,</span> <span class="n">label</span>
<span class="k">def</span> <span class="nf">test</span><span class="p">(</span>
<span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"alexnet"</span><span class="p">,</span>
<span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span><span class="p">,</span>
<span class="n">epochs</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span>
<span class="n">num_workers</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
<span class="p">):</span>
<span class="n">accelerator</span> <span class="o">=</span> <span class="n">Accelerator</span><span class="p">()</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">accelerator</span><span class="p">.</span><span class="n">device</span>
<span class="n">model</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">torchvision</span><span class="p">.</span><span class="n">models</span><span class="p">,</span> <span class="n">model_name</span><span class="p">)(</span><span class="n">weights</span><span class="o">=</span><span class="s">"DEFAULT"</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">DummyDataset</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span>
<span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">num_workers</span><span class="o">=</span><span class="n">num_workers</span>
<span class="p">)</span>
<span class="n">model</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">accelerator</span><span class="p">.</span><span class="n">prepare</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">optimizer</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
<span class="k">if</span> <span class="n">accelerator</span><span class="p">.</span><span class="n">is_local_main_process</span><span class="p">:</span>
<span class="n">tqdm0</span> <span class="o">=</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">),</span> <span class="n">position</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Epochs"</span><span class="p">)</span>
<span class="n">tqdm1</span> <span class="o">=</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">position</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">leave</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Batches"</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">tqdm0</span> <span class="o">=</span> <span class="nb">range</span><span class="p">(</span><span class="n">epochs</span><span class="p">)</span>
<span class="n">tqdm1</span> <span class="o">=</span> <span class="n">data</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="n">tqdm0</span><span class="p">:</span>
<span class="k">for</span> <span class="n">source</span><span class="p">,</span> <span class="n">targets</span> <span class="ow">in</span> <span class="n">tqdm1</span><span class="p">:</span>
<span class="n">source</span> <span class="o">=</span> <span class="n">source</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">targets</span> <span class="o">=</span> <span class="n">targets</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">source</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">targets</span><span class="p">)</span>
<span class="k">if</span> <span class="n">accelerator</span><span class="p">.</span><span class="n">is_local_main_process</span><span class="p">:</span>
<span class="n">tqdm1</span><span class="p">.</span><span class="n">set_postfix</span><span class="p">({</span><span class="s">"images"</span><span class="p">:</span> <span class="n">source</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">"loss"</span><span class="p">:</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()})</span>
<span class="n">accelerator</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Valid evaluation
</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">fire</span><span class="p">.</span><span class="n">Fire</span><span class="p">(</span><span class="n">test</span><span class="p">)</span>
</code></pre></div></div>
Building Python manywheels
2023-06-13T00:00:00+00:00
https://andre-rendeiro.com/2023/06/13/building-python-manywheels
<h1 id="building-python-manywheels">Building Python manywheels</h1>
<p>Manywheeels are a way to distribute pre-compiled wheels for Python packages that are compatible with manylinux. This is a great way to distribute packages that depend on C-extensions, as it allows the user to install the package without having to compile the C-extensions themselves.</p>
<p>Building them is a bit of a pain, but not as much as I thought.</p>
<p>Here’s an example of a script that I use to build manylinux wheels for the <code class="language-plaintext highlighter-rouge">forceatlas2</code> package.</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># Build manylinux wheel for a package</span>
<span class="nv">PACKAGE_NAME</span><span class="o">=</span>fa2
<span class="nv">GIT_URL</span><span class="o">=</span>git@github.com:bhargavchippada/forceatlas2.git
<span class="nv">GIT_NAME</span><span class="o">=</span>forceatlas2
<span class="nv">PACKAGE_VERSION</span><span class="o">=</span>0.3.5
<span class="nv">ARCH</span><span class="o">=</span>linux_x86_64
<span class="nv">PYTHON_VERSION</span><span class="o">=</span>cp310
<span class="nv">INTERPRETER</span><span class="o">=</span>python3.10
<span class="nv">WHEEL_PREFIX</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">PACKAGE_NAME</span><span class="k">}</span><span class="s2">-</span><span class="k">${</span><span class="nv">PACKAGE_VERSION</span><span class="k">}</span><span class="s2">-</span><span class="k">${</span><span class="nv">PYTHON_VERSION</span><span class="k">}</span><span class="s2">-</span><span class="k">${</span><span class="nv">PYTHON_VERSION</span><span class="k">}</span><span class="s2">"</span>
<span class="c"># Print commands and exit on error</span>
<span class="nb">set</span> <span class="nt">-ex</span>
clean <span class="o">()</span> <span class="o">{</span>
<span class="nb">rm</span> <span class="nt">-rf</span> build
<span class="nb">rm</span> <span class="nt">-rf</span> dist
<span class="nb">rm</span> <span class="nt">-rf</span> <span class="k">*</span>.whl
<span class="nb">rm</span> <span class="nt">-rf</span> <span class="k">*</span>.egg-info
<span class="nb">rm</span> <span class="nt">-rf</span> wheelhouse
<span class="nb">rm</span> <span class="nt">-rf</span> <span class="k">*</span>/__pycache__
<span class="nb">rm</span> <span class="nt">-rf</span> <span class="k">*</span>/<span class="k">*</span>.so
<span class="nb">rm</span> <span class="nt">-rf</span> forceatlas2
<span class="o">}</span>
<span class="nb">echo</span> <span class="s2">"Building wheel for </span><span class="k">${</span><span class="nv">PACKAGE_NAME</span><span class="k">}</span><span class="s2"> package"</span>
<span class="nb">echo</span> <span class="s2">"Cloning repository"</span>
<span class="nb">rm</span> <span class="nt">-rf</span> <span class="nv">$GIT_NAME</span>
<span class="c"># git clone --depth 1 --branch v$PACKAGE_VERSION $GIT_URL</span>
git clone <span class="nv">$GIT_URL</span>
<span class="nb">cd</span> <span class="nv">$GIT_NAME</span>
<span class="nb">echo</span> <span class="s2">"Making sure build tools are up to date"</span>
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> pip <span class="nb">install</span> <span class="nt">--upgrade</span> pip
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> pip <span class="nb">install</span> <span class="nt">--upgrade</span> setuptools wheel auditwheel
<span class="c"># Build wheel</span>
<span class="nb">echo</span> <span class="s2">"Building wheel"</span>
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> pip wheel <span class="nt">-w</span> wheelhouse <span class="nb">.</span>
<span class="c"># Make manylinux compliant</span>
<span class="nb">echo</span> <span class="s2">"Making manylinux compliant wheel"</span>
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> auditwheel repair wheelhouse/<span class="k">${</span><span class="nv">WHEEL_PREFIX</span><span class="k">}</span>-<span class="k">${</span><span class="nv">ARCH</span><span class="k">}</span>.whl
<span class="c"># Test installation</span>
<span class="nb">echo</span> <span class="s2">"Testing installation"</span>
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> pip <span class="nb">install </span>wheelhouse/<span class="k">${</span><span class="nv">WHEEL_PREFIX</span><span class="k">}</span><span class="nt">-manylinux</span><span class="k">*</span>.whl
<span class="nb">echo</span> <span class="s2">"Removing package"</span>
<span class="nv">$INTERPRETER</span> <span class="nt">-m</span> pip uninstall <span class="k">${</span><span class="nv">PACKAGE_NAME</span><span class="k">}</span> <span class="nt">-y</span>
<span class="c"># Cleanup</span>
<span class="nb">echo</span> <span class="s2">"Cleaning up"</span>
clean</code></pre></figure>
Fast.ai and its tricks
2023-02-04T00:00:00+00:00
https://andre-rendeiro.com/2023/02/04/fastai-and-its-tricks
<p>Oh fast.ai, how I love thee.</p>
<p>How easy it is to get started and train a model.</p>
<p>But sometimes, you are a pain in the neck.</p>
<p>The documentation isn’t really there and the source code is not readable.</p>
<p>I hear the forums are very helpful but it’s not quite a format I have time for.</p>
<p>Ultimately fasta.ai will only get you started, but I have immense respect and admiration for Howard, Sylvain, and the other people behind it.</p>
<p>Sometimes there is a true spark of genius in the way they do things, going above what many academics do and what other libraries provide (e.g. lr_find).</p>
<p>Let’s check out one of my favourite tricks: using a ResNet as an encoder in an autoencoder.</p>
<h3 id="sources">Sources:</h3>
<ul>
<li>https://colab.research.google.com/drive/1t9dn6qIdKc6rdF-A02KMdJ8UVGYPFh4v#scrollTo=e9dnMvm7q5q4</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">imageio</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torchvision</span>
<span class="kn">from</span> <span class="nn">fastai.vision.all</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">L</span><span class="p">,</span>
<span class="n">DataBlock</span><span class="p">,</span>
<span class="n">ImageBlock</span><span class="p">,</span>
<span class="n">CategoryBlock</span><span class="p">,</span>
<span class="n">aug_transforms</span><span class="p">,</span>
<span class="n">vision_learner</span><span class="p">,</span>
<span class="n">create_body</span><span class="p">,</span>
<span class="n">Resize</span><span class="p">,</span>
<span class="n">PixelShuffle_ICNR</span><span class="p">,</span>
<span class="n">ConvLayer</span><span class="p">,</span>
<span class="n">nn</span><span class="p">,</span>
<span class="n">Module</span><span class="p">,</span>
<span class="n">SigmoidRange</span><span class="p">,</span>
<span class="n">Tensor</span><span class="p">,</span>
<span class="n">xresnet18</span><span class="p">,</span>
<span class="n">Learner</span><span class="p">,</span>
<span class="n">MSELossFlat</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">get_class</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="p">.</span><span class="n">stem</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">get_self</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span>
<span class="c1"># Make dummy dataset
</span><span class="n">dataset_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"datasets"</span><span class="p">)</span> <span class="o">/</span> <span class="s">"dummy_resize"</span>
<span class="n">dataset_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"train"</span><span class="p">).</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"valid"</span><span class="p">).</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="p">[</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">]:</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">3</span><span class="p">)).</span><span class="n">astype</span><span class="p">(</span><span class="s">"uint8"</span><span class="p">)</span>
<span class="n">imageio</span><span class="p">.</span><span class="n">imwrite</span><span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"train"</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s">.jpg"</span><span class="p">,</span> <span class="n">img</span><span class="p">)</span>
<span class="n">imageio</span><span class="p">.</span><span class="n">imwrite</span><span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"valid"</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s">.jpg"</span><span class="p">,</span> <span class="n">img</span><span class="p">)</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">L</span><span class="p">(</span><span class="n">dataset_dir</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">"*/*.jpg"</span><span class="p">))</span>
<span class="c1"># Define model architecture
</span><span class="k">class</span> <span class="nc">UpsampleBlock</span><span class="p">(</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
<span class="bp">self</span><span class="p">,</span>
<span class="n">up_in_c</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">final_div</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span>
<span class="n">blur</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
<span class="n">leaky</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="o">**</span><span class="n">kwargs</span><span class="p">,</span>
<span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">shuf</span> <span class="o">=</span> <span class="n">PixelShuffle_ICNR</span><span class="p">(</span><span class="n">up_in_c</span><span class="p">,</span> <span class="n">up_in_c</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="n">blur</span><span class="o">=</span><span class="n">blur</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="n">ni</span> <span class="o">=</span> <span class="n">up_in_c</span> <span class="o">//</span> <span class="mi">2</span>
<span class="n">nf</span> <span class="o">=</span> <span class="n">ni</span> <span class="k">if</span> <span class="n">final_div</span> <span class="k">else</span> <span class="n">ni</span> <span class="o">//</span> <span class="mi">2</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="n">ni</span><span class="p">,</span> <span class="n">nf</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">ConvLayer</span><span class="p">(</span><span class="n">nf</span><span class="p">,</span> <span class="n">nf</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">relu</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">ReLU</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">up_in</span><span class="p">:</span> <span class="n">Tensor</span><span class="p">)</span> <span class="o">-></span> <span class="n">Tensor</span><span class="p">:</span>
<span class="n">up_out</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">shuf</span><span class="p">(</span><span class="n">up_in</span><span class="p">)</span>
<span class="n">cat_x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="n">up_out</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">conv2</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">conv1</span><span class="p">(</span><span class="n">cat_x</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">decoder_resnet</span><span class="p">(</span><span class="n">y_range</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">3</span><span class="p">):</span>
<span class="k">return</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
<span class="n">UpsampleBlock</span><span class="p">(</span><span class="mi">512</span><span class="p">),</span>
<span class="n">UpsampleBlock</span><span class="p">(</span><span class="mi">256</span><span class="p">),</span>
<span class="n">UpsampleBlock</span><span class="p">(</span><span class="mi">128</span><span class="p">),</span>
<span class="n">UpsampleBlock</span><span class="p">(</span><span class="mi">64</span><span class="p">),</span>
<span class="n">UpsampleBlock</span><span class="p">(</span><span class="mi">32</span><span class="p">),</span>
<span class="n">nn</span><span class="p">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="n">n_out</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">SigmoidRange</span><span class="p">(</span><span class="o">*</span><span class="n">y_range</span><span class="p">),</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">autoencoder</span><span class="p">(</span><span class="n">encoder</span><span class="p">,</span> <span class="n">y_range</span><span class="p">):</span>
<span class="k">return</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="n">encoder</span><span class="p">,</span> <span class="n">decoder_resnet</span><span class="p">(</span><span class="n">y_range</span><span class="p">))</span>
<span class="c1"># Make dataloader
</span><span class="n">btfms</span> <span class="o">=</span> <span class="n">aug_transforms</span><span class="p">()</span>
<span class="n">block</span> <span class="o">=</span> <span class="n">DataBlock</span><span class="p">(</span>
<span class="n">blocks</span><span class="o">=</span><span class="p">[</span><span class="n">ImageBlock</span><span class="p">(),</span> <span class="n">ImageBlock</span><span class="p">()],</span>
<span class="n">get_y</span><span class="o">=</span><span class="n">get_self</span><span class="p">,</span>
<span class="n">batch_tfms</span><span class="o">=</span><span class="n">btfms</span><span class="p">,</span>
<span class="n">item_tfms</span><span class="o">=</span><span class="n">Resize</span><span class="p">(</span><span class="mi">32</span><span class="p">),</span>
<span class="p">)</span>
<span class="n">dls</span> <span class="o">=</span> <span class="n">block</span><span class="p">.</span><span class="n">dataloaders</span><span class="p">(</span><span class="n">files</span><span class="p">,</span> <span class="n">path</span><span class="o">=</span><span class="n">dataset_dir</span><span class="p">,</span> <span class="n">bs</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">dls</span><span class="p">.</span><span class="n">one_batch</span><span class="p">()</span>
<span class="c1"># Build model and check
</span><span class="n">encoder</span> <span class="o">=</span> <span class="n">create_body</span><span class="p">(</span><span class="n">xresnet18</span><span class="p">(),</span> <span class="n">n_in</span><span class="o">=</span><span class="mi">3</span><span class="p">).</span><span class="n">cuda</span><span class="p">()</span>
<span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">shape</span>
<span class="n">y_range</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mf">3.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">)</span>
<span class="n">ac_resnet</span> <span class="o">=</span> <span class="n">autoencoder</span><span class="p">(</span><span class="n">encoder</span><span class="p">,</span> <span class="n">y_range</span><span class="p">).</span><span class="n">cuda</span><span class="p">()</span>
<span class="k">assert</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">ac_resnet</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">shape</span>
<span class="n">decoder</span> <span class="o">=</span> <span class="n">decoder_resnet</span><span class="p">(</span><span class="n">y_range</span><span class="p">).</span><span class="n">cuda</span><span class="p">()</span>
<span class="k">assert</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">decoder</span><span class="p">(</span><span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">)).</span><span class="n">shape</span>
<span class="c1"># Train
</span><span class="n">learn</span> <span class="o">=</span> <span class="n">Learner</span><span class="p">(</span><span class="n">dls</span><span class="p">,</span> <span class="n">ac_resnet</span><span class="p">,</span> <span class="n">loss_func</span><span class="o">=</span><span class="n">MSELossFlat</span><span class="p">())</span>
</code></pre></div></div>
Fine tuning vision models with different input shape
2023-01-30T00:00:00+00:00
https://andre-rendeiro.com/2023/01/30/fine-tuning-vision-models-with-different-input-shape
<p>Oh vision models, how I love thee.</p>
<p>But sometimes, you are a pain in the neck.</p>
<p>The beauty of pre-training on ImageNet is that we have weights for several architechtures which somehow are relatable across architectures and therefore also across datasets (with some tricks).</p>
<p>However, if our dataset is not a natural image (i.e. photography/simple microscopy encoded in 3 channels), we need to do some tricks to make it work.</p>
<p>Here are some tricks I’ve learned along the way.</p>
<h2 id="different-channel-number-or-different-xy-size">Different channel number or different XY size</h2>
<h3 id="sources">Sources</h3>
<ul>
<li>https://www.kaggle.com/code/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb/notebook</li>
<li>https://forums.fast.ai/t/how-to-do-transfer-learning-with-different-inputs/28395/5</li>
<li>https://forums.fast.ai/t/feeding-different-sized-images-to-fine-tune-resnet34/58712/5</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">argparse</span> <span class="kn">import</span> <span class="n">ArgumentParser</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">imageio</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torchvision</span>
<span class="kn">import</span> <span class="nn">fastai</span>
<span class="kn">from</span> <span class="nn">fastai.vision.all</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">L</span><span class="p">,</span>
<span class="n">DataBlock</span><span class="p">,</span>
<span class="n">ImageBlock</span><span class="p">,</span>
<span class="n">CategoryBlock</span><span class="p">,</span>
<span class="n">aug_transforms</span><span class="p">,</span>
<span class="n">vision_learner</span><span class="p">,</span>
<span class="n">error_rate</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">fastai.callback.tracker</span> <span class="kn">import</span> <span class="n">SaveModelCallback</span>
<span class="k">def</span> <span class="nf">get_class</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="p">.</span><span class="n">stem</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"-"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># Make dummy dataset
</span><span class="n">dataset_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"datasets"</span><span class="p">)</span> <span class="o">/</span> <span class="s">"dummy_resize"</span>
<span class="n">dataset_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"train"</span><span class="p">).</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"valid"</span><span class="p">).</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'b'</span><span class="p">]:</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">3</span><span class="p">)).</span><span class="n">astype</span><span class="p">(</span><span class="s">'uint8'</span><span class="p">)</span>
<span class="n">imageio</span><span class="p">.</span><span class="n">imwrite</span><span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"train"</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s">.jpg"</span><span class="p">,</span> <span class="n">img</span><span class="p">)</span>
<span class="n">imageio</span><span class="p">.</span><span class="n">imwrite</span><span class="p">(</span><span class="n">dataset_dir</span> <span class="o">/</span> <span class="s">"valid"</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="s">.jpg"</span><span class="p">,</span> <span class="n">img</span><span class="p">)</span>
<span class="c1"># Make dataloader
</span><span class="n">block</span> <span class="o">=</span> <span class="n">DataBlock</span><span class="p">(</span><span class="n">blocks</span><span class="o">=</span><span class="p">[</span><span class="n">ImageBlock</span><span class="p">,</span> <span class="n">CategoryBlock</span><span class="p">],</span> <span class="n">get_y</span><span class="o">=</span><span class="n">get_class</span><span class="p">)</span>
<span class="n">dls</span> <span class="o">=</span> <span class="n">block</span><span class="p">.</span><span class="n">dataloaders</span><span class="p">(</span><span class="n">L</span><span class="p">(</span><span class="n">dataset_dir</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">"*/*.jpg"</span><span class="p">)),</span> <span class="n">path</span><span class="o">=</span><span class="n">dataset_dir</span><span class="p">,</span> <span class="n">bs</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="c1"># Get model
</span><span class="n">model_name</span> <span class="o">=</span> <span class="s">'resnet50'</span>
<span class="n">fa</span> <span class="o">=</span> <span class="nb">getattr</span><span class="p">(</span><span class="n">torchvision</span><span class="p">.</span><span class="n">models</span><span class="p">,</span> <span class="n">model_name</span><span class="p">)</span>
<span class="n">learn</span> <span class="o">=</span> <span class="n">vision_learner</span><span class="p">(</span><span class="n">dls</span><span class="p">,</span> <span class="n">fa</span><span class="p">)</span>
<span class="n">learn</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">learn</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
<span class="c1"># Compare architectures
</span><span class="n">rn</span> <span class="o">=</span> <span class="n">fa</span><span class="p">(</span><span class="n">weights</span><span class="o">=</span><span class="s">"DEFAULT"</span><span class="p">).</span><span class="nb">eval</span><span class="p">()</span>
<span class="n">learn</span><span class="p">.</span><span class="n">model</span>
<span class="c1"># Check size adjusts
</span><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">img</span><span class="p">.</span><span class="n">transpose</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))[</span><span class="n">np</span><span class="p">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="p">...])</span> <span class="o">/</span> <span class="mi">255</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">rn</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">learn</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># both work, but eval() mode is required
</span>
<span class="c1"># Train
</span><span class="n">learn</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">learn</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
<span class="n">learn</span><span class="p">.</span><span class="n">fine_tune</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
Timm - Image models
2022-11-25T00:00:00+00:00
https://andre-rendeiro.com/2022/11/25/timm---image-models
<p>Timm: Torch Image Models - an amazing library to interface with a wide wide range of image models.</p>
<p>Honestly, it is mind blowing to me how many models are available in this library, all pre-trained on ImageNet.</p>
<p>Let’s dig in!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">timm</span>
<span class="kn">from</span> <span class="nn">bii.datasets</span> <span class="kn">import</span> <span class="n">get_cycif_data</span>
<span class="n">model_names</span> <span class="o">=</span> <span class="n">timm</span><span class="p">.</span><span class="n">list_models</span><span class="p">(</span><span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model_names</span> <span class="o">=</span> <span class="n">timm</span><span class="p">.</span><span class="n">list_models</span><span class="p">(</span><span class="s">'*vit*'</span><span class="p">,</span> <span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s">'maxvit_rmlp_nano_rw_256'</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">timm</span><span class="p">.</span><span class="n">create_model</span><span class="p">(</span><span class="n">model_name</span><span class="p">,</span> <span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">).</span><span class="nb">eval</span><span class="p">()</span> <span class="c1"># .to('cuda')
</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">meta</span> <span class="o">=</span> <span class="n">get_cycif_data</span><span class="p">()</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">16</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">([[</span><span class="n">ch</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">)]</span> <span class="k">for</span> <span class="n">ch</span> <span class="ow">in</span> <span class="n">x</span><span class="p">.</span><span class="n">values</span><span class="p">])[:,</span> <span class="p">:,</span> <span class="p">:</span><span class="mi">256</span><span class="p">,</span> <span class="p">:</span><span class="mi">256</span><span class="p">]</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="c1"># .to('cuda')
</span><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">o</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">i</span><span class="p">).</span><span class="n">numpy</span><span class="p">()</span>
<span class="n">corr</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">channel</span><span class="p">).</span><span class="n">T</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span>
<span class="kn">from</span> <span class="nn">seaborn_extensions</span> <span class="kn">import</span> <span class="n">clustermap</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">clustermap</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">center</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">"RdBu_r"</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">lat</span> <span class="o">=</span> <span class="n">pca</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">o</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="o">*</span><span class="n">lat</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">name</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">channel</span><span class="p">.</span><span class="n">values</span><span class="p">):</span>
<span class="n">ax</span><span class="p">.</span><span class="n">text</span><span class="p">(</span><span class="o">*</span><span class="n">lat</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">s</span><span class="o">=</span><span class="n">name</span><span class="p">)</span>
</code></pre></div></div>
VM setup for deep learning
2022-10-02T00:00:00+00:00
https://andre-rendeiro.com/2022/10/02/vm_setup_for_deep_learning
<h1 id="gpu-machine-ubuntu-2204">GPU-machine, Ubuntu 22.04</h1>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Get latest version strings</span>
<span class="nv">UBUNTU_VERSION</span><span class="o">=</span>ubuntu2204
<span class="nv">ARCH</span><span class="o">=</span>amd64
<span class="nv">DRIVER_VERSION</span><span class="o">=</span><span class="si">$(</span>apt-cache search nvidia-driver- | <span class="nb">grep</span> <span class="s2">"^nvidia-driver"</span> | <span class="nb">grep</span> <span class="nt">-v</span> open | <span class="nb">sort</span> | <span class="nb">tail</span> <span class="nt">-n</span> 1 | <span class="nb">sed</span> <span class="s2">"s/ .*//g"</span> | <span class="nb">sed</span> <span class="nt">-r</span> <span class="s1">'s/.*-([0-9]+)/\1/g'</span><span class="si">)</span>
<span class="nv">CUDA_VERSION</span><span class="o">=</span><span class="si">$(</span>apt-cache search cuda- | <span class="nb">grep</span> <span class="s2">"^cuda-[0-9]"</span> | <span class="nb">sort</span> | <span class="nb">tail</span> <span class="nt">-n</span> 1 | <span class="nb">sed</span> <span class="s2">"s/ .*//g"</span> | <span class="nb">sed</span> <span class="nt">-r</span> <span class="s1">'s/.*-([0-9]+-[0-9])/\1/g'</span><span class="si">)</span>
<span class="nv">CUDDN_VERSION</span><span class="o">=</span>9.0.0
<span class="c"># Clean up old installs</span>
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> purge nvidia<span class="k">*</span>
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> remove nvidia-<span class="k">*</span>
<span class="nb">sudo rm</span> /etc/apt/sources.list.d/cuda<span class="k">*</span>
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> autoremove <span class="o">&&</span> <span class="nb">sudo </span>apt-get <span class="nt">-y</span> autoclean
<span class="nb">sudo rm</span> <span class="nt">-rf</span> /usr/local/cuda<span class="k">*</span>
<span class="c"># Update</span>
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get upgrade <span class="nt">-y</span>
<span class="c"># Install base libs</span>
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="se">\</span>
build-essential <span class="se">\</span>
libatlas-base-dev <span class="se">\</span>
libopencv-dev <span class="se">\</span>
libprotoc-dev <span class="se">\</span>
make <span class="se">\</span>
unzip <span class="se">\</span>
git <span class="se">\</span>
gcc <span class="se">\</span>
g++ <span class="se">\</span>
libglu1-mesa libglu1-mesa-dev <span class="se">\</span>
libcurl4-openssl-dev <span class="se">\</span>
libssl-dev <span class="se">\</span>
freeglut3-dev <span class="se">\</span>
libx11-dev <span class="se">\</span>
libxmu-dev <span class="se">\</span>
libxi-dev
<span class="c"># Install NVIDIA drivers</span>
<span class="nb">sudo </span>add-apt-repository ppa:graphics-drivers/ppa
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> libnvidia-common-<span class="k">${</span><span class="nv">DRIVER_VERSION</span><span class="k">}</span>
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> libnvidia-gl-<span class="k">${</span><span class="nv">DRIVER_VERSION</span><span class="k">}</span>
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> nvidia-driver-<span class="k">${</span><span class="nv">DRIVER_VERSION</span><span class="k">}</span>
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> nvidia-settings
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> nvidia-utils-<span class="k">${</span><span class="nv">DRIVER_VERSION</span><span class="k">}</span>
<span class="c"># Install CUDA</span>
wget https://developer.download.nvidia.com/compute/cuda/repos/<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>/x86_64/cuda-<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>.pin
<span class="nb">sudo mv </span>cuda-<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>.pin /etc/apt/preferences.d/cuda-repository-pin-600
<span class="nb">sudo </span>apt-key adv <span class="nt">--fetch-keys</span> https://developer.download.nvidia.com/compute/cuda/repos/<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>/x86_64/3bf863cc.pub
<span class="nb">sudo </span>add-apt-repository <span class="s2">"deb https://developer.download.nvidia.com/compute/cuda/repos/</span><span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span><span class="s2">/x86_64/ /"</span>
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> cuda-<span class="k">${</span><span class="nv">CUDA_VERSION</span><span class="k">}</span>
<span class="c"># setup CUDA paths (TODO: test if this has already been added)</span>
<span class="nv">__export</span><span class="o">=</span><span class="s1">'
# CUDA
if [ -d "/usr/local/cuda/bin/" ]; then
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/opt/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/opt/cuda/include${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/targets/x86_64-linux/include${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
fi
'</span>
<span class="nb">echo</span> <span class="s2">"</span><span class="nv">$__export</span><span class="s2">"</span> <span class="o">>></span> ~/.bashrc
<span class="nb">source</span> ~/.bashrc
<span class="nb">sudo </span>ldconfig
<span class="c"># Install CuDNN: https://developer.nvidia.com/rdp/cudnn-download</span>
wget https://developer.download.nvidia.com/compute/cudnn/<span class="k">${</span><span class="nv">CUDDN_VERSION</span><span class="k">}</span>/local_installers/cudnn-local-repo-<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>-<span class="k">${</span><span class="nv">CUDDN_VERSION</span><span class="k">}</span>_1.0-1_<span class="k">${</span><span class="nv">ARCH</span><span class="k">}</span>.deb
<span class="nb">sudo </span>dpkg <span class="nt">-i</span> cudnn-local-repo-<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>-<span class="k">${</span><span class="nv">CUDDN_VERSION</span><span class="k">}</span>_1.0-1_<span class="k">${</span><span class="nv">ARCH</span><span class="k">}</span>.deb
<span class="nb">sudo cp</span> /var/cudnn-local-repo-<span class="k">${</span><span class="nv">UBUNTU_VERSION</span><span class="k">}</span>-<span class="k">${</span><span class="nv">CUDDN_VERSION</span><span class="k">}</span>/cudnn-<span class="k">*</span><span class="nt">-keyring</span>.gpg /usr/share/keyrings/
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get <span class="nt">-y</span> <span class="nb">install </span>cudnn
<span class="c"># Reboot!</span>
<span class="nb">sudo </span>reboot
</code></pre></div></div>
<h1 id="non-gpu-machine-ubuntu-2204">Non-GPU machine, Ubuntu 22.04</h1>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Make sure VM is up-to-date</span>
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get upgrade <span class="nt">-y</span>
<span class="c"># Install system libraries (not strictly required by often)</span>
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> <span class="se">\</span>
build-essential <span class="se">\</span>
libatlas-base-dev <span class="se">\</span>
libopencv-dev <span class="se">\</span>
libprotoc-dev <span class="se">\</span>
make <span class="se">\</span>
unzip <span class="se">\</span>
git <span class="se">\</span>
gcc <span class="se">\</span>
g++ <span class="se">\</span>
libglu1-mesa libglu1-mesa-dev <span class="se">\</span>
libcurl4-openssl-dev <span class="se">\</span>
libssl-dev <span class="se">\</span>
freeglut3-dev <span class="se">\</span>
libx11-dev <span class="se">\</span>
libxmu-dev <span class="se">\</span>
libxi-dev
<span class="c"># Make sure the `python` command is by default Python3 (it will be 3.10 for Ubuntu 22.04)</span>
<span class="nb">sudo </span>apt <span class="nb">install </span>python-is-python3
<span class="nv">VERSION</span><span class="o">=</span><span class="sb">`</span>python <span class="nt">--version</span><span class="sb">`</span>
python <span class="nt">-c</span> <span class="s2">"assert '</span><span class="nv">$VERSION</span><span class="s2">' == 'Python 3.10.6'"</span>
<span class="c"># Install pip (better to install manually than to use system's)</span>
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
<span class="c"># Install Python scientific stack</span>
python <span class="nt">-m</span> pip <span class="nb">install </span>wheel setuptools
<span class="c"># Install Python scientific stack</span>
python <span class="nt">-m</span> pip <span class="nb">install</span> <span class="se">\</span>
IPython <span class="se">\</span>
urlpath <span class="se">\</span>
tqdm <span class="se">\</span>
numpy <span class="se">\</span>
scipy <span class="se">\</span>
pandas <span class="se">\</span>
matplotlib <span class="se">\</span>
seaborn <span class="se">\</span>
anndata <span class="se">\</span>
scanpy <span class="se">\</span>
squidpy <span class="se">\</span>
statsmodels <span class="se">\</span>
scikit-learn <span class="se">\</span>
scikit-image <span class="se">\</span>
networkx <span class="se">\</span>
torch <span class="se">\</span>
torchvision <span class="se">\</span>
IPython <span class="se">\</span>
<span class="nt">--extra-index-url</span> https://download.pytorch.org/whl/cu113
</code></pre></div></div>
Torch basics
2022-07-05T00:00:00+00:00
https://andre-rendeiro.com/2022/07/05/torch-basics
<h1 id="torch-basics">Torch basics</h1>
<p>Pytorch, the essential deep learning library. It is flexible and easy to use - what a breath of fresh air compared to TensorFlow.</p>
<p>Let’s dig in!</p>
<h2 id="linear-regression-from-scratch">Linear regression from scratch:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span> <span class="k">as</span> <span class="n">t</span>
<span class="k">def</span> <span class="nf">get_data</span><span class="p">():</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="mi">100</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">real_m</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">real_b</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">real_m</span> <span class="o">+</span> <span class="n">real_b</span> <span class="o">+</span> <span class="p">(</span><span class="n">t</span><span class="p">.</span><span class="n">rand_like</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span>
<span class="k">def</span> <span class="nf">linear_regression</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="n">params</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">absolute_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">pred</span><span class="p">).</span><span class="nb">abs</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">loss</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">"step</span><span class="se">\t</span><span class="s">m</span><span class="se">\t</span><span class="s">b</span><span class="se">\t</span><span class="s">loss"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">50</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="se">\t</span><span class="si">{</span><span class="n">params</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="se">\t</span><span class="si">{</span><span class="n">params</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="se">\t</span><span class="si">{</span><span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="n">t</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">get_data</span><span class="p">()</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-5</span>
<span class="c1"># eps = 1e-8
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000</span><span class="p">):</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">linear_regression</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">absolute_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">)</span>
<span class="n">report</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">loss</span><span class="p">)</span>
<span class="c1"># Calculate loss
</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># step
</span> <span class="k">with</span> <span class="n">t</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">params</span> <span class="o">=</span> <span class="p">(</span><span class="n">params</span> <span class="o">-</span> <span class="n">params</span><span class="p">.</span><span class="n">grad</span> <span class="o">*</span> <span class="n">eps</span><span class="p">).</span><span class="n">requires_grad_</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="using-an-optimizer">Using an optimizer:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch.optim</span> <span class="kn">import</span> <span class="n">SGD</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">get_data</span><span class="p">()</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-5</span>
<span class="n">optim</span> <span class="o">=</span> <span class="n">SGD</span><span class="p">([</span><span class="n">params</span><span class="p">],</span> <span class="n">eps</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000</span><span class="p">):</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">linear_regression</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">absolute_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">)</span>
<span class="n">report</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">loss</span><span class="p">)</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optim</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="n">optim</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch.optim</span> <span class="kn">import</span> <span class="n">Adam</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">get_data</span><span class="p">()</span>
<span class="n">params</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-2</span>
<span class="n">optim</span> <span class="o">=</span> <span class="n">Adam</span><span class="p">([</span><span class="n">params</span><span class="p">],</span> <span class="n">eps</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1_000</span><span class="p">):</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">linear_regression</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">absolute_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">)</span>
<span class="n">report</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">params</span><span class="p">,</span> <span class="n">loss</span><span class="p">)</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optim</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="n">optim</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="pytorch-lightning">Pytorch lightning:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kn">import</span> <span class="nn">pytorch_lightning</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">pytorch_lightning.callbacks.early_stopping</span> <span class="kn">import</span> <span class="n">EarlyStopping</span>
<span class="k">class</span> <span class="nc">LinearRegressor</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="n">LightningModule</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">layer1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">loss_fn</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">pred</span><span class="p">).</span><span class="nb">abs</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">training_step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch</span><span class="p">,</span> <span class="n">batch_idx</span><span class="p">):</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">batch</span>
<span class="c1"># with autocast():
</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss_fn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_hat</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="s">"train_loss"</span><span class="p">,</span> <span class="n">loss</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loss</span>
<span class="k">def</span> <span class="nf">validation_step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch</span><span class="p">,</span> <span class="n">batch_idx</span><span class="p">):</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">batch</span>
<span class="n">y_hat</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss_fn</span><span class="p">(</span><span class="n">y_hat</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loss</span>
<span class="k">def</span> <span class="nf">configure_optimizers</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">optimizer</span>
<span class="n">pl</span><span class="p">.</span><span class="n">seed_everything</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegressor</span><span class="p">()</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">Trainer</span><span class="p">()</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">get_data</span><span class="p">()</span>
<span class="n">trainer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_dataloaders</span><span class="o">=</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
<span class="c1"># Using callbacks
</span><span class="n">early_stop</span> <span class="o">=</span> <span class="n">EarlyStopping</span><span class="p">(</span><span class="n">monitor</span><span class="o">=</span><span class="s">"train_loss"</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">"min"</span><span class="p">)</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">Trainer</span><span class="p">(</span><span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">early_stop</span><span class="p">])</span>
<span class="n">trainer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_dataloaders</span><span class="o">=</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="dataset-dataloader-and-vision-models">Dataset, Dataloader and vision models:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">imageio</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">DataLoader</span>
<span class="kn">from</span> <span class="nn">torchvision.io</span> <span class="kn">import</span> <span class="n">read_image</span>
<span class="kn">import</span> <span class="nn">pytorch_lightning</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">pytorch_lightning.callbacks.early_stopping</span> <span class="kn">import</span> <span class="n">EarlyStopping</span>
<span class="k">def</span> <span class="nf">get_image_data</span><span class="p">()</span> <span class="o">-></span> <span class="nb">tuple</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">]:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">,</span> <span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">224</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">"uint8"</span><span class="p">)</span>
<span class="n">x</span><span class="p">[:</span><span class="mi">50</span><span class="p">,</span> <span class="p">...,</span> <span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">([</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mi">50</span> <span class="o">+</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">50</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span>
<span class="k">def</span> <span class="nf">write_dataset_to_disk</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">output_dir</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-></span> <span class="bp">None</span><span class="p">:</span>
<span class="n">output_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">x_</span><span class="p">,</span> <span class="n">y_</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)):</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">output_dir</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">).</span><span class="n">zfill</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="si">}</span><span class="s">.</span><span class="si">{</span><span class="n">y_</span><span class="si">}</span><span class="s">.jpg"</span>
<span class="n">imageio</span><span class="p">.</span><span class="n">imwrite</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">x_</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">ImageDataset</span><span class="p">(</span><span class="n">Dataset</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data_dir</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">=</span> <span class="n">data_dir</span>
<span class="bp">self</span><span class="p">.</span><span class="n">filenames</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">data_dir</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="s">"*.jpg"</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">filenames</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__getitem__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">idx</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">tuple</span><span class="p">[</span><span class="n">t</span><span class="p">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="nb">int</span><span class="p">]:</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">filenames</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">as_posix</span><span class="p">())</span> <span class="o">/</span> <span class="mi">255</span>
<span class="n">label</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">filenames</span><span class="p">[</span><span class="n">idx</span><span class="p">].</span><span class="n">stem</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"."</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">image</span><span class="p">,</span> <span class="n">label</span>
<span class="k">class</span> <span class="nc">VisionModel</span><span class="p">(</span><span class="n">pl</span><span class="p">.</span><span class="n">LightningModule</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">AvgPool2d</span><span class="p">(</span><span class="mi">224</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">layer2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">layer3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sigmoid</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">layer1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">t</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">layer2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">layer3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">loss_fn</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pred</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">pred</span><span class="p">).</span><span class="nb">abs</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">training_step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch</span><span class="p">,</span> <span class="n">batch_idx</span><span class="p">):</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">batch</span>
<span class="c1"># with autocast():
</span> <span class="n">y_hat</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss_fn</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_hat</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="s">"train_loss"</span><span class="p">,</span> <span class="n">loss</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loss</span>
<span class="k">def</span> <span class="nf">validation_step</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">batch</span><span class="p">,</span> <span class="n">batch_idx</span><span class="p">):</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">batch</span>
<span class="n">y_hat</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss_fn</span><span class="p">(</span><span class="n">y_hat</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loss</span>
<span class="k">def</span> <span class="nf">configure_optimizers</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">optim</span><span class="p">.</span><span class="n">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">optimizer</span>
<span class="n">pl</span><span class="p">.</span><span class="n">seed_everything</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">get_image_data</span><span class="p">()</span>
<span class="n">output_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="s">"test_dataset"</span><span class="p">)</span>
<span class="n">write_dataset_to_disk</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">output_dir</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">ImageDataset</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span>
<span class="n">dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">VisionModel</span><span class="p">()</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">Trainer</span><span class="p">()</span>
<span class="n">early_stop</span> <span class="o">=</span> <span class="n">EarlyStopping</span><span class="p">(</span><span class="n">monitor</span><span class="o">=</span><span class="s">"train_loss"</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">"min"</span><span class="p">)</span>
<span class="n">trainer</span> <span class="o">=</span> <span class="n">pl</span><span class="p">.</span><span class="n">Trainer</span><span class="p">(</span><span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">early_stop</span><span class="p">])</span>
<span class="n">trainer</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">train_dataloaders</span><span class="o">=</span><span class="n">dataloader</span><span class="p">)</span>
<span class="c1"># Inference
</span><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">([(</span><span class="n">y</span><span class="p">,</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="p">...]).</span><span class="n">item</span><span class="p">())</span> <span class="k">for</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">data</span><span class="p">])</span>
</code></pre></div></div>
An update
2022-06-01T00:00:00+00:00
https://andre-rendeiro.com/2022/06/01/an_update
<h2 id="the-current-state-of-this-space">The current state of this space</h2>
<p>Since the end of my PhD in 2019 I have not really updated this space primarily due to ‘lack of time’ but also because with the rise of academic Twitter and the proliferation of preprint servers, I felt that the need for a personal blog was less pressing.</p>
<p>I also worked on a number of projects that I was either not allowed to share publicly, and that didn’t help keeping the initial idea of an open notebook of my research.</p>
<p>At the same time, doing nice, polished analyses and writing up results in a blog post takes considerable time and effort.</p>
<h2 id="what-will-this-space-be-going-forward">What will this space be going forward?</h2>
<p>Now (2022) starting my group at CeMM, I don’t see myself having more time to write long blog posts, and again I think I won’t be able to share all the work I will be doing.</p>
<p>At the same time I still think there is value in sharing ideas and small proof of concept tests that may be of broader relevance and use to others.</p>
<p>So going forward, I will try to use this space to share small snippets of code, ideas, and thoughts that I think may be of interest to others, but primarily to myself.</p>
Modeling the cell packaging process of the 10X Chromium device
2019-12-13T00:00:00+00:00
https://andre-rendeiro.com/2019/12/13/chromium_modeling
<p>Available as a <a href="/data/notebooks/chromium_modeling/chromium_modeling.ipynb">Jupyter notebook here</a>.</p>
<p><br /></p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul id="markdown-toc">
<li><a href="#table-of-contents" id="markdown-toc-table-of-contents">Table of Contents</a></li>
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
<li><a href="#the-data" id="markdown-toc-the-data">The data</a></li>
<li><a href="#modeling-the-observed-distributions" id="markdown-toc-modeling-the-observed-distributions">Modeling the observed distributions</a> <ul>
<li><a href="#zero-inflated-poisson-distribution" id="markdown-toc-zero-inflated-poisson-distribution">Zero-inflated Poisson distribution</a> <ul>
<li><a href="#using-the-zip-model-to-predict-collision-rates-in-scifi-rna-seq-data" id="markdown-toc-using-the-zip-model-to-predict-collision-rates-in-scifi-rna-seq-data">Using the ZIP model to predict collision rates in scifi-RNA-seq data</a></li>
</ul>
</li>
<li><a href="#zero-inflated-negative-binomial-distribution" id="markdown-toc-zero-inflated-negative-binomial-distribution">Zero-inflated Negative binomial distribution</a> <ul>
<li><a href="#using-the-zinb-model-to-predict-collision-rates-in-scifi-rna-seq-data" id="markdown-toc-using-the-zinb-model-to-predict-collision-rates-in-scifi-rna-seq-data">Using the ZINB model to predict collision rates in scifi-RNA-seq data</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#model-comparison" id="markdown-toc-model-comparison">Model comparison</a> <ul>
<li><a href="#waic" id="markdown-toc-waic">WAIC</a></li>
<li><a href="#leave-one-out-cross-validation" id="markdown-toc-leave-one-out-cross-validation">Leave one out cross-validation</a></li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
<li><a href="#appendix" id="markdown-toc-appendix">Appendix</a></li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>This notebook aims to explore the droplet generation and cell/nuclei loading processes which take place in a 10X Chromium device.</p>
<p>To that end, we ran the device channels with increasing numbers of nuclei and with a buffer that did not cause lysis. This way, after collecting the droplet emulsion, we were able to simply count optically the number of nuclei in each droplet. Here is how the droplet emulsion looks like for various concentrations:
<img src="http://www.medical-epigenomics.org/papers/datlinger2019/data/FigS1a.png" alt="FigS1a" width="100%" /></p>
<p>For more details on the experimental procedure, please refer to the <a href="https://www.biorxiv.org/content/10.1101/2019.12.17.879304v1">scifi-RNA-seq preprint</a>.</p>
<p>We hoped to gain deeper understanding into the droplet generation, bead and nuclei loading procedures, in order to derive the statistical properties underlying them.</p>
<p>In this notebook we will focus on the nuclei loading procedure and we will use the counts of nuclei per droplet in the resulting emultion to model this distributions and make some predictions about the scalability of Chromium loading and the consequences in terms of nuclei collision (the occurence of more than one uniquely labeled nuclei within a droplet) for the 10X protocol as well as the scifi-RNA-seq protocol.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># We'll start by importing the required libraries
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">matplotlib.ticker</span> <span class="kn">import</span> <span class="n">ScalarFormatter</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">import</span> <span class="nn">scipy</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">import</span> <span class="nn">pymc3</span> <span class="k">as</span> <span class="n">pm</span>
<span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">context</span><span class="o">=</span><span class="s">"paper"</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="s">"ticks"</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="s">"colorblind"</span><span class="p">,</span> <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">"svg.fonttype"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"none"</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">"text.usetex"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">savefig</span> <span class="o">=</span> <span class="bp">False</span> <span class="c1"># Change to True to save figures as svg files
</span></code></pre></div></div>
<h1 id="the-data">The data</h1>
<p>We’ll read a CSV file with counts of nuclei per droplet for different experiments where the Chromium device was loaded with different concentrations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Load observed counts of nuclei per droplet
</span><span class="n">url</span> <span class="o">=</span> <span class="s">"http://www.medical-epigenomics.org/papers/datlinger2019/data/droplet_counts.csv"</span>
<span class="n">droplet_counts</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">droplet_counts</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loaded_nuclei</th>
<th>cells_per_droplet</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>15300</td>
<td>0</td>
<td>509</td>
</tr>
<tr>
<td>1</td>
<td>15300</td>
<td>1</td>
<td>89</td>
</tr>
<tr>
<td>2</td>
<td>15300</td>
<td>2</td>
<td>8</td>
</tr>
<tr>
<td>3</td>
<td>15300</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>4</td>
<td>15300</td>
<td>4</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">droplet_counts</span><span class="p">.</span><span class="n">tail</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loaded_nuclei</th>
<th>cells_per_droplet</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td>120</td>
<td>1530000</td>
<td>20</td>
<td>2</td>
</tr>
<tr>
<td>121</td>
<td>1530000</td>
<td>21</td>
<td>3</td>
</tr>
<tr>
<td>122</td>
<td>1530000</td>
<td>22</td>
<td>0</td>
</tr>
<tr>
<td>123</td>
<td>1530000</td>
<td>23</td>
<td>0</td>
</tr>
<tr>
<td>124</td>
<td>1530000</td>
<td>24</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
<p>Let’s calculate relative fractions within experiments for plotting later.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # split apply combine for fraction normalization
</span><span class="n">droplet_counts</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">join</span><span class="p">(</span>
<span class="n">droplet_counts</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'loaded_nuclei'</span><span class="p">)</span>
<span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">((</span><span class="n">x</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">/</span> <span class="n">x</span><span class="p">[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">sum</span><span class="p">()).</span><span class="n">tolist</span><span class="p">(),</span>
<span class="n">index</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"count (%)"</span><span class="p">))</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="c1"># droplet_counts['norm_count'] += sys.float_info.epsilon
</span>
<span class="c1"># # split apply combine for % max normalization
</span><span class="n">droplet_counts</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">join</span><span class="p">(</span>
<span class="n">droplet_counts</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'loaded_nuclei'</span><span class="p">)</span>
<span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(((</span><span class="n">x</span><span class="p">[</span><span class="s">'count'</span><span class="p">]</span> <span class="o">/</span> <span class="n">x</span><span class="p">[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">max</span><span class="p">())</span> <span class="o">*</span> <span class="mi">100</span><span class="p">).</span><span class="n">tolist</span><span class="p">(),</span>
<span class="n">index</span><span class="o">=</span><span class="n">x</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"count (% max)"</span><span class="p">))</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="c1"># # split apply combine for number of droplets counted
</span><span class="n">droplet_counts</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">reset_index</span><span class="p">().</span><span class="n">set_index</span><span class="p">(</span><span class="s">"loaded_nuclei"</span><span class="p">).</span><span class="n">join</span><span class="p">(</span>
<span class="n">droplet_counts</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'loaded_nuclei'</span><span class="p">)[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">sum</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="s">"droplets_analyzed"</span><span class="p">)</span>
<span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">drop</span><span class="p">(</span><span class="s">"index"</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>The resulting dataframe looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">droplet_counts</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loaded_nuclei</th>
<th>cells_per_droplet</th>
<th>count</th>
<th>count (%)</th>
<th>count (% max)</th>
<th>droplets_analyzed</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>15300</td>
<td>0</td>
<td>509</td>
<td>0.835796</td>
<td>100.000000</td>
<td>609</td>
</tr>
<tr>
<td>1</td>
<td>15300</td>
<td>1</td>
<td>89</td>
<td>0.146141</td>
<td>17.485265</td>
<td>609</td>
</tr>
<tr>
<td>2</td>
<td>15300</td>
<td>2</td>
<td>8</td>
<td>0.013136</td>
<td>1.571709</td>
<td>609</td>
</tr>
<tr>
<td>3</td>
<td>15300</td>
<td>3</td>
<td>2</td>
<td>0.003284</td>
<td>0.392927</td>
<td>609</td>
</tr>
<tr>
<td>4</td>
<td>15300</td>
<td>4</td>
<td>0</td>
<td>0.000000</td>
<td>0.000000</td>
<td>609</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">droplet_counts</span><span class="p">.</span><span class="n">tail</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loaded_nuclei</th>
<th>cells_per_droplet</th>
<th>count</th>
<th>count (%)</th>
<th>count (% max)</th>
<th>droplets_analyzed</th>
</tr>
</thead>
<tbody>
<tr>
<td>120</td>
<td>1530000</td>
<td>20</td>
<td>2</td>
<td>0.003185</td>
<td>2.564103</td>
<td>628</td>
</tr>
<tr>
<td>121</td>
<td>1530000</td>
<td>21</td>
<td>3</td>
<td>0.004777</td>
<td>3.846154</td>
<td>628</td>
</tr>
<tr>
<td>122</td>
<td>1530000</td>
<td>22</td>
<td>0</td>
<td>0.000000</td>
<td>0.000000</td>
<td>628</td>
</tr>
<tr>
<td>123</td>
<td>1530000</td>
<td>23</td>
<td>0</td>
<td>0.000000</td>
<td>0.000000</td>
<td>628</td>
</tr>
<tr>
<td>124</td>
<td>1530000</td>
<td>24</td>
<td>1</td>
<td>0.001592</td>
<td>1.282051</td>
<td>628</td>
</tr>
</tbody>
</table>
</div>
<p>Let’s visualize the fraction of empty vs filled droplets per experiment:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Get droplet empty/fill rates
</span><span class="n">empty_rate</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">droplet_counts</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"cells_per_droplet"</span><span class="p">)</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"loaded_nuclei"</span><span class="p">)</span>
<span class="p">[</span><span class="s">'count'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">/</span> <span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">())</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="s">"empty_rate"</span><span class="p">))</span>
<span class="n">fill_rate</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">empty_rate</span><span class="p">)</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="s">"fill_rate"</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">colors</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"colorblind"</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mf">6.1</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="k">for</span> <span class="n">axis</span><span class="p">,</span> <span class="n">scale</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">axes</span><span class="p">,</span> <span class="p">[</span><span class="s">'linear'</span><span class="p">,</span> <span class="s">'log'</span><span class="p">]):</span>
<span class="n">axis2</span> <span class="o">=</span> <span class="n">axis</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">ax</span><span class="p">,</span> <span class="n">var_</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">([(</span><span class="n">axis</span><span class="p">,</span> <span class="n">fill_rate</span><span class="p">),</span> <span class="p">(</span><span class="n">axis2</span><span class="p">,</span> <span class="n">empty_rate</span><span class="p">)]):</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">fill_rate</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">var_</span><span class="p">,</span> <span class="s">".-"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Cells loaded"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">f</span><span class="s">"Droplet </span><span class="si">{</span><span class="n">var_</span><span class="p">.</span><span class="n">name</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">'_'</span><span class="p">,</span> <span class="s">' '</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xscale</span><span class="p">(</span><span class="n">scale</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_13_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Fraction of empty vs loading concentration
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">tight_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">fill_rate</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">fill_rate</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span><span class="n">fill_rate</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">fill_rate</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">scatter</span><span class="p">(</span><span class="n">fill_rate</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">fill_rate</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Fraction of filled droplets"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Loaded nuclei"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.fill_fraction.barplot.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_14_0.png" alt="png" /></p>
<p>Okay, we can see that the droplets get filled at an exponential rate.</p>
<p>Let’s now plot the distributions of cells per droplet in the various transformations and in linear or log scale.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # some preparations for plotting
</span><span class="n">experiments</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">[</span><span class="s">'loaded_nuclei'</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">colors</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"colorblind"</span><span class="p">)</span>
<span class="n">inc</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">variables</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">columns</span><span class="p">[</span><span class="n">droplet_counts</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">"count"</span><span class="p">)]</span>
<span class="n">nrows</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">variables</span><span class="p">)</span>
<span class="n">ncols</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">experiments</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Plot distributions separately for each experiment
</span><span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="k">for</span> <span class="n">sharey</span><span class="p">,</span> <span class="n">label</span> <span class="ow">in</span> <span class="p">[</span>
<span class="p">(</span><span class="bp">False</span><span class="p">,</span> <span class="s">"independent_yscale"</span><span class="p">),</span> <span class="p">(</span><span class="s">"row"</span><span class="p">,</span> <span class="s">"same_yscale"</span><span class="p">)]:</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span>
<span class="n">nrows</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">ncols</span><span class="p">,</span>
<span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="n">nrows</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">inc</span><span class="p">,</span> <span class="n">ncols</span> <span class="o">*</span> <span class="n">inc</span><span class="p">),</span>
<span class="n">tight_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="n">sharey</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">variables</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">experiments</span><span class="p">):</span>
<span class="n">dc</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">droplet_counts</span><span class="p">[</span><span class="s">'loaded_nuclei'</span><span class="p">]</span> <span class="o">==</span> <span class="n">e</span><span class="p">,</span> <span class="p">:]</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">ax</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">axis</span><span class="p">[[</span><span class="n">i</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">i</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">],</span> <span class="n">j</span><span class="p">]):</span>
<span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">dc</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">dc</span><span class="p">[</span><span class="n">l</span><span class="p">],</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span>
<span class="sa">f</span><span class="s">"Nuclei loaded: </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span>
<span class="sa">f</span><span class="s">"Droplets counted: </span><span class="si">{</span><span class="n">dc</span><span class="p">[</span><span class="s">'droplets_analyzed'</span><span class="p">].</span><span class="n">unique</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span>
<span class="sa">f</span><span class="s">"Fill fraction: </span><span class="si">{</span><span class="n">fill_rate</span><span class="p">[</span><span class="n">e</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">::</span><span class="mi">2</span><span class="p">,</span> <span class="p">:].</span><span class="n">flatten</span><span class="p">():</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_yscale</span><span class="p">(</span><span class="s">"symlog"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">ax</span><span class="p">.</span><span class="n">get_ylabel</span><span class="p">()</span> <span class="o">+</span> <span class="s">" (log)"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">yaxis</span><span class="p">.</span><span class="n">set_major_formatter</span><span class="p">(</span><span class="n">ScalarFormatter</span><span class="p">())</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="sa">f</span><span class="s">"droplet_counts.barplot.</span><span class="si">{</span><span class="n">label</span><span class="si">}</span><span class="s">.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Plot experiment distributions jointly across experiments
</span><span class="n">ncols</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">x</span> <span class="o">=</span> <span class="s">'cells_per_droplet'</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span>
<span class="n">nrows</span><span class="p">,</span> <span class="n">ncols</span><span class="p">,</span>
<span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="n">nrows</span> <span class="o">*</span> <span class="n">inc</span><span class="p">,</span> <span class="n">ncols</span> <span class="o">*</span> <span class="n">inc</span><span class="p">),</span>
<span class="n">tight_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">variables</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">e</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">experiments</span><span class="p">):</span>
<span class="n">dc</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">droplet_counts</span><span class="p">[</span><span class="s">'loaded_nuclei'</span><span class="p">]</span> <span class="o">==</span> <span class="n">e</span><span class="p">]</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:]:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">dc</span><span class="p">[</span><span class="n">x</span><span class="p">],</span> <span class="n">dc</span><span class="p">[</span><span class="n">l</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">,</span> <span class="n">markersize</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">e</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">dc</span><span class="p">[</span><span class="n">x</span><span class="p">],</span> <span class="p">(</span><span class="n">dc</span><span class="p">[</span><span class="n">l</span><span class="p">]).</span><span class="nb">min</span><span class="p">(),</span> <span class="n">dc</span><span class="p">[</span><span class="n">l</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_yscale</span><span class="p">(</span><span class="s">"symlog"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">yaxis</span><span class="p">.</span><span class="n">set_major_formatter</span><span class="p">(</span><span class="n">ScalarFormatter</span><span class="p">())</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">variables</span><span class="p">):</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="n">l</span> <span class="o">+</span> <span class="s">"(log)"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:]:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei per droplet"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="n">legend</span><span class="p">(</span><span class="n">title</span><span class="o">=</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.overlayed_lineplot.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_19_0.png" alt="png" /></p>
<p>These looks reasonably Poisson-like, but from observing the higher loading concentrations, it is clear that there is more droplets without cells (0 nuclei per droplet) than it would be explained by a Poisson distribution.</p>
<p>Regardless of this zero-component, let’s dig a bit further and check whether how the mean/variance relationship of these distributions is.</p>
<p>In order to do that, we’ll have to expand the count data in the original observations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">value_counts_to_observations</span><span class="p">(</span><span class="n">dc</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">)</span> <span class="o">-></span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">:</span>
<span class="s">"""
Generate observations from counts
Parameters
----------
dc : pd.Series
Series with index as value and values as counts.
"""</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">dc</span><span class="p">.</span><span class="n">index</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">dc</span><span class="p">[</span><span class="n">i</span><span class="p">])])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Check Poisson assumptions
# # Gather observed mean variance per loading concentration, with and without zeros
</span><span class="n">r_all</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="n">r_nonzero</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">droplet_counts</span><span class="p">[</span><span class="s">'loaded_nuclei'</span><span class="p">].</span><span class="n">unique</span><span class="p">():</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">value_counts_to_observations</span><span class="p">(</span>
<span class="n">droplet_counts</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="sa">f</span><span class="s">"loaded_nuclei == </span><span class="si">{</span><span class="n">l</span><span class="si">}</span><span class="s">"</span><span class="p">).</span><span class="n">set_index</span><span class="p">(</span><span class="s">"cells_per_droplet"</span><span class="p">)[</span><span class="s">'count'</span><span class="p">])</span>
<span class="n">r_all</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">.</span><span class="n">mean</span><span class="p">(),</span> <span class="n">c</span><span class="p">.</span><span class="n">var</span><span class="p">()</span>
<span class="n">r_nonzero</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">=</span> <span class="n">c</span><span class="p">[</span><span class="n">c</span> <span class="o">></span> <span class="mi">0</span><span class="p">].</span><span class="n">mean</span><span class="p">(),</span> <span class="n">c</span><span class="p">[</span><span class="n">c</span> <span class="o">></span> <span class="mi">0</span><span class="p">].</span><span class="n">var</span><span class="p">()</span>
<span class="n">r_all</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">r_all</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'var'</span><span class="p">]).</span><span class="n">T</span>
<span class="n">r_all</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">'loaded_nuclei'</span>
<span class="n">r_nonzero</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">r_nonzero</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">'mean'</span><span class="p">,</span> <span class="s">'var'</span><span class="p">]).</span><span class="n">T</span>
<span class="n">r_nonzero</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">'loaded_nuclei'</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's quickly fit a linear model on the mean/variance relationship
</span><span class="n">lm_all</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lm_all</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">r_all</span><span class="p">[</span><span class="s">'mean'</span><span class="p">].</span><span class="n">values</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">r_all</span><span class="p">[</span><span class="s">'var'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
<span class="n">lm_nonzero</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lm_nonzero</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">r_nonzero</span><span class="p">[</span><span class="s">'mean'</span><span class="p">].</span><span class="n">values</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">r_nonzero</span><span class="p">[</span><span class="s">'var'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">lm_all</span><span class="p">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lm_all</span><span class="p">.</span><span class="n">intercept_</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">lm_nonzero</span><span class="p">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lm_all</span><span class="p">.</span><span class="n">intercept_</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1.60711113] -0.232195442242209
[1.224458] -0.232195442242209
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">inc</span> <span class="o">*</span> <span class="mf">1.5</span><span class="p">,</span> <span class="n">inc</span> <span class="o">*</span> <span class="mf">1.5</span><span class="p">),</span>
<span class="n">sharey</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">"Poissonian properties of cell loading in droplets"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"All observations"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Only non-zero observations"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span><span class="p">,</span> <span class="n">r_</span><span class="p">,</span> <span class="n">lm_</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">r_all</span><span class="p">,</span> <span class="n">lm_all</span><span class="p">),</span> <span class="p">(</span><span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">r_nonzero</span><span class="p">,</span> <span class="n">lm_nonzero</span><span class="p">)]:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">r_</span><span class="p">[</span><span class="s">'mean'</span><span class="p">],</span> <span class="n">r_</span><span class="p">[</span><span class="s">'var'</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">)</span>
<span class="n">vmin</span> <span class="o">=</span> <span class="n">r_</span><span class="p">.</span><span class="nb">min</span><span class="p">().</span><span class="nb">min</span><span class="p">()</span>
<span class="n">vmin</span> <span class="o">-=</span> <span class="n">vmin</span> <span class="o">*</span> <span class="mf">0.1</span>
<span class="n">vmax</span> <span class="o">=</span> <span class="n">r_</span><span class="p">.</span><span class="nb">max</span><span class="p">().</span><span class="nb">max</span><span class="p">()</span>
<span class="n">vmax</span> <span class="o">+=</span> <span class="n">vmax</span> <span class="o">*</span> <span class="mf">0.1</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="p">(</span><span class="n">vmin</span><span class="p">,</span> <span class="n">vmax</span><span class="p">),</span> <span class="p">(</span><span class="n">vmin</span><span class="p">,</span> <span class="n">vmax</span><span class="p">),</span>
<span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">r_</span><span class="p">[</span><span class="s">'mean'</span><span class="p">].</span><span class="nb">min</span><span class="p">(),</span> <span class="n">r_</span><span class="p">[</span><span class="s">'mean'</span><span class="p">].</span><span class="nb">max</span><span class="p">())</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">x</span><span class="p">,</span> <span class="n">lm_</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))),</span>
<span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"black"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axes</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Mean"</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Variance"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.poisson_assumptions.mean_vs_var.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_24_0.png" alt="png" /></p>
<p>We can see that the observed relationship between mean and variance is obviously affected by whether we consider the observations of empty droplets (the later is more of a thought exercise).</p>
<p>However, in both cases, the relationship between mean and variance is larger than 1. This means that the variance increases with higher loading concentrations. We will probably have to deal with higher variance than mean when modeling these data.</p>
<p>Just as a curiosity, it is interesting to see that the variance is the lower range (the one recommended to use by 10X) is lower or equal to the mean. This is however not the sub-Poissonian loading properties of the Chromium (read more here: https://liorpachter.wordpress.com/2019/02/19/introduction-to-single-cell-rna-seq-technologies/ and here https://doi.org/10.6084/m9.figshare.7704659.v1) as that data refers to the <em>bead loading</em> process (which achieves sub-poissonian properties due to the deformable beads), and here we are dealing with <em>cell loading</em>.</p>
<h1 id="modeling-the-observed-distributions">Modeling the observed distributions</h1>
<p>We’ll try to model these data in order to understand the latent process underlying its generation and in order to estimate parameters that would allow us to extrapolate the results to any other device loading concentration. This will be useful later when we ask the question:</p>
<blockquote>
<p>“<em>if I load the device with X nuclei, how many collisions should I expect?</em>”</p>
</blockquote>
<h2 id="zero-inflated-poisson-distribution">Zero-inflated Poisson distribution</h2>
<p>Although from exploring the data a bit we already have reason to suspect these data might not follow exactly a Poisson distribution, for simplicity we will start modeling them with a Zero-Inflated Poisson distribution (ZIP). The zero-inflated component is due to the fact that even in high loading concentrations we observe a considerable ammount of droplets without cells. The ZIP has a λ (lambda) parameter to model both mean and variance, but unlike the usual Poisson, also has a Ψ (psi) parameter for modeling the zero-inflated component.</p>
<p>The way I interpret Ψ is that it is estimating the fraction of droplets with zero cells which did not get filled due to some other factor which does not follow the Poisson process - in other words, the fraction of droplets which did not have the chance to be filled - a <em>“technical”</em> reason for being empty.</p>
<p>This could easily be by design, for example in a initial period of burn-in in the device, where cells are not yet going into the microfluic channel that produces droplets.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's assemble all data in a pivot dataframe
</span><span class="n">droplet_counts_p</span> <span class="o">=</span> <span class="n">droplet_counts</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span>
<span class="n">index</span><span class="o">=</span><span class="s">"cells_per_droplet"</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">"loaded_nuclei"</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">"count"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">droplet_counts_p</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loaded_nuclei 15300 191250 382500 765000 1530000
cells_per_droplet
0 509 263 49 73 28
1 89 162 66 12 0
2 8 142 114 25 1
3 2 85 146 50 3
4 0 39 98 86 8
5 0 14 76 114 28
6 1 4 29 97 43
7 0 2 26 88 63
8 0 0 4 67 62
9 0 0 8 35 70
10 0 1 0 21 78
11 0 1 0 11 65
12 0 0 0 7 56
13 0 1 1 3 37
14 0 0 0 3 23
15 0 0 0 5 19
16 0 0 0 0 19
17 0 0 0 0 9
18 0 0 0 0 5
19 0 0 0 0 5
20 0 0 0 0 2
21 0 0 0 0 3
22 0 0 0 0 0
23 0 0 0 0 0
24 0 0 0 0 1
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Now we expand these distributions into the actual observed data
</span><span class="n">counts</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span>
<span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">droplet_counts_p</span><span class="p">.</span><span class="n">index</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">droplet_counts_p</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">])])</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">droplet_counts_p</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # # cap all to min(observations) ~= 609
</span><span class="n">n_exp</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">experiments</span><span class="p">)</span>
<span class="n">m</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">len</span><span class="p">,</span> <span class="n">counts</span><span class="p">))</span>
<span class="n">counts</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">counts</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>609
</code></pre></div></div>
<p>Here’s our model (easier to read the code bottom to top):</p>
<ul>
<li>the observed count data comes from a Zero-Inflated Poisson distribution, of parameters lambda (theta), psi;</li>
<li>the λ (lambda) parameter comes from a Exponential distribution (λ > 0) for which we impose as prior knowledge the mean number of cells per droplet for within a given experiment;</li>
<li>the Ψ (psi) parameter comes from a Uniform distribution (0 < Ψ < 1) and we don’t impose any prior on it;</li>
<li>each of these parameters have shape of <code class="language-plaintext highlighter-rouge">n_exp</code> which is the number of experiments/loading concentrations (i.e. they will be estimated for each loading concentration separately).</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Fit Zero Inflated Poisson
</span><span class="k">with</span> <span class="n">pm</span><span class="p">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">zip_model</span><span class="p">:</span>
<span class="n">psi</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Uniform</span><span class="p">(</span><span class="s">'psi'</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_exp</span><span class="p">))</span>
<span class="n">lam</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Exponential</span><span class="p">(</span><span class="s">'lam'</span><span class="p">,</span> <span class="n">lam</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">counts</span><span class="p">),</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_exp</span><span class="p">))</span>
<span class="n">pois</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">ZeroInflatedPoisson</span><span class="p">(</span><span class="s">'pois'</span><span class="p">,</span> <span class="n">psi</span><span class="o">=</span><span class="n">psi</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="n">lam</span><span class="p">,</span>
<span class="n">observed</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">counts</span><span class="p">).</span><span class="n">T</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Parameters for the MCMC (NUTS) sampler
</span><span class="n">sampler_params</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span>
<span class="n">random_seed</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="c1"># random seed for reproducibility
</span> <span class="n">n_init</span><span class="o">=</span><span class="mi">200000</span><span class="p">,</span> <span class="c1"># number of iterations of initializer (this is actually the default)
</span> <span class="n">tune</span><span class="o">=</span><span class="mi">5000</span><span class="p">,</span> <span class="c1"># number of tuning iterations (this is probably the most critical)
</span> <span class="n">draws</span><span class="o">=</span><span class="mi">5000</span><span class="p">,</span> <span class="c1"># number of iterations to sample (these will be used for our parameter estimates)
</span><span class="p">)</span>
<span class="n">TAKE_AFTER</span> <span class="o">=</span> <span class="mi">1000</span> <span class="c1"># number of initial iterations to discard (just as precaution we'll exclude these)
</span>
<span class="n">vi_params</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span>
<span class="n">n</span><span class="o">=</span><span class="mi">50000</span> <span class="c1"># iterations
</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># To fit the model, we use MCMC sampling, because it is tractable
</span><span class="k">with</span> <span class="n">zip_model</span><span class="p">:</span>
<span class="n">zip_trace</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="o">**</span><span class="n">sampler_params</span><span class="p">)[</span><span class="n">TAKE_AFTER</span><span class="p">:]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [lam, psi]
Sampling 2 chains, 0 divergences: 100%|██████████| 20000/20000 [00:30<00:00, 657.28draws/s]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># We can now sample from the model posterior
</span><span class="k">with</span> <span class="n">zip_model</span><span class="p">:</span>
<span class="n">zip_ppc_trace</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">sample_posterior_predictive</span><span class="p">(</span>
<span class="n">trace</span><span class="o">=</span><span class="n">zip_trace</span><span class="p">,</span> <span class="n">samples</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'draws'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">random_seed</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'random_seed'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>100%|██████████| 10000/10000 [00:08<00:00, 1180.97it/s]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Just for fun, let's also do Variational Inference too
</span><span class="kn">from</span> <span class="nn">pymc3.variational.callbacks</span> <span class="kn">import</span> <span class="n">CheckParametersConvergence</span>
<span class="k">with</span> <span class="n">zip_model</span><span class="p">:</span>
<span class="n">zip_advi</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">ADVI</span><span class="p">()</span>
<span class="n">zip_tracker</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">callbacks</span><span class="p">.</span><span class="n">Tracker</span><span class="p">(</span>
<span class="n">mean</span><span class="o">=</span><span class="n">zip_advi</span><span class="p">.</span><span class="n">approx</span><span class="p">.</span><span class="n">mean</span><span class="p">.</span><span class="nb">eval</span><span class="p">,</span>
<span class="n">std</span><span class="o">=</span><span class="n">zip_advi</span><span class="p">.</span><span class="n">approx</span><span class="p">.</span><span class="n">std</span><span class="p">.</span><span class="nb">eval</span>
<span class="p">)</span>
<span class="n">zip_mean_field</span> <span class="o">=</span> <span class="n">zip_advi</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">n</span><span class="o">=</span><span class="n">vi_params</span><span class="p">[</span><span class="s">'n'</span><span class="p">],</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">CheckParametersConvergence</span><span class="p">(),</span> <span class="n">zip_tracker</span><span class="p">])</span>
<span class="n">zip_vi_trace</span> <span class="o">=</span> <span class="n">zip_mean_field</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">draws</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'draws'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Average Loss = 5,780.7: 100%|██████████| 50000/50000 [00:30<00:00, 1623.27it/s]
Finished [100%]: Average Loss = 5,780.7
</code></pre></div></div>
<p>To check whether the MCMC model converged, we can observe the value of each parameter in the sequence of samples. Stabilization of the value over the chain means the model converged.</p>
<p>There are two lines per parameter because I used 2 CPUs to sample chains in parallel.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">axes</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">traceplot</span><span class="p">(</span><span class="n">zip_trace</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="n">figure</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"droplet_counts.ZIP_MCMC_performance.svg"</span><span class="p">,</span>
<span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_37_0.png" alt="png" /></p>
<p>We can see that the λ parameter increases with device loading concentration. The Ψ however has a little back and forth.</p>
<p>In addition, the estimate of Ψ for the lowest loading concentration is quite broad. This should come as no surprise as this distribution is in fact mostly zeros - separating the ones expected from a Poisson process from the “technical” ones, is hard.</p>
<p>Let’s now have a look at the Variational Inference performance.</p>
<p>The objective is to maximize the evidence lower bound (ELBO). This effectively makes Bayesian inference an optimization problem.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">mu_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">221</span><span class="p">)</span>
<span class="n">std_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">222</span><span class="p">)</span>
<span class="n">hist_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">212</span><span class="p">)</span>
<span class="n">mu_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zip_tracker</span><span class="p">[</span><span class="s">'mean'</span><span class="p">])</span>
<span class="n">mu_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Mean'</span><span class="p">)</span>
<span class="n">std_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zip_tracker</span><span class="p">[</span><span class="s">'std'</span><span class="p">])</span>
<span class="n">std_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'SD'</span><span class="p">)</span>
<span class="n">hist_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zip_advi</span><span class="p">.</span><span class="n">hist</span><span class="p">)</span>
<span class="n">hist_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Negative ELBO'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"droplet_counts.ZIP_VI_performance.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_40_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zip_vi_trace</span> <span class="o">=</span> <span class="n">zip_mean_field</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">draws</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'draws'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Let's gather the model parameters in a dataframe
</span><span class="n">zip_res</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'lam'</span><span class="p">,</span> <span class="s">'psi'</span><span class="p">]:</span>
<span class="k">for</span> <span class="n">red_func</span><span class="p">,</span> <span class="n">metric</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">,</span> <span class="s">'_mean'</span><span class="p">),</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">,</span> <span class="s">'_std'</span><span class="p">)]:</span>
<span class="k">for</span> <span class="n">sampler</span><span class="p">,</span> <span class="n">s_label</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">zip_trace</span><span class="p">,</span> <span class="s">'_mcmc'</span><span class="p">),</span> <span class="p">(</span><span class="n">zip_vi_trace</span><span class="p">,</span> <span class="s">'_vi'</span><span class="p">)]:</span>
<span class="n">zip_res</span><span class="p">[</span><span class="n">param</span> <span class="o">+</span> <span class="n">metric</span> <span class="o">+</span> <span class="n">s_label</span><span class="p">]</span> <span class="o">=</span> <span class="n">red_func</span><span class="p">(</span><span class="n">sampler</span><span class="p">[</span><span class="n">param</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">flatten</span><span class="p">()</span>
<span class="n">zip_res</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">zip_res</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">experiments</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"loaded_nuclei"</span><span class="p">))</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"droplet_counts.joint_model.ZIP_params.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">zip_res</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> lam_mean_mcmc lam_mean_vi lam_std_mcmc lam_std_vi \
loaded_nuclei
15300 0.437405 0.439764 0.085570 0.048505
191250 1.999454 1.988304 0.081287 0.080989
382500 3.321722 3.326178 0.082102 0.085014
765000 5.766861 5.757179 0.103082 0.111554
1530000 9.992216 10.021525 0.130423 0.140611
psi_mean_mcmc psi_mean_vi psi_std_mcmc psi_std_vi
loaded_nuclei
15300 0.482498 0.463390 0.091216 0.046608
191250 0.717861 0.717384 0.024540 0.024842
382500 0.949200 0.948621 0.012091 0.012536
765000 0.902959 0.903910 0.012053 0.013317
1530000 0.950930 0.950629 0.008769 0.009652
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # Plot estimated lamba
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">"Nuclei loading counts modeled as output of a ZeroInflatedPoisson function"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">axis2</span> <span class="o">=</span> <span class="p">[</span><span class="n">ax</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span> <span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">]</span>
<span class="c1"># axis[0].get_shared_y_axes().join(axis[1])
</span><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">method</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">([</span><span class="s">"mcmc"</span><span class="p">,</span> <span class="s">"vi"</span><span class="p">]):</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="n">method</span><span class="p">.</span><span class="n">upper</span><span class="p">(),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'lam_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">fill_between</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'lam_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">-</span> <span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'lam_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'lam_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'lam_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">fill_between</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">-</span> <span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span>
<span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">zip_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\lambda$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\psi$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZIP_params.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">tight_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_43_0.png" alt="png" /></p>
<p>Alright!</p>
<p>The plot on the left shows the MCMC estimates for the two parameters, and the right one the one from Variational Inference. They are really similar. The λ parameter looks pretty (but not quite) linear within the range of observed values. This is no suprise as the observed count distributions clearly show a shift towards more filled droplets and droplets with more cells.</p>
<p>As MCMC given enough sampling is guaranteed to find an optimal solution (and has other advantages such as a generative process), I will use mostly those estimates. They seem to support the observed data well. We can also see that we have less confidence for the estimation of Ψ in the lower end of droplet concentration, as expected.</p>
<p>So overal we probably have good estimates of the parameters of a ZIP model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # Plot estimated zero inflated component (PSI) vs observed zero fraction
</span><span class="n">fill_rate2</span> <span class="o">=</span> <span class="n">fill_rate</span><span class="p">.</span><span class="n">to_frame</span><span class="p">().</span><span class="n">join</span><span class="p">(</span><span class="n">zip_res</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">])</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">fill_rate2</span><span class="p">[</span><span class="s">'fill_rate'</span><span class="p">],</span> <span class="n">fill_rate2</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">])</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Fill fraction observed</span><span class="se">\n</span><span class="s">(Technical and experimental fractions)"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Zero inflated Poisson's $\psi$ </span><span class="se">\n</span><span class="s">(Technical only fill fraction predicted)"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.fill_fraction.measured_vs_ZIP_psi_param.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_45_0.png" alt="png" /></p>
<p>This shows that in high loading concentrations, the empty droplets are almost fully explained by a technical component then by the “normal” underlying Poisson process.</p>
<h3 id="using-the-zip-model-to-predict-collision-rates-in-scifi-rna-seq-data">Using the ZIP model to predict collision rates in scifi-RNA-seq data</h3>
<p>Now let’s predict collision rates as a function of loading rate, given a fixed number of barcodes in round 1.</p>
<p>We simply have to divide the number of input cells per number of barcodes since round 1 barcoding effectively reduces the number of collisions we care about by the same order.</p>
<p>Then, we simply need to lookup the respective ZIP parameter for that input and get what part of the cell per droplet distribution is bigger than 1.</p>
<p>In theory the second step could be done simply by getting the <code class="language-plaintext highlighter-rouge">1 - poisson(λ).cmf(x)</code>, but scipy does not implement a <code class="language-plaintext highlighter-rouge">cmf</code> method for a Zero Inflated Poisson distribution and PYMC3 also does not have a <code class="language-plaintext highlighter-rouge">logpmf</code> yet :(</p>
<p>We will try to sample from the distribution (using scipy’s <code class="language-plaintext highlighter-rouge">rvs</code> method) and set a number of those draws to 0 conditioned on a draws from a bernouli distribution of parameter <code class="language-plaintext highlighter-rouge">PSI</code> to simulate the zero inflated component of the ZIP.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Okay, so we interpolate over loading concentrations and get the values of each parameter
</span>
<span class="n">zip_lambda_f</span> <span class="o">=</span> <span class="n">scipy</span><span class="p">.</span><span class="n">interpolate</span><span class="p">.</span><span class="n">interp1d</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zip_res</span><span class="p">[</span><span class="s">'lam_mean_mcmc'</span><span class="p">],</span>
<span class="n">kind</span><span class="o">=</span><span class="s">'quadratic'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">"extrapolate"</span><span class="p">)</span>
<span class="n">zip_psi_f</span> <span class="o">=</span> <span class="n">scipy</span><span class="p">.</span><span class="n">interpolate</span><span class="p">.</span><span class="n">interp1d</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zip_res</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">],</span>
<span class="n">kind</span><span class="o">=</span><span class="s">'quadratic'</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">"extrapolate"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Now we use those functions to predict across a range
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span>
<span class="n">zip_lambda_y</span> <span class="o">=</span> <span class="n">zip_lambda_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">zip_psi_y</span> <span class="o">=</span> <span class="n">zip_psi_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # Plot estimated lamba
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zip_res</span><span class="p">[</span><span class="s">'lam_mean_mcmc'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">zip_lambda_y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span> <span class="o">=</span> <span class="n">axis</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zip_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zip_res</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">zip_psi_y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\lambda$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\psi$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZIP_params.interpolated.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_50_0.png" alt="png" /></p>
<p>So the quadratic interpolation over Ψ does not look great, but it is fine in the lower range and probably good enough for the rest.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # Now we sample across the X
# x = np.linspace(1000, zip_res.index.max(), 10)
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">logspace</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mf">6.5</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">base</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mf">1e5</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="n">barcode_combos</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">96</span><span class="p">,</span> <span class="mi">96</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">96</span> <span class="o">*</span> <span class="mi">16</span><span class="p">]</span>
<span class="k">for</span> <span class="n">barcodes</span> <span class="ow">in</span> <span class="n">barcode_combos</span><span class="p">:</span>
<span class="k">for</span> <span class="n">input_nuclei</span> <span class="ow">in</span> <span class="n">x</span><span class="p">:</span>
<span class="n">psi</span> <span class="o">=</span> <span class="n">zip_psi_f</span><span class="p">([</span><span class="n">input_nuclei</span> <span class="o">/</span> <span class="n">barcodes</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># bounding PSI to [0, 1] since the extrapolation is unbounded
</span> <span class="n">psi</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">psi</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">psi</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">psi</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">lamb</span> <span class="o">=</span> <span class="n">zip_lambda_f</span><span class="p">([</span><span class="n">input_nuclei</span> <span class="o">/</span> <span class="n">barcodes</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># s = scipy.stats.bernoulli(psi).rvs(n) * scipy.stats.poisson(lamb).rvs(n)
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">ZeroInflatedPoisson</span><span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">psi</span><span class="o">=</span><span class="n">psi</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="n">lamb</span><span class="p">).</span><span class="n">random</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="c1"># Now we simply count fraction of collisions (droplets with more than one nuclei)
</span> <span class="n">f</span><span class="p">[(</span><span class="n">barcodes</span><span class="p">,</span> <span class="n">input_nuclei</span><span class="p">)]</span> <span class="o">=</span> <span class="p">[</span><span class="n">lamb</span><span class="p">,</span> <span class="n">psi</span><span class="p">,</span> <span class="p">(</span><span class="n">s</span> <span class="o">></span> <span class="mi">1</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">n</span><span class="p">]</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"lambda"</span><span class="p">,</span> <span class="s">"psi"</span><span class="p">,</span> <span class="s">"collision_rate"</span><span class="p">]).</span><span class="n">T</span>
<span class="n">f</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'barcodes'</span><span class="p">,</span> <span class="s">'loaded_nuclei'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> lambda psi collision_rate
barcodes loaded_nuclei
1 1.000000e+03 0.293869 0.463584 0.01689
2.448437e+03 0.308521 0.465498 0.01798
5.994843e+03 0.344288 0.470187 0.02115
1.467799e+04 0.431214 0.481675 0.03351
3.593814e+04 0.640165 0.509853 0.06886
8.799225e+04 1.128507 0.579146 0.17910
2.154435e+05 2.184723 0.750606 0.48059
5.274997e+05 4.267001 0.991129 0.91848
1.291550e+06 8.774839 0.877488 0.87683
3.162278e+06 15.901714 1.000000 1.00000
96 1.000000e+03 0.283844 0.462276 0.01548
2.448437e+03 0.283997 0.462296 0.01524
5.994843e+03 0.284371 0.462345 0.01469
1.467799e+04 0.285288 0.462464 0.01520
3.593814e+04 0.287532 0.462757 0.01688
8.799225e+04 0.293024 0.463473 0.01612
2.154435e+05 0.306457 0.465228 0.01714
5.274997e+05 0.339254 0.469525 0.02150
1.291550e+06 0.419013 0.480054 0.03157
3.162278e+06 0.611036 0.505875 0.06396
384 1.000000e+03 0.283765 0.462266 0.01506
2.448437e+03 0.283803 0.462271 0.01581
5.994843e+03 0.283897 0.462283 0.01554
1.467799e+04 0.284126 0.462313 0.01492
3.593814e+04 0.284687 0.462386 0.01578
8.799225e+04 0.286061 0.462565 0.01562
2.154435e+05 0.289424 0.463004 0.01621
5.274997e+05 0.297652 0.464077 0.01689
1.291550e+06 0.317764 0.466707 0.01923
3.162278e+06 0.366803 0.473149 0.02625
1536 1.000000e+03 0.283745 0.462263 0.01516
2.448437e+03 0.283754 0.462264 0.01518
5.994843e+03 0.283778 0.462267 0.01534
1.467799e+04 0.283835 0.462275 0.01573
3.593814e+04 0.283975 0.462293 0.01540
8.799225e+04 0.284319 0.462338 0.01550
2.154435e+05 0.285160 0.462447 0.01570
5.274997e+05 0.287219 0.462716 0.01567
1.291550e+06 0.292258 0.463373 0.01668
3.162278e+06 0.304582 0.464983 0.01789
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">"Prediction with Zero Inflated Poisson"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="k">for</span> <span class="n">barcodes</span> <span class="ow">in</span> <span class="n">barcode_combos</span><span class="p">:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">f</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">barcodes</span><span class="p">].</span><span class="n">index</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">barcodes</span><span class="p">,</span> <span class="s">'collision_rate'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span>
<span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">barcodes</span><span class="si">}</span><span class="s"> round1 barcodes"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xscale</span><span class="p">(</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_yscale</span><span class="p">(</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"% collisions"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="k">for</span> <span class="n">x_</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">]:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">x_</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">y_</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">10500</span><span class="p">,</span> <span class="mi">191000</span><span class="p">,</span> <span class="mi">383000</span><span class="p">,</span> <span class="mi">765000</span><span class="p">,</span> <span class="mi">1530000</span><span class="p">]:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">y_</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZIP_params.prediction_of_collision_rate.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_53_0.png" alt="png" /></p>
<h2 id="zero-inflated-negative-binomial-distribution">Zero-inflated Negative binomial distribution</h2>
<p>We saw previously that the Poisson distribution (or the Zero-inflated Poisson) might not be the best for these data, since the relationship between variance and mean is not 1.</p>
<p>A related distribution that models mean and variance with with two parameters is the Negative Binomial distribution. Just as we used the zero-inflated component on the ZIP, we can also have a Zero-Inflated Negative Binomial (ZINB) distribution. The ZINB has a μ (mu) parameter to model the mean, a α parameter for the dispersion, and a Ψ (psi) parameter for the zero-inflated component.</p>
<p>My interpretation of Ψ is the same as in ZIP - the main difference between the two distributions is in how variation is handled.</p>
<p>Here’s this model (again from bottom to top):</p>
<ul>
<li>the observed count data comes from a Zero-Inflated Negative Binomial distribution, of parameters mu, alpha and psi;</li>
<li>the μ (mu) parameter, just like in the ZIP distribution, comes from a Exponential distribution (λ > 0) for which we impose as prior knowledge the mean number of cells per droplet for within a given experiment;</li>
<li>the α (alpha) parameter, is Gamma distributed, which is in it’s turn parametrized with α (shape; α > 0) and β (scale/rate; β > 0). For the priors, here I’m a bit at loss so for α I use the the standard deviation of the observed counts an an uninformative prior of 5 for β;</li>
<li>the Ψ (psi) parameter comes from a Uniform distribution (0 < Ψ < 1) and we don’t impose any prior on it;</li>
<li>each of these parameters have shape of <code class="language-plaintext highlighter-rouge">n_exp</code> which is the number of experiments/loading concentrations (i.e. they will be estimated for each loading concentration separately).</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">pm</span><span class="p">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">zinb_model</span><span class="p">:</span>
<span class="n">psi</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Uniform</span><span class="p">(</span><span class="s">'psi'</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_exp</span><span class="p">))</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Exponential</span><span class="p">(</span><span class="s">'mu'</span><span class="p">,</span> <span class="n">lam</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">counts</span><span class="p">),</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_exp</span><span class="p">))</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Gamma</span><span class="p">(</span><span class="s">'alpha'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">counts</span><span class="p">),</span> <span class="n">beta</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_exp</span><span class="p">))</span>
<span class="n">nb</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">ZeroInflatedNegativeBinomial</span><span class="p">(</span><span class="s">'nb'</span><span class="p">,</span> <span class="n">psi</span><span class="o">=</span><span class="n">psi</span><span class="p">,</span> <span class="n">mu</span><span class="o">=</span><span class="n">mu</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">observed</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">counts</span><span class="p">).</span><span class="n">T</span><span class="p">.</span><span class="n">values</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># To fit the model, let's use MCMC to sample, because it is tractable
</span><span class="k">with</span> <span class="n">zinb_model</span><span class="p">:</span>
<span class="n">zinb_trace</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="o">**</span><span class="n">sampler_params</span><span class="p">)[</span><span class="n">TAKE_AFTER</span><span class="p">:]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [alpha, mu, psi]
Sampling 2 chains, 52 divergences: 100%|██████████| 20000/20000 [01:45<00:00, 188.68draws/s]
There were 49 divergences after tuning. Increase `target_accept` or reparameterize.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">zinb_model</span><span class="p">:</span>
<span class="n">zinb_ppc_trace</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">sample_posterior_predictive</span><span class="p">(</span>
<span class="n">zinb_trace</span><span class="p">,</span> <span class="n">sampler_params</span><span class="p">[</span><span class="s">'draws'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">random_seed</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'random_seed'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>100%|██████████| 10000/10000 [00:11<00:00, 850.79it/s]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Just for fun, let's also do Variational Inference too
</span><span class="kn">from</span> <span class="nn">pymc3.variational.callbacks</span> <span class="kn">import</span> <span class="n">CheckParametersConvergence</span>
<span class="k">with</span> <span class="n">zinb_model</span><span class="p">:</span>
<span class="n">zinb_advi</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">ADVI</span><span class="p">()</span>
<span class="n">zinb_tracker</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">callbacks</span><span class="p">.</span><span class="n">Tracker</span><span class="p">(</span>
<span class="n">mean</span><span class="o">=</span><span class="n">zinb_advi</span><span class="p">.</span><span class="n">approx</span><span class="p">.</span><span class="n">mean</span><span class="p">.</span><span class="nb">eval</span><span class="p">,</span>
<span class="n">std</span><span class="o">=</span><span class="n">zinb_advi</span><span class="p">.</span><span class="n">approx</span><span class="p">.</span><span class="n">std</span><span class="p">.</span><span class="nb">eval</span>
<span class="p">)</span>
<span class="n">zinb_mean_field</span> <span class="o">=</span> <span class="n">zinb_advi</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">n</span><span class="o">=</span><span class="n">vi_params</span><span class="p">[</span><span class="s">'n'</span><span class="p">],</span> <span class="n">callbacks</span><span class="o">=</span><span class="p">[</span><span class="n">CheckParametersConvergence</span><span class="p">(),</span> <span class="n">zinb_tracker</span><span class="p">])</span>
<span class="n">zinb_vi_trace</span> <span class="o">=</span> <span class="n">zinb_mean_field</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">draws</span><span class="o">=</span><span class="n">sampler_params</span><span class="p">[</span><span class="s">'draws'</span><span class="p">])</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Average Loss = 5,924.6: 100%|██████████| 50000/50000 [00:53<00:00, 935.31it/s]
Finished [100%]: Average Loss = 5,924.6
</code></pre></div></div>
<p>Let’s inspect the model fits.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">axes</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">traceplot</span><span class="p">(</span><span class="n">zinb_trace</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="n">figure</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"droplet_counts.joint_model.ZINB_sampling.svg"</span><span class="p">,</span>
<span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_60_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">mu_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">221</span><span class="p">)</span>
<span class="n">std_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">222</span><span class="p">)</span>
<span class="n">hist_ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="mi">212</span><span class="p">)</span>
<span class="n">mu_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zinb_tracker</span><span class="p">[</span><span class="s">'mean'</span><span class="p">])</span>
<span class="n">mu_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Mean'</span><span class="p">)</span>
<span class="n">std_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zinb_tracker</span><span class="p">[</span><span class="s">'std'</span><span class="p">])</span>
<span class="n">std_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'SD'</span><span class="p">)</span>
<span class="n">hist_ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">zinb_advi</span><span class="p">.</span><span class="n">hist</span><span class="p">)</span>
<span class="n">hist_ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Negative ELBO'</span><span class="p">);</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"droplet_counts.ZINB_VI_performance.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_61_0.png" alt="png" /></p>
<p>Okay, it didn’t converge fully but it’s probably good enough.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's collect the model posteriors
</span><span class="n">zinb_res</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="p">[</span><span class="s">'mu'</span><span class="p">,</span> <span class="s">'alpha'</span><span class="p">,</span> <span class="s">'psi'</span><span class="p">]:</span>
<span class="k">for</span> <span class="n">red_func</span><span class="p">,</span> <span class="n">metric</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">,</span> <span class="s">'_mean'</span><span class="p">),</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">,</span> <span class="s">'_std'</span><span class="p">)]:</span>
<span class="k">for</span> <span class="n">sampler</span><span class="p">,</span> <span class="n">s_label</span> <span class="ow">in</span> <span class="p">[(</span><span class="n">zinb_trace</span><span class="p">,</span> <span class="s">'_mcmc'</span><span class="p">),</span> <span class="p">(</span><span class="n">zinb_vi_trace</span><span class="p">,</span> <span class="s">'_vi'</span><span class="p">)]:</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="n">param</span> <span class="o">+</span> <span class="n">metric</span> <span class="o">+</span> <span class="n">s_label</span><span class="p">]</span> <span class="o">=</span> <span class="n">red_func</span><span class="p">(</span><span class="n">sampler</span><span class="p">[</span><span class="n">param</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">flatten</span><span class="p">()</span>
<span class="n">zinb_res</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">zinb_res</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">experiments</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"loaded_nuclei"</span><span class="p">))</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"droplet_counts.joint_model.ZINB_params.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">zinb_res</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mu_mean_mcmc mu_mean_vi mu_std_mcmc mu_std_vi \
loaded_nuclei
15300 0.252255 0.240411 0.047009 0.027336
191250 1.552145 1.539827 0.105615 0.077658
382500 3.181334 3.171134 0.093812 0.097421
765000 5.687228 5.676871 0.135440 0.144371
1530000 9.929232 9.973743 0.173669 0.190277
alpha_mean_mcmc alpha_mean_vi alpha_std_mcmc alpha_std_vi \
loaded_nuclei
15300 0.885246 0.846011 0.299376 0.268840
191250 1.785842 1.743211 0.354235 0.264529
382500 5.559653 5.543956 0.709964 0.734716
765000 8.286683 8.303245 0.914013 0.970718
1530000 11.836782 11.840635 1.064922 1.160125
psi_mean_mcmc psi_mean_vi psi_std_mcmc psi_std_vi
loaded_nuclei
15300 0.830984 0.857926 0.115343 0.079732
191250 0.923597 0.925170 0.045748 0.035024
382500 0.987763 0.985989 0.009495 0.015404
765000 0.912406 0.912225 0.012326 0.013725
1530000 0.951503 0.952088 0.008837 0.009638
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # Plot posterior
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span>
<span class="s">"Nuclei loading counts modeled as output of a ZeroInflatedNegativeBinomial function"</span><span class="p">,</span>
<span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">axis2</span> <span class="o">=</span> <span class="p">[</span><span class="n">ax</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span> <span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">]</span>
<span class="n">axis3</span> <span class="o">=</span> <span class="p">[</span><span class="n">ax</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span> <span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis2</span><span class="p">]</span>
<span class="c1"># axis[0].get_shared_y_axes().join(axis[1])
</span><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">method</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">([</span><span class="s">"mcmc"</span><span class="p">,</span> <span class="s">"vi"</span><span class="p">]):</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="n">method</span><span class="p">.</span><span class="n">upper</span><span class="p">(),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'mu_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">fill_between</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'mu_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">-</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'mu_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'mu_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'mu_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'alpha_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">fill_between</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'alpha_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">-</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'alpha_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'alpha_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'alpha_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis3</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'o'</span><span class="p">)</span>
<span class="n">axis3</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">fill_between</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">-</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_mean_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">]</span> <span class="o">+</span> <span class="n">zinb_res</span><span class="p">[</span><span class="sa">f</span><span class="s">'psi_std_</span><span class="si">{</span><span class="n">method</span><span class="si">}</span><span class="s">'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">axis3</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\mu$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\alpha$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis3</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\psi$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">axis3</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZINB_params.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">tight_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_64_0.png" alt="png" /></p>
<p>Okay!</p>
<p>This time VI and MCMC still roughly agree. The estimate of μ continues to increase, similar to the λ parameter in the ZIP model.</p>
<p>The dispersion parameter α is however much harder to estimate. It is however often both larger than μ and increasing with nuclei loading which are the properties we expect from it.</p>
<p>I will again use MCMC over VI for the ZINB model posterior parameter estimates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Analogously, we can again see the mean/variance relashionship on the posterior
</span><span class="n">lm_zinb</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lm_zinb</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">].</span><span class="n">values</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span>
<span class="n">zinb_res</span><span class="p">[</span><span class="s">'alpha_mean_mcmc'</span><span class="p">].</span><span class="n">values</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">lm_zinb</span><span class="p">.</span><span class="n">coef_</span><span class="p">,</span> <span class="n">lm_zinb</span><span class="p">.</span><span class="n">intercept_</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">],</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'alpha_mean_mcmc'</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">].</span><span class="nb">min</span><span class="p">(),</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">].</span><span class="nb">max</span><span class="p">())</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">lm_zinb</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">((</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))),</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"ZINB posterior"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\mu$"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\alpha$"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1.16834092] 0.8567637198540723
Text(0, 0.5, '$\\alpha$')
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_66_2.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # So first we interpolate over a range of loading concentrations and get the parameter values
# interpolation_type = "slinear"
</span><span class="n">interpolation_type</span> <span class="o">=</span> <span class="s">"quadratic"</span>
<span class="n">zinb_mu_f</span> <span class="o">=</span> <span class="n">scipy</span><span class="p">.</span><span class="n">interpolate</span><span class="p">.</span><span class="n">interp1d</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">],</span>
<span class="n">kind</span><span class="o">=</span><span class="n">interpolation_type</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">"extrapolate"</span><span class="p">)</span>
<span class="n">zinb_alpha_f</span> <span class="o">=</span> <span class="n">scipy</span><span class="p">.</span><span class="n">interpolate</span><span class="p">.</span><span class="n">interp1d</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'alpha_mean_mcmc'</span><span class="p">],</span>
<span class="n">kind</span><span class="o">=</span><span class="n">interpolation_type</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">"extrapolate"</span><span class="p">)</span>
<span class="n">zinb_psi_f</span> <span class="o">=</span> <span class="n">scipy</span><span class="p">.</span><span class="n">interpolate</span><span class="p">.</span><span class="n">interp1d</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">],</span>
<span class="n">kind</span><span class="o">=</span><span class="n">interpolation_type</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="s">"extrapolate"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Now generate across range
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="nb">max</span><span class="p">())</span>
<span class="n">zinb_mu_y</span> <span class="o">=</span> <span class="n">zinb_mu_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">zinb_alpha_y</span> <span class="o">=</span> <span class="n">zinb_alpha_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">zinb_psi_y</span> <span class="o">=</span> <span class="n">zinb_psi_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # With linear regression
# zinb_mu_f = LinearRegression()
# zinb_mu_f.fit(zinb_res.index.values.reshape((-1, 1)), zinb_res['mu_mean_mcmc'].values)
# zinb_alpha_f = LinearRegression()
# zinb_alpha_f.fit(zinb_res.index.values.reshape((-1, 1)), zinb_res['alpha_mean_mcmc'].values)
# zinb_psi_f = LinearRegression()
# zinb_psi_f.fit(zinb_res.index.values.reshape((-1, 1)), zinb_res['psi_mean_mcmc'].values)
# x = np.linspace(1000, zinb_res.index.max()).reshape((-1, 1))
# zinb_mu_y = zinb_mu_f.predict(x)
# zinb_alpha_y = zinb_alpha_f.predict(x)
# zinb_psi_y = zinb_psi_f.predict(x)
</span></code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># # # Plot estimated lamba
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'mu_mean_mcmc'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">zinb_mu_y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span> <span class="o">=</span> <span class="n">axis</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'alpha_mean_mcmc'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">zinb_alpha_y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">)</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis3</span> <span class="o">=</span> <span class="n">axis</span><span class="p">.</span><span class="n">twinx</span><span class="p">()</span>
<span class="n">axis3</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">zinb_res</span><span class="p">.</span><span class="n">index</span><span class="p">,</span> <span class="n">zinb_res</span><span class="p">[</span><span class="s">'psi_mean_mcmc'</span><span class="p">],</span>
<span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="n">axis3</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">zinb_psi_y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">)</span>
<span class="n">axis3</span><span class="p">.</span><span class="n">tick_params</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">labelcolor</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\mu$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axis2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\alpha$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axis3</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="sa">r</span><span class="s">"$\psi$"</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span>
<span class="c1"># axis2.set_ylim(0, 1)
</span><span class="n">axis3</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZIP_params.interpolated.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_70_0.png" alt="png" /></p>
<h3 id="using-the-zinb-model-to-predict-collision-rates-in-scifi-rna-seq-data">Using the ZINB model to predict collision rates in scifi-RNA-seq data</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="c1"># # Now we sample across the X
# x = np.linspace(zinb_res.index.min(), zinb_res.index.max(), 10)
# x = np.linspace(zinb_res.index.min(), zinb_res.index.max() * 4, 10)
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">logspace</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mf">6.5</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">base</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mf">1e5</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="n">barcode_combos</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">96</span><span class="p">,</span> <span class="mi">96</span> <span class="o">*</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">96</span> <span class="o">*</span> <span class="mi">16</span><span class="p">]</span>
<span class="k">for</span> <span class="n">barcodes</span> <span class="ow">in</span> <span class="n">barcode_combos</span><span class="p">:</span>
<span class="k">for</span> <span class="n">input_nuclei</span> <span class="ow">in</span> <span class="n">x</span><span class="p">:</span>
<span class="n">psi</span> <span class="o">=</span> <span class="n">zinb_psi_f</span><span class="p">([</span><span class="n">input_nuclei</span> <span class="o">/</span> <span class="n">barcodes</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># bounding PSI to [0, 1] since the extrapolation is unbounded
</span> <span class="n">psi</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">psi</span><span class="p">,</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">sys</span><span class="p">.</span><span class="n">float_info</span><span class="p">.</span><span class="n">epsilon</span><span class="p">)</span>
<span class="n">psi</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">psi</span><span class="p">,</span> <span class="mi">0</span> <span class="o">+</span> <span class="n">sys</span><span class="p">.</span><span class="n">float_info</span><span class="p">.</span><span class="n">epsilon</span><span class="p">)</span>
<span class="c1"># alpha = zinb_alpha_f([input_nuclei / barcodes])[0]
</span> <span class="c1"># This should also be bound to 0 < ALPHA < +inf since the extrapolation is unbounded.
</span> <span class="c1"># However, it's a bit more tricky than simply set it to and arbitrary low value
</span> <span class="c1"># (e.g. sys.float_info.epsilon). Instead I regularize it with a ReLU function.
</span> <span class="n">alpha</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">e</span> <span class="o">**</span> <span class="n">zinb_alpha_f</span><span class="p">([</span><span class="n">input_nuclei</span> <span class="o">/</span> <span class="n">barcodes</span><span class="p">])[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">zinb_mu_f</span><span class="p">([</span><span class="n">input_nuclei</span> <span class="o">/</span> <span class="n">barcodes</span><span class="p">])[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">s</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">pm</span><span class="p">.</span><span class="n">distributions</span>
<span class="p">.</span><span class="n">ZeroInflatedNegativeBinomial</span>
<span class="p">.</span><span class="n">dist</span><span class="p">(</span><span class="n">psi</span><span class="o">=</span><span class="n">psi</span><span class="p">,</span> <span class="n">mu</span><span class="o">=</span><span class="n">mu</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">).</span><span class="n">random</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">))</span>
<span class="k">except</span> <span class="nb">ValueError</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Failed at </span><span class="si">{</span><span class="n">barcodes</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">input_nuclei</span><span class="p">)</span><span class="si">}</span><span class="s"> "</span>
<span class="sa">f</span><span class="s">"with parameters: </span><span class="si">{</span><span class="n">mu</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">alpha</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">psi</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">s</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># Now we simply count fraction of collisions
</span> <span class="n">f</span><span class="p">[(</span><span class="n">barcodes</span><span class="p">,</span> <span class="n">input_nuclei</span><span class="p">)]</span> <span class="o">=</span> <span class="p">[</span><span class="n">mu</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">psi</span><span class="p">,</span> <span class="p">(</span><span class="n">s</span> <span class="o">></span> <span class="mi">1</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">n</span><span class="p">]</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"mu"</span><span class="p">,</span> <span class="s">"alpha"</span><span class="p">,</span> <span class="s">"psi"</span><span class="p">,</span> <span class="s">"collision_rate"</span><span class="p">]).</span><span class="n">T</span>
<span class="n">f</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'barcodes'</span><span class="p">,</span> <span class="s">'loaded_nuclei'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mu alpha psi collision_rate
barcodes loaded_nuclei
1 1.000000e+03 0.158251 1.277566 0.822348 0.01443
2.448437e+03 0.167693 1.272095 0.823231 0.01562
5.994843e+03 0.190887 1.259394 0.825384 0.01966
1.467799e+04 0.248130 1.232400 0.830612 0.03082
3.593814e+04 0.391010 1.190056 0.843153 0.06518
8.799225e+04 0.757174 1.223013 0.872304 0.16042
2.154435e+05 1.751603 2.266193 0.934359 0.42690
5.274997e+05 4.239832 7.305513 0.980458 0.85256
1.291550e+06 8.670790 10.626222 0.895200 0.88619
3.162278e+06 16.992353 22.653441 1.000000 0.99999
96 1.000000e+03 0.151810 1.281400 0.821745 0.01301
2.448437e+03 0.151908 1.281341 0.821754 0.01298
5.994843e+03 0.152149 1.281197 0.821776 0.01380
1.467799e+04 0.152737 1.280844 0.821832 0.01381
3.593814e+04 0.154178 1.279981 0.821967 0.01309
8.799225e+04 0.157708 1.277887 0.822298 0.01408
2.154435e+05 0.166360 1.272856 0.823106 0.01590
5.274997e+05 0.187610 1.261125 0.825081 0.01962
1.291550e+06 0.240019 1.235858 0.829878 0.03051
3.162278e+06 0.370629 1.194033 0.841407 0.05950
384 1.000000e+03 0.151759 1.281431 0.821740 0.01371
2.448437e+03 0.151784 1.281416 0.821742 0.01361
5.994843e+03 0.151844 1.281380 0.821748 0.01289
1.467799e+04 0.151991 1.281292 0.821762 0.01295
3.593814e+04 0.152351 1.281075 0.821795 0.01351
8.799225e+04 0.153233 1.280546 0.821878 0.01335
2.154435e+05 0.155393 1.279257 0.822081 0.01346
5.274997e+05 0.160685 1.276139 0.822576 0.01425
1.291550e+06 0.173666 1.268723 0.823787 0.01643
3.162278e+06 0.205594 1.251875 0.826738 0.02256
1536 1.000000e+03 0.151747 1.281439 0.821739 0.01375
2.448437e+03 0.151753 1.281435 0.821739 0.01329
5.994843e+03 0.151768 1.281426 0.821741 0.01269
1.467799e+04 0.151805 1.281404 0.821744 0.01262
3.593814e+04 0.151895 1.281350 0.821753 0.01341
8.799225e+04 0.152115 1.281217 0.821773 0.01273
2.154435e+05 0.152655 1.280893 0.821824 0.01339
5.274997e+05 0.153977 1.280101 0.821948 0.01384
1.291550e+06 0.157214 1.278178 0.822251 0.01409
3.162278e+06 0.165151 1.273550 0.822993 0.01509
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span> <span class="o">*</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">"Prediction with Zero Inflated Negative Binomial"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="k">for</span> <span class="n">barcodes</span> <span class="ow">in</span> <span class="n">f</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">levels</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span>
<span class="n">f</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">barcodes</span><span class="p">].</span><span class="n">index</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">barcodes</span><span class="p">,</span> <span class="s">'collision_rate'</span><span class="p">]</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span>
<span class="n">linestyle</span><span class="o">=</span><span class="s">"-"</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s">"o"</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">barcodes</span><span class="si">}</span><span class="s"> round1 barcodes"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xscale</span><span class="p">(</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_yscale</span><span class="p">(</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Nuclei loaded"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"% collisions"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="k">for</span> <span class="n">x_</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">100</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">]:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">x_</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">y_</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">10500</span><span class="p">,</span> <span class="mi">191000</span><span class="p">,</span> <span class="mi">383000</span><span class="p">,</span> <span class="mi">765000</span><span class="p">,</span> <span class="mi">1530000</span><span class="p">]:</span>
<span class="n">axis</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">y_</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">"grey"</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">"--"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="s">"droplet_counts.ZINB_params.prediction_of_collision_rate.svg"</span><span class="p">,</span>
<span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_72_0.png" alt="png" /></p>
<p>The ZINB model also seems to show that the μ parameter is more or less linear in the observed range (this is important for scalability) and that it would be theoretically possible to load up to ~1 million pre-labeled nuclei with the scifi method into the Chromium device.</p>
<h1 id="model-comparison">Model comparison</h1>
<p>It seems the ZINB model largely agrees with the ZIP. Can we quantify which one “performs better” ideally taking into account the differences between model complexity?</p>
<h2 id="waic">WAIC</h2>
<p>There are various ways. One option is to use for example the Watanabe–Akaike information criterion (WAIC) and associated metrics. For more information on the practical interpretation of these metrics see for example the PyMC3 tutorial pages on the subejct here: https://docs.pymc.io/notebooks/model_comparison.html#Widely-applicable-Information-Criterion-(WAIC)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zip_waic</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">waic</span><span class="p">(</span><span class="n">zip_trace</span><span class="p">,</span> <span class="n">zip_model</span><span class="p">)</span>
<span class="n">zinb_waic</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">waic</span><span class="p">(</span><span class="n">zinb_trace</span><span class="p">,</span> <span class="n">zinb_model</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ZIP model WAIC: </span><span class="si">{</span><span class="n">zip_waic</span><span class="p">.</span><span class="n">waic</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ZINB model WAIC: </span><span class="si">{</span><span class="n">zinb_waic</span><span class="p">.</span><span class="n">waic</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/afr/.local/lib/python3.7/site-packages/arviz/stats/stats.py:1126: UserWarning: For one or more samples the posterior variance of the log predictive densities exceeds 0.4. This could be indication of WAIC starting to fail.
See http://arxiv.org/abs/1507.04544 for details
"For one or more samples the posterior variance of the log predictive "
ZIP model WAIC: 11379.860
ZINB model WAIC: 11483.723
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">pm</span><span class="p">.</span><span class="n">__version__</span> <span class="o">==</span> <span class="s">"3.8"</span><span class="p">:</span>
<span class="n">df_comp_WAIC</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">compare</span><span class="p">({</span><span class="s">"ZIP"</span><span class="p">:</span> <span class="n">zip_trace</span><span class="p">,</span> <span class="s">"ZINB"</span><span class="p">:</span> <span class="n">zinb_trace</span><span class="p">},</span> <span class="n">ic</span><span class="o">=</span><span class="s">"WAIC"</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">pm</span><span class="p">.</span><span class="n">__version__</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'.'</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o"><</span> <span class="s">'8'</span><span class="p">:</span>
<span class="n">df_comp_WAIC</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">compare</span><span class="p">({</span><span class="n">zip_model</span><span class="p">:</span> <span class="n">zip_trace</span><span class="p">,</span> <span class="n">zinb_model</span><span class="p">:</span> <span class="n">zinb_trace</span><span class="p">},</span> <span class="n">ic</span><span class="o">=</span><span class="s">"WAIC"</span><span class="p">)</span>
<span class="n">df_comp_WAIC</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">df_comp_WAIC</span><span class="p">.</span><span class="n">index</span>
<span class="p">.</span><span class="n">to_series</span><span class="p">()</span>
<span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s">'ZIP'</span><span class="p">)</span>
<span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'ZINB'</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="nb">ValueError</span><span class="p">(</span><span class="s">"Please install PYMC3 version 3.6 or 3.8."</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">df_comp_WAIC</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"droplet_counts.WAIC_comparison.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">df_comp_WAIC</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> rank waic p_waic d_waic weight se dse warning \
ZIP 0 11379.9 11.8109 0 0.997478 131.318 0 True
ZINB 1 11483.7 9.6716 103.863 0.00252247 116.791 35.0087 True
waic_scale
ZIP deviance
ZINB deviance
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pm</span><span class="p">.</span><span class="n">compareplot</span><span class="p">(</span><span class="n">df_comp_WAIC</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.axes._subplots.AxesSubplot at 0x7ffb1e138a50>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_77_1.png" alt="png" /></p>
<h2 id="leave-one-out-cross-validation">Leave one out cross-validation</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zip_loo</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">loo</span><span class="p">(</span><span class="n">zip_trace</span><span class="p">,</span> <span class="n">zip_model</span><span class="p">)</span>
<span class="n">zinb_loo</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">loo</span><span class="p">(</span><span class="n">zinb_trace</span><span class="p">,</span> <span class="n">zinb_model</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ZIP model LOO: </span><span class="si">{</span><span class="n">zip_loo</span><span class="p">.</span><span class="n">loo</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ZINB model LOO: </span><span class="si">{</span><span class="n">zinb_loo</span><span class="p">.</span><span class="n">loo</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ZIP model LOO: 11379.856
ZINB model LOO: 11483.727
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">pm</span><span class="p">.</span><span class="n">__version__</span> <span class="o">==</span> <span class="s">"3.8"</span><span class="p">:</span>
<span class="n">df_comp_LOO</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">compare</span><span class="p">({</span><span class="s">"ZIP"</span><span class="p">:</span> <span class="n">zip_trace</span><span class="p">,</span> <span class="s">"ZINB"</span><span class="p">:</span> <span class="n">zinb_trace</span><span class="p">},</span> <span class="n">ic</span><span class="o">=</span><span class="s">"LOO"</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">pm</span><span class="p">.</span><span class="n">__version__</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">'.'</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span> <span class="o"><</span> <span class="s">'8'</span><span class="p">:</span>
<span class="n">df_comp_LOO</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">compare</span><span class="p">({</span><span class="n">zip_model</span><span class="p">:</span> <span class="n">zip_trace</span><span class="p">,</span> <span class="n">zinb_model</span><span class="p">:</span> <span class="n">zinb_trace</span><span class="p">},</span> <span class="n">ic</span><span class="o">=</span><span class="s">"LOO"</span><span class="p">)</span>
<span class="n">df_comp_LOO</span><span class="p">.</span><span class="n">index</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">df_comp_LOO</span><span class="p">.</span><span class="n">index</span>
<span class="p">.</span><span class="n">to_series</span><span class="p">()</span>
<span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="s">'ZIP'</span><span class="p">)</span>
<span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s">'ZINB'</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="nb">ValueError</span><span class="p">(</span><span class="s">"Please install PYMC3 version 3.6 or 3.8."</span><span class="p">)</span>
<span class="k">if</span> <span class="n">savefig</span><span class="p">:</span>
<span class="n">df_comp_LOO</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"droplet_counts.LOO_comparison.csv"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">df_comp_LOO</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> rank loo p_loo d_loo weight se dse warning \
ZIP 0 11379.9 11.8087 0 0.996112 126.292 0 False
ZINB 1 11483.7 9.67338 103.871 0.00388844 112.788 35.007 False
loo_scale
ZIP deviance
ZINB deviance
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pm</span><span class="p">.</span><span class="n">compareplot</span><span class="p">(</span><span class="n">df_comp_LOO</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.axes._subplots.AxesSubplot at 0x7ffb41e79a90>
</code></pre></div></div>
<p><img src="/data/notebooks/chromium_modeling/output_81_1.png" alt="png" /></p>
<p>The tables and plots the above show that the ZIP model is substancially b a better choice than the ZINB regarding a balance between model complexity (overfitting) and model performance (underfitting).</p>
<p>The outcome of this model helps to demonstrate how the Chromium device can be loaded beyond the recommended by 10X as long as one can pre-label the nuclei with enough complexity.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Both models show that the mean number of nuclei per droplet (λ or μ parameters in ZIP or ZINB models respectively) increases mostly linearly within the observed range. This means that there is great room for “overloading” the Chromium device if collision events are handled.</p>
<p>That’s exactly what the <em>scifi</em> concept introduces: by labeling all molecules of the transcriptome of each cell, one is able to reduce the number of colisions linearly by the number of unique round1 combinations. Modeling the nuclei loading procedure on the Chromium device shows that it is theoretically possible to load up to ~1 million pre-labeled nuclei with the scifi method into the Chromium device with an acceptable collision rate.</p>
<p><em>Thank you to Nikolaus Fortelny for feedback on this post.</em></p>
<h1 id="appendix">Appendix</h1>
<p>Some technical considerations:</p>
<ul>
<li>On the zero inflation:
<ul>
<li>Please don’t confuse what I model here as a zero inflation component as the zero inflation in transcript detection in single cell RNA-seq. Various people in the field have recently argued that there is no zero inflation at the transcriptome level, but simply low sensitivity overal in transcript capture - this is still well captured by a Negative Binomial model. The zero inflation observed by optically counting the cells in the droplet emulsion has nothing to do with that. It is also distinct in that while it is hard to prove in lower nuclei loading concentrations, it is clearly visible and distinct from the Poisson-like distribution of cells per droplet. Having said that, the zero inflation in the observed data might not even be a problem at all as it could easily be a design choice by 10X in order to ensure that most cells/nuclei are used at a point where the device is ready to make use of mostly useful droplets.</li>
</ul>
</li>
<li>On the inter/extrapolation across the model’s posterior:
<ul>
<li>This obviously feels wrong and not just because the quadratic fit is obviously… overfiting. Using a probabilistic model and then using only point-estimates to get estimates for the remaining input space is not what I would ideally like to do. If someone has better suggestions, feel free to comment.</li>
</ul>
</li>
</ul>
ATAC-seq power analysis
2018-08-10T00:00:00+00:00
https://andre-rendeiro.com/2018/08/10/atacseq_power_analysis
<p>Available as a <a href="/data/notebooks/atacseq_power_analysis/atacseq-power-analysis.ipynb">Jupyter notebook here</a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<h1 id="atac-seq-study-power-analysis">ATAC-seq study power analysis</h1>
<p>We’ll review the fundamental properties of ATAC-seq depth and variation distributions for each regulatory element quantified in several projects on primary PBMC samples in the Bock lab.</p>
<p>The concept and statistical metrics are based on the report by Hart et al, Calculating Sample Size Estimates for RNA Sequencing Data, JOURNAL OF COMPUTATIONAL BIOLOGY, 2013 for RNA-seq data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Set settings
</span><span class="n">sns</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="n">context</span><span class="o">=</span><span class="s">"paper"</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="s">"colorblind"</span><span class="p">,</span> <span class="n">color_codes</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">"svg.fonttype"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"none"</span>
<span class="n">matplotlib</span><span class="p">.</span><span class="n">rc</span><span class="p">(</span><span class="s">'text'</span><span class="p">,</span> <span class="n">usetex</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s start by reading in the matrices for 4 projects and recover two vectors for each: average raw read coverage per regulatory element and the coefficient of variation (std / mean).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">projects</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="s">"cll-chromatin"</span><span class="p">,</span> <span class="s">"~/projects/cll-chromatin/data_submission/cll_peaks.raw_coverage.tsv"</span><span class="p">),</span>
<span class="p">(</span><span class="s">"cll-ibrutinib"</span><span class="p">,</span> <span class="s">"~/projects/cll-ibrutinib/results/cll-ibrutinib_CLL_peaks.raw_coverage.csv"</span><span class="p">),</span>
<span class="p">(</span><span class="s">"cll-time_course"</span><span class="p">,</span> <span class="s">"~/projects/cll-time_course/results/cll-time_course_peaks.raw_coverage.csv"</span><span class="p">),</span>
<span class="p">(</span><span class="s">"cll-timetotreat"</span><span class="p">,</span> <span class="s">"~/projects/cll-timetotreat/results/cll-timetotreat_peaks.raw_coverage.csv"</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">covs</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="n">cvs</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="p">(</span><span class="n">project</span><span class="p">,</span> <span class="n">coverage</span><span class="p">)</span> <span class="ow">in</span> <span class="n">projects</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">project</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_table</span><span class="p">(</span><span class="n">coverage</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">","</span> <span class="k">if</span> <span class="n">coverage</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">"csv"</span><span class="p">)</span> <span class="k">else</span> <span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
<span class="k">if</span> <span class="s">'chrom'</span> <span class="ow">in</span> <span class="n">cov</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">cov</span><span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">cov</span><span class="p">[</span><span class="s">'index'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cov</span><span class="p">[</span><span class="s">'chrom'</span><span class="p">]</span> <span class="o">+</span> <span class="s">":"</span> <span class="o">+</span> <span class="n">cov</span><span class="p">[</span><span class="s">'start'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span> <span class="o">+</span> <span class="s">"-"</span> <span class="o">+</span> <span class="n">cov</span><span class="p">[</span><span class="s">'end'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">cov</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"index"</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">cov</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="n">cov</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">"ATAC"</span><span class="p">)]</span>
<span class="c1"># Get average coverage in raw read counts per regulatory element
</span> <span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">]</span> <span class="o">=</span> <span class="n">cov</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Get coeficient of variation per regulatory element
</span> <span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">]</span> <span class="o">=</span> <span class="n">cov</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="n">cov</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cll-chromatin
cll-ibrutinib
cll-time_course
cll-timetotreat
</code></pre></div></div>
<p>Let’s now plot the distribution of raw read coverages for each project.
We will use various scales due to the exponential nature of NGS data but also because the emphasis is on comparing various projects.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Plot coverage for each project
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span> <span class="n">gridspec_kw</span><span class="o">=</span><span class="p">{</span><span class="s">"width_ratios"</span><span class="p">:</span> <span class="p">(</span><span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">)})</span>
<span class="k">for</span> <span class="n">project</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">projects</span><span class="p">:</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">sort_values</span><span class="p">()</span>
<span class="n">rpm</span> <span class="o">=</span> <span class="p">(</span><span class="n">cov</span> <span class="o">/</span> <span class="n">cov</span><span class="p">.</span><span class="nb">sum</span><span class="p">())</span> <span class="o">*</span> <span class="mf">1e6</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">rpm</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">rpm</span><span class="p">[</span><span class="n">rpm</span> <span class="o"><</span> <span class="mi">50</span><span class="p">],</span> <span class="n">kde</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">rpm</span><span class="p">[</span><span class="n">rpm</span> <span class="o"><</span> <span class="mi">50</span><span class="p">],</span> <span class="n">kde</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Coverage (reads per million)"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Reg. elements"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlim</span><span class="p">((</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">50</span><span class="p">))</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_xlim</span><span class="p">((</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">30</span><span class="p">))</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_yscale</span><span class="p">(</span><span class="s">"log"</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"atac-seq_power_analysis.coverage_distribution.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/atacseq_power_analysis/output_8_0.png" alt="png" /></p>
<p>We see that across projects the type and shape of the distribution associated with read counts are extremely similar.</p>
<p>Now we will assess the level of (mostly) biological variation in the datasets. Since there is a relationship between technical variation and the mean depth of coverage in a certain regulatory element, the coefficient of variation allows us to phase this effect by dividing standard deviation by the mean coverage of each regulatory element.</p>
<p>Let’s observe these distributions for the various projects:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Plot CVs for each project
</span><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="k">for</span> <span class="n">project</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">projects</span><span class="p">:</span>
<span class="n">cv</span> <span class="o">=</span> <span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">sort_values</span><span class="p">()</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">cv</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">cv</span><span class="p">,</span> <span class="n">kde</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Biological CV"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Reg. elements"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlim</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"atac-seq_power_analysis.CV_distribution.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/atacseq_power_analysis/output_10_0.png" alt="png" /></p>
<p>While for all projects the distributions are some sort of gamma distribution, the the shape and scale parameters vary quite considerably.</p>
<p>The “cll-time_course” project shows markedly more variation than others, presumably because it is the only one where there are 6 pure (FACS) sorted cell types. On the other end, the “cll-chromatin” project shows the least amount of global variation, maybe because it focused on a single cell type with generally high purity. The other two projects contain either sorted CLL cells jointly with mixtures of PBMCs which are generally enriched for CLL cells, so also an intermideate between the extremes regarding cell type purity.</p>
<p>Let’s also plot this in a cumulative manner for easier comparability, like Hart et al, figure 2:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Plot cumulative CVs for each project
</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="k">for</span> <span class="n">project</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">projects</span><span class="p">:</span>
<span class="n">cv</span> <span class="o">=</span> <span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span><span class="n">cv</span><span class="p">,</span> <span class="n">cv</span><span class="p">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">cv</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">plot</span><span class="p">(</span><span class="n">cv</span> <span class="o">/</span> <span class="n">cv</span><span class="p">.</span><span class="nb">max</span><span class="p">(),</span> <span class="n">cv</span><span class="p">.</span><span class="n">cumsum</span><span class="p">()</span> <span class="o">/</span> <span class="n">cv</span><span class="p">.</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">label</span><span class="o">=</span><span class="n">project</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Biological CV"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Normalized CV"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="k">for</span> <span class="n">ax</span> <span class="ow">in</span> <span class="n">axis</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Cumulative distribution"</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="c1"># axis.set_title("Coeficient of variation for CLL ATAC-seq projects")
</span><span class="n">axis</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">legend</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">legend</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"atac-seq_power_analysis.cumulative_CV.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/atacseq_power_analysis/output_12_0.png" alt="png" /></p>
<p>Having measured and investigated the technical and parameters of ATAC-seq experiments, we can now start to infer what is the sample size needed to discover differential regulatory elements between sample groups given some level of confidence, effect size and power.</p>
<p>We will use the RNASeqPower R library by the same authors which implements a formula for the number of samples given the above parameters plus the technical ones such as the coefficient of variation and depth of coverage for each regulatory element.</p>
<p>For simplicity we will focus on a particular project which had the highest sample purity.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Import the RNASeqPower R library
</span><span class="kn">import</span> <span class="nn">rpy2</span>
<span class="kn">import</span> <span class="nn">rpy2.robjects</span> <span class="k">as</span> <span class="n">robjects</span>
<span class="kn">import</span> <span class="nn">rpy2.robjects.numpy2ri</span>
<span class="n">rpy2</span><span class="p">.</span><span class="n">robjects</span><span class="p">.</span><span class="n">numpy2ri</span><span class="p">.</span><span class="n">activate</span><span class="p">()</span>
<span class="n">robjects</span><span class="p">.</span><span class="n">r</span><span class="p">(</span><span class="s">'require("RNASeqPower")'</span><span class="p">)</span>
<span class="n">rnapower</span> <span class="o">=</span> <span class="n">robjects</span><span class="p">.</span><span class="n">r</span><span class="p">(</span><span class="s">'rnapower'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's get the range of coverage and CV of a specific project
</span>
<span class="n">project</span> <span class="o">=</span> <span class="s">"cll-chromatin"</span>
<span class="k">print</span><span class="p">(</span><span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">describe</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">describe</span><span class="p">())</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>count 111971.000000
mean 59.708063
std 154.245453
min 0.215909
25% 3.965909
50% 11.272727
75% 37.227273
max 3390.352273
dtype: float64
count 111971.000000
mean 0.829191
std 0.377015
min 0.369232
25% 0.567725
50% 0.732309
75% 0.981739
max 8.895140
dtype: float64
</code></pre></div></div>
<p>We will through a linear range of values of depth and variation observed in the study to estimate the number of samples, while at the same time also getting predictions for various levels of effect sizes and power. The significance level will always be 0.05.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">effects</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.25</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">)</span>
<span class="n">powers</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.8</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">)</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.05</span>
<span class="n">quantiles</span> <span class="o">=</span> <span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">)</span>
<span class="n">bins</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">depth_range</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="n">quantiles</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="n">quantiles</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">bins</span><span class="p">)</span>
<span class="n">cv_range</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="n">quantiles</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">quantile</span><span class="p">(</span><span class="n">quantiles</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">bins</span><span class="p">)</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="k">for</span> <span class="n">depth</span> <span class="ow">in</span> <span class="n">depth_range</span><span class="p">:</span>
<span class="k">for</span> <span class="n">cv</span> <span class="ow">in</span> <span class="n">cv_range</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">rnapower</span><span class="p">(</span><span class="n">depth</span><span class="o">=</span><span class="n">depth</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">cv</span><span class="p">,</span>
<span class="n">effect</span><span class="o">=</span><span class="n">robjects</span><span class="p">.</span><span class="n">vectors</span><span class="p">.</span><span class="n">FloatVector</span><span class="p">(</span><span class="n">effects</span><span class="p">),</span>
<span class="n">alpha</span><span class="o">=</span><span class="n">alpha</span><span class="p">,</span>
<span class="n">power</span><span class="o">=</span><span class="n">robjects</span><span class="p">.</span><span class="n">vectors</span><span class="p">.</span><span class="n">FloatVector</span><span class="p">(</span><span class="n">powers</span><span class="p">))),</span>
<span class="n">index</span><span class="o">=</span><span class="n">effects</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">powers</span><span class="p">)</span>
<span class="n">p</span><span class="p">[</span><span class="s">'depth'</span><span class="p">]</span> <span class="o">=</span> <span class="n">depth</span>
<span class="n">p</span><span class="p">[</span><span class="s">'cv'</span><span class="p">]</span> <span class="o">=</span> <span class="n">cv</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">res</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">res</span><span class="p">.</span><span class="n">index</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"effect"</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">melt</span><span class="p">(</span><span class="n">res</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(),</span> <span class="n">id_vars</span><span class="o">=</span><span class="p">[</span><span class="s">'effect'</span><span class="p">,</span> <span class="s">'depth'</span><span class="p">,</span> <span class="s">'cv'</span><span class="p">],</span> <span class="n">var_name</span><span class="o">=</span><span class="s">"power"</span><span class="p">)</span>
<span class="n">res</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"atac-seq_power_analysis.sample_estimation.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">res</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>effect</th>
<th>depth</th>
<th>cv</th>
<th>power</th>
<th>value</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1.25</td>
<td>1.965909</td>
<td>0.493028</td>
<td>0.8</td>
<td>236.995786</td>
</tr>
<tr>
<th>1</th>
<td>1.50</td>
<td>1.965909</td>
<td>0.493028</td>
<td>0.8</td>
<td>71.779814</td>
</tr>
<tr>
<th>2</th>
<td>2.00</td>
<td>1.965909</td>
<td>0.493028</td>
<td>0.8</td>
<td>24.561698</td>
</tr>
<tr>
<th>3</th>
<td>1.25</td>
<td>1.965909</td>
<td>0.500749</td>
<td>0.8</td>
<td>239.414878</td>
</tr>
<tr>
<th>4</th>
<td>1.50</td>
<td>1.965909</td>
<td>0.500749</td>
<td>0.8</td>
<td>72.512494</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">effects</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">powers</span><span class="p">),</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">powers</span><span class="p">)</span> <span class="o">*</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">8</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">effects</span><span class="p">)))</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">effect</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">effects</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">power</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">powers</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">res</span><span class="p">[(</span><span class="n">res</span><span class="p">[</span><span class="s">'power'</span><span class="p">]</span> <span class="o">==</span> <span class="n">power</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">res</span><span class="p">[</span><span class="s">'effect'</span><span class="p">]</span> <span class="o">==</span> <span class="n">effect</span><span class="p">)],</span> <span class="n">index</span><span class="o">=</span><span class="s">'cv'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'depth'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">"value"</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span>
<span class="n">p</span><span class="p">,</span> <span class="n">xticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">yticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">cmap</span><span class="o">=</span><span class="s">"inferno"</span><span class="p">,</span> <span class="n">vmin</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=</span><span class="mi">3000</span> <span class="o">**</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">/</span> <span class="n">effect</span><span class="p">),</span> <span class="n">robust</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">rasterized</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Sample number"</span><span class="p">},</span>
<span class="n">ax</span><span class="o">=</span><span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">])</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Fold-change {}; {} % power"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">effect</span><span class="p">,</span> <span class="n">power</span> <span class="o">*</span> <span class="mi">100</span><span class="p">),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"CV</span><span class="se">\n</span><span class="s">(10th to 90th percentile)"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Read depth</span><span class="se">\n</span><span class="s">(10th to 90th percentile)"</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">ax</span><span class="p">.</span><span class="n">get_xticklabels</span><span class="p">(),</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">90</span><span class="p">)</span>
<span class="n">axis</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">].</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">ax</span><span class="p">.</span><span class="n">get_yticklabels</span><span class="p">(),</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">fig</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"atac-seq_power_analysis.sample_size.heatmaps.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/data/notebooks/atacseq_power_analysis/output_19_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># export CV and depth for project
</span><span class="n">project</span> <span class="o">=</span> <span class="s">"cll-chromatin"</span>
<span class="n">covs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">to_frame</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"mean_depth"</span><span class="p">).</span><span class="n">join</span><span class="p">(</span><span class="n">cvs</span><span class="p">[</span><span class="n">project</span><span class="p">].</span><span class="n">to_frame</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"cv"</span><span class="p">)).</span><span class="n">to_csv</span><span class="p">(</span>
<span class="s">"atac-seq_power_analysis.metrics.csv"</span><span class="p">)</span>
</code></pre></div></div>
Human receptor-ligand interaction repertoire
2017-09-06T00:00:00+00:00
https://andre-rendeiro.com/2017/09/06/human_receptor-ligand-expression
<p>Available as a <a href="https://github.com/afrendeiro/afrendeiro.github.io/blob/master/data/notebooks/human_ligand_receptor_expression/receptor-ligand-expression.ipynb">Jupyter notebook here</a></p>
<h2 id="receptor-ligand-interactions-in-human-primary-cells">Receptor-ligand interactions in human primary cells</h2>
<p>Ramilowski et al (10.1038/ncomms8866) have a very nice resource on receptor-ligand interactions in human primary cells.
Let’s explore…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">"white"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">rcParams</span><span class="p">[</span><span class="s">'svg.fonttype'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'none'</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Populating the interactive namespace from numpy and matplotlib
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Read in the supplementary material
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span>
<span class="s">"https://images.nature.com/original/nature-assets/ncomms/2015/150722/ncomms8866/extref/ncomms8866-s3.xlsx"</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Pair.Name</th>
<th>Ligand.ApprovedSymbol</th>
<th>Ligand.Name</th>
<th>Receptor.ApprovedSymbol</th>
<th>Receptor.Name</th>
<th>DLRP</th>
<th>HPMR</th>
<th>IUPHAR</th>
<th>HPRD</th>
<th>STRING.binding</th>
<th>STRING.experiment</th>
<th>HPMR.Ligand</th>
<th>HPMR.Receptor</th>
<th>PMID.Manual</th>
<th>Pair.Source</th>
<th>Pair.Evidence</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>A2M_LRP1</td>
<td>A2M</td>
<td>alpha-2-macroglobulin</td>
<td>LRP1</td>
<td>low density lipoprotein receptor-related prote...</td>
<td>NaN</td>
<td>HPMR</td>
<td>NaN</td>
<td>HPRD</td>
<td>STRING.binding</td>
<td>STRING.experiment</td>
<td>A2M</td>
<td>LRP1</td>
<td>NaN</td>
<td>known</td>
<td>literature supported</td>
</tr>
<tr>
<th>1</th>
<td>AANAT_MTNR1A</td>
<td>AANAT</td>
<td>aralkylamine N-acetyltransferase</td>
<td>MTNR1A</td>
<td>melatonin receptor 1A</td>
<td>NaN</td>
<td>HPMR</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>AANAT</td>
<td>MTNR1A</td>
<td>NaN</td>
<td>known</td>
<td>literature supported</td>
</tr>
<tr>
<th>2</th>
<td>AANAT_MTNR1B</td>
<td>AANAT</td>
<td>aralkylamine N-acetyltransferase</td>
<td>MTNR1B</td>
<td>melatonin receptor 1B</td>
<td>NaN</td>
<td>HPMR</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>AANAT</td>
<td>MTNR1B</td>
<td>NaN</td>
<td>known</td>
<td>literature supported</td>
</tr>
<tr>
<th>3</th>
<td>ACE_AGTR2</td>
<td>ACE</td>
<td>angiotensin I converting enzyme</td>
<td>AGTR2</td>
<td>angiotensin II receptor, type 2</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>HPRD</td>
<td>NaN</td>
<td>NaN</td>
<td>ACE</td>
<td>AGTR2</td>
<td>NaN</td>
<td>novel</td>
<td>literature supported</td>
</tr>
<tr>
<th>4</th>
<td>ACE_BDKRB2</td>
<td>ACE</td>
<td>angiotensin I converting enzyme</td>
<td>BDKRB2</td>
<td>bradykinin receptor B2</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>HPRD</td>
<td>NaN</td>
<td>NaN</td>
<td>ACE</td>
<td>BDKRB2</td>
<td>NaN</td>
<td>novel</td>
<td>literature supported</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's explore the existing known interactions
</span>
<span class="c1"># filter out interactions labeled as "EXCLUDED"
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span>
<span class="o">~</span><span class="n">df</span><span class="p">[</span><span class="s">'Pair.Evidence'</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="n">contains</span><span class="p">(</span><span class="s">"EXCLUDED"</span><span class="p">),</span>
<span class="p">[</span><span class="s">'Pair.Name'</span><span class="p">,</span> <span class="s">'Ligand.ApprovedSymbol'</span><span class="p">,</span> <span class="s">'Receptor.ApprovedSymbol'</span><span class="p">]]</span>
<span class="n">df</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'pair'</span><span class="p">,</span> <span class="s">'ligand'</span><span class="p">,</span> <span class="s">'receptor'</span><span class="p">]</span>
<span class="n">df</span><span class="p">[</span><span class="s">'interaction'</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's make a pivot table of receptor vs ligands
</span><span class="n">df_pivot</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="s">"ligand"</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'receptor'</span><span class="p">,</span> <span class="n">aggfunc</span><span class="o">=</span><span class="nb">sum</span><span class="p">,</span> <span class="n">fill_value</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Heatmap
</span><span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">clustermap</span><span class="p">(</span>
<span class="n">df_pivot</span><span class="p">,</span> <span class="n">xticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">yticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s">"jaccard"</span><span class="p">,</span>
<span class="n">rasterized</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Jaccard distance"</span><span class="p">})</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_row_dendrogram</span><span class="p">.</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_col_dendrogram</span><span class="p">.</span><span class="n">set_visible</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"receptor"</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">"human_receptor_ligand.interaction_map.clustermap.svg"</span><span class="p">),</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/afr/.local/lib/python3.5/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
</code></pre></div></div>
<p><img src="/data/notebooks/human_ligand_receptor_expression/output_4_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's now play with the expression of these genes in tissues
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span>
<span class="s">"https://images.nature.com/original/nature-assets/ncomms/2015/150722/ncomms8866/extref/ncomms8866-s5.xlsx"</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">])</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
<span class="c1"># remove not expressed genes in any tissue
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's correlate tissues in the expression of these genes
</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.12</span><span class="p">,</span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.12</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">clustermap</span><span class="p">(</span>
<span class="n">df</span><span class="p">.</span><span class="n">corr</span><span class="p">(),</span> <span class="n">xticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">),</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"Pearson correlation"</span><span class="p">})</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">get_yticklabels</span><span class="p">(),</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="s">"xx-small"</span><span class="p">,</span> <span class="n">rasterized</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_row_dendrogram</span><span class="p">.</span><span class="n">set_rasterized</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_col_dendrogram</span><span class="p">.</span><span class="n">set_rasterized</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span>
<span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">"human_receptor_ligand.expression.tissue_correlation.clustermap.svg"</span><span class="p">),</span>
<span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/afr/.local/lib/python3.5/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
</code></pre></div></div>
<p><img src="/data/notebooks/human_ligand_receptor_expression/output_6_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Let's visualize the expression ligand and receptors in the different tissues
</span><span class="n">df2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">df</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">"F5.PrimaryCells.Expression_Max"</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">dropna</span><span class="p">())</span>
<span class="n">w</span><span class="p">,</span> <span class="n">h</span> <span class="o">=</span> <span class="n">df2</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.12</span><span class="p">,</span> <span class="n">df2</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="mf">0.01</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">clustermap</span><span class="p">(</span><span class="n">df2</span><span class="p">.</span><span class="n">T</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s">"correlation"</span><span class="p">,</span> <span class="n">z_score</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">vmin</span><span class="o">=-</span><span class="mi">3</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">w</span><span class="p">),</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"label"</span><span class="p">:</span> <span class="s">"log(expression) Z-score"</span><span class="p">},</span> <span class="n">xticklabels</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">rasterized</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_yticklabels</span><span class="p">(</span><span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">get_yticklabels</span><span class="p">(),</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="s">"x-small"</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_row_dendrogram</span><span class="p">.</span><span class="n">set_rasterized</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_col_dendrogram</span><span class="p">.</span><span class="n">set_rasterized</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">ax_heatmap</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Gene (ligands, receptors)"</span><span class="p">)</span>
<span class="n">g</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">"human_receptor_ligand.expression.clustermap.svg"</span><span class="p">),</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">"tight"</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">300</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/afr/.local/lib/python3.5/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
</code></pre></div></div>
<p><img src="/data/notebooks/human_ligand_receptor_expression/output_7_1.png" alt="png" /></p>
Demo: analysis of flow cytometry data at single-cell resolution
2016-11-23T00:00:00+00:00
https://andre-rendeiro.com/2016/11/23/flow_cytometry_demo
<p>I’m interested in exploring flow cytometry data from a single-cell perspective. After all it is a very high throughput method that can measure a modest amount of variables in each single cell.</p>
<p>I decided to use the <code class="language-plaintext highlighter-rouge">FlowCytometryTools</code> Python library just to see what I could extract from those <em>magic</em> fcs files and how feature complete the library is.</p>
<p>I made the following Jupyter notebook:</p>
<p><br /></p>
<h4 id="demo-of-exploring-flow-cytometry-data-with-the-flowcytometrytools-library">Demo of exploring flow cytometry data with the FlowCytometryTools library</h4>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Import library
</span><span class="kn">from</span> <span class="nn">FlowCytometryTools</span> <span class="kn">import</span> <span class="n">FCMeasurement</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># load fcs file (version 3)
</span><span class="n">sample</span> <span class="o">=</span> <span class="n">FCMeasurement</span><span class="p">(</span><span class="n">ID</span><span class="o">=</span><span class="s">'Example sample'</span><span class="p">,</span> <span class="n">datafile</span><span class="o">=</span><span class="s">"data/example.fcs"</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Let's see the number of cells measured
</span><span class="n">sample</span><span class="p">.</span><span class="n">counts</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>156157
</code></pre></div></div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># All metadata
</span><span class="n">sample</span><span class="p">.</span><span class="n">meta</span><span class="p">.</span><span class="n">items</span><span class="p">()[:</span><span class="mi">10</span><span class="p">]</span> <span class="c1"># see only first 10 entries</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[(u'P13MS', u'350'),
(u'$ETIM', u'13:21:42'),
(u'P8DISPLAY', u'LOG'),
(u'FSC ASF', u'0.74'),
(u'CYTNUM', u'1'),
(u'$ENDDATA', u'9998405 '),
(u'P2DISPLAY', u'LIN'),
(u'$ENDSTEXT', u'0'),
(u'LASER2NAME', u'Red')]
</code></pre></div></div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sample</span><span class="p">.</span><span class="n">channel_names</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(u'FSC-A',
u'FSC-H',
u'FSC-W',
u'SSC-A',
u'SSC-H',
u'SSC-W',
u'B/E Alexa Fluor 488-A',
u'B/C PE-TexasRed-A',
u'B/B PerCP-Cy5-5-A',
u'YG/A PE-Cy7-A',
u'R/C APC-A',
u'R/B Alexa Fluor 700-A',
u'R/A APC-Cy7-A',
u'V/C Pacific Blue-A',
u'YG/E PE-A',
u'Time')
</code></pre></div></div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">sample</span><span class="p">.</span><span class="n">channels</span></code></pre></figure>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>$PnN</th>
<th>$PnB</th>
<th>$PnG</th>
<th>$PnE</th>
<th>$PnR</th>
<th>$PnV</th>
</tr>
<tr>
<th>Channel Number</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>FSC-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>244</td>
</tr>
<tr>
<th>2</th>
<td>FSC-H</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>244</td>
</tr>
<tr>
<th>3</th>
<td>FSC-W</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>244</td>
</tr>
<tr>
<th>4</th>
<td>SSC-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>276</td>
</tr>
<tr>
<th>5</th>
<td>SSC-H</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>276</td>
</tr>
<tr>
<th>6</th>
<td>SSC-W</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>276</td>
</tr>
<tr>
<th>7</th>
<td>B/E Alexa Fluor 488-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>469</td>
</tr>
<tr>
<th>8</th>
<td>B/C PE-TexasRed-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>450</td>
</tr>
<tr>
<th>9</th>
<td>B/B PerCP-Cy5-5-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>462</td>
</tr>
<tr>
<th>10</th>
<td>YG/A PE-Cy7-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>458</td>
</tr>
<tr>
<th>11</th>
<td>R/C APC-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>586</td>
</tr>
<tr>
<th>12</th>
<td>R/B Alexa Fluor 700-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>560</td>
</tr>
<tr>
<th>13</th>
<td>R/A APC-Cy7-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>618</td>
</tr>
<tr>
<th>14</th>
<td>V/C Pacific Blue-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>385</td>
</tr>
<tr>
<th>15</th>
<td>YG/E PE-A</td>
<td>32</td>
<td>1.0</td>
<td>[0, 0]</td>
<td>262144</td>
<td>443</td>
</tr>
<tr>
<th>16</th>
<td>Time</td>
<td>32</td>
<td>0.01</td>
<td>[0, 0]</td>
<td>262144</td>
<td>None</td>
</tr>
</tbody>
</table>
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># get forward vs side scatter of first 10 cells
</span><span class="n">sample</span><span class="p">.</span><span class="n">data</span><span class="p">[[</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">]][:</span><span class="mi">10</span><span class="p">]</span></code></pre></figure>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>FSC-A</th>
<th>SSC-A</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>53618.921875</td>
<td>42435.988281</td>
</tr>
<tr>
<th>1</th>
<td>100054.664062</td>
<td>33288.808594</td>
</tr>
<tr>
<th>2</th>
<td>60825.039062</td>
<td>35168.011719</td>
</tr>
<tr>
<th>3</th>
<td>58227.640625</td>
<td>37189.890625</td>
</tr>
<tr>
<th>4</th>
<td>67312.617188</td>
<td>39781.621094</td>
</tr>
<tr>
<th>5</th>
<td>92615.437500</td>
<td>73334.914062</td>
</tr>
<tr>
<th>6</th>
<td>51280.519531</td>
<td>33090.449219</td>
</tr>
<tr>
<th>7</th>
<td>43725.859375</td>
<td>36683.550781</td>
</tr>
<tr>
<th>8</th>
<td>62111.902344</td>
<td>22713.960938</td>
</tr>
<tr>
<th>9</th>
<td>84667.843750</td>
<td>33091.320312</td>
</tr>
</tbody>
</table>
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Get all channels in one table
# this would be the main table to work on from here
</span><span class="n">sample</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="s">"Time"</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="c1"># show only first 10 cells</span></code></pre></figure>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>FSC-A</th>
<th>FSC-H</th>
<th>FSC-W</th>
<th>SSC-A</th>
<th>SSC-H</th>
<th>SSC-W</th>
<th>B/E Alexa Fluor 488-A</th>
<th>B/C PE-TexasRed-A</th>
<th>B/B PerCP-Cy5-5-A</th>
<th>YG/A PE-Cy7-A</th>
<th>R/C APC-A</th>
<th>R/B Alexa Fluor 700-A</th>
<th>R/A APC-Cy7-A</th>
<th>V/C Pacific Blue-A</th>
<th>YG/E PE-A</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>53618.921875</td>
<td>46566.0</td>
<td>75462.132812</td>
<td>42435.988281</td>
<td>40018.0</td>
<td>69495.859375</td>
<td>66.989998</td>
<td>59.160000</td>
<td>36.540001</td>
<td>5453.760254</td>
<td>2175.400146</td>
<td>624.150024</td>
<td>329.960022</td>
<td>9.130000</td>
<td>156.400009</td>
</tr>
<tr>
<th>1</th>
<td>100054.664062</td>
<td>90369.0</td>
<td>72560.093750</td>
<td>33288.808594</td>
<td>32241.0</td>
<td>67665.867188</td>
<td>62.639999</td>
<td>215.759995</td>
<td>97.440002</td>
<td>2553.920166</td>
<td>2539.670166</td>
<td>579.619995</td>
<td>252.580002</td>
<td>14.940000</td>
<td>720.359985</td>
</tr>
<tr>
<th>2</th>
<td>60825.039062</td>
<td>53044.0</td>
<td>75149.492188</td>
<td>35168.011719</td>
<td>33952.0</td>
<td>67883.210938</td>
<td>90.480003</td>
<td>125.279999</td>
<td>28.710001</td>
<td>6724.280273</td>
<td>1590.670044</td>
<td>669.410034</td>
<td>292.000000</td>
<td>6.640000</td>
<td>291.640015</td>
</tr>
<tr>
<th>3</th>
<td>58227.640625</td>
<td>50084.0</td>
<td>76192.140625</td>
<td>37189.890625</td>
<td>35568.0</td>
<td>68524.421875</td>
<td>69.599998</td>
<td>334.950012</td>
<td>139.199997</td>
<td>2679.040039</td>
<td>1762.950073</td>
<td>458.440002</td>
<td>202.210007</td>
<td>9.960000</td>
<td>917.239990</td>
</tr>
<tr>
<th>4</th>
<td>67312.617188</td>
<td>57849.0</td>
<td>76257.140625</td>
<td>39781.621094</td>
<td>38333.0</td>
<td>68012.632812</td>
<td>77.430000</td>
<td>192.270004</td>
<td>100.919998</td>
<td>4035.120117</td>
<td>1524.970093</td>
<td>475.230011</td>
<td>239.440002</td>
<td>9.960000</td>
<td>608.119995</td>
</tr>
<tr>
<th>5</th>
<td>92615.437500</td>
<td>72473.0</td>
<td>83750.437500</td>
<td>73334.914062</td>
<td>66461.0</td>
<td>72314.242188</td>
<td>120.059998</td>
<td>2424.689941</td>
<td>2306.370117</td>
<td>253.919998</td>
<td>237.980011</td>
<td>241.630005</td>
<td>66.430000</td>
<td>34.029999</td>
<td>6612.040039</td>
</tr>
<tr>
<th>6</th>
<td>51280.519531</td>
<td>44760.0</td>
<td>75083.117188</td>
<td>33090.449219</td>
<td>31656.0</td>
<td>68505.679688</td>
<td>86.129997</td>
<td>57.420002</td>
<td>32.189999</td>
<td>2461.920166</td>
<td>1025.650024</td>
<td>365.000000</td>
<td>151.839996</td>
<td>43.989998</td>
<td>80.040001</td>
</tr>
<tr>
<th>7</th>
<td>43725.859375</td>
<td>38597.0</td>
<td>74244.570312</td>
<td>36683.550781</td>
<td>35153.0</td>
<td>68389.421875</td>
<td>153.990005</td>
<td>202.710007</td>
<td>163.559998</td>
<td>2104.040039</td>
<td>1400.140015</td>
<td>401.500000</td>
<td>234.330002</td>
<td>191.729996</td>
<td>734.160034</td>
</tr>
<tr>
<th>8</th>
<td>62111.902344</td>
<td>55068.0</td>
<td>73918.890625</td>
<td>22713.960938</td>
<td>21820.0</td>
<td>68221.000000</td>
<td>73.080002</td>
<td>46.110001</td>
<td>81.779999</td>
<td>4808.839844</td>
<td>1565.119995</td>
<td>646.049988</td>
<td>348.940002</td>
<td>12.450000</td>
<td>123.279999</td>
</tr>
<tr>
<th>9</th>
<td>84667.843750</td>
<td>75601.0</td>
<td>73395.750000</td>
<td>33091.320312</td>
<td>32645.0</td>
<td>66432.007812</td>
<td>31.320000</td>
<td>15.660000</td>
<td>13.050000</td>
<td>3248.520020</td>
<td>1015.430054</td>
<td>360.619995</td>
<td>141.620010</td>
<td>6.640000</td>
<td>75.440002</td>
</tr>
</tbody>
</table>
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Let's plot some things:
</span>
<span class="c1"># plot with provided library wrappers
</span><span class="n">_</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span>
<span class="c1"># we can obviously also plot with base matplotlib + seaborn
</span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pylab</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span><span class="s">"whitegrid"</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">jointplot</span><span class="p">(</span><span class="n">sample</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="s">'FSC-A'</span><span class="p">],</span> <span class="n">sample</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="s">'SSC-A'</span><span class="p">],</span>
<span class="n">xlim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">250000</span><span class="p">),</span> <span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">250000</span><span class="p">),</span>
<span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><seaborn.axisgrid.JointGrid at 0x7f334e8f0b50>
</code></pre></div></div>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_10_1.png" align="middle" style="width: 500px;" />
</div>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_10_2.png" align="middle" style="width: 500px;" />
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Now let's make a gate using the interactive interface
# sample.view_interactively()</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">FlowCytometryTools</span> <span class="kn">import</span> <span class="n">ThresholdGate</span><span class="p">,</span> <span class="n">PolyGate</span>
<span class="c1"># Four threshold gates
</span><span class="n">gate1</span> <span class="o">=</span> <span class="n">ThresholdGate</span><span class="p">(</span><span class="mi">17500</span><span class="p">,</span> <span class="s">'FSC-A'</span><span class="p">,</span> <span class="n">region</span><span class="o">=</span><span class="s">'above'</span><span class="p">)</span>
<span class="n">gate2</span> <span class="o">=</span> <span class="n">ThresholdGate</span><span class="p">(</span><span class="mi">60000</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">,</span> <span class="n">region</span><span class="o">=</span><span class="s">'below'</span><span class="p">)</span>
<span class="n">gate3</span> <span class="o">=</span> <span class="n">ThresholdGate</span><span class="p">(</span><span class="mi">110000</span><span class="p">,</span> <span class="s">'FSC-A'</span><span class="p">,</span> <span class="n">region</span><span class="o">=</span><span class="s">'below'</span><span class="p">)</span>
<span class="n">gate4</span> <span class="o">=</span> <span class="n">ThresholdGate</span><span class="p">(</span><span class="mi">10000</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">,</span> <span class="n">region</span><span class="o">=</span><span class="s">'above'</span><span class="p">)</span>
<span class="c1"># Similar thing with a polygon
# drawn interactively
</span><span class="n">gate5</span> <span class="o">=</span> <span class="n">PolyGate</span><span class="p">(</span>
<span class="p">[(</span><span class="mf">3.140e+04</span><span class="p">,</span> <span class="mf">9.951e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">1.108e+04</span><span class="p">,</span> <span class="mf">5.092e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">1.421e+04</span><span class="p">,</span> <span class="mf">3.304e+04</span><span class="p">),</span>
<span class="p">(</span><span class="mf">2.385e+04</span><span class="p">,</span> <span class="mf">2.426e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">3.219e+04</span><span class="p">,</span> <span class="mf">1.583e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">3.818e+04</span><span class="p">,</span> <span class="mf">5.706e+03</span><span class="p">),</span>
<span class="p">(</span><span class="mf">5.251e+04</span><span class="p">,</span> <span class="mf">4.019e+03</span><span class="p">),</span> <span class="p">(</span><span class="mf">1.208e+05</span><span class="p">,</span> <span class="mf">1.178e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">1.408e+05</span><span class="p">,</span> <span class="mf">5.429e+04</span><span class="p">),</span>
<span class="p">(</span><span class="mf">3.505e+04</span><span class="p">,</span> <span class="mf">9.681e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">3.114e+04</span><span class="p">,</span> <span class="mf">9.951e+04</span><span class="p">),</span> <span class="p">(</span><span class="mf">3.219e+04</span><span class="p">,</span> <span class="mf">9.613e+04</span><span class="p">)],</span>
<span class="p">(</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">),</span> <span class="n">region</span><span class="o">=</span><span class="s">'in'</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'gate4'</span><span class="p">)</span>
<span class="n">_</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">plot</span><span class="p">([</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="s">'SSC-A'</span><span class="p">],</span> <span class="n">bins</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">gates</span><span class="o">=</span><span class="p">[</span><span class="n">gate1</span><span class="p">,</span> <span class="n">gate2</span><span class="p">,</span> <span class="n">gate3</span><span class="p">,</span> <span class="n">gate4</span><span class="p">,</span> <span class="n">gate5</span><span class="p">])</span></code></pre></figure>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_12_0.png" align="middle" style="width: 500px;" />
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Plot channels individually
</span><span class="n">_</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span></code></pre></figure>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_13_0.png" align="middle" style="width: 500px;" />
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Log transform and plot again
</span><span class="n">_</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="s">'hlog'</span><span class="p">,</span> <span class="n">channels</span><span class="o">=</span><span class="p">[</span><span class="s">'FSC-A'</span><span class="p">]).</span><span class="n">plot</span><span class="p">(</span><span class="s">'FSC-A'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span></code></pre></figure>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_14_0.png" align="middle" style="width: 500px;" />
</div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Log transform and plot different channel
</span><span class="n">_</span> <span class="o">=</span> <span class="n">sample</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="s">'hlog'</span><span class="p">,</span> <span class="n">channels</span><span class="o">=</span><span class="p">[</span><span class="s">'R/B Alexa Fluor 700-A'</span><span class="p">]).</span><span class="n">plot</span><span class="p">(</span><span class="s">'R/B Alexa Fluor 700-A'</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">1000</span><span class="p">)</span></code></pre></figure>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/flow/flow_15_0.png" align="middle" style="width: 500px;" />
</div>
Quantile tracks
2016-10-08T00:00:00+00:00
https://andre-rendeiro.com/2016/10/08/quantile_track
<p>The code we shared on our latest work (see <a href="http://dx.doi.org/10.1038/ncomms11938">doi:10.1038/ncomms11938</a>) contained all parts necessary to reproduce the figures in the paper, but there was one part that I didn’t share. In <a href="http://www.nature.com/articles/ncomms11938/figures/2">Figure 2</a>, you can see percentiles of normalized ATAC-seq signal for the 88 samples used in the study - the code in question produces bigWig files used in this visualization.</p>
<p>The reason why I haven’t shared it was because it was a bit challenging and I didn’t manage to make it as system-independent as I’d want. Most of the code for the paper is in Python, but for this I used a combination of Python, Bash and GNU programs to handle the amount of signal genome-wide. In addition, I’ve made extensive use of a HPC cluster with slurm as manager to speed things up.</p>
<p>Since I have now been asked by two persons to share the code, I do it now here.</p>
<p><br /></p>
<p>This is a brief explanation of the steps:</p>
<ul>
<li>
<p>The entry point is “quantile_tracks.py”. This is meant to produce two files containing information of the BAM file paths and sizes.</p>
</li>
<li>
<p>The next script is “quantile_tracks.sh”, which does the real work. It:</p>
<ol>
<li>
<p>makes windows across the genome (here I chose to have 1bp windows, which was a bit overkill) for each chromosome;</p>
</li>
<li>
<p>computes the read coverage in each of these windows for each chromosome, for each bam file;</p>
</li>
<li>
<p>paste together the covereage of all samples per chromosome;</p>
</li>
<li>
<p>split these in chunks;</p>
</li>
<li>
<p>compute quantiles across samples for each chunk (uses the <code class="language-plaintext highlighter-rouge">quantilize.py</code> script);</p>
</li>
<li>
<p>concatenate back the quantiles of the chunks and,</p>
</li>
<li>
<p>make a bigWig file for each quantile. Pretty much in each step, a swarm of jobs is launched to the cluster, so if you have a different HPC configuration (which is more than certain) you’ll have to adapt that.</p>
</li>
</ol>
</li>
</ul>
<p>If it sounds complicated and unoptimized, it’s because it is. I didn’t bother optimizing this since I only needed to run it once and I had no resource limitations. I did explore other alternatives, for example using the pyBigWig library to handle everything from with Python, but it seems it is still in early development has a lot of problems. Other solutions failed for other reasons or did not give me as much control as I needed (e.g. to normalize samples by coverage).</p>
<p>Note that you’ll need some tools for this like gzip, zcat, split, awk, bedtools, bgzip (from htslib), UCSC tools (for bedGraphToBigWig), which are all quite common though, and a few static files like sizes of chromosomes in your genome assembly. Also depending on the resolution you chose to do this, this might take a lot of disk space (in the order of Tb).</p>
<p>In the end it does produce some nice visualizations, but I’m not sure all this trouble was really worth it - up to you to decide.</p>
<p><br /></p>
<p>If anyone can make this (the concept, not really my quickly-hacked code) into a nice tool, please feel free!</p>
<script src="https://gist.github.com/f91ac2c554557eb2f1e4fbe8f234e14e.js"> </script>
Indel Detection From Amplicons
2016-06-21T00:00:00+00:00
https://andre-rendeiro.com/2016/06/21/indel_detection_from_amplicons
<style>
.centerImages {
line-height:200px;
text-align:center;
margin-left: auto;
margin-right: auto;
width: 90%;
vertical-align:middle;
}
.ulpost {list-style-type: none; margin: 0; padding: 0;}
.lipost {display: inline; margin-right: 20px;}
.lipost>a {width: 120px;}
</style>
<p>There is plenty of good software out there for indel detection, but obviously I had to get some data which didn’t fit any of them (AFAIK).
In my case I had some amplicon libraries of targeted loci with CRISPR (two cell lines, three guide RNAs each) from a MiSeq run.
What was particular about these was that the amplicon library was digested after amplification to provide more fragments and increase the likehood that the reads would overlap the indel event, so in practice my reads didn’t start and end with the ends of the amplicon sequence - so <strong>I quickly hacked something together for this effect</strong> (<strong>sharing for preservation, use at own risk</strong>).</p>
<p><a href="https://github.com/afrendeiro/amplicon_indel_detection">All code and outputs can be seen here</a>.</p>
<p><br /></p>
<p>My first approch was to simply trim the reads of any primers/adapters, map them to the amplicons and count indels based on the CIGAR string of the reads.</p>
<p>The second one (at the request of a colleage) was to observe the distance between the extremeties of the amplicon in the reads without alignment.</p>
<p><br /></p>
<p>Here’s how it looks like for the first method:</p>
<p><img src="https://rawgithub.com/afrendeiro/amplicon_indel_detection/master/results/editing_efficiency.indels.svg" width="90%" /></p>
<div class="centerImages">
<img src="https://rawgithub.com/afrendeiro/amplicon_indel_detection/master/results/editing_efficiency.indels_percentage.svg" align="middle" style="width: 40%;" />
</div>
<p><br /></p>
<p>And the same with the “grep method”:</p>
<p><img src="https://rawgithub.com/afrendeiro/amplicon_indel_detection/master/results/editing_efficiency.read_sizes.svg" width="90%" /></p>
<div class="centerImages">
<img src="https://rawgithub.com/afrendeiro/amplicon_indel_detection/master/results/editing_efficiency.sizes_percentage.svg" align="middle" style="width: 40%;" />
</div>
<p>Although the methods differ in the sensitivity, both show very similar estimates of indel percentages.
Unfortunately for these experiments the editing efficiency was not very high due to a problem in the lab, but it is since solved.</p>
<p><br /></p>
<p>The only thing missing is the rate of in frame indels because I’d need to look up the coordinate of the transcript in relation to the amplicon, since I aligned the reads to the “amplicon library” rather than to the genome, but that was too much work considering that people generally simply consider every indel multiple of 3 to be in frame.</p>
Enrichr
2016-04-16T00:00:00+00:00
https://andre-rendeiro.com/2016/04/16/enrichr
<p>A great tool I found recently is <a href="http://amp.pharm.mssm.edu/Enrichr/">Enrichr</a> by the Ma’ayan lab.</p>
<p>Using it’s API is straightforward, and I must say that it is really fast and they do support a high number of queries (don’t abuse obviously).</p>
<p>Here’s a Python program to use the API easily:</p>
<script src="https://gist.github.com/474cb1d9be3176ae9ac20b55c6369ac9.js"> </script>
SNPs from the GWAS catalog as a LOLA region set
2016-01-28T00:00:00+00:00
https://andre-rendeiro.com/2016/01/28/gwas_catalog
<p>Ever wondered if some genomic regions of interest overlap significantly with known (or own) sets of regions?</p>
<p><a href="http://databio.org/lola/">LOLA</a> is an R package that handles that for you. It includes a “core” set of regions from public databases and lets you extend them with your own regions of interest.</p>
<p>I wanted to include the position of every known SNP associated with a trait (specially clinical) in the database, but also preferebly grouped by the broad type of trait. here’s what I came up with by using <a href="https://www.ebi.ac.uk/gwas/">EBI’s GWAS catalog</a>.</p>
<h3 id="creating-a-bed-file-with-snps-for-each-disease-group">Creating a bed file with SNPs for each disease group</h3>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="c1"># get GWAS catalog
</span><span class="n">os</span><span class="p">.</span><span class="n">system</span><span class="p">(</span><span class="s">"wget -O gwas_catalog.tsv http://www.ebi.ac.uk/gwas/api/search/downloads/alternative"</span><span class="p">)</span> <span class="c1"># gwas db dump
</span><span class="n">os</span><span class="p">.</span><span class="n">system</span><span class="p">(</span><span class="s">"wget http://www.ebi.ac.uk/fgpt/gwas/ontology/GWAS-EFO-Mappings201405.xlsx"</span><span class="p">)</span> <span class="c1"># gwas mapping/ontology
</span>
<span class="c1"># read in catalog and mappings
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"gwas_catalog.tsv"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
<span class="n">mapping</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s">"GWAS-EFO-Mappings201405.xlsx"</span><span class="p">)</span>
<span class="c1"># merge both
</span><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">mapping</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="p">[</span><span class="s">'PUBMEDID'</span><span class="p">])</span>
<span class="c1"># subset columns
</span><span class="n">df3</span> <span class="o">=</span> <span class="n">df2</span><span class="p">[[</span><span class="s">'CHR_ID'</span><span class="p">,</span> <span class="s">'CHR_POS'</span><span class="p">,</span> <span class="s">'PUBMEDID'</span><span class="p">,</span> <span class="s">'DISEASE/TRAIT'</span><span class="p">,</span> <span class="s">'PARENT'</span><span class="p">,</span> <span class="s">'SNPS'</span><span class="p">,</span> <span class="s">'STRONGEST SNP-RISK ALLELE'</span><span class="p">,</span> <span class="s">'P-VALUE'</span><span class="p">,</span> <span class="s">'OR or BETA'</span><span class="p">]]</span>
<span class="n">df3</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'chr'</span><span class="p">,</span> <span class="s">'pos'</span><span class="p">,</span> <span class="s">'pubmed_id'</span><span class="p">,</span> <span class="s">'trait'</span><span class="p">,</span> <span class="s">'ontology_group'</span><span class="p">,</span> <span class="s">'snp'</span><span class="p">,</span> <span class="s">'snp_strongest_allele'</span><span class="p">,</span> <span class="s">'p_value'</span><span class="p">,</span> <span class="s">'beta'</span><span class="p">]</span>
<span class="n">df3</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"gwas_catalog.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># filter on p-value
</span><span class="n">df4</span> <span class="o">=</span> <span class="n">df3</span><span class="p">[</span><span class="n">df3</span><span class="p">[</span><span class="s">'p_value'</span><span class="p">]</span> <span class="o"><</span> <span class="mf">5e-8</span><span class="p">]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="s">"regions"</span><span class="p">):</span>
<span class="n">os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s">"regions"</span><span class="p">)</span>
<span class="c1"># export bed file for each ontology group
</span><span class="n">regionset_index</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="k">for</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">df4</span><span class="p">[</span><span class="s">'ontology_group'</span><span class="p">].</span><span class="n">unique</span><span class="p">():</span>
<span class="n">df5</span> <span class="o">=</span> <span class="n">df4</span><span class="p">[</span><span class="n">df4</span><span class="p">[</span><span class="s">'ontology_group'</span><span class="p">]</span> <span class="o">==</span> <span class="n">group</span><span class="p">]</span>
<span class="n">df5</span> <span class="o">=</span> <span class="n">df5</span><span class="p">[[</span><span class="s">'chr'</span><span class="p">,</span> <span class="s">'pos'</span><span class="p">]]</span>
<span class="n">df5</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'chr'</span><span class="p">,</span> <span class="s">'start'</span><span class="p">]</span>
<span class="c1"># drop entries without a position
</span> <span class="n">df5</span><span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">how</span><span class="o">=</span><span class="s">'any'</span><span class="p">,</span> <span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'chr'</span><span class="p">,</span> <span class="s">'start'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df5</span><span class="p">[</span><span class="s">'chr'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">'chr'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">i</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">df5</span><span class="p">[</span><span class="s">'chr'</span><span class="p">]]</span>
<span class="n">df5</span><span class="p">[</span><span class="s">'end'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df5</span><span class="p">[</span><span class="s">'start'</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">df5</span><span class="p">[</span><span class="s">'start'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df5</span><span class="p">[</span><span class="s">'start'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df5</span><span class="p">[</span><span class="s">'end'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df5</span><span class="p">[</span><span class="s">'end'</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="c1"># write bed file
</span> <span class="n">group</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">" "</span><span class="p">,</span> <span class="s">"_"</span><span class="p">,</span> <span class="n">group</span><span class="p">).</span><span class="n">lower</span><span class="p">()</span>
<span class="n">df5</span><span class="p">.</span><span class="n">drop_duplicates</span><span class="p">().</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"regions/gwas_catalog.%s.bed"</span> <span class="o">%</span> <span class="n">group</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c1"># save in regionset index
</span> <span class="n">regionset_index</span> <span class="o">=</span> <span class="n">regionset_index</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">([</span><span class="s">"Human"</span><span class="p">,</span> <span class="s">"SNPs in gwas catalog - %s"</span> <span class="o">%</span> <span class="n">group</span><span class="p">,</span> <span class="s">"GWAS catalog"</span><span class="p">,</span> <span class="s">"gwas_catalog.%s.bed"</span> <span class="o">%</span> <span class="n">group</span><span class="p">]),</span> <span class="n">ignore_index</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># save regionset index
</span><span class="n">regionset_index</span><span class="p">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">"species"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">,</span> <span class="s">"dataSource"</span><span class="p">,</span> <span class="s">"filename"</span><span class="p">]</span>
<span class="n">regionset_index</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"index.txt"</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span></code></pre></figure>
<p>You will have a file named <code class="language-plaintext highlighter-rouge">index.txt</code> enumerating the various region sets, which are inside a folder named <code class="language-plaintext highlighter-rouge">regions</code>.</p>
<h3 id="documenting-your-region-sets">Documenting your region sets</h3>
<p>Simply create a tab-delimited file in the same folder named <code class="language-plaintext highlighter-rouge">collection.txt</code> with the following information:</p>
<table>
<thead>
<tr>
<th>collector</th>
<th>date</th>
<th>source</th>
<th>description</th>
</tr>
</thead>
<tbody>
<tr>
<td>arendeiro</td>
<td>2016-01-28</td>
<td>customRegionDB/hg38/gwas</td>
<td>GWAS from EBI’s GWAS catalog (http://www.ebi.ac.uk/gwas/)</td>
</tr>
</tbody>
</table>
Survival analysis with lifelines - part 1
2016-01-04T00:00:00+00:00
https://andre-rendeiro.com/2016/01/04/survival_analysis_with_lifelines_part1
<blockquote>
<p>This post is available as a <a href="http://jupyter.readthedocs.org/">Jupiter notebook</a> <a href="http://nbviewer.jupyter.org/github/afrendeiro/afrendeiro.github.io/blob/master/data/notebooks/lifelines_survival_part1.ipynb">here</a>.</p>
</blockquote>
<p>The <a href="http://lifelines.readthedocs.org/">lifelines package</a> is a well documented, easy-to-use Python package for survival analysis.</p>
<p>I had never done any survival analysis, but the fact that package has great documentation made me adventure in the field. From the documentation I was able to understand the key concepts of survival analysis and run a few simple analysis on clinical data gathered by our collaborators from a cohort of cancer patients. This obviously does not mean it is a replacement of proper study of the field, but nonetheless I highly recommend reading the whole documentation for begginers on the topic and the usage of the package to anyone working in the field.</p>
<h3 id="getting-our-hands-dirty">Getting our hands dirty</h3>
<p><small><strong>Note:</strong> In these data, although already anonymized, I have added some jitter for the actual values to differ from the real ones.</small></p>
<p>Although all one needs for survival analysis is two arrays with the time duration patients were observed and whether death occured during that time, in reality you’re more likely to get from clinicians an Excel file with dates of birth, diagnosis, and death along with other relevant information on the clinical cohort.</p>
<p>Let’s read some data in and transform those fields into the time we have been observing the patient (from diagnosis to the last checkup):</p>
<p><small><strong>Hint:</strong> make sure you tell pandas which columns hold dates and the format they are in for correct date parsing.</small></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">clinical</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span>
<span class="s">"clinical_data.csv"</span><span class="p">,</span>
<span class="n">parse_dates</span><span class="o">=</span><span class="p">[</span><span class="s">"patient_death_date"</span><span class="p">,</span> <span class="s">"diagnosis_date"</span><span class="p">,</span> <span class="s">"patient_last_checkup_date"</span><span class="p">],</span>
<span class="n">dayfirst</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># get duration of patient observation
</span><span class="n">clinical</span><span class="p">[</span><span class="s">"duration"</span><span class="p">]</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="s">"patient_last_checkup_date"</span><span class="p">]</span> <span class="o">-</span> <span class="n">clinical</span><span class="p">[</span><span class="s">"diagnosis_date"</span><span class="p">]</span>
<span class="n">clinical</span><span class="p">.</span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>patient_last_checkup_date</th>
<th>diagnosis_date</th>
<th>patient_death_date</th>
<th>t1</th>
<th>t2</th>
<th>t3</th>
<th>t4</th>
<th>t5</th>
<th>t6</th>
<th>t7</th>
<th>t8</th>
<th>t9</th>
<th>t10</th>
<th>t11</th>
<th>t12</th>
<th>duration</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2011-12-05</td>
<td>1977-08-23</td>
<td>2011-12-19</td>
<td>F</td>
<td>A0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>12522 days</td>
</tr>
<tr>
<th>1</th>
<td>2015-01-15</td>
<td>1997-08-06</td>
<td>NaT</td>
<td>M</td>
<td>A0</td>
<td>1</td>
<td>NaN</td>
<td>NaN</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>6371 days</td>
</tr>
<tr>
<th>2</th>
<td>2011-11-14</td>
<td>1987-03-11</td>
<td>NaT</td>
<td>F</td>
<td>A0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>9014 days</td>
</tr>
<tr>
<th>3</th>
<td>2008-11-15</td>
<td>1992-04-27</td>
<td>2008-12-7</td>
<td>F</td>
<td>A0</td>
<td>1</td>
<td>2</td>
<td>1</td>
<td>True</td>
<td>False</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>6046 days</td>
</tr>
<tr>
<th>4</th>
<td>2008-10-09</td>
<td>1994-07-19</td>
<td>2009-12-22</td>
<td>M</td>
<td>A0</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>True</td>
<td>True</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>False</td>
<td>5196 days</td>
</tr>
</tbody>
</table>
</div>
<p>Let’s check globaly how our patients are doing:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">KaplanMeierFitter</span>
<span class="c1"># Duration of patient following in months
</span><span class="n">T</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span><span class="p">.</span><span class="n">days</span> <span class="o">/</span> <span class="mf">30.</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">clinical</span><span class="p">[</span><span class="s">"duration"</span><span class="p">]]</span>
<span class="c1"># Observation of death in boolean
# True for observed event (death);
# else False (this includes death not observed; death by other causes)
</span><span class="n">C</span> <span class="o">=</span> <span class="p">[</span><span class="bp">True</span> <span class="k">if</span> <span class="n">i</span> <span class="ow">is</span> <span class="ow">not</span> <span class="n">pd</span><span class="p">.</span><span class="n">NaT</span> <span class="k">else</span> <span class="bp">False</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">clinical</span><span class="p">[</span><span class="s">"patient_death_date"</span><span class="p">]]</span>
<span class="n">fitter</span> <span class="o">=</span> <span class="n">KaplanMeierFitter</span><span class="p">()</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">event_observed</span><span class="o">=</span><span class="n">C</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"all patients"</span><span class="p">)</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.axes._subplots.AxesSubplot at 0x7f3812afaf10>
</code></pre></div></div>
<p><img src="/data/figures/survival_part1/output_6_1.png" alt="png" /></p>
<p>Now we want to split our cohort according to values in several variables (<em>e.g.</em> gender, age, presence/absence of a clinical marker), and check what’s the progression of survival, and if differences between groups are significant.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">lifelines.statistics</span> <span class="kn">import</span> <span class="n">logrank_test</span>
<span class="kn">from</span> <span class="nn">matplotlib.offsetbox</span> <span class="kn">import</span> <span class="n">AnchoredText</span>
<span class="n">trait</span> <span class="o">=</span> <span class="s">"t1"</span> <span class="c1"># we pick one trait, gender in this case
</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">trait</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">label</span><span class="p">[</span><span class="o">~</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">isnull</span><span class="p">,</span> <span class="n">label</span><span class="p">))]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># Separately for each class
# get index of patients from class
</span><span class="n">f</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">clinical</span><span class="p">[</span><span class="n">trait</span><span class="p">]</span> <span class="o">==</span> <span class="s">"F"</span><span class="p">].</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="c1"># fit the KaplarMayer with the subset of data from the respective class
</span><span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">([</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">f</span><span class="p">],</span> <span class="n">event_observed</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">f</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">"F"</span><span class="p">)</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># get index of patients from class
</span><span class="n">m</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">clinical</span><span class="p">[</span><span class="n">trait</span><span class="p">]</span> <span class="o">==</span> <span class="s">"M"</span><span class="p">].</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="c1"># fit the KaplarMayer with the subset of data from the respective class
</span><span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">([</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">m</span><span class="p">],</span> <span class="n">event_observed</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">m</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="s">"M"</span><span class="p">)</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># test difference between curves
</span><span class="n">p</span> <span class="o">=</span> <span class="n">logrank_test</span><span class="p">(</span>
<span class="p">[</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">f</span><span class="p">],</span> <span class="p">[</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">m</span><span class="p">],</span>
<span class="n">event_observed_A</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">f</span><span class="p">],</span>
<span class="n">event_observed_B</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">m</span><span class="p">]).</span><span class="n">p_value</span>
<span class="c1"># add p-value to plot
</span><span class="n">ax</span><span class="p">.</span><span class="n">add_artist</span><span class="p">(</span><span class="n">AnchoredText</span><span class="p">(</span><span class="s">"p = %f"</span> <span class="o">%</span> <span class="nb">round</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">frameon</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.offsetbox.AnchoredText at 0x7f3813692dd0>
</code></pre></div></div>
<p><img src="/data/figures/survival_part1/output_8_1.png" alt="png" /></p>
<p>We can also investigate hazard over time instead of survival:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">NelsonAalenFitter</span>
<span class="n">fitter</span> <span class="o">=</span> <span class="n">NelsonAalenFitter</span><span class="p">()</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">event_observed</span><span class="o">=</span><span class="n">C</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"all patients"</span><span class="p">)</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.axes._subplots.AxesSubplot at 0x7f3812ef6fd0>
</code></pre></div></div>
<p><img src="/data/figures/survival_part1/output_10_1.png" alt="png" /></p>
<p>Great, so if we make the code more general and wrap it into a function, we can run see how survival or hazard of patients with certain traits differ.</p>
<p>We can also investigate variables with more than one class and compare them in a pairwise fashion.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">lifelines</span> <span class="kn">import</span> <span class="n">NelsonAalenFitter</span>
<span class="kn">import</span> <span class="nn">itertools</span>
<span class="k">def</span> <span class="nf">survival_plot</span><span class="p">(</span><span class="n">clinical</span><span class="p">,</span> <span class="n">fitter</span><span class="p">,</span> <span class="n">fitter_name</span><span class="p">,</span> <span class="n">feature</span><span class="p">,</span> <span class="n">time</span><span class="p">):</span>
<span class="n">T</span> <span class="o">=</span> <span class="p">[</span><span class="n">i</span><span class="p">.</span><span class="n">days</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="mi">30</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">clinical</span><span class="p">[</span><span class="n">time</span><span class="p">]]</span> <span class="c1"># duration of patient following
</span> <span class="c1"># events:
</span> <span class="c1"># True for observed event (death);
</span> <span class="c1"># else False (this includes death not observed; death by other causes)
</span> <span class="n">C</span> <span class="o">=</span> <span class="p">[</span><span class="bp">True</span> <span class="k">if</span> <span class="n">i</span> <span class="ow">is</span> <span class="ow">not</span> <span class="n">pd</span><span class="p">.</span><span class="n">NaT</span> <span class="k">else</span> <span class="bp">False</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">clinical</span><span class="p">[</span><span class="s">"patient_death_date"</span><span class="p">]]</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># All patients together
</span> <span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">T</span><span class="p">,</span> <span class="n">event_observed</span><span class="o">=</span><span class="n">C</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"all patients"</span><span class="p">)</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Filter feature types which are nan
</span> <span class="n">label</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">feature</span><span class="p">].</span><span class="n">unique</span><span class="p">()</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">label</span><span class="p">[</span><span class="o">~</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="n">pd</span><span class="p">.</span><span class="n">isnull</span><span class="p">,</span> <span class="n">label</span><span class="p">))]</span>
<span class="c1"># Separately for each class
</span> <span class="k">for</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">label</span><span class="p">:</span>
<span class="c1"># get patients from class
</span> <span class="n">s</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">clinical</span><span class="p">[</span><span class="n">feature</span><span class="p">]</span> <span class="o">==</span> <span class="n">value</span><span class="p">].</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">fit</span><span class="p">([</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">s</span><span class="p">],</span> <span class="n">event_observed</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">s</span><span class="p">],</span> <span class="n">label</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">value</span><span class="p">))</span>
<span class="n">fitter</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">show_censors</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">if</span> <span class="n">fitter_name</span> <span class="o">==</span> <span class="s">"survival"</span><span class="p">:</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.05</span><span class="p">)</span>
<span class="c1"># Test pairwise differences between all classes
</span> <span class="n">p_values</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">itertools</span><span class="p">.</span><span class="n">combinations</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="mi">2</span><span class="p">):</span>
<span class="n">a_</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">clinical</span><span class="p">[</span><span class="n">feature</span><span class="p">]</span> <span class="o">==</span> <span class="n">a</span><span class="p">].</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">b_</span> <span class="o">=</span> <span class="n">clinical</span><span class="p">[</span><span class="n">clinical</span><span class="p">[</span><span class="n">feature</span><span class="p">]</span> <span class="o">==</span> <span class="n">b</span><span class="p">].</span><span class="n">index</span><span class="p">.</span><span class="n">tolist</span><span class="p">()</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">logrank_test</span><span class="p">(</span>
<span class="p">[</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">a_</span><span class="p">],</span> <span class="p">[</span><span class="n">T</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">b_</span><span class="p">],</span>
<span class="n">event_observed_A</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">a_</span><span class="p">],</span>
<span class="n">event_observed_B</span><span class="o">=</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">b_</span><span class="p">]).</span><span class="n">p_value</span>
<span class="c1"># see result of test with p.print_summary()
</span> <span class="n">p_values</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"p-value '"</span> <span class="o">+</span> <span class="s">" vs "</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="nb">str</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">b</span><span class="p">)])</span> <span class="o">+</span> <span class="s">"': %f"</span> <span class="o">%</span> <span class="n">p</span><span class="p">)</span>
<span class="c1"># Add p-values as anchored text
</span> <span class="n">ax</span><span class="p">.</span><span class="n">add_artist</span><span class="p">(</span><span class="n">AnchoredText</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">p_values</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">frameon</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"%s - %s since diagnosis"</span> <span class="o">%</span> <span class="p">(</span><span class="n">feature</span><span class="p">,</span> <span class="n">fitter_name</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">features</span> <span class="o">=</span> <span class="p">[</span><span class="s">"t%i"</span> <span class="o">%</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">11</span><span class="p">)]</span>
<span class="c1"># For each clinical feature
</span><span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span>
<span class="n">survival_plot</span><span class="p">(</span><span class="n">clinical</span><span class="p">,</span> <span class="n">KaplanMeierFitter</span><span class="p">(),</span> <span class="s">"survival"</span><span class="p">,</span> <span class="n">feature</span><span class="p">,</span> <span class="s">"duration"</span><span class="p">)</span>
<span class="n">survival_plot</span><span class="p">(</span><span class="n">clinical</span><span class="p">,</span> <span class="n">NelsonAalenFitter</span><span class="p">(),</span> <span class="s">"hazard"</span><span class="p">,</span> <span class="n">feature</span><span class="p">,</span> <span class="s">"duration"</span><span class="p">)</span></code></pre></figure>
<p><img src="/data/figures/survival_part1/output_13_0.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_1.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_2.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_3.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_4.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_5.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_6.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_7.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_8.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_9.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_10.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_11.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_12.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_13.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_14.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_15.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_16.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_17.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_18.png" alt="png" /></p>
<p><img src="/data/figures/survival_part1/output_13_19.png" alt="png" /></p>
Pubmed wordcloud
2015-10-28T00:00:00+00:00
https://andre-rendeiro.com/2015/10/28/pubmed_wordcloud
<p>A friend asked me recently to make a wordcloud with publications that arise when searching Pubmed for a particular term.</p>
<p>My implementation uses NCBI’s <code class="language-plaintext highlighter-rouge">eutils</code> to search for the term and retrieve pubmed ids, which I in a second step query NCBI with. I chose to use the publication titles to cound word frequencies and build the wordcloud after removing common words (<em>e.g.</em> articles).</p>
<script src="https://gist.github.com/6ec23ce2d0317a160e8f.js"> </script>
<p>Since there are a few free online tools to actually draw the tool (<em>e.g.</em> wordle.net) I didn’t bother implementing that, but I did searched and there are a few interesting Python modules to do that as well (<a href="https://github.com/amueller/word_cloud">amueller/word_cloud</a> seemed quite feature-complete, for example).</p>
<h3 id="lcmv-wordcloud">LCMV wordcloud</h3>
<p>(Some common words removed on request (virus, etc…))</p>
<p><img src="/data/figures/lcmv_cloud.svg" alt="lcmv wordcloud" /></p>
NGS analysis objects with Python decorators
2015-08-25T00:00:00+00:00
https://andre-rendeiro.com/2015/08/25/analysis_object_with_decorators
<p>Recently I decided to start encapsulating my analysis into an object which has methods but also holds data from a data analysis.</p>
<p>This is conveninent since I’ve <a href="2015-06-07-python_objects_for_ngs_projects_samples.md">implemented classes to handle pipeline objects</a> (projects, samples, annotation sheets).</p>
<p>Still, nothing particularly exciting:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pipelines</span> <span class="kn">import</span> <span class="n">Project</span>
<span class="k">class</span> <span class="nc">Analysis</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""
Class to hold functions and data from analysis.
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data_dir</span><span class="p">,</span> <span class="n">plots_dir</span><span class="p">,</span> <span class="n">samples</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">data_dir</span> <span class="o">=</span> <span class="n">data_dir</span>
<span class="bp">self</span><span class="p">.</span><span class="n">plots_dir</span> <span class="o">=</span> <span class="n">plots_dir</span>
<span class="bp">self</span><span class="p">.</span><span class="n">samples</span> <span class="o">=</span> <span class="n">samples</span>
<span class="k">def</span> <span class="nf">do_some_work</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="p">...</span>
<span class="bp">self</span><span class="p">.</span><span class="n">results</span> <span class="o">=</span> <span class="n">results</span>
<span class="n">prj</span> <span class="o">=</span> <span class="n">Project</span><span class="p">(</span><span class="s">"testprj"</span><span class="p">)</span>
<span class="n">prj</span><span class="p">.</span><span class="n">addSampleSheet</span><span class="p">(</span><span class="s">"metadata/sample_annotation.csv"</span><span class="p">)</span>
<span class="n">analysis</span> <span class="o">=</span> <span class="n">Analysis</span><span class="p">(</span><span class="s">"data"</span><span class="p">,</span> <span class="s">"results/plots"</span><span class="p">,</span> <span class="n">prj</span><span class="p">.</span><span class="n">samples</span><span class="p">)</span>
<span class="n">analysis</span><span class="p">.</span><span class="n">do_some_work</span><span class="p">()</span></code></pre></figure>
<h2 id="python-decorators">Python decorators</h2>
<p>It would be nice if each time I run a certain functions of the <code class="language-plaintext highlighter-rouge">Analysis</code> class the class itself would be saved as a Python <code class="language-plaintext highlighter-rouge">pickle</code>.</p>
<p><strong>Enter <a href="https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators">Python decorators</a></strong>.</p>
<p>If you’re new to Python or Python decorators (I’ve known them for a while but seldomly use them) <a href="http://simeonfranklin.com/blog/2012/jul/1/python-decorators-in-12-steps/">here’s a really nice introduction</a> to <em>nested functions</em>, <em>clojures</em> and <em>decorators</em>.</p>
<p>In this case I write a decorator which calls the function and then performs its action, in this case, pickling the <code class="language-plaintext highlighter-rouge">Analysis</code> object:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># decorator for some methods of Analysis class
</span><span class="k">def</span> <span class="nf">pickle_me</span><span class="p">(</span><span class="n">function</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="n">obj</span><span class="p">):</span>
<span class="n">function</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span>
<span class="n">pickle</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="s">"analysis.pickle"</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">))</span>
<span class="k">return</span> <span class="n">wrapper</span></code></pre></figure>
<p>To apply the decorator the some methods of the class, just add <code class="language-plaintext highlighter-rouge">@pickle_me</code> to the method:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Analysis</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="p">...</span>
<span class="o">@</span><span class="n">pickle_me</span>
<span class="k">def</span> <span class="nf">do_some_work</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="p">...</span>
<span class="bp">self</span><span class="p">.</span><span class="n">results</span> <span class="o">=</span> <span class="n">results</span></code></pre></figure>
<p>This works however, for functions accepting only <code class="language-plaintext highlighter-rouge">self</code> as argument as in the example. To allow an arbitrary number of arguments passed to each function, we will make the decorator function accept one argument which will be the <code class="language-plaintext highlighter-rouge">Analysis</code> (I keep calling it <code class="language-plaintext highlighter-rouge">obj</code>) object and any number of arguments (by using <code class="language-plaintext highlighter-rouge">*args</code>), which will be passed to the function that is decorated.</p>
<p>If I give the <code class="language-plaintext highlighter-rouge">Analysis</code> class an attribute holding where it should be pickled (<code class="language-plaintext highlighter-rouge">pickle_file</code>), then I can tell the decorator function to get it from the class itself:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># decorator for some methods of Analysis class
</span><span class="k">def</span> <span class="nf">pickle_me</span><span class="p">(</span><span class="n">function</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">wrapper</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
<span class="n">function</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span>
<span class="n">pickle</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">obj</span><span class="p">.</span><span class="n">pickle_file</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">))</span>
<span class="k">return</span> <span class="n">wrapper</span>
<span class="k">class</span> <span class="nc">Analysis</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""
Class to hold functions and data from analysis.
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">pickle_file</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">pickle_file</span> <span class="o">=</span> <span class="n">pickle_file</span>
<span class="o">@</span><span class="n">pickle_me</span>
<span class="k">def</span> <span class="nf">do_some_work</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="p">...</span></code></pre></figure>
ATAC-seq library nucleosome fitting
2015-07-31T00:00:00+00:00
https://andre-rendeiro.com/2015/07/31/nucleosome_fit
<p><a href="https://github.com/dbrg77/ATAC/blob/master/ATAC_seq_read_length_curve_fitting.ipynb">This great iPython notebook</a> on ATAC-seq nucleosome fitting gave me exactly what I needed for ATAC-seq data analysis.</p>
<p>I’ve added a few more simple features, adapter the fitting parameters to data produced in our lab and created a single function with it, which I will call during ATAC-seq pipeline runs.</p>
<p>Some metrics will be valuable to assess sample quality as well.</p>
<script src="https://gist.github.com/c37c112a6b4e58eb75b0.js"> </script>
Python objects for NGS projects and samples
2015-06-07T00:00:00+00:00
https://andre-rendeiro.com/2015/06/07/python_objects_for_ngs_projects_samples
<p>I’ve been using <a href="https://github.com/afrendeiro/pipelines">Python programs to manage NGS sample pipelines</a> for a while, and while it started slowly, they’re in a state in which the code is much more reliable and I can work much faster.</p>
<p>A big part of it was due to some simple object modeling of projects and samples.
Here’s the basics of what I implemented. See the full code at <a href="https://github.com/afrendeiro/pipelines/blob/master/pipelines/models.py">github</a>.</p>
<h2 id="a-project-object">A <em>Project</em> object</h2>
<p>In its simplest form, a project object holds attributes and defines and creates (if necessary) a directory structure.
Here’s how I chose to structure my projects:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">parent</span>
<span class="o">|</span><span class="n">___name</span>
<span class="o">|</span><span class="n">___data</span>
<span class="o">|</span><span class="n">___results</span>
<span class="o">|</span> <span class="o">|</span><span class="n">___plots</span>
<span class="o">|</span><span class="n">___runs</span>
<span class="o">|</span><span class="n">___executables</span>
<span class="o">|</span><span class="n">___pickles</span>
<span class="o">|</span><span class="n">___logs</span></code></pre></figure>
<p>So, all the <code class="language-plaintext highlighter-rouge">Project</code> object takes as argument is <code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">parent</code>. The structure is then created when <code class="language-plaintext highlighter-rouge">__init__</code> (which is called automatically upon creation of the object), calling in its turn <code class="language-plaintext highlighter-rouge">setProjectDirs</code> and <code class="language-plaintext highlighter-rouge">makeProjectDirs</code>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span> <span class="k">as</span> <span class="n">_os</span>
<span class="k">class</span> <span class="nc">Paths</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""A class to hold paths as attributes."""</span>
<span class="k">pass</span>
<span class="k">class</span> <span class="nc">Project</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""A class to model a project.
:param name: Project name.
:type name: str
:param parent: Path to where the project structure will be created.
:type parent: str
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">parent</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">Project</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span> <span class="o">=</span> <span class="n">Paths</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">setProjectDirs</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">makeProjectDirs</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">setProjectDirs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Atributes directories for the project."""</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">root</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">parent</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">runs</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="s">"runs"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">pickles</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">runs</span><span class="p">,</span> <span class="s">"pickles"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">executables</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">runs</span><span class="p">,</span> <span class="s">"executables"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">logs</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">runs</span><span class="p">,</span> <span class="s">"logs"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="s">"data"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">results</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">root</span><span class="p">,</span> <span class="s">"results"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">plots</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">results</span><span class="p">,</span> <span class="s">"plots"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">makeProjectDirs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Creates project directory structure if it doesn't exist."""</span>
<span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">path</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">__dict__</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
<span class="n">_os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">path</span><span class="p">)</span></code></pre></figure>
<h2 id="a-sample-object">A <em>Sample</em> object</h2>
<p>I decided to have my <code class="language-plaintext highlighter-rouge">Sample</code> objects created from a Pandas <code class="language-plaintext highlighter-rouge">Series</code>, since sample annotation sheet are often in tabular form and can easily be read with Pandas.</p>
<p>I wanted something like:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">series</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span>
<span class="p">[</span><span class="s">"ChIP-seq"</span><span class="p">,</span> <span class="s">"hg19"</span><span class="p">,</span> <span class="s">"/data/samples/test.bam"</span><span class="p">],</span>
<span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s">"technique"</span><span class="p">,</span> <span class="s">"genome"</span><span class="p">,</span> <span class="s">"unmappedBam"</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">sample</span> <span class="o">=</span> <span class="n">Sample</span><span class="p">(</span><span class="n">series</span><span class="p">)</span></code></pre></figure>
<p>I first considered creating <code class="language-plaintext highlighter-rouge">Sample</code> inheriting from <code class="language-plaintext highlighter-rouge">pandas.Series</code> to take advantage of its already implemented methods, but in the end it was lacking some features (tab-completion in iPython wasn’t showing the methods I defined). Also, compatibility with new Pandas versions was not guarenteed. Therefore, I simply assign the pandas <code class="language-plaintext highlighter-rouge">Series</code> attributes to a new <code class="language-plaintext highlighter-rouge">Sample</code> object.</p>
<p>The directory structure if sample-centric: all files from a sample are under a sample-specific directory, and then, other sub-directories hold more specific files:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Sample</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""
Class to model NGS samples.
:param series: Pandas `Series` object.
:type series: pandas.Series
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">series</span><span class="p">):</span>
<span class="c1"># Passed series must either be a pd.Series or a daughter class
</span> <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">series</span><span class="p">,</span> <span class="n">_pd</span><span class="p">.</span><span class="n">Series</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">TypeError</span><span class="p">(</span><span class="s">"Provided object is not a pandas Series."</span><span class="p">)</span>
<span class="nb">super</span><span class="p">(</span><span class="n">Sample</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># Set series attributes on self
</span> <span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">series</span><span class="p">.</span><span class="n">to_dict</span><span class="p">().</span><span class="n">items</span><span class="p">():</span>
<span class="nb">setattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span> <span class="o">=</span> <span class="n">Paths</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">setFilePaths</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Sets the paths of all files for this sample."""</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">sampleRoot</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">project</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="c1"># Files in the root of the sample dir
</span> <span class="bp">self</span><span class="p">.</span><span class="n">fastqc</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">sampleRoot</span>
<span class="c1"># Unmapped: merged bam, fastq, trimmed fastq
</span> <span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">unmapped</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">sampleRoot</span><span class="p">,</span> <span class="s">"unmapped"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">unmapped</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">unmapped</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">".bam"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">fastq</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">unmapped</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">".fastq"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">trimmed</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">unmapped</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">".trimmed.fastq"</span><span class="p">)</span>
<span class="c1"># Mapped: mapped, duplicates marked, removed, reads shifted
</span> <span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">mapped</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">sampleRoot</span><span class="p">,</span> <span class="s">"mapped"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">mapped</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">mapped</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">".trimmed.bowtie2.bam"</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">filtered</span> <span class="o">=</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">mapped</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">+</span> <span class="s">".trimmed.bowtie2.filtered.bam"</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">makeSampleDirs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Creates sample directory structure if it doesn't exist."""</span>
<span class="k">for</span> <span class="n">path</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">__dict__</span><span class="p">.</span><span class="n">values</span><span class="p">():</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">_os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">exists</span><span class="p">(</span><span class="n">path</span><span class="p">):</span>
<span class="n">_os</span><span class="p">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">path</span><span class="p">)</span></code></pre></figure>
<h4 id="sample-methods"><em>Sample</em> methods</h4>
<p>I create some useful methods for the samples.</p>
<p>I check if it contains required attributes and if these aren’t <code class="language-plaintext highlighter-rouge">nan</code>:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">checkValid</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Check if any of its important attributes is None."""</span>
<span class="n">req</span> <span class="o">=</span> <span class="p">[</span><span class="s">"technique"</span><span class="p">,</span> <span class="s">"genome"</span><span class="p">,</span> <span class="s">"unmappedBam"</span><span class="p">]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="nb">all</span><span class="p">([</span><span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">attr</span><span class="p">)</span> <span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">req</span><span class="p">]):</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Required values for sample do not exist."</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">any</span><span class="p">([</span><span class="n">attr</span> <span class="o">==</span> <span class="s">"nan"</span> <span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="n">req</span><span class="p">]):</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Required values for sample are empty."</span><span class="p">)</span></code></pre></figure>
<p>I create a name for a sample from every non-<code class="language-plaintext highlighter-rouge">nan</code> attribute it might contain from a specific list:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">generateName</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Generates a name for the sample by joining some of its possible attribute strings."""</span>
<span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"_"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span>
<span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">__getattribute__</span><span class="p">(</span><span class="n">attr</span><span class="p">))</span> <span class="k">for</span> <span class="n">attr</span> <span class="ow">in</span> <span class="p">[</span>
<span class="s">"cellLine"</span><span class="p">,</span> <span class="s">"numberCells"</span><span class="p">,</span> <span class="s">"technique"</span><span class="p">,</span> <span class="s">"ip"</span><span class="p">,</span>
<span class="s">"patient"</span><span class="p">,</span> <span class="s">"patientID"</span><span class="p">,</span> <span class="s">"sampleID"</span><span class="p">,</span> <span class="s">"treatment"</span><span class="p">,</span> <span class="s">"condition"</span><span class="p">,</span>
<span class="s">"biologicalReplicate"</span><span class="p">,</span> <span class="s">"technicalReplicate"</span><span class="p">,</span>
<span class="s">"experimentName"</span><span class="p">,</span> <span class="s">"genome"</span><span class="p">]</span> <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">attr</span><span class="p">)</span> <span class="ow">and</span> <span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">__getattribute__</span><span class="p">(</span><span class="n">attr</span><span class="p">))</span> <span class="o">!=</span> <span class="s">"nan"</span><span class="p">]</span>
<span class="p">)</span></code></pre></figure>
<h2 id="a-samplesheet-object">A <em>SampleSheet</em> object</h2>
<p>Obviously, always creating a new Pandas <code class="language-plaintext highlighter-rouge">Series</code>, just to pass it to <code class="language-plaintext highlighter-rouge">Sample</code> does not make much sense.</p>
<p>I created a new class which loads a sample annotation sheet form a csv file
and creates samples from it.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">SampleSheet</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""
Class to model a sample annotation sheet.
:param csv: Path to csv file.
:type csv: str
"""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">csv</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">SampleSheet</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># TODO: checks on given args
</span> <span class="bp">self</span><span class="p">.</span><span class="n">csv</span> <span class="o">=</span> <span class="n">csv</span>
<span class="bp">self</span><span class="p">.</span><span class="n">samples</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">checkSheet</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">checkSheet</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Check if csv file exists and has all required columns.
"""</span>
<span class="k">try</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">df</span> <span class="o">=</span> <span class="n">_pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">csv</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">IOError</span><span class="p">(</span><span class="s">"Given csv file couldn't be read."</span><span class="p">)</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">raise</span> <span class="n">e</span>
<span class="n">req</span> <span class="o">=</span> <span class="p">[</span><span class="s">"technique"</span><span class="p">,</span> <span class="s">"genome"</span><span class="p">,</span> <span class="s">"unmappedBam"</span><span class="p">]</span>
<span class="n">missing</span> <span class="o">=</span> <span class="p">[</span><span class="n">col</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">req</span> <span class="k">if</span> <span class="n">col</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">missing</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">TypeError</span><span class="p">(</span><span class="s">"Annotation sheet is missing columns: %s"</span> <span class="o">%</span> <span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">missing</span><span class="p">))</span></code></pre></figure>
<h4 id="samplesheet-methods"><em>SampleSheet</em> methods</h4>
<p>Obviously methods to create samples from the <code class="language-plaintext highlighter-rouge">SampleSheet</code> (either from a single pandas <code class="language-plaintext highlighter-rouge">Series</code> or from the whole sheet:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="k">def</span> <span class="nf">makeSample</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">series</span><span class="p">):</span>
<span class="s">"""
Make a children of class Sample dependent on its "technique" attribute.
:param series: Pandas `Series` object.
:type series: pandas.Series
:return: An object or class `Sample` or a child of that class.
:rtype: pipelines.Sample
"""</span>
<span class="k">if</span> <span class="n">technique</span> <span class="ow">in</span> <span class="p">[</span><span class="s">"chipseq"</span><span class="p">,</span> <span class="s">"atac-seq"</span><span class="p">]:</span>
<span class="k">return</span> <span class="n">Sample</span><span class="p">(</span><span class="n">series</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">TypeError</span><span class="p">(</span><span class="s">"Sample is not in known technique."</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">makeSamples</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Creates samples from annotation sheet dependent on technique and adds them to the project.
"""</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">df</span><span class="p">)):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">samples</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">makeSample</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">df</span><span class="p">.</span><span class="n">ix</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span></code></pre></figure>
<p>Two methods to revert to a csv file (<code class="language-plaintext highlighter-rouge">to_csv</code> like in a <code class="language-plaintext highlighter-rouge">pandas.DataFrame</code>) and to get a new data frame from the already created samples (<code class="language-plaintext highlighter-rouge">asDataFrame</code>):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"> <span class="k">def</span> <span class="nf">asDataFrame</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""
Returns a `pandas.DataFrame` representation of self.
"""</span>
<span class="k">return</span> <span class="n">_pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">s</span><span class="p">.</span><span class="n">asSeries</span><span class="p">()</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">samples</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">to_csv</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="nb">all</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="s">"""
Saves a csv annotation sheet from the samples.
:param path: Path to csv file to be written.
:type path: str
:param all: If all sample attributes should be kept in the annotation sheet.
:type all: bool
"""</span>
<span class="bp">self</span><span class="p">.</span><span class="n">asDataFrame</span><span class="p">().</span><span class="n">to_csv</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span></code></pre></figure>
<h2 id="binding-them-all">Binding them all</h2>
<p>Ideally one would:</p>
<ol>
<li>create a <code class="language-plaintext highlighter-rouge">Project</code>;</li>
<li>add a csv file to it in a new method which would create a <code class="language-plaintext highlighter-rouge">SampleSheet</code> object. This would:
<ol>
<li>Make new <code class="language-plaintext highlighter-rouge">Sample</code> objects for each sample, creating its attributes and directory structure;</li>
<li>Add the <code class="language-plaintext highlighter-rouge">Sample</code> objects to a container in <code class="language-plaintext highlighter-rouge">Project</code>.</li>
</ol>
</li>
</ol>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Project</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="p">...</span>
<span class="k">def</span> <span class="nf">addSampleSheet</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">csv</span><span class="p">):</span>
<span class="s">"""
Build a `SampleSheet` object from a csv file and
add it and its samples to the project.
:param csv: Path to csv file.
:type csv: str
"""</span>
<span class="c1"># Make SampleSheet object
</span> <span class="bp">self</span><span class="p">.</span><span class="n">sheet</span> <span class="o">=</span> <span class="n">SampleSheet</span><span class="p">(</span><span class="n">csv</span><span class="p">)</span>
<span class="c1"># pair project and sheet
</span> <span class="bp">self</span><span class="p">.</span><span class="n">sheet</span><span class="p">.</span><span class="n">project</span> <span class="o">=</span> <span class="bp">self</span>
<span class="c1"># Generate sample objects from annotation sheet
</span> <span class="bp">self</span><span class="p">.</span><span class="n">sheet</span><span class="p">.</span><span class="n">makeSamples</span><span class="p">()</span>
<span class="c1"># Add samples to Project
</span> <span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">sheet</span><span class="p">.</span><span class="n">samples</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">addSample</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
<span class="n">sample</span><span class="p">.</span><span class="n">setFilePaths</span><span class="p">()</span>
<span class="n">sample</span><span class="p">.</span><span class="n">makeSampleDirs</span><span class="p">()</span></code></pre></figure>
<h1 id="practical-examples">Practical examples</h1>
<p>Here’s a step in an example pipeline which runs Fastqc on (unmapped) bam files from all samples:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pipelines</span> <span class="kn">import</span> <span class="n">Project</span><span class="p">,</span> <span class="n">SampleSheet</span>
<span class="k">def</span> <span class="nf">fastqc</span><span class="p">(</span><span class="n">inputBam</span><span class="p">,</span> <span class="n">outputDir</span><span class="p">,</span> <span class="n">sampleName</span><span class="p">):</span>
<span class="k">return</span> <span class="s">"fastqc --noextract --outdir {0} {1}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">outputDir</span><span class="p">,</span> <span class="n">inputBam</span><span class="p">)</span>
<span class="n">prj</span> <span class="o">=</span> <span class="n">Project</span><span class="p">(</span><span class="s">"ngs"</span><span class="p">)</span>
<span class="n">prj</span><span class="p">.</span><span class="n">addSampleSheet</span><span class="p">(</span><span class="s">"/projects/example/sheet.csv"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">prj</span><span class="p">.</span><span class="n">samples</span><span class="p">:</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="n">fastqc</span><span class="p">(</span><span class="n">sample</span><span class="p">.</span><span class="n">unmappedBam</span><span class="p">,</span> <span class="n">sample</span><span class="p">.</span><span class="n">dirs</span><span class="p">.</span><span class="n">sampleRoot</span><span class="p">,</span> <span class="n">sample</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="n">os</span><span class="p">.</span><span class="n">system</span><span class="p">(</span><span class="n">cmd</span><span class="p">)</span> <span class="c1"># in real-life one wouldn't use `os.system`
</span> <span class="p">...</span></code></pre></figure>
<p>Notice the absent use of file paths in the pipeline. Although still pretty simple, it is now much simpler to handle every file created by the pipeline for each sample.</p>
<p>These objects are also useful during analysis steps to quickly grab files produced by the pipeline and start an analysis right away.</p>
<p>Here I grab all ChIP-seq peak files from all samples and create a peak set by concatenating them all and merging:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">pipelines</span> <span class="kn">import</span> <span class="n">Project</span><span class="p">,</span> <span class="n">SampleSheet</span>
<span class="kn">import</span> <span class="nn">pybedtools</span>
<span class="n">prj</span> <span class="o">=</span> <span class="n">Project</span><span class="p">(</span><span class="s">"ngs"</span><span class="p">)</span>
<span class="n">prj</span><span class="p">.</span><span class="n">addSampleSheet</span><span class="p">(</span><span class="s">"/projects/example/sheet.csv"</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">sample</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">samples</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">sample</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="c1"># Get peaks
</span> <span class="n">peaks</span> <span class="o">=</span> <span class="n">pybedtools</span><span class="p">.</span><span class="n">BedTool</span><span class="p">(</span><span class="n">sample</span><span class="p">.</span><span class="n">peaks</span><span class="p">)</span>
<span class="c1"># Merge overlaping peaks within a sample if existing
</span> <span class="n">peaks</span> <span class="o">=</span> <span class="n">peaks</span><span class="p">.</span><span class="n">merge</span><span class="p">()</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">sites</span> <span class="o">=</span> <span class="n">peaks</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Concatenate all peaks
</span> <span class="n">sites</span> <span class="o">=</span> <span class="n">sites</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">peaks</span><span class="p">)</span>
<span class="c1"># Merge overlaping peaks across samples
</span><span class="n">sites</span> <span class="o">=</span> <span class="n">sites</span><span class="p">.</span><span class="n">merge</span><span class="p">()</span></code></pre></figure>
Predicting dyads from MNase-seq data
2015-05-12T00:00:00+00:00
https://andre-rendeiro.com/2015/05/12/predicting_dyads_from_mnase
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<style>
.centerImages {
line-height:200px;
text-align:center;
margin-left: auto;
margin-right: auto;
width: 90%;
vertical-align:middle;
}
.ulpost {list-style-type: none; margin: 0; padding: 0;}
.lipost {display: inline; margin-right: 20px;}
.lipost>a {width: 120px;}
</style>
<p><br /></p>
<p>I needed the location of nucleosomal dyads in the K562 cell line (ENCODE tier 1 line). Surprisingly, although plenty of MNase-seq data for that cell line is available, no nucleosome and dyad prediction exists.</p>
<h3 id="numap">NuMap</h3>
<p>I found the <a href="http://www-hsc.usc.edu/~valouev/NuMap/NuMap.html">NuMap</a> software by Anton Valouev to do exactly what I intended.</p>
<p>Since it is in a somewhat obscure page and this seemed to be the only place where this software was, I have <a href="https://github.com/afrendeiro/NuMap">uploaded it into a Github repository</a> for the sake of preservation.</p>
<p>Predicting dyads from MNase-seq data with NuMap seemed trivial: I downloaded the <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhNsome/">K562 MNase-seq data set</a> (11 replicates ~85Gb!!), combined all replicates and ran NuMap on the data(instructions on the Github README).</p>
<p>From NuMap output there are <a href="https://www.dropbox.com/s/asmp7bi40lrvtjb/K562_dyads.bed?dl=0">dyad positions in bed format</a> and you can also produce several metrics to evaluate how good the prediction was.</p>
<h3 id="distograms--phasograms">Distograms & phasograms</h3>
<p>Valouev describes two measurements of the frequencies of distances between MNase-seq reads. The frequency of distances between reads mapping to opposite strands can be used to build a “distogram”, which ilustrates the expected nucleosome length (147 bp) - this is consistent across most eukaryotic cells. The frequency of distances between reads mapping to the same strand gives a measurement of the distance between nucleosomes, as they’re separated by some linker DNA - (Valouev calls this plot a “phasogram”). This measurement, on the other hand tends to be species and cell-type specific.</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/dist-phasogram.png" align="middle" />
</div>
<p align="center">Valouev, A., Johnson, S. M., Boyd, S. D., Smith, C. L., Fire, A. Z., Sidow, A. (2011). Determinants of nucleosome organization in primary human cells. Nature, 474(7352), 516–520. <a href="http://doi.org/10.1038/nature10002">http://doi.org/10.1038/nature10002</a></p>
<h4 id="k562-predictions">K562 predictions:</h4>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/mnase-distogram.png" align="middle" />
</div>
<p align="center">The expected 147 bp nucleosome length in K562 cells.</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/mnase-phasogram.png" align="middle" />
</div>
<p align="center">The average distance between dyads in K562 cells seems to be 185 bp.</p>
Resources for CRISPR/Cas9 genome editing of zebrafish
2015-04-14T00:00:00+00:00
https://andre-rendeiro.com/2015/04/14/zebrafish_crispr_resources
<h1 id="zebrafish-grna-design-for-crispr">Zebrafish gRNA design for CRISPR</h1>
<p>I chose to use the <a href="http://dx.doi.org/10.1073%2Fpnas.1308335110">Wente-Chen design</a> which showed excellent results even with multiplexed targeting:</p>
<p><img src="https://www.addgene.org/static/data/easy-thumbnails/filer_public/cms/filer_public/5d/cc/5dccae2d-9f92-4547-a03d-9c32a785b401/chen-lab-plasmid-cloning-figure.png__700x505_q85_crop_subsampling-2_upscale.png" alt="Design" /></p>
<p>You can <a href="https://www.addgene.org/crispr/chen/">order the plasmids from Addgene</a>.</p>
<h3 id="grna-design">gRNA design</h3>
<p>Target region should obey the following pattern: <code class="language-plaintext highlighter-rouge">GG-N(19)-GG</code>.</p>
<h4 id="tools">Tools</h4>
<ul>
<li><a href="http://www.e-crisp.org/E-CRISP">http://www.e-crisp.org/E-CRISP </a></li>
<li><a href="https://chopchop.rc.fas.harvard.edu">https://chopchop.rc.fas.harvard.edu</a></li>
<li><a href="http://crispr.dbcls.jp">http://crispr.dbcls.jp</a></li>
</ul>
<h4 id="dos--donts-probably-incomplete">Dos & don’ts (probably incomplete)</h4>
<ul>
<li>Pick 2-3 distinct targets;</li>
<li>Select exons upstream rather than downstream on the gene;</li>
<li>Have these targets preferentially in several exons;</li>
<li>Avoid repetitive sequences;</li>
<li>Design several target sequences at once, even if not ordering all oligos initially.</li>
</ul>
<h4 id="using-chopchop">Using ChopChop</h4>
<p>Get matrix with Target sequence and its annotations.</p>
<p>While at this take notice of one primer pair to amplify the targeted region.</p>
<h3 id="design-oligos">Design oligos</h3>
<p>To design oligos based on the target sequence, one must remove the PAM sequence (this will be in the genome) and add sequences to the ends of the primers so that after annealed, they can complement the pattern of the digested plasmid.</p>
<p>For the <a href="http://dx.doi.org/10.1073%2Fpnas.1308335110">Wente-Chen design</a>, these are “TA” to the left primer and “AAAC” to the right one.</p>
<p>This small Python script does the job:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">Bio</span> <span class="kn">import</span> <span class="n">Seq</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"gRNA_design.csv"</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">"left_oligo"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">df</span><span class="p">[</span><span class="s">"right_oligo"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)):</span>
<span class="n">target</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">ix</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"target_sequence"</span><span class="p">])</span>
<span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s">"left_oligo"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"TA"</span> <span class="o">+</span> <span class="n">target</span><span class="p">[:</span><span class="mi">20</span><span class="p">]</span> <span class="c1"># add 5' seq to 20 nt-long RNA
</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="s">"right_oligo"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"AAAC"</span> <span class="o">+</span> <span class="n">Seq</span><span class="p">.</span><span class="n">Seq</span><span class="p">(</span><span class="n">target</span><span class="p">[:</span><span class="mi">20</span><span class="p">]).</span><span class="n">reverse_complement</span><span class="p">()[:</span><span class="mi">18</span><span class="p">]</span> <span class="c1"># add restriction site and take PAM seq out
</span>
<span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"gRNA_design.primers.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span></code></pre></figure>
<h3 id="cloning">Cloning</h3>
<p>As <a href="https://www.addgene.org/static/cms/filer_public/02/12/0212c99c-6937-4884-8fb0-a097b965f1c3/sgrna-plasmid-construction-protocol.pdf">described here</a>:</p>
<h4 id="plasmids">Plasmids</h4>
<ul>
<li><a href="https://www.addgene.org/47929/">pCS2-nCas9n</a> - zf codon-optimized Cas9 with nuclear localization signals</li>
<li><a href="https://www.addgene.org/46759/">pT7-gRNA</a> - gRNA backbone</li>
</ul>
<h4 id="reagents-materials">Reagents, materials</h4>
<ul>
<li>BsmBI, BglII, SalI, BamHI and NotI or XbaI restriction enzymes</li>
<li>T4 ligase</li>
<li>Proteinase K</li>
<li>MEGAshortscript T7 kit (Ambion/Invitrogen)</li>
<li>mMESSAGE mMACHINE SP6 or T3 kit (Invitrogen)</li>
<li>RNeasy Mini kit (Qiagen)</li>
<li>mirVana miRNA Isolation Kit (Ambion/Invitrogen)</li>
<li>NEB Buffer 1, 3</li>
<li>T4 ligase buffer</li>
<li>LB/amp plates</li>
</ul>
<h1 id="literature">Literature</h1>
<ol>
<li>Ablain, J., Durand, E. M., Yang, S., Zhou, Y. & Zon, L. I. A CRISPR/Cas9 Vector System for Tissue-Specific Gene Disruption in Zebrafish. Dev. Cell 32, 756–764 (2015).</li>
<li>Auer, T. O., Duroure, K., Cian, A. De, Concordet, J. & Bene, F. Del. homology-independent DNA repair Highly efficient CRISPR / Cas9-mediated knock-in in zebrafish by homology-independent DNA repair. 142–153 (2014). doi:10.1101/gr.161638.113</li>
<li>Gagnon, J. a. et al. Efficient mutagenesis by Cas9 protein-mediated oligonucleotide insertion and large-scale assessment of single-guide RNAs. PLoS One 9, 5–12 (2014).</li>
<li>Hruscha, A. et al. Efficient CRISPR/Cas9 genome editing with low off-target effects in zebrafish. Development 140, 4982–7 (2013).</li>
<li>Hwang, W. Y. et al. Efficient genome editing in zebrafish using a CRISPR-Cas system. Nat. Biotechnol. 31, 227–9 (2013).</li>
<li>Hwang, W. Y. et al. Heritable and Precise Zebrafish Genome Editing Using a CRISPR-Cas System. PLoS One 8, 1–9 (2013).</li>
<li>Jao, L.-E., Wente, S. R. & Chen, W. Efficient multiplex biallelic zebrafish genome editing using a CRISPR nuclease system. Proc. Natl. Acad. Sci. U. S. A. 110, 13904–9 (2013).</li>
<li>Kimura, Y., Hisano, Y., Kawahara, A. & Higashijima, S. Efficient generation of knock-in transgenic zebrafish carrying reporter/driver genes by CRISPR/Cas9-mediated genome engineering. Sci. Rep. 4, 6545 (2014).</li>
<li>Shen, B. et al. Generation of gene-modified mice via Cas9/RNA-mediated gene targeting. Cell Res. 23, 720–3 (2013).</li>
</ol>
Using Google cloud computing
2015-04-14T00:00:00+00:00
https://andre-rendeiro.com/2015/04/14/using-google-cloud-computing
<p>I recently started using cloud computing services.</p>
<p>Amazon seems to be the preferred provider of cloud services and they do rightly so: their breath of services and their customization is currently unparalleled. Although I had experimented with some Amazon Web Services (AWS) before (<em>e.g.</em> S3 storage), I had never used it for computing.</p>
<p>Unfortunately, Amazon has quite restrictive limits for new users (while you can’t get out of the Free Tier):</p>
<ul>
<li>limit of two usage zones (this wouldn’t be a problem, weren’t it for:)</li>
<li>all possible zones to choose from are in the US</li>
<li>really weak VMs available</li>
</ul>
<p>I contacted service to change my zones and have my permissions raised and start real work, willing to pay the costs, but the issue took 3 days to be responded with basically “though luck” as reply and <a href="https://forums.aws.amazon.com/thread.jspa?threadID=175448">this seems like a general pattern</a>.</p>
<p>So I figured other providers might give better conditions to starting users due to their smaller market share. So it was with <a href="https://cloud.google.com/compute/">Google Cloud Compute</a>.</p>
<p>Features:</p>
<ul>
<li>credit of 300$ to spend over 60 days - <em>very</em> attractive;</li>
<li>unrestricted choice of zones;</li>
<li>more choice of VMs for starting users;</li>
<li>simpler interface (also less features);</li>
<li>competitive prices per hour and Gb storage compared with AWS.</li>
</ul>
<p>Computing and storage aren’t as separated as in AWS. The computing service is called Google Cloud Engine - similar to AWS’ EC2. Long-term storage is called Google Storage and is equivalent to AWS’ S3. Disks can be mounted on instances in a way equivalent to AWS’ EBS storage.</p>
<p>Following is a series of notes on how to interface with GCE and GCS, written mostly for the future me.</p>
<h2 id="instances">Instances</h2>
<h4 id="mounting-new-disks-in-instances">Mounting new disks in instances:</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>df -h # see mounted volumes
sudo mkdir /projects
sudo chown user:user /projects
sudo /usr/share/google/safe_format_and_mount -m "mkfs.ext4 -F" /dev/sdb /projects
</code></pre></div></div>
<p>Set to mount at startup - add:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/dev/sdaX /media/mydata ext4 defaults 0 0
</code></pre></div></div>
<p>to /etc/fstab</p>
<h4 id="sftp">Sftp</h4>
<p>You can give an external IP to your instances and transfer files easily.</p>
<p>You can use Filezilla by adding your instance key (<code class="language-plaintext highlighter-rouge">Edit -> Preferences -> SFTP -> Add key...</code>) and using <code class="language-plaintext highlighter-rouge">sftp://<user>@<externalIP></code>.</p>
<h2 id="images">Images</h2>
<p>Pretty much similar to AWS EC2: create a new instance, <a href="http://andre-rendeiro.me/2015/04/08/bioinfo_fresh_install_ubuntu/">install all your software</a> and save an image of the instance. Next time start a new instance with this image and <em>voilá</em> all your software is there.</p>
<p>Unfortunately, I haven’t found a way of sharing images :disappointed:.</p>
<h2 id="tools">Tools</h2>
<ul>
<li><code class="language-plaintext highlighter-rouge">gcloud</code> : manage services, instances, configurations, permissions</li>
<li><code class="language-plaintext highlighter-rouge">gsutil</code> : manage cloud storage (upload, download to and from local)</li>
</ul>
<h3 id="uploading-to-gcs">Uploading to gcs</h3>
<p>Upload in parallel to Google cloud storage:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip install crcmod
# configure ~/.boto
# uncomment parallel_process_count line
# or use this: https://github.com/afrendeiro/dotfiles/blob/master/.boto
# with Rsync
gsutil -m rsync -r . gs://storage/data/
# selectively using grep
ls /localdir/data/mapped | grep .dups.bam | \ # grep samples
grep -v _string_ | \ # exclude some samples based on some string
gsutil -m cp -I gs://storagedir/data/mapped/ # upload
</code></pre></div></div>
<h4 id="change-permissions">Change permissions</h4>
<p><em>e.g.</em> upload bigwig tracks and hub, make them publicly accessible</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gsutil -m rsync data/bigWig gs://storage/bigWig/
gsutil cp trackHub_hg19.txt gs://storage/bigWig/
gsutil -m acl ch -g All:R gs://storage/bigWig/*
</code></pre></div></div>
<p>Auto-resumable uploads, pretty fast.</p>
<p>Uploaded ~250 bam files (1-5 Gb each) overnight!</p>
Bioinformatics software install from fresh Ubuntu image on AWS EC2
2015-04-08T00:00:00+00:00
https://andre-rendeiro.com/2015/04/08/bioinfo_fresh_install_ubuntu
<p>I’ve recently started using Amazon’s web services (AWS) for cloud computing (EC2 service). It takes a bit until you grasp the concepts and get familiar with the services offered. I won’t write more on this since I found an <a href="https://github.com/griffithlab/rnaseq_tutorial/wiki/Intro-to-AWS-Cloud-Computing"><strong>excelent guide</strong> to AWS and EC2</a>, which I recommend going through if you’re new to cloud computing in general or AWS’.</p>
<p>After you start your virtual machine (VM) instances you’ll want to start working as fast as possible but you’ll need the software you’re used to for that. Fortunately you only have to install all of that once. After you’ve installed all you can save an image of the instance’s boot disk and reuse it as many times as you want when you create new VM instances</p>
<p>Here’s a guide to install common bioinformatics software in a fresh Ubuntu 14.04 LTS image on an AWS EC2 instance.</p>
<h3 id="software">Software</h3>
<p>This will get updated, so <a href="https://gist.github.com/afrendeiro/a3718d50cdb370a83f88/revisions">refer to the gist’s versions</a> to see changes.</p>
<script src="https://gist.github.com/a3718d50cdb370a83f88.js"> </script>
<h3 id="static-files-used-in-bioinformatics-genomes-indexes-annotations">Static files used in bioinformatics (genomes, indexes, annotations)</h3>
<script src="https://gist.github.com/2ba0d2bb3a5710c51c0f.js"> </script>
Tools for ChIP-seq differential binding analysis
2015-04-03T00:00:00+00:00
https://andre-rendeiro.com/2015/04/03/chipseq_diffbind_analysis
<p>Detection of differential binding events in ChIP-seq data is still a tricky business. For a new collaboration, the whole project is going to depend on it, so I went out there and tried to collect existing tools, work with them and see their pros and cons.</p>
<p>I was looking specifically for tools that work well without replicates or input controls since we already have some data lying around from a pilot in the begging of the project, but they might be useful as well as the data comes along.</p>
<p>In no particular order:</p>
<h2 id="diffbind"><a href="http://www.bioconductor.org/packages/release/bioc/html/DiffBind.html">diffBind</a></h2>
<ul>
<li>Robust tutorial.</li>
<li>Requires a strict table with sample annotation (not necessarily bad though).</li>
<li>Requires peak files.</li>
<li>Uses multiple replicates in analysis.</li>
<li>Always requires input files to perform analysis.</li>
<li>Close <code class="language-plaintext highlighter-rouge">R</code> integration provides many useful methods to explore output by plotting.</li>
</ul>
<h2 id="manorm"><a href="http://bcb.dfci.harvard.edu/~gcyuan/MAnorm/MAnorm.htm">MANorm</a></h2>
<ul>
<li>Requires peak files.</li>
<li>Does not require input files.</li>
<li>Terrible code packaging and usage practices.</li>
</ul>
<p>Commands to install dependencies are outdated. If anyone is also strugling with it, here’s what worked for me:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>source("http://bioconductor.org/biocLite.R")
biocLite("aroma.light")
install.packages(c("R.oo","R.utils","MASS"))
</code></pre></div></div>
<h2 id="diffreps"><a href="https://github.com/shenlab-sinai/diffreps">Diffreps</a></h2>
<ul>
<li>Installation is not straightforward (dependency hell).</li>
<li>Requires peak files.</li>
<li>Does not require input files (but can be used for fold enrichment filtering).</li>
<li>Some nice tools downstream of differential calling (mostly region annotation).</li>
</ul>
<h2 id="odin"><a href="http://www.regulatory-genomics.org/odin-2/basic-introduction/">Odin</a></h2>
<ul>
<li>Can use inputs for analysis.</li>
<li>Limited description of output.</li>
<li>Output in “a proprietary BED format” (do I need to say anything?)</li>
</ul>
<h2 id="macs2"><a href="https://github.com/taoliu/MACS/wiki/Call-differential-binding-events">MACS2</a></h2>
<ul>
<li>Does not require peak files.</li>
<li>Poor documentation on the diff bind functions</li>
<li>Very immature code (“prepare a pen to write down the number of non-redundant reads” - seriously?)</li>
</ul>
<h2 id="multigps"><a href="mahonylab.org/software/multigps/">MultiGPS</a></h2>
<h2 id="pepr"><a href="https://github.com/troublezhang/PePr">PePr</a></h2>
<ul>
<li>Requires more than one replicate per condition for analysis.</li>
<li>Does not require peak files.</li>
<li>Does not require input files.</li>
<li>Supports several input file formats.</li>
</ul>
<h2 id="chipdiff"><a href="http://cmb.gis.a-star.edu.sg/ChIPSeq/paperChIPDiff.htm">ChIPDiff</a></h2>
<ul>
<li>No documentation besides a readme in a zip file.</li>
<li>Not really packaged.</li>
</ul>
<h2 id="dime"><a href="http://www.stat.osu.edu/~statgen/SOFTWARE/DIME/">DIME</a></h2>
<ul>
<li>Poor documentation (only function description in the <a href="http://cran.r-project.org/web/packages/DIME/">R package</a>).</li>
</ul>
<h2 id="dbchip"><a href="http://pages.cs.wisc.edu/~kliang/DBChIP/">DBChIP</a></h2>
<ul>
<li>Useful tutorial.</li>
<li>Only recommended for point-source factors.</li>
<li>Requires peak files.</li>
</ul>
MapReduce-like operations across jobs in cluster - part II
2015-02-14T00:00:00+00:00
https://andre-rendeiro.com/2015/02/14/mapreduce_slurm_II
<p>In <a href="http://andre-rendeiro.me/2015/01/31/mapreduce_slurm/">my previous post</a> concerning interactive (<em>e.g.</em> during a <code class="language-plaintext highlighter-rouge">IPython</code> session) parallelization of tasks with a MapReduce-like aproach across nodes in a cluster, I created an object which interfaces Slurm and the interactive session I’m working on, by splitting an input in pools and submitting each pool as a job that would be further processed in parallel.</p>
<p>Since the class was performing two distinct functions (handling jobs, splitting input in a task-dependent manner), I split it into two classes: <code class="language-plaintext highlighter-rouge">DivideAndSlurm</code> - which takes care of job processing; <code class="language-plaintext highlighter-rouge">Task</code> which is a meta-class for different tasks which can be parallelized this way.</p>
<p><code class="language-plaintext highlighter-rouge">Task</code> subclasses should be created for a particular class, inheriting from the meta one, allowing little effort when writing new classes, since basically what changes between tasks is (1) the location of the script which actually will compute the output, and (2) how the output is reduced when collected, since different output objects should be reduced differently (<em>e.g.</em> <code class="language-plaintext highlighter-rouge">collections.Counter</code> objects can be reduced by summation, but <code class="language-plaintext highlighter-rouge">list</code> or <code class="language-plaintext highlighter-rouge">dict</code> ones no (although here I might get smarter and write general colllection methods for different output types).</p>
<p>The basic usage now looks like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">divideAndSlurm</span> <span class="kn">import</span> <span class="n">DivideAndSlurm</span><span class="p">,</span> <span class="n">Task</span>
<span class="n">slurm</span> <span class="o">=</span> <span class="n">DivideAndSlurm</span><span class="p">()</span> <span class="c1"># create instance of object
</span><span class="n">regions</span> <span class="o">=</span> <span class="p">[</span><span class="n">promoters</span><span class="p">,</span> <span class="n">genes</span><span class="p">]</span> <span class="c1"># data is iterable with iterables - each is a separate task with multiple regions
</span>
<span class="k">for</span> <span class="n">region</span> <span class="ow">in</span> <span class="n">regions</span><span class="p">:</span> <span class="c1"># Add several tasks:
</span> <span class="n">task</span> <span class="o">=</span> <span class="n">Task</span><span class="p">(</span><span class="n">region</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="n">bamFile</span><span class="p">)</span> <span class="c1"># Add new task - syntax: data, fractions, *aditional arguments
</span> <span class="n">slurm</span><span class="p">.</span><span class="n">add_task</span><span class="p">(</span><span class="n">task</span><span class="p">)</span> <span class="c1"># Add task to slurm invokes the splitting of the data, and talk between objects
</span> <span class="n">slurm</span><span class="p">.</span><span class="n">submit</span><span class="p">(</span><span class="n">task</span><span class="p">)</span> <span class="c1"># Submit the task
</span>
<span class="n">regionsOutput</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">slurm</span><span class="p">.</span><span class="n">tasks</span><span class="p">:</span>
<span class="k">if</span> <span class="n">task</span><span class="p">.</span><span class="n">is_ready</span><span class="p">():</span>
<span class="n">regionsOutput</span><span class="p">[</span><span class="n">task</span><span class="p">]</span> <span class="o">=</span> <span class="n">task</span><span class="p">.</span><span class="n">collect</span><span class="p">()</span></code></pre></figure>
<p>The meta class <code class="language-plaintext highlighter-rouge">Task</code> accepts args and kwargs, so inheriting sub-classes can use task-specific options.</p>
<p>I further included many more functions and attributes to handle tasks (<code class="language-plaintext highlighter-rouge">slurm.tasks</code> or <code class="language-plaintext highlighter-rouge">task.log</code> attributes) faulty job executions (<em>e.g.</em> allowing collection of output even if some jobs would fail - <code class="language-plaintext highlighter-rouge">task.permissive</code> attribute - off by default), status checking (<code class="language-plaintext highlighter-rouge">task.is_running()</code>, <code class="language-plaintext highlighter-rouge">task.is_ready()</code>, <code class="language-plaintext highlighter-rouge">task.has_output()</code>, <code class="language-plaintext highlighter-rouge">task.failed()</code>) and handling tasks (<code class="language-plaintext highlighter-rouge">slurm.cancel_task(task)</code>, <code class="language-plaintext highlighter-rouge">slurm.cancel_all_tasks()</code>, <code class="language-plaintext highlighter-rouge">slurm.remove_task(task)</code>).</p>
<h3 id="repository">Repository</h3>
<p>The small library is called <a href="https://github.com/afrendeiro/divideAndSlurm">divideAndSlurm</a> and includes a setup.py to install.</p>
Taxonomic distribution of interpro domains
2015-02-12T00:00:00+00:00
https://andre-rendeiro.com/2015/02/12/taxonomic-distribution-of-interpro-domains
<p>I recently faced a problem that seemed trivial to solve at first but (as always) was not: <em>How to get the taxonomic distribution of proteins with a given interpro domain?</em></p>
<p>I start off even without knowing exactly which protein domains to look at (I wanted to look at all that were present in certain proteins). To do this, I query Uniprot through biomart to get all the annotated interpro domains present in a group of proteins. I also get a bunch of other info (other IDs, names) along.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">biomart</span> <span class="kn">import</span> <span class="n">BiomartServer</span><span class="p">,</span> <span class="n">BiomartDataset</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="s">"P01106"</span><span class="p">,</span> <span class="s">"P17947"</span><span class="p">]</span> <span class="c1"># some examples
</span>
<span class="c1"># connect to biomart
</span><span class="n">server</span> <span class="o">=</span> <span class="n">BiomartServer</span><span class="p">(</span><span class="s">"http://www.biomart.org/biomart"</span><span class="p">)</span>
<span class="n">uniprot</span> <span class="o">=</span> <span class="n">server</span><span class="p">.</span><span class="n">datasets</span><span class="p">[</span><span class="s">'uniprot'</span><span class="p">]</span>
<span class="c1"># query interpro domains for the prots
</span><span class="n">attributes</span> <span class="o">=</span> <span class="p">[</span><span class="s">'accession'</span><span class="p">,</span> <span class="s">'ensembl_id'</span><span class="p">,</span> <span class="s">'entry_type'</span><span class="p">,</span> <span class="s">'gene_name'</span><span class="p">,</span> <span class="s">'name'</span><span class="p">,</span> <span class="s">'interpro_id'</span><span class="p">]</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">uniprot</span><span class="p">.</span><span class="n">search</span><span class="p">({</span>
<span class="s">'filters'</span><span class="p">:</span> <span class="p">{</span><span class="s">'accession'</span><span class="p">:</span> <span class="n">ids</span><span class="p">},</span>
<span class="s">'attributes'</span><span class="p">:</span> <span class="n">attributes</span>
<span class="p">})</span>
<span class="c1"># put in dataframe
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">))],</span>
<span class="n">columns</span><span class="o">=</span><span class="n">attributes</span>
<span class="p">)</span></code></pre></figure>
<p>We now switch to the Interpro database and get the scientific name (and other stuff) of species with proteins containing these interpro domains. Again Biomart is our friend.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">domains</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'interpro_id'</span><span class="p">]</span>
<span class="n">attributes</span> <span class="o">=</span> <span class="p">[</span><span class="s">'entry_id'</span><span class="p">,</span> <span class="s">'entry_type'</span><span class="p">,</span> <span class="s">'entry_name'</span><span class="p">,</span> <span class="s">'taxonomy_scientific_name'</span><span class="p">]</span>
<span class="n">interpro</span> <span class="o">=</span> <span class="n">BiomartDataset</span><span class="p">(</span><span class="s">"http://www.biomart.org/biomart"</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'entry'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">domain</span> <span class="ow">in</span> <span class="n">domains</span><span class="p">:</span>
<span class="c1"># Query taxonomies with domains
</span> <span class="n">response</span> <span class="o">=</span> <span class="n">interpro</span><span class="p">.</span><span class="n">search</span><span class="p">({</span>
<span class="s">'filters'</span><span class="p">:</span> <span class="p">{</span><span class="s">'entry_id'</span><span class="p">:</span> <span class="n">domain</span><span class="p">},</span>
<span class="s">'attributes'</span><span class="p">:</span> <span class="n">attributes</span>
<span class="p">})</span>
<span class="n">taxons</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">([</span><span class="n">line</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">list</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">content</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">))],</span>
<span class="n">columns</span><span class="o">=</span><span class="n">attributes</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">domain</span> <span class="o">==</span> <span class="n">domains</span><span class="p">[</span><span class="mi">0</span><span class="p">]:</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">taxons</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df</span><span class="p">,</span> <span class="n">taxons</span><span class="p">])</span></code></pre></figure>
<p>To assess the distribution of these proteins across clades of species, one needs more information about these species. Getting there was the bit not so obvious to me.</p>
<p><a href="http://www.ncbi.nlm.nih.gov/taxonomy">NCBI taxonomy</a> has this type of information, but I wasn’t familiar with their APIs or services so called <a href="http://www.ncbi.nlm.nih.gov/books/NBK25501/">NCBI eutils</a>. Luckyly there’s a Python solution for <a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc108">accessing NCBI’s Entrez databases thourough BioPython</a>.</p>
<p>We define two functions: to get the ID of a taxon based on its name (this actually may fail if species names have appended stuff like “strain”) and to get the record (which contains a full lineage description) for that taxon based on its ID.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">Bio</span> <span class="kn">import</span> <span class="n">Entrez</span>
<span class="k">def</span> <span class="nf">get_tax_id</span><span class="p">(</span><span class="n">specie</span><span class="p">):</span>
<span class="s">"""Get taxon ID for specie."""</span>
<span class="n">specie</span> <span class="o">=</span> <span class="n">specie</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">" "</span><span class="p">,</span> <span class="s">"+"</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="n">search</span> <span class="o">=</span> <span class="n">Entrez</span><span class="p">.</span><span class="n">esearch</span><span class="p">(</span><span class="n">term</span><span class="o">=</span><span class="n">specie</span><span class="p">,</span> <span class="n">db</span><span class="o">=</span><span class="s">"taxonomy"</span><span class="p">,</span> <span class="n">retmode</span><span class="o">=</span><span class="s">"xml"</span><span class="p">)</span>
<span class="n">record</span> <span class="o">=</span> <span class="n">Entrez</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">search</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">record</span><span class="p">[</span><span class="s">"Count"</span><span class="p">])</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">if</span> <span class="s">"IdList"</span> <span class="ow">in</span> <span class="n">record</span><span class="p">.</span><span class="n">keys</span><span class="p">():</span>
<span class="k">return</span> <span class="n">record</span><span class="p">[</span><span class="s">'IdList'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">get_tax_data</span><span class="p">(</span><span class="n">taxid</span><span class="p">):</span>
<span class="s">"""Fetch the record of a taxon ID."""</span>
<span class="n">search</span> <span class="o">=</span> <span class="n">Entrez</span><span class="p">.</span><span class="n">efetch</span><span class="p">(</span><span class="nb">id</span><span class="o">=</span><span class="n">taxid</span><span class="p">,</span> <span class="n">db</span><span class="o">=</span><span class="s">"taxonomy"</span><span class="p">,</span> <span class="n">retmode</span><span class="o">=</span><span class="s">"xml"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">Entrez</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="n">search</span><span class="p">)</span>
<span class="n">Entrez</span><span class="p">.</span><span class="n">email</span> <span class="o">=</span> <span class="s">""</span> <span class="c1"># enter your email here
</span>
<span class="c1"># initialize empty column for taxonomy
</span><span class="n">df</span><span class="p">[</span><span class="s">'taxonomy'</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">for</span> <span class="n">specie</span> <span class="ow">in</span> <span class="n">unique</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">"taxonomy_scientific_name"</span><span class="p">]):</span>
<span class="n">taxid</span> <span class="o">=</span> <span class="n">get_tax_id</span><span class="p">(</span><span class="n">specie</span><span class="p">)</span>
<span class="k">if</span> <span class="n">taxid</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">get_tax_data</span><span class="p">(</span><span class="n">taxid</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">>=</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">if</span> <span class="s">"Lineage"</span> <span class="ow">in</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">keys</span><span class="p">():</span>
<span class="c1"># add taxonomy to all rows with this specie name
</span> <span class="n">df</span><span class="p">[</span><span class="s">'taxonomy'</span><span class="p">][</span><span class="n">df</span><span class="p">[</span><span class="s">'taxonomy_scientific_name'</span><span class="p">]</span> <span class="o">==</span> <span class="n">specie</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">"Lineage"</span><span class="p">]</span></code></pre></figure>
<h3 id="full-example">Full example:</h3>
<script src="https://gist.github.com/edc5af41628c886abe0a.js"> </script>
MapReduce-like operations across jobs in cluster
2015-01-31T00:00:00+00:00
https://andre-rendeiro.com/2015/01/31/mapreduce_slurm
<p>MapReduce operations allow parallelization of tasks taking advantage of aditional available cpus. However, one might want to use processors across several nodes in a computing cluster and while several options exist to perform this (with very different aims and scallability options), I didn’t feel like there was an option which would allow doing this interactively (for example during a <code class="language-plaintext highlighter-rouge">IPython</code> session) in a Slurm cluster and without requiring diving into lots of documentation. So obviously, here’s my custom solution.</p>
<p>The strategy I followed splits input in pools which are submitted in parallel through jobs to the cluster, each one of them is further processed in parallel using the <code class="language-plaintext highlighter-rouge">multiprocessing</code> library. This is a middle term between mapping all inputs to different jobs (clogging the cluster) and using only the CPUs available in one machine/node, by controlling the number of jobs that are submitted to the cluster and the size of each pool submitted. This approach was inspired by conversations with Michael Schuster and Nathan Sheffield in my lab.</p>
<p>I create an <code class="language-plaintext highlighter-rouge">object</code> to manage tasks which can include huge amounts of independent data to process the same way. Each task’s input is split in equal(ish)-sized pools and submitted to Slurm as jobs when wanted. For now I take care of tasks using a dict by I will expand this to make a <code class="language-plaintext highlighter-rouge">Task(object)</code> class, which would take care of them.</p>
<p>I use <code class="language-plaintext highlighter-rouge">subprocess</code> to keep track of the job IDs Slurm gives to the jobs and this way I can track if they’re finished or still running.
Now the task going to be called is written in a separate script that is called by the Slurm job.</p>
<p>The basic usage would be something like this:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">slurm</span> <span class="o">=</span> <span class="n">DivideAndSlurm</span><span class="p">()</span> <span class="c1"># create instance of object
</span><span class="n">regions</span> <span class="o">=</span> <span class="p">[</span><span class="n">promoters</span><span class="p">,</span> <span class="n">genes</span><span class="p">]</span> <span class="c1"># data is iterable with iterables - each is a separate task with multiple regions
</span>
<span class="k">for</span> <span class="n">region</span> <span class="ow">in</span> <span class="n">regions</span><span class="p">:</span> <span class="c1"># Add several tasks:
</span> <span class="n">taskNumber</span> <span class="o">=</span> <span class="n">slurm</span><span class="p">.</span><span class="n">task</span><span class="p">(</span><span class="n">region</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="n">bamFile</span><span class="p">)</span> <span class="c1"># Add new task - syntax: data, fractions, *aditional arguments
</span> <span class="n">slurm</span><span class="p">.</span><span class="n">submit</span><span class="p">(</span><span class="n">taskNumber</span><span class="p">)</span> <span class="c1"># Submit new task
</span>
<span class="n">slurm</span><span class="p">.</span><span class="n">is_ready</span><span class="p">(</span><span class="n">taskNumber</span><span class="p">)</span> <span class="c1"># check if task is done
</span><span class="n">output</span> <span class="o">=</span> <span class="n">slurm</span><span class="p">.</span><span class="n">collect_distances</span><span class="p">(</span><span class="n">taskNumber</span><span class="p">)</span> <span class="c1"># collect output</span></code></pre></figure>
<p>This would submit 20 jobs per task, which would each take further advantage of parallel processing.</p>
<p>The essential code for the class is here:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">cPickle</span> <span class="k">as</span> <span class="n">pickle</span>
<span class="k">class</span> <span class="nc">DivideAndSlurm</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class to handle a map-reduce style submission of jobs to a Slurm cluster."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tasks</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_slurmHeader</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">command</span> <span class="o">=</span> <span class="s">""" #!/bin/bash
# Start running the job
hostname
date
"""</span>
<span class="k">return</span> <span class="n">command</span>
<span class="k">def</span> <span class="nf">_slurmFooter</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">command</span> <span class="o">=</span> <span class="s">"""
date # Job end
"""</span>
<span class="k">return</span> <span class="n">command</span>
<span class="k">def</span> <span class="nf">_slurmSubmitJob</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">jobFile</span><span class="p">):</span>
<span class="s">"""Submit command to shell."""</span>
<span class="n">command</span> <span class="o">=</span> <span class="s">"sbatch %s"</span> <span class="o">%</span> <span class="n">jobFile</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">Popen</span><span class="p">(</span><span class="n">command</span><span class="p">,</span> <span class="n">stdout</span><span class="o">=</span><span class="n">subprocess</span><span class="p">.</span><span class="n">PIPE</span><span class="p">,</span> <span class="n">shell</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">p</span><span class="p">.</span><span class="n">communicate</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_split_data</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">taskName</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">fractions</span><span class="p">):</span>
<span class="s">"""Split data in fractions and create pickle objects with them."""</span>
<span class="n">chunkify</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">lst</span><span class="p">,</span><span class="n">n</span><span class="p">:</span> <span class="p">[</span><span class="n">lst</span><span class="p">[</span><span class="n">i</span><span class="p">::</span><span class="n">n</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">n</span><span class="p">)]</span>
<span class="n">groups</span> <span class="o">=</span> <span class="n">chunkify</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">fractions</span><span class="p">)</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">taskName</span> <span class="o">+</span> <span class="s">"_"</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">groups</span><span class="p">))]</span>
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tmpDir</span><span class="p">,</span> <span class="n">ID</span><span class="p">)</span> <span class="k">for</span> <span class="n">ID</span> <span class="ow">in</span> <span class="n">ids</span><span class="p">]</span>
<span class="n">groups</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">ids</span><span class="p">,</span> <span class="n">groups</span><span class="p">,</span> <span class="n">files</span><span class="p">)</span> <span class="c1"># keep track of groups in self
</span>
<span class="c1"># serialize groups
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">groups</span><span class="p">)):</span>
<span class="n">pickle</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="c1"># actual group of objects
</span> <span class="nb">open</span><span class="p">(</span><span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="s">".pickle"</span><span class="p">,</span> <span class="s">'wb'</span><span class="p">),</span> <span class="c1"># group pickle file
</span> <span class="n">protocol</span><span class="o">=</span><span class="n">pickle</span><span class="p">.</span><span class="n">HIGHEST_PROTOCOL</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">groups</span>
<span class="k">def</span> <span class="nf">task</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">fractions</span><span class="p">,</span> <span class="n">bam_file</span><span class="p">,</span> <span class="n">strand_wise</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">fragment_size</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="s">"""Add task to be performed with data."""</span>
<span class="n">now</span> <span class="o">=</span> <span class="n">string</span><span class="p">.</span><span class="n">join</span><span class="p">([</span><span class="n">time</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">"%Y%m%d%H%M%S"</span><span class="p">,</span> <span class="n">time</span><span class="p">.</span><span class="n">localtime</span><span class="p">())</span> <span class="nb">str</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1000</span><span class="p">))],</span> <span class="n">sep</span><span class="o">=</span><span class="s">"_"</span><span class="p">)</span>
<span class="n">taskName</span> <span class="o">=</span> <span class="s">"task_name_{0}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">now</span><span class="p">)</span>
<span class="n">log</span> <span class="o">=</span> <span class="n">taskName</span> <span class="o">+</span> <span class="s">".log"</span>
<span class="c1"># check data is iterable
</span> <span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">==</span> <span class="nb">dict</span> <span class="ow">or</span> <span class="nb">type</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">==</span> <span class="n">OrderedDict</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">items</span><span class="p">()</span>
<span class="c1"># split data in fractions
</span> <span class="n">groups</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_split_data</span><span class="p">(</span><span class="n">taskName</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">fractions</span><span class="p">)</span>
<span class="c1"># make jobs with groups of data
</span> <span class="n">jobs</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="n">jobFiles</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">groups</span><span class="p">)):</span>
<span class="n">jobFile</span> <span class="o">=</span> <span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="s">"_task_name.sh"</span>
<span class="n">input_pickle</span> <span class="o">=</span> <span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="s">".pickle"</span>
<span class="n">output_pickle</span> <span class="o">=</span> <span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="s">".output.pickle"</span>
<span class="c1"># assemble command for job
</span> <span class="n">task</span> <span class="o">=</span> <span class="s">" python perform_task_parallel.py {0} {1} {2} "</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">input_pickle</span><span class="p">,</span> <span class="n">output_pickle</span><span class="p">,</span> <span class="n">bam_file</span><span class="p">)</span>
<span class="k">if</span> <span class="n">strand_wise</span><span class="p">:</span>
<span class="n">task</span> <span class="o">+=</span> <span class="s">"--strand-wise "</span>
<span class="n">task</span> <span class="o">+=</span> <span class="s">"--fragment-size {0}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">fragment_size</span><span class="p">)</span>
<span class="c1"># assemble job file
</span> <span class="n">job</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_slurmHeader</span><span class="p">(</span><span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">log</span><span class="p">,</span> <span class="n">queue</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">queue</span><span class="p">,</span> <span class="n">userMail</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">userMail</span><span class="p">)</span> <span class="o">+</span> <span class="n">task</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">_slurmFooter</span><span class="p">()</span>
<span class="c1"># keep track of jobs and their files
</span> <span class="n">jobs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
<span class="n">jobFiles</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">jobFile</span><span class="p">)</span>
<span class="c1"># write job file to disk
</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">jobFile</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">handle</span><span class="p">:</span>
<span class="n">handle</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">textwrap</span><span class="p">.</span><span class="n">dedent</span><span class="p">(</span><span class="n">job</span><span class="p">))</span>
<span class="c1"># save task in object
</span> <span class="n">taskNumber</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="c1"># don't keep track of data
</span> <span class="s">"name"</span> <span class="p">:</span> <span class="n">taskName</span><span class="p">,</span>
<span class="s">"groups"</span> <span class="p">:</span> <span class="n">groups</span><span class="p">,</span>
<span class="s">"jobs"</span> <span class="p">:</span> <span class="n">jobs</span><span class="p">,</span>
<span class="s">"jobFiles"</span> <span class="p">:</span> <span class="n">jobFiles</span><span class="p">,</span>
<span class="s">"log"</span> <span class="p">:</span> <span class="n">log</span>
<span class="p">}</span>
<span class="c1"># return taskNumber so that it can be used later
</span> <span class="k">return</span> <span class="n">taskNumber</span>
<span class="k">def</span> <span class="nf">submit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">taskNumber</span><span class="p">):</span>
<span class="s">"""Submit slurm jobs with each fraction of data."""</span>
<span class="n">jobIDs</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"jobs"</span><span class="p">])):</span>
<span class="n">output</span><span class="p">,</span> <span class="n">err</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_slurmSubmitJob</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"jobFiles"</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
<span class="n">jobIDs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"\D"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="n">output</span><span class="p">))</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"submission_time"</span><span class="p">]</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"jobIDs"</span><span class="p">]</span> <span class="o">=</span> <span class="n">jobIDs</span>
<span class="k">def</span> <span class="nf">collect_output</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">taskNumber</span><span class="p">):</span>
<span class="s">"""If self.is_ready(taskNumber), return joined data."""</span>
<span class="k">if</span> <span class="n">taskNumber</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">KeyError</span><span class="p">(</span><span class="s">"Task number not in object's tasks."</span><span class="p">)</span>
<span class="k">if</span> <span class="s">"output"</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">]:</span> <span class="c1"># if output is already stored, just return it
</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"output"</span><span class="p">]</span>
<span class="c1"># load all pickles into list
</span> <span class="n">groups</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"groups"</span><span class="p">]</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">pickle</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">groups</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="s">".output.pickle"</span><span class="p">,</span> <span class="s">'r'</span><span class="p">))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">groups</span><span class="p">))]</span>
<span class="c1"># if all are counters, and their elements are counters, sum them
</span> <span class="k">if</span> <span class="nb">all</span><span class="p">([</span><span class="nb">type</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">==</span> <span class="n">Counter</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">outputs</span><span class="p">))]):</span>
<span class="n">output</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">,</span> <span class="n">outputs</span><span class="p">)</span> <span class="c1"># reduce
</span> <span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">output</span><span class="p">)</span> <span class="o">==</span> <span class="n">Counter</span><span class="p">:</span>
<span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"output"</span><span class="p">]</span> <span class="o">=</span> <span class="n">output</span> <span class="c1"># store output in object
</span> <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">tasks</span><span class="p">[</span><span class="n">taskNumber</span><span class="p">][</span><span class="s">"output"</span><span class="p">]</span>
</code></pre></figure>
<p>In a second level of parallelization, a regular map-reduce operation is also employed. Here I request the help of the <code class="language-plaintext highlighter-rouge">parmap</code> module (a wrapper to <code class="language-plaintext highlighter-rouge">multiprocessing</code>), since <code class="language-plaintext highlighter-rouge">multiprocessing.map()</code> does not allow several arguments passed to the function:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">multiprocessing</span>
<span class="kn">import</span> <span class="nn">parmap</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">Counter</span>
<span class="k">def</span> <span class="nf">task</span><span class="p">(</span><span class="n">singleFeature</span><span class="p">,</span> <span class="n">bamFile</span><span class="p">):</span>
<span class="s">"""Computes something with reads present in a single, specific interval.
Returns Counter."""</span>
<span class="c1"># ...
</span> <span class="k">return</span> <span class="n">Counter</span>
<span class="n">output</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span>
<span class="k">lambda</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">,</span>
<span class="n">parmap</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">features</span><span class="p">,</span> <span class="n">bamFile</span>
<span class="p">)</span>
<span class="p">)</span></code></pre></figure>
<p>Also, <code class="language-plaintext highlighter-rouge">collections.Counter</code> objects are really usefull and one can reduce them by summation.</p>
<h1 id="complete-example">Complete example:</h1>
<p>I illustrate the complete implementation of the Class with an example which takes several genomic regions (combinations of H3K4me3 or H3K27me3 peaks) and compute an output (coverage, density, etc…) under those peaks.</p>
<p>I add more functions to the main Object to perform tasks such as removal of temporary files (pickles, sh file, logs…) and to check if job is finished and output is of the right form.</p>
<script src="https://gist.github.com/b5e97b429ff7363f5574.js"> </script>
<p>.</p>
Mining CosmicDB
2014-12-29T00:00:00+00:00
https://andre-rendeiro.com/2014/12/29/mining_cosmicdb
<p>I’m interested in proteins that have an “epigenetic” function and want to have a feeling of how is the distribution of their mutations in cancers.</p>
<p>COSMIC (Catalogue Of Somatic Mutations In Cancer database) has information of commonly mutated genes in various cancers.
Since they require login to access the data, I downloaded the data manually on the website. These data are (or were at some point) available through BioMart but I haven’t been able to access it programatically and apparently <a href="https://www.biostars.org/p/102234/">I’m not the only one</a>.</p>
<p>Data will be explored further in the future.</p>
<script src="https://gist.github.com/6ef84ee6940b4f28d85c.js"> </script>
Playing with CellProfiler
2014-11-23T00:00:00+00:00
https://andre-rendeiro.com/2014/11/23/playing_with_cellprofiler
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<style>
.centerImages {
line-height:200px;
text-align:center;
margin-left: auto;
margin-right: auto;
width: 90%;
vertical-align:middle;
}
.ulpost {list-style-type: none; margin: 0; padding: 0;}
.lipost {display: inline; margin-right: 20px;}
.lipost>a {width: 120px;}
</style>
<h3>Contents</h3>
<ul class="ulpost">
<li class="lipost"><a href="#background">Background</a></li>
<li class="lipost"><a href="#results">Results</a></li>
<li class="lipost"><a href="#discussion">Discussion</a></li>
<li class="lipost"><a href="#methods">Methods</a></li>
<li class="lipost"><a href="#code">Code</a></li>
</ul>
<p><br /></p>
<p>A combination of increasing demand for new drugs and advances in automation have made high-throughput screening of chemical compounds a reality. Image-based screens, in particular, can measure hundreds of cellular features at the single-cell level and are therefore of great interest.</p>
<p>In this small rotation project, I optimized the usage of existing computational tools for the analysis of image-based chemical screens at the single-cell level.</p>
<p>##<a href="/2014/11/23/playing_with_cellprofiler#background" name="background">Background</a>
While there is increasing need for the discovery of new drugs, the number of new approved drugs per decade is in fast decline. In response to this, high-throughput screening of chemical compounds is a tool that is increasingly more used and in high demand due to its ability to investigate hundreds of thousands of compounds in a relatively short time.</p>
<p>Image-based screens are able to measure hundreds of cellular and sub-cellular features quantitatively and are therefore extremely powerful. Although generally slower than other screening methods, the amount and variety of data acquired in high-throughput image-based screening is unsurpassed. It’s ability to have single-cell measurements is a quality which only very recently is starting to be acquired in other fields of molecular biology, and thus makes it by excellence a data-rich method.</p>
<p>Of particular interest is the ability of these screens to probe into the cell-to-cell variability of responses to perturbations, where the population context of cells can be a determinant condition to respond in variable ways when perturbed.</p>
<p align="center"><small> I know citations are due here, I'm looking into https://github.com/inukshuk/jekyll-scholar to do that in the best way.</small></p>
<p>##<a href="/2014/11/23/playing_with_cellprofiler#results" name="results">Results</a>
To obtain objects from the images captured in the screen, object detection using Cell Profiler was performed. This was performed by manually adjusting detection parameters (see Methods section). Based on previously identified objects from the raw signal (Figure 1A-D), other cellular features can be inferred (Figure 1E) and several metrics dependent on the detected objects can be computed (Figure 1F).</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/cellProfiler.png" align="middle" style="width: 700px;" />
</div>
<p align="center">Figure 1 - <small>A) Raw signal from the DAPI channel; B) Nuclei object detection; C) Raw signal from the Alexa488 channel; D) Detection of objects representing $\beta$-tubulin staining; E) Inferred cytoplasm objects based on the detection of B) and D); F) Color-coder cell objects based on the percentage of surface in contact with other cell objects (neighbours).</small></p>
<p style="clear: both;"> </p>
<p>Objects representing cells were detected and their properties were quantified in over 200 variables. To reduce complexity mean measurements per well were first used to describe inter-well variability.</p>
<p>Due to position-dependent differences in the conditions for cell growth, plate-dependent effects are noticeable on virtually every cellular measurement taken during the screen. To neutralize this, measurements were normalized by taking into account the cell’s position on the plate (see methods). Normalization of the bias is noticeable (Figure 2) and in the case of measurements capable to distinguish cell viability (<em>e.g.</em> number of cells, cell area), led to generally higher Z-factor scores for that measurement. Normalization not only corrects experimental biases but improves the ability to compare wells with different treatments and therefore distinguish between negative and positive controls - <em>e.g.</em> notice the difference between wells in diagonals (positive controls) in raw versus normalized data in Figure 2B.</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/concatenated_plots.jpg" align="middle" style="width: 700px;" />
</div>
<p align="center">Figure 2 - Comparison between raw and normalized data of four cellular measurements:
<small>A) Cell numbers per well; B) Average area per well C) Averaged perimeter per well D) Averaged cellular form factor per well. For each measurement, the left panels use raw data, whereas the right ones use data normalized as described in the methods section. The upper panels represent data preserving the positional information of the 384-well plate used in the screen.
</small>
</p>
<p style="clear: both;"> </p>
<p>Since no single measurement is able to perfectly discriminate the cellular response to the chemical treatment, relationships between different measurements were observed.</p>
<p>In accordance with expectations, some measurements revealed a high degree of correlation due to their intrinsic physical relationships. Mean cellular area and perimeter are naturally correlated (Figure 2A) and this is not particularly informative since these cellular properties are maintained during cellular growth or death. Cell number and measurements of cellular form factor are anti-correlated (Figure 2B), showing that the second measurement is detecting changes in cellular shape as the number of cells in a well diminishes. Measurements of overall cellular shape, seem therefore better fitted than measurements of cellular dimension to discriminate between positive and negative controls either when in combination (Figure 2C) or when compared with other cellular measurements (Figure 2D).</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/correlation_plots.jpg" align="middle" style="width: 500px;" />
</div>
<p align="center">Figure 2 - Relationship between some of the cellular measurements.
<small>A) Area and perimeter; B) Cell number and cellular form factor; C) Eccentricity and form factor; D) Form factor and area. All measurements (except cell number) are averages from each well. Pearson correlation coefficients are shown for values within groups of wells under the same category (Negative control: DMSO; Positive controls: PosCon; and treatments with several compounds: Compound).
</small>
</p>
<p style="clear: both;"> </p>
<p>To explore the variability of cellular response to the same chemical perturbation under the same conditions, I took advantage of the power of this screen to measure single-cells. Variability can be observed as a density function produced for each well (Figure 4). The general shape of the distribution in most measurements resembles a Gaussian distribution, showing that the majority of cells exposed to the same stimuli respond equally. Nevertheless, for some compounds it is possible to notice variable response within the cellular population exposed to it, forming sub-populations. The existence of sub-populations can be explained by technical artefacts such as cell death and fragmentation during the immunostaining procedure, incorrect detection of cellular objects, but the existence of a true differential response to a compound from cells can also be due to differential population contexts.</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/single_cell_density.jpg" align="middle" style="width: 500px;" />
</div>
<p align="center">Figure 3 - Density plots of <small>A) Cell shape eccentricity and B) Cell perimeter per well.</small></p>
<p style="clear: both;"> </p>
<p>Due to the high-information content nature of the screen, a very interesting approach to detect compounds with a relevant effect in overall cellular biology would be to use machine learning to distinguish between normal cellular states and states revealing an effect of the compound. I trained a Support Vector Machine (SVM) using a linear model in a classification task to distinguish between cells treated with DMSO or with a compound used as a positive control.</p>
<p>As a first step, I used both subsets (12 to 48 measurements) and the whole group (independently) of both cellular or nuclear measurements of shape and intensity in a Principal Component Analysis (PCA) approach to provide the SVM with a two-dimensional input (Figure 4). Models were used to classify all cells that were in wells treated with chemical compounds but unexpectedly none of the attempted training events using any of the combinations of staring cellular or nuclear measurements yielded a model capable of classifying any of the cells in wells treated with compounds as affected by the compound as the cells in the positive control wells. Nevertheless, it is possible to observe that such cells must exist, as can be observed by the shift in the density towards the location where most cells in positive control wells are located as measured by some variables independently such as demonstrated in Figure 4.</p>
<div class="centerImages">
<img src="https://andre-rendeiro.com/data/figures/pcas.jpg" align="middle" style="width: 500px;" />
</div>
<p align="center">Figure 4 - Principal component analysis <small>of A) 12 cellular measurements; B) 48 measurements of nuclei features.</small></p>
<p style="clear: both;"> </p>
<p>##<a href="/2014/11/23/playing_with_cellprofiler#discussion" name="discussion">Discussion</a>
While many measurements of cell dimensions, shape, intensity of signal are correlated with the cellular responses to the chemical treatment, it is clear no single measure is fit to evaluate it to a sufficient degree - therefore, methods that integrate several measurements are of great interest.</p>
<p>All of our attempts to use a SVM classifier to detect compounds with effects similar to the ones used as positive controls failed. The main drawback probably lies in the requirement of SVM models to be fed with low dimension data. The use of a PCA approach to reduce dimensionality is counterproductive since it promotes severe loss of information and does not necessarily guarantee dispersion of classes as shown in Figure 4. Many of the variables might not be informative at all, increasing noise levels into the data and clouding the training step of machine learning. It is also possible that the distribution of cellular responses to treatments with a vast array of compounds is not linear and therefore a linear classifier would also not be appropriate.</p>
<p>##<a href="/2014/11/23/playing_with_cellprofiler#methods" name="methods">Methods</a>
<a href="http://cellprofiler.org/">Cell Profiler</a> was used to automate image analysis on the images obtained from four fields in each well of the 384-well plate. This software allows detection of objects based on the signal of the three channels used in this assay - DAPI for nuclei, Alexa488 for $\beta$-tubulin and MitoTrackerOrange staining for mitochondria.</p>
<p>An approach to identify cellular features based on the fluorescent signal from the images was designed, allowing the establishment of relationships between them and can build on other classes of objects already identified. Two primary objects (nuclei and mitochondria) were identified based on the source of fluorescent signal from the DAPI and MitoTrackerOrange stainings, respectively. Secondary objects representing whole cells were collected by propagation from the nuclei objects until the border of signal from the Alexa 488 channel using a global thresholding strategy with the Otsu algorithm. While identifying both primary and secondary objects, objects that touched the image border were discarded. If this occurred to the secondary object, the primary object used for the propagation was discarded as well to avoid the existence of unpaired primary-secondary objects. Cytoplasm was identified as a Tertiary object through the subtraction of the nuclei area from the area of the cell objects, therefore establishing a relation between a cell’s nuclei and cytoplasm. Relationships between cell and mitochondria objects were established by intersecting the location of mitochondria with the cells and each overlapping mitochondria object was annotated with its respective parent cell.</p>
<p>For each object identified, measurements of signal intensity of all channels, area and shape were taken. Measurements of euclidean distances between all objects within and between classes were taken, including pairs of parent-child objects.</p>
<p>Since Cell Profiler’s capabilities lied mainly in object discovery and measurement, data was exported to a HDF5 file and analysis proceeded in a custom-built script heavily based on Python and the Pandas library, therefore allowing great flexibility and control over operations. Normalization of measurements was performed to compensate biases in cellular growth conditions due to the cell’s position in the plate wells. This was performed for each cell by subtracting the median of all cell measurements in its row and column. For measurements which naturally possess high global variance, values were transformed with a logarithm of base 2.</p>
<p>For all measurements, a measure of its ability to distinguish between negative and positive controls (Z-factor) was calculated as described by Zhang \cite{Zhang1999} and shown in equation 1:</p>
\[Z = 1 - \left( 3 \times \left(\frac{SD \left(Pos \right ) + SD \left(Neg \right )}{abs \left(mean \left (Pos \right ) - mean \left (Neg \right )\right )}\right )\right )\]
<p>In the analysis of single-cell variation, I chose 12, 20, 31 and 48 different measurements of nuclei shape, intensity and area, performed Principal Component Analysis (PCA) on those variables to reduce dimensionality to two variables. A Support Vector Machine (SVM) linear classifier was trained on the two-variable data from cells in negative and positive control wells of two plates and ran the classifier model on all cells treated with the chemical compounds.</p>
<p>High-throughput image-based chemical screens yield a high amount of data in the form of images. The analysis of this kind of data can be particular challenging due to their subjective nature. With the development of feature-detection from images and machine learning algorithms, the task of detecting, quantifying and classifying objects is becoming increasingly easy. Nonetheless, detection of cellular features in images still requires optimization due to the variability between cellular phenotypes which arise naturally between different cell lines, but are exacerbated when these are perturbed.</p>
<p>##<a href="/2014/11/23/playing_with_cellprofiler#code" name="code">Code</a>
This is basically just a dump of code I used, I need to revisit it.</p>
<script src="https://gist.github.com/7425a73e808dc90449ed.js"> </script>
Mining drug functions into an ontology
2014-11-18T00:00:00+00:00
https://andre-rendeiro.com/2014/11/18/mining_drug_functions
<style>
.ulpost {list-style-type: none; margin: 0; padding: 0;}
.lipost {display: inline; margin-right: 20px;}
.lipost>a {width: 120px;}
</style>
<h3>Contents</h3>
<ul class="ulpost">
<li class="lipost"><a href="#background">Background</a></li>
<li class="lipost"><a href="#results">Results</a></li>
<li class="lipost"><a href="#discussion">Discussion</a></li>
<li class="lipost"><a href="#methods">Methods</a></li>
<li class="lipost"><a href="#code">Code</a></li>
</ul>
<p><br /></p>
<p>In this short rotation, I mined several databases for known biological functions of chemical compounds and built an ontology which can be used to test enrichment of functional classes in interesting compounds discovered by chemical screening.</p>
<p>#<a href="/2014/11/18/mining_drug_functions#background" name="background">Background</a>
With the exponential rise of information on chemical compounds coming from high-throughput screens there is also increasing demand for integration of available data on the biological role of chemical compounds with biological relevance. By relating chemical compounds with their previously known functions in similar or distinct cellular environments one could be able to better understand underlying modes of action of drugs, or use this information to develop better drugs for new screens.</p>
<p>I implemented a small tool to mine several databases with relevant information on the bioactivity of chemical compounds. This creates a functional ontology of compounds representative of the whole library assayed which can be used to test the enrichment of subsets of relevant compounds (hits) with high-order biological functions of the compounds. I demonstrate the usefulness of the tool by annotating all compounds used in an image-based chemical screen and explore the ontology of the compounds with interesting effects by testing the significant enrichment of functional classes within them.</p>
<p>#<a href="/2014/11/18/mining_drug_functions#results" name="results">Results</a>
As example, all 1448 compounds from a chemical screen were used. Since a particular source of noise in databases stems from unambiguously annotated compounds, we used annotations only from databases which allowed search using <em>SMILES</em>.</p>
<p>Having all compounds annotated with their general biological functions, we created an ontology of terms that establishes relations between compounds sharing the same function. This ontology reflects the usage of the compounds - widely used compounds are usually annotated with more functions, while many of the most obscure compounds are often annotated with smaller number of terms (see Figure 1A). Since the specificity of the terms with which a compound is annotated can be widely ranging, most abundant terms are therefore generally less specific, with many drugs having one term that is highly specific to them and therefore unshared (Figure 1B).</p>
<div>
<img src=https://andre-rendeiro.com"/data/figures/terms_and_drugs.png"
align="middle" style="width: 700px;"/>
</div>
<p align="center">Figure 1 - Relationship between compounds and their functional ontology terms.
<small>Top panel: Ontology terms per compound; Bottom panel: Compounds per ontology term. </small>
</p>
<p style="clear: both;"> </p>
<p>Using a subset of compounds which showed relevant impact on cellular in the chemical screen, all terms from compounds in this subset were tested for their over-representation in the subset considering the whole library of compounds used in the screen. A list of significantly over-represented terms can be see in Table 1.
Since the readout for this screen was a measure of viability, it is no surprise that the “antineoplastic agent” has the lowest p-value of the set since most antineoplastic drugs focus on the neutralization of cell viability. Of particular relevance though are the terms “anti-inflammatory drug”, “glucocoricoid”, “immunosuppressive agent”, “anti-inflammatory agent “ and others related to steroid drugs. These are the majority of the significant ones, and over-representation of this class indicates that the overall biological effect of inflammatory suppressive drugs is of great interest.</p>
<p align="center">Table 1 - Results from test of enrichment of functional terms.</p>
<table align="center">
<tr>
<th>term</th>
<th>odds_ratio</th>
<th>p-value</th>
</tr>
<tr>
<td>antineoplastic agent</td>
<td>3.98</td>
<td>0.000388</td>
</tr>
<tr>
<td>hydroxamic acid</td>
<td>20.89</td>
<td>0.001495</td>
</tr>
<tr>
<td>immunosuppressive agent</td>
<td>7.22</td>
<td>0.001685</td>
</tr>
<tr>
<td>11beta-hydroxy steroid</td>
<td>6.87</td>
<td>0.002029</td>
</tr>
<tr>
<td>20-oxo steroid</td>
<td>6.55</td>
<td>0.002422</td>
</tr>
<tr>
<td>glucocorticoid</td>
<td>8.72</td>
<td>0.002736</td>
</tr>
<tr>
<td>anti-inflammatory drug</td>
<td>6.26</td>
<td>0.002868</td>
</tr>
<tr>
<td>secondary alcohol</td>
<td>5.93</td>
<td>0.008652</td>
</tr>
<tr>
<td>3-oxo-Delta(1),Delta(4)-steroid</td>
<td>5.63</td>
<td>0.010105</td>
</tr>
<tr>
<td>antifungal agent</td>
<td>5.63</td>
<td>0.010105</td>
</tr>
<tr>
<td>21-hydroxy steroid</td>
<td>8.31</td>
<td>0.010449</td>
</tr>
<tr>
<td>anti-inflammatory agent</td>
<td>7.55</td>
<td>0.012958</td>
</tr>
<tr>
<td>fluorinated steroid</td>
<td>7.55</td>
<td>0.012959</td>
</tr>
<tr>
<td>spiroketal</td>
<td>13.59</td>
<td>0.018290</td>
</tr>
<tr>
<td>antiparasitic agent</td>
<td>5.52</td>
<td>0.026196</td>
</tr>
</table>
<p style="clear: both;"> </p>
<p>#<a href="/2014/11/18/mining_drug_functions#discussion" name="discussion">Discussion</a>
Further development could focus on the optimization of the data mining process, where more sources could be integrated, and more information could be extracted. A critical improvement would be a better integration of information from the several sources used, merging them on common terms, which could reduce the amount of any contradicting annotation. Implementing user-oriented annotation would also be of interest, since users could select sources to annotate compounds focused on a particular interest. Adding support for alternative testing options, such as the hypergeometric test would also be something to develop further.</p>
<p>#<a href="/2014/11/18/mining_drug_functions#methods" name="methods">Methods</a></p>
<p>####<a name="database mining">Database mining</a>
Sources for the annotation of chemical compounds were diverse databases with relevance to Chemical Biology or specialized on their use in a biological context. Matches to most databases were performed by a simplified molecular-input line-entry system (<em>SMILES</em>) - a one-line notation commonly used to describe the structure of a chemical compound. This system has the advantage of while still being human-readable, it provides an unequivocal description of a given chemical compound (one <em>SMILES</em> - one compound), but the reverse relationship can be redundant (one compound - many <em>SMILES</em>). This property makes database searches by <em>SMILES</em> suboptimal if the same compound is described with different <em>SMILES</em> in various entries.</p>
<p><a href="www.chemspider.com/">ChemSpider</a>, maintained by the Royal Society of Chemistry contains chemical structures of over 32 million compounds and provides text search with an API. Matches to database were performed based on <em>SMILES</em>.</p>
<p>ChEMBL is the EMBL-EBI database for bioactive small molecules. It contains information on structural and chemical properties as well as on ‘bioactivities’ attributed to compounds. To circumvent the redundancy of the database, reduce the rates of misannotation of compounds and get an annotation as complete as possible, we developed a very simple algorithm to match a <em>SMILES</em> query to its ChEMBL entry if more than one match was retrieved: for each of several selected properties, results were compared with the query and differing results were discarded until only one result remained. This proceeded from more broad to specific characteristics in this order: molecular formula, molecular weight, <em>SMILES</em>, ChEMBL known drug. If after all iterations more than one result remained, the one on the top of the results was selected. For each selected compound bioactivities were also extracted if compound was marked as active in that particular assay. This information comes from assays in which the compound was used and include the status of activity (active/inactive) the target (ChEMBL id and name were extracted) and the organism in which the assay was performed.</p>
<p>Chemical Entities of Biological Interest (ChEBI) is another EMBL-EBI database for chemical compounds which focuses on the ontology of these molecular entities. It is therefore extremely useful for the purpose of this project. Queries to the database are made by <em>SMILES</em> and a positive match requires at least 70\% identity. Extracted information consists of a ChEBI id, compound name and functional terms associated with the compound in the ontology. For the purpose of annotation, the ChEBI ontology tree was flattened and the compound was annotated with all parent terms.</p>
<p>The Kyoto encyclopedia of Genes and Genomes (KEGG) also houses a database focusing on compounds with biological relevance (KEGG DRUG). While the information for approved compounds is comprehensive, including chemical structures, associated targets, metabolizing enzymes and interaction network information, search is only possible through the name of the compound, which is highly error-prone.</p>
<p>Since ultimately the goal of chemical screening is to identify molecules which can be used in human treatments, it is useful to connect this compound to previously executed clinical trials. The <a href="http://clinicaltrials.gov">http://clinicaltrials.gov</a> database is a repository maintained by the U.S. National Institutes of Health containing worldwide clinical trial data. Search is only possible by compound name. Only completed studies employing the compounds were inspected and the study title, url and outcome were extracted.</p>
<h4 id="implementation"><a name="implementation">Implementation</a></h4>
<p>Python is the sole language used for the implementation. Two separate scripts were composed to allow independence between the compound annotation step <code class="language-plaintext highlighter-rouge">drugAnnotation.py</code> and the testing for enriched ontology terms in a subset of the compounds (<code class="language-plaintext highlighter-rouge">drugEnrichment.py</code>). This also provides some modularization, allowing these scripts to be used in other contexts than originally conceived.</p>
<p>The APIs of the databases mentioned in the previous section were accessed using the Python <em>bioservices</em> package (ChEMBL, ChEBI and KEGG), while the <em>chemspipy</em> package was used for ChemSpider. Since <code class="language-plaintext highlighter-rouge">clinicaltrials.org</code> does not offer an API, queries were done by requesting results in XML. The python <em>urllib2</em> and <em>BeautifulSoup</em> packages were used to manage connections and extract content.</p>
<p>By default, database searches by name will not be performed, but the user can supply a flag which overides the default behaviour and allows annotation of compounds from databases such as KEGG and <code class="language-plaintext highlighter-rouge">clinicaltrials.org</code>.</p>
<p><em>Numpy</em> and <em>Pandas</em> packages were used throughout to easily provide implementation of data frames in Python and handle numeric data. This approach was very convenient due to the input data being on a spreadsheet format. Output data was kept in a comma-separated value (CSV) format, which is easily loaded into most GUI spreadsheet editors.</p>
<p>The script testing over-representation requires ontology annotation of the whole universe of compounds assayed (the output from <code class="language-plaintext highlighter-rouge">drugAnnotation.py</code>) and a list of compounds of interest which is a subset of the universe. The two are merged based on <em>SMILES</em>, and the Fischer’s Exact Test (implemented in the <em>Scipy</em> package) is used to test each ontology term in the list of compounds of interest for over- or under-representation. The full list of tested term along with the number of compounds in each term, odds-ratio and p-value is outputted to allow the user overview over the whole data and control over the significance threshold.</p>
<h1 id="code"><a href="/2014/11/18/mining_drug_functions#code" name="code">Code</a></h1>
<p>Code is licensed under the GNU General Public Licence (Version 3) and freely available at github: <a href="https://github.com/afrendeiro/drugAnnotation">https://github.com/afrendeiro/drugAnnotation</a>.</p>
An open science notebook
2014-10-29T00:00:00+00:00
https://andre-rendeiro.com/2014/10/29/notebook
<p>This notebook is primarily a tool for me to do science and (why not?) to communicate it. Nevertheless, it is not written for a general audience, but to myself and maybe any collaborators. I write with hope my entries will be intelligible to my future self.</p>
<p>Depending on the evolution of my usage of this notebook, I will also post ideas, literature references, code and graphs pertaining to the research projects I am working on.</p>
<h2 id="why-an-electronic-notebook">Why an electronic notebook?</h2>
<p>Electronic lab notebooks have been around for a while in many different forms. They provide numerous advantages over conventional ones such as easier and faster search, copying and everlasting backup, direct incorporation of data electronic data, and most importantly, they support collaborative work and allow sharing of data to a much wider audience, to the immediate benefit of collaborators.</p>
<h2 id="why-an-open-notebook">Why an open notebook?</h2>
<h3 id="open-science">Open science</h3>
<p><a href="http://en.wikipedia.org/wiki/Open_science">“Science is broadly understood as collecting, analysing, publishing, reanalyzing, critiquing, and reusing data”</a>. The Open Science movement proposes that barriers that make broad dissemination of scientific data difficult should be abolished in order to improve reproducibility of scientific results and use of information by anyone willing to. These barriers include for example paywalls and restricions on use imposed by research publishers, or proprietary closed-source software.</p>
<h3 id="open-notebook">Open notebook</h3>
<p>An open lab notebook allows raw and processed data to be available as it is produced, allowing transparency and access to failed experiments and negative or less significant results that would be otherwise unpublished.</p>
<p>In fact, this notebook is open in two senses:</p>
<ul>
<li>The content is accessible and can be remixed and shared by anyone (upon attribution);</li>
<li>Its code is open source and is available to anyone, ready to use or to continue developing it as their own.</li>
</ul>
<p>Public exposure can nonetheless be a hurdle, but I hope that with time, generalization of the concept will start removing the barriers.</p>