<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Esmaeil Alizadeh</title>
<link>https://ealizadeh.com/blog.html</link>
<atom:link href="https://ealizadeh.com/blog.xml" rel="self" type="application/rss+xml"/>
<description>Personal website</description>
<generator>quarto-1.9.16</generator>
<lastBuildDate>Sun, 30 Jul 2023 00:00:00 GMT</lastBuildDate>
<item>
  <title>The 2 Metrics That Reveal True Data Dispersion Beyond Standard Deviation</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/dispersion-cv-qcd/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dispersion-cv-qcd/img/_featured_image.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="An illustration of a person looking at graphs"></p>
</figure>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/dispersion-cv-qcd-32849f828434">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>We’ve all heard the saying, “variety is the spice of life,” and in data, that variety or diversity often takes the form of dispersion.</p>
<p>Data dispersion makes data fascinating by highlighting patterns and insights we wouldn’t have found otherwise. Typically, we use the following as the measures of dispersion: variance, standard deviation, range, interquantile range (IQR). However, we may need to examine dataset dispersion beyond this typical measures in cases.</p>
<p>This is where the Coefficient of Variation (CV) and Quartile Coefficient of Dispersion (QCD) provides insights when comparing datasets</p>
<p>In this tutorial, we will explore the two concepts of CV and QCD and we will answer the following questions for each of them - What are they and how they are defined? - How they can be computed? - How to interpret the results?</p>
<p>All above questions will be answered through two examples.</p>
</section>
<section id="understanding-variability-and-dispersion" class="level2">
<h2 class="anchored" data-anchor-id="understanding-variability-and-dispersion">Understanding Variability and Dispersion</h2>
<p>Whether we’re measuring people’s heights or housing prices, we seldom find all data points to be the same. We won’t expect everyone to be the same. Some people are tall, average, or short. Data generally varies. In order to study this data variability or dispersion, we usually quantify that using measures like range, variance, standard deviation, etc. The measures of dispersion quantify how spread out our data points are.</p>
<p>However, what if we wish to evaluate the variability across datasets. For example, what if we want to compare the sales prices of a jewelry shop and a bookstore. Standard deviation won’t work here, as the scales of the two datasets are likely very different.</p>
<p>Coefficient of Variation (CV) and Quartile Coefficient of Dispersion (QCD) are useful indicators of dispersion in this context.</p>
<section id="deep-dive-coefficient-of-variation" class="level3">
<h3 class="anchored" data-anchor-id="deep-dive-coefficient-of-variation">Deep Dive: Coefficient of Variation</h3>
<p>The <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">Coefficient of Variation (CV)</a>, also known as <em>relative standard deviation</em>, is a standardized measure of dispersion. It’s expressed as a percentage and doesn’t have units. As a result, CV is an excellent measure of variability for comparing data in different scales.</p>
<p>Mathematically, CV is computed as the ratio of the standard deviation to the mean, often multiplied by 100 to get a percentage. The formula is as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%20%20%5Ctext%7BCoefficient%20of%20Variations%20(CV)%7D%20=%20%5Cfrac%7B%5Ctext%7BStandard%20Deviation%7D%7D%7B%5Ctext%7Bmean%7D%7D%0A"></p>
<p>Let’s use Numpy’s <code>mean</code> and <code>std</code> function to compute CV in Python.</p>
<div id="67f410b7" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> calc_cv(data_array) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>:</span>
<span id="cb1-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Calculate coefficient of variation."""</span></span>
<span id="cb1-3">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> np.std(data_array) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> np.mean(data_array)</span></code></pre></div></div>
</div>
<p>Next, let’s consider another dimensionless measure of dispersion that is QCD.</p>
</section>
</section>
<section id="deep-dive-quartile-coefficient-of-dispersion" class="level2">
<h2 class="anchored" data-anchor-id="deep-dive-quartile-coefficient-of-dispersion">Deep Dive: Quartile Coefficient of Dispersion</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion">Quartile Coefficient of Dispersion (QCD)</a> is another measure of <em>relative</em> dispersion, especially useful when dealing with skewed data or even the data has outliers. The QCD focuses on the spread of the middle 50% of a dataset, i.e., the interquartile range (IQR). That’s why QCD is a robust measure.</p>
<p>The QCD is calculated as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BQCD%7D%20=%20%5Cfrac%7BQ3%20-%20Q1%7D%7BQ3%20+%20Q1%7D%0A"></p>
<p>Where <img src="https://latex.codecogs.com/png.latex?Q1"> is the first quartile (the 25th percentile), and <img src="https://latex.codecogs.com/png.latex?Q3"> is the third quartile (the 75th percentile).</p>
<div id="b31447e0" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> calc_qcd(data_array) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>: </span>
<span id="cb2-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""Calculates Quartile Coefficient Difference"""</span></span>
<span id="cb2-3">  q1, q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.percentile(data_array, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>])</span>
<span id="cb2-4">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> (q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> q1) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> q1)  </span></code></pre></div></div>
</div>
<p>Similarly to the CV, the QCD is a unitless metric that may be very helpful for comparing the dispersion of skewed datasets.</p>
<p>The following examples will better demonstrate the idea behind CV and QCD.</p>
</section>
<section id="examples" class="level2">
<h2 class="anchored" data-anchor-id="examples">Examples</h2>
<section id="scenario-1" class="level3">
<h3 class="anchored" data-anchor-id="scenario-1">Scenario 1</h3>
<p>Consider the following two datasets showing the monthly sales of a jewelry shop and a bookstore.</p>
<ul>
<li>Jewelry shop: The average monthly sales are $10,000 with a standard deviation of $2,000.</li>
<li>Bookstore: The average monthly sales are $1,000 with a standard deviation of $200.</li>
</ul>
<p>Let’s generate sample data for both examples using Numpy.</p>
<div id="data-ex1-cv" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb3-4"></span>
<span id="cb3-5">sns.set_theme(context<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"notebook"</span>, style<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"whitegrid"</span>, palette<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"deep"</span>)</span>
<span id="cb3-6"></span>
<span id="cb3-7">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setting a seed for reproducibility</span></span>
<span id="cb3-8"></span>
<span id="cb3-9">jewelry_sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.normal(loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, scale<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean=10000  std=2000</span></span>
<span id="cb3-10">bookstore_sales <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.normal(loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, scale<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean=1000  std=2000</span></span>
<span id="cb3-11"></span>
<span id="cb3-12">mean_jewelry, std_jewelry <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.mean(jewelry_sales), np.std(jewelry_sales)</span>
<span id="cb3-13">mean_bookstore, std_bookstore <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.mean(bookstore_sales), np.std(bookstore_sales)</span>
<span id="cb3-14"></span>
<span id="cb3-15">cv_jewelry, cv_bookstore <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> calc_cv(jewelry_sales), calc_cv(bookstore_sales)</span>
<span id="cb3-16"></span>
<span id="cb3-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb3-18">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Jewelry Shop: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Mean = $</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>mean_jewelry<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-19">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Standard Deviation = $</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_jewelry<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-20">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- CV = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cv_jewelry<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> (dimensionless)"</span></span>
<span id="cb3-21">)</span>
<span id="cb3-22"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb3-23">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Bookstore: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Mean = $</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>mean_bookstore<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-24">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Standard Deviation = $</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_bookstore<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb3-25">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- CV = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cv_bookstore<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> (dimensionless)"</span></span>
<span id="cb3-26">)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Jewelry Shop: 
    - Mean = $10119.616
    - Standard Deviation = $2015.764
    - CV = 0.199 (dimensionless)
Bookstore: 
    - Mean = $1016.403
    - Standard Deviation = $206.933
    - CV = 0.204 (dimensionless)</code></pre>
</div>
</div>
<p>Let’s see the distribution of both datasets and compare their CVs.</p>
<div class="cell" data-layout-nrow="2" data-layout="[[60,-10,30],[20,60,20]]" data-execution_count="4">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb5-2">sns.histplot(jewelry_sales, kde<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>)</span>
<span id="cb5-3">ax.axvline(mean_jewelry, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean"</span>)</span>
<span id="cb5-4">ax.axvline(mean_jewelry <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> std_jewelry, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1 StdDev"</span>)</span>
<span id="cb5-5">ax.axvline(mean_jewelry <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> std_jewelry, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>)</span>
<span id="cb5-6">ax.set_title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Histogram of Jewelry Sales </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">(mean=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>mean_jewelry<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, std=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_jewelry<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">)"</span>)</span>
<span id="cb5-7">ax.legend()</span>
<span id="cb5-8"></span>
<span id="cb5-9">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb5-10">sns.histplot(bookstore_sales, kde<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>)</span>
<span id="cb5-11">ax.axvline(mean_bookstore, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean"</span>)</span>
<span id="cb5-12">ax.axvline(mean_bookstore <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> std_bookstore, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"1 StdDev"</span>)</span>
<span id="cb5-13">ax.axvline(mean_bookstore <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> std_bookstore, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--"</span>)</span>
<span id="cb5-14">ax.set_title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Histogram of Bookstore Sales </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">(mean=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>mean_bookstore<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, std=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_bookstore<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">)"</span>)</span>
<span id="cb5-15">ax.legend()</span>
<span id="cb5-16"></span>
<span id="cb5-17">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb5-18">sns.barplot(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Jewelry'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Bookstore'</span>], y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[cv_jewelry, cv_bookstore], ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span>
<span id="cb5-19">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Coefficient of Variation between Jewelry and Bookstore Monthly Sales"</span>)</span>
<span id="cb5-20">ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV"</span>)</span>
<span id="cb5-21"></span>
<span id="cb5-22">plt.show()</span></code></pre></div></div>
</details>
<div id="fig-ex1" class="quarto-layout-panel">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ex1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="quarto-layout-row">
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex1" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex1-1" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex1-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex1-output-1.png" data-ref-parent="fig-ex1" width="589" height="452" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex1-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(a) Distribution of average monthly sales of the jewelry shop
</figcaption>
</figure>
</div>
</div>
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex1" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex1-2" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex1-2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex1-output-2.png" data-ref-parent="fig-ex1" width="601" height="452" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex1-2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(b) Distribution of average monthly sales of the bookstore
</figcaption>
</figure>
</div>
</div>
</div>
<div class="quarto-layout-row">
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex1" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex1-3" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex1-3-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex1-output-3.png" data-ref-parent="fig-ex1" width="609" height="435" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex1-3-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(c) Comparing Coefficient of Variation between two stores
</figcaption>
</figure>
</div>
</div>
</div>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ex1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Studying the dispersion between average monthly sales of jewelry shop and bookstore based on standard deviation.
</figcaption>
</figure>
</div>
</div>
<p>The jewelry shop’s average sales and standard deviation are substantially larger than the bookstore’s (mean of $10,119 and standard deviation of $2,015 compared to the mean of $1,016 with standard deviation of $206), yet their CVs are the same (20%).</p>
<p>This means that relative to their respective average sales, both the jewelry shop and the bookstore have the same relative variablity despite their huge differences in sale volumes (and their standard deviation).</p>
<p>This exemplifies the idea of CV as a relative measure of variability and shows how it can be applied to make comparisons between datasets of different scales.</p>
</section>
<section id="scenario-2" class="level3">
<h3 class="anchored" data-anchor-id="scenario-2">Scenario 2</h3>
<p>Consider two datasets of employee ages from two firms.</p>
<p>Let’s say: - Company A (a startup): Younger workers, some elderly. - Company B (a well-established): Older workers, some younger.</p>
<p>Let’s generate sample data for both examples using Numpy.</p>
<div id="data-ex2-qcd" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setting the seed for reproducibility</span></span>
<span id="cb6-2"></span>
<span id="cb6-3">ages_company_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.normal(loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, scale<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean = 25 yrs, std = 3 yrs</span></span>
<span id="cb6-4">ages_company_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.normal(loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span>, scale<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># mean = 45 yrs, std = 5 yrs</span></span>
<span id="cb6-5"></span>
<span id="cb6-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a few outliers</span></span>
<span id="cb6-7">ages_company_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.append(ages_company_A, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">62</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span>])</span>
<span id="cb6-8">ages_company_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.append(ages_company_B, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">22</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>])</span>
<span id="cb6-9"></span>
<span id="cb6-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute Q1, Q3, and IQR, and QCD for both datasets</span></span>
<span id="cb6-11">ages_company_A_q1, ages_company_A_q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.percentile(ages_company_A, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>])</span>
<span id="cb6-12">ages_company_A_iqr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ages_company_A_q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> ages_company_A_q1  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># IQR = Q3 - Q1</span></span>
<span id="cb6-13">ages_company_B_q1, ages_company_B_q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.percentile(ages_company_B, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">75</span>])</span>
<span id="cb6-14">ages_company_B_iqr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ages_company_B_q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> ages_company_B_q1</span>
<span id="cb6-15"></span>
<span id="cb6-16">ages_company_A_qcd <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> calc_qcd(ages_company_A)</span>
<span id="cb6-17">ages_company_B_qcd <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> calc_qcd(ages_company_B)</span>
<span id="cb6-18"></span>
<span id="cb6-19"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb6-20">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Company A: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Q1 = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_A_q1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-21">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Q3 = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_A_q3<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-22">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- IQR = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_A_iqr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-23">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- QCD = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_A_qcd<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> (dimensionless)"</span></span>
<span id="cb6-24">)</span>
<span id="cb6-25"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb6-26">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Company B: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Q1 = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_B_q1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-27">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- Q3 = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_B_q3<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-28">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- IQR = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_B_iqr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> years"</span></span>
<span id="cb6-29">  <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\t</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">- QCD = </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ages_company_B_qcd<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> (dimensionless)"</span></span>
<span id="cb6-30">)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Company A: 
    - Q1 = 22.840 years
    - Q3 = 26.490 years
    - IQR = 3.650 years
    - QCD = 0.074 (dimensionless)
Company B: 
    - Q1 = 42.351 years
    - Q3 = 47.566 years
    - IQR = 5.215 years
    - QCD = 0.058 (dimensionless)</code></pre>
</div>
</div>
<p>Now, let’s plot the distribution of the data along with the boxplot and QCD to visualise the information above.</p>
<div class="cell" data-layout-nrow="2" data-layout="[[45,-10,45],[45,-10,45]]" data-execution_count="6">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb8-2">sns.histplot(ages_company_A, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>)</span>
<span id="cb8-3">ax.set_title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Histogram of the Age of Company A Employees </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">(mean=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>mean(ages_company_A)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, std=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>std(ages_company_A)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">)"</span>)</span>
<span id="cb8-4"></span>
<span id="cb8-5">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb8-6">sns.histplot(ages_company_B, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>)</span>
<span id="cb8-7">ax.set_title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Histogram of the Age of Company B Employees </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">(mean=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>mean(ages_company_B)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, std=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>std(ages_company_B)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">)"</span>)</span>
<span id="cb8-8"></span>
<span id="cb8-9"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb8-10">df1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(ages_company_A, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>])</span>
<span id="cb8-11">df1[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"company"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Company A"</span></span>
<span id="cb8-12">df2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(ages_company_B, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>])</span>
<span id="cb8-13">df2[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"company"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Company B"</span></span>
<span id="cb8-14"></span>
<span id="cb8-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create box plots</span></span>
<span id="cb8-16">plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb8-17">sns.boxplot(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"company"</span>, y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>, data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>pd.concat([df1, df2]))</span>
<span id="cb8-18">plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Boxplots of Test Scores for Classroom A and Classroom B"</span>)</span>
<span id="cb8-19"></span>
<span id="cb8-20">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots()</span>
<span id="cb8-21">sns.barplot(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Company A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Company B"</span>], y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[ages_company_A_qcd, ages_company_B_qcd], ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span>
<span id="cb8-22">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Coefficient of Variation between Jewelry and Bookstore monthly sales"</span>)</span>
<span id="cb8-23">ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"QCD"</span>)</span>
<span id="cb8-24"></span>
<span id="cb8-25">plt.show()</span></code></pre></div></div>
</details>
<div id="fig-ex2" class="quarto-layout-panel">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ex2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="quarto-layout-row">
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex2" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex2-1" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex2-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex2-output-1.png" data-ref-parent="fig-ex2" width="589" height="452" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex2-1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(a) Distribution of age of employees in Company A
</figcaption>
</figure>
</div>
</div>
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex2" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex2-2" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex2-2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex2-output-2.png" data-ref-parent="fig-ex2" width="589" height="452" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex2-2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(b) Distribution of age of employees in Company B
</figcaption>
</figure>
</div>
</div>
</div>
<div class="quarto-layout-row">
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex2" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex2-3" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex2-3-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex2-output-3.png" data-ref-parent="fig-ex2" width="812" height="529" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex2-3-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(c) Boxplots of age of employees in both companies
</figcaption>
</figure>
</div>
</div>
<div class="cell-output cell-output-display quarto-layout-cell-subref quarto-layout-cell" data-ref-parent="fig-ex2" style="flex-basis: 50.0%;justify-content: flex-start;">
<div id="fig-ex2-4" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-subfloat-fig figure">
<div aria-describedby="fig-ex2-4-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/dispersion-cv-qcd/index_files/figure-html/fig-ex2-output-4.png" data-ref-parent="fig-ex2" width="601" height="435" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-subfloat-caption quarto-subfloat-fig" id="fig-ex2-4-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
(d) Comparing Quantile Coefficient of Dispersion between two companies
</figcaption>
</figure>
</div>
</div>
</div>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ex2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: Studying the dispersion between ages of employees in Companies A and B based on robust measure of IQR.
</figcaption>
</figure>
</div>
</div>
<p>Company B’s IQR (5.215 years vs.&nbsp;3.65 years) suggests wider age dispersion. However, Company B’s elderly staff affects this.</p>
<p>On the other hand, Company A has a larger QCD (0.074 vs.&nbsp;0.058) than Company B, showing a greater age distribution variation relative to size. The IQR doesn’t reveal this.</p>
<p>In the upcoming sections, we’ll learn how to quantify this difference using the Coefficient of Variation and the Quartile Coefficient of Dispersion.</p>
</section>
</section>
<section id="discussion" class="level2">
<h2 class="anchored" data-anchor-id="discussion">Discussion</h2>
<p>Let’s answer a few questions that you may think.</p>
<section id="why-not-focus-on-measures-like-standard-deviation-or-iqr" class="level3">
<h3 class="anchored" data-anchor-id="why-not-focus-on-measures-like-standard-deviation-or-iqr">Why not focus on measures like standard deviation or IQR?</h3>
<p>We use standard deviation and IQR to quantify dispersion in datasets. The standard deviation shows the average data point distance from the mean. The IQR shows the distribution of the middle 50% of our data.</p>
<p>However, these measures may be deceptive when comparing the dispersion of two or more datasets with different units or scales, skewed distributions, or in the presense of outliers.</p>
<p>While standard deviation and IQR are useful statistical tools, we occasionally require CV and QCD to conduct fair comparisons.</p>
<p>The CV and QCD both measure and compare variability, although they do it in somewhat different ways. Your data and desired variability determine which one to use.</p>
</section>
<section id="when-to-use-cv" class="level3">
<h3 class="anchored" data-anchor-id="when-to-use-cv">When to use CV?</h3>
<p>CV is a good way to compare the amount of variation in different datasets that have different sizes, units, or average values. Because the CV is a relative measure of spread, it shows how different things are from the mean.</p>
<p>The mean and standard deviation, two measures that are greatly affected by “outliers,” are used to create the CV. So, the CV can give a distorted view of spread in datasets that aren’t normally distributed or have outliers. Thus, CV works best with data that is evenly spread out and doesn’t have any extreme values.</p>
<p>In the sales case, the price ranges for these two groups are very different, so the scales used to measure their sales are also very different. The jewelry store is likely to have much higher average sales and much more variation. If we used the standard deviation to measure how variable these two groups are, we might come to the wrong conclusion that the jewelry shop’s sales are more variable.</p>
<p>The CV allowed us to compare the variability of sales between the two datasets, regardless of their different scales. If the CV is higher for one category, it means that the sales are more variable relative to the average sales for that category.</p>
</section>
<section id="when-to-use-qcd" class="level3">
<h3 class="anchored" data-anchor-id="when-to-use-qcd">When to use QCD?</h3>
<p>The QCD uses dataset quartiles, which are less outlier-sensitive. QCD is a robust dispersion measure for skewed distributions or datasets containing outliers. The QCD concentrates on the center 50% of the data, which may better capture dispersion in such datasets.</p>
<p>In our example, we examined the age differences between two companies: a startup company (A) with mostly younger employees, and a well-established company (B) with mostly elderly. Given their distinct age ranges, the median age and variability would be higher for the older company. Using the Interquartile Range (IQR) to compare dispersion might inaccurately suggest higher age variance in the established company, as IQR measures absolute variability and is higher for larger values.</p>
<p>The QCD is more effective as it standardizes variability against the median, enabling us to compare age variability between companies on different scales. A higher QCD indicates greater age variance relative to the median for that company. Therefore, the QCD was chosen for this comparison as it accounts for different scales and potential data skew or outliers.</p>
</section>
<section id="takeaways" class="level3">
<h3 class="anchored" data-anchor-id="takeaways">Takeaways</h3>
<p>Choosing between CV and QCD depends on the nature of your dataset and analysis goals. Below are key points about both measures:</p>
<ul>
<li><strong>Coefficient of Variation (CV)</strong>
<ul>
<li>CV is calculated as the ratio of the standard deviation to the mean.</li>
<li>CV is dimensionless.</li>
<li>Higher CV indicates greater variability relative to the mean.</li>
<li>CV could give misleading results if the mean is near zero (divising by zero!).</li>
</ul></li>
<li><strong>Quartile Coefficient of Dispersion (QCD)</strong>
<ul>
<li>QCD is based on quartiles.</li>
<li>QCD is a robust measure (less sensitive to extreme values).</li>
<li>QCD is dimensionless.</li>
<li>Higher QCD indicates higher variability of values relative to the median.</li>
<li>QCD does not fully capture the spread if the distribution’s tails are important.</li>
</ul></li>
</ul>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>To sum up, the Coefficient of Variation (CV) and the Quartile Coefficient of Dispersion (QCD) are crucial statistics for examining dispersion in numerical data. CV excels at comparing scaled data, while QCD helps in case of&nbsp;skewed or outlier datasets. We looked at two cases (with Python programs and analysis) to see how this works in practice. By using them wisely, we may get useful information for making decisions.</p>
<div class="callout callout-style-default callout-tip no-icon callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>AI Sprout Newsletter
</div>
</div>
<div class="callout-body-container callout-body">
<p>I curate a weekly newsletter called 🌱 <a href="https://aisprout.xyz">AI Sprout</a> where I provide hands-on reviews and analysis of the latest AI tools and innovations. Subscribe to explore emerging AI with me!</p>
<p>Let’s connect on <a href="https://www.linkedin.com/in/alizadehesmaeil/">LinkedIn</a> and <a href="https://twitter.com/es_alizadeh">Twitter</a> 🤝</p>
</div>
</div>
<div class="callout callout-style-simple callout-note no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Update History
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 10%">
<col style="width: 70%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Sections</th>
<th style="text-align: left;">Changes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">August 21, 2023</td>
<td style="text-align: left;">-</td>
<td style="text-align: left;">Fixed duplicate subplots in Figure&nbsp;2.</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {The 2 {Metrics} {That} {Reveal} {True} {Data} {Dispersion}
    {Beyond} {Standard} {Deviation}},
  date = {2023-07-30},
  url = {https://ealizadeh.com/blog/dispersion-cv-qcd/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2023" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“The 2 Metrics That Reveal True Data Dispersion Beyond
Standard Deviation,”</span> Jul. 30, 2023. <a href="https://ealizadeh.com/blog/dispersion-cv-qcd/">https://ealizadeh.com/blog/dispersion-cv-qcd/</a></div>
</div></div></section></div> ]]></description>
  <category>Data Science</category>
  <category>Statistics</category>
  <category>Data Dispersion</category>
  <category>Standard Deviation</category>
  <guid>https://ealizadeh.com/blog/dispersion-cv-qcd/</guid>
  <pubDate>Sun, 30 Jul 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/dispersion-cv-qcd/img/_featured_image.png" medium="image" type="image/png" height="117" width="144"/>
</item>
<item>
  <title>Data Science Accelerated: ChatGPT Code Interpreter as Your AI Assistant</title>
  <link>https://ealizadeh.com/</link>
  <description>Published in Towards AI</description>
  <category>ChatGPT</category>
  <category>Data Science</category>
  <category>Python</category>
  <category>AI</category>
  <guid>https://ealizadeh.com/</guid>
  <pubDate>Wed, 19 Jul 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/assets/DS-accelerated__chatgpt-code-interpreter-as-your-AI-assistant.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Scikit-LLM: Bridging the Gap Between LLM and Scikit-learn</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/tutorial-scikit-llm/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/tutorial-scikit-llm/img/_featured_image.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A magnifier with a list of text anlaysis topics covered in the post."></p>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p><a href="https://github.com/iryna-kondr/scikit-llm">Scikit-LLM</a> is a Python package that integrates large language models (LLMs) like OpenAI’s GPT-3 into the <a href="https://scikit-learn.org/">scikit-learn</a> framework for text analysis tasks.</p>
<p>Scikit-LLM is designed to work within the scikit-learn framework. Hence, if you’re familiar with scikit-learn, you’ll feel right at home with scikit-llm. The library offers a range of features, out of which we will cover the following <span class="citation" data-cites="online:python_skllm_github">[1]</span>:</p>
<ul>
<li>Zero-shot text classification</li>
<li>Multi-label zero-shot text classification</li>
<li>Text vectorization</li>
<li>Text translation</li>
<li>Text summarization</li>
</ul>
<section id="installation" class="level3">
<h3 class="anchored" data-anchor-id="installation">Installation</h3>
<p>You can install the library via pip:</p>
<pre><code>pip install scikit-llm</code></pre>
</section>
<section id="configuration" class="level3">
<h3 class="anchored" data-anchor-id="configuration">Configuration</h3>
<p>Before you start using Scikit-LLM, you need to pass your OpenAI API key to Scikit-LLM. You can check out this <a href="www.howtogeek.com/885918/how-to-get-an-openai-api-key/">post</a> to set up your OpenAI API key.</p>
<div id="b74e2e8a" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm.config <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SKLLMConfig</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> dotenv <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> load_dotenv</span>
<span id="cb2-5">load_dotenv()</span>
<span id="cb2-6"></span>
<span id="cb2-7">OPENAI_SECRET_KEY <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"OPENAI_SECRET_KEY"</span>]</span>
<span id="cb2-8">OPENAI_ORG_ID <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.environ[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"OPENAI_ORG_ID"</span>]</span>
<span id="cb2-9"></span>
<span id="cb2-10">SKLLMConfig.set_openai_key(OPENAI_SECRET_KEY)</span>
<span id="cb2-11">SKLLMConfig.set_openai_org(OPENAI_ORG_ID)</span></code></pre></div></div>
</div>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Caution
</div>
</div>
<div class="callout-body-container callout-body">
<p>Please note that Scikit-LLM provides a convenient interface to access OpenAI’s GPT-3 models. Use of these models is not free and requires an API key. While the API cost is relatively cheap, depending on the volume of your data and frequency of calls, these costs can add up. Therefore, it’s important to plan and manage your usage carefully to control costs. Always remember to review OpenAI’s <a href="https://openai.com/pricing/">pricing details</a> and terms of use before getting started with Scikit-LLM.</p>
<p>To give you a rough idea, I ran this notebook at least five times to make this tutorial, and the total cost was US $0.02. I have to say, I thought this would be higher!</p>
</div>
</div>
</section>
</section>
<section id="zero-shot-text-classification" class="level2">
<h2 class="anchored" data-anchor-id="zero-shot-text-classification">Zero-Shot Text Classification</h2>
<p>One of the features of Scikit-LLM is the ability to perform zero-shot text classification. Scikit-LLM provides two classes for this purpose:</p>
<ul>
<li><code>ZeroShotGPTClassifier</code>: used for single label classification (e.g.&nbsp;sentiment analysis),</li>
<li><code>MultiLabelZeroShotGPTClassifier</code>: used for a multi-label classification task.</li>
</ul>
<section id="single-label-zeroshotgptclassifier" class="level3">
<h3 class="anchored" data-anchor-id="single-label-zeroshotgptclassifier">Single label ZeroShotGPTClassifier</h3>
<p>Let’s do a sentiment analysis of a few movie reviews. For training purposes, we define the sentiment for each review (defined by a variable <code>movie_review_labels</code>). We train the model with these reviews and labels, so that we can predict new movie reviews using the trained model.</p>
<p>The sample dataset for the movie reviews is given below:</p>
<div id="zeroshotgptclassifier-dataset" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">movie_reviews <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb3-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"This movie was absolutely wonderful. The storyline was compelling and the characters were very realistic."</span>,</span>
<span id="cb3-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I really loved the film! The plot had a few unexpected twists which kept me engaged till the end."</span>,</span>
<span id="cb3-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The movie was alright. Not great, but not bad either. A decent one-time watch."</span>,</span>
<span id="cb3-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I didn't enjoy the film that much. The plot was quite predictable and the characters lacked depth."</span>,</span>
<span id="cb3-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"This movie was not to my taste. It felt too slow and the storyline wasn't engaging enough."</span>,</span>
<span id="cb3-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The film was okay. It was neither impressive nor disappointing. It was just fine."</span>,</span>
<span id="cb3-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I was blown away by the movie! The cinematography was excellent and the performances were top-notch."</span>,</span>
<span id="cb3-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I didn't like the movie at all. The story was uninteresting and the acting was mediocre at best."</span>,</span>
<span id="cb3-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The movie was decent. It had its moments but was not consistently engaging."</span></span>
<span id="cb3-11">]</span>
<span id="cb3-12"></span>
<span id="cb3-13">movie_review_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb3-14">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"positive"</span>, </span>
<span id="cb3-15">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"positive"</span>, </span>
<span id="cb3-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neutral"</span>, </span>
<span id="cb3-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"negative"</span>, </span>
<span id="cb3-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"negative"</span>, </span>
<span id="cb3-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neutral"</span>, </span>
<span id="cb3-20">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"positive"</span>, </span>
<span id="cb3-21">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"negative"</span>, </span>
<span id="cb3-22">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neutral"</span></span>
<span id="cb3-23">]</span>
<span id="cb3-24"></span>
<span id="cb3-25">new_movie_reviews <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb3-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A positive review</span></span>
<span id="cb3-27">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The movie was fantastic! I was captivated by the storyline from beginning to end."</span>,</span>
<span id="cb3-28"></span>
<span id="cb3-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A negative review</span></span>
<span id="cb3-30">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I found the film to be quite boring. The plot moved too slowly and the acting was subpar."</span>,</span>
<span id="cb3-31"></span>
<span id="cb3-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A neutral review</span></span>
<span id="cb3-33">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The movie was okay. Not the best I've seen, but certainly not the worst."</span></span>
<span id="cb3-34">]</span></code></pre></div></div>
</div>
<p>Let’s train the model and then check what the model predicts for each new review.</p>
<div id="zeroshotgptclassifier-train-predict" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ZeroShotGPTClassifier</span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize the classifier with the OpenAI model</span></span>
<span id="cb4-4">clf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ZeroShotGPTClassifier(openai_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>)</span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train the model </span></span>
<span id="cb4-7">clf.fit(X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>movie_reviews, y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>movie_review_labels)  </span>
<span id="cb4-8"></span>
<span id="cb4-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use the trained classifier to predict the sentiment of the new reviews</span></span>
<span id="cb4-10">predicted_movie_review_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> clf.predict(X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>new_movie_reviews)  </span>
<span id="cb4-11"></span>
<span id="cb4-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> review, sentiment <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(new_movie_reviews, predicted_movie_review_labels):</span>
<span id="cb4-13">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Review: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>review<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Predicted Sentiment: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>sentiment<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>  0%|          | 0/3 [00:00&lt;?, ?it/s] 33%|███▎      | 1/3 [00:01&lt;00:02,  1.21s/it] 67%|██████▋   | 2/3 [00:02&lt;00:01,  1.28s/it]100%|██████████| 3/3 [00:03&lt;00:00,  1.23s/it]100%|██████████| 3/3 [00:03&lt;00:00,  1.24s/it]</code></pre>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>Review: The movie was fantastic! I was captivated by the storyline from beginning to end.
Predicted Sentiment: positive


Review: I found the film to be quite boring. The plot moved too slowly and the acting was subpar.
Predicted Sentiment: negative


Review: The movie was okay. Not the best I've seen, but certainly not the worst.
Predicted Sentiment: neutral

</code></pre>
</div>
</div>
<p>As can be seen above, the model predicted the sentiment of each movie review correctly.</p>
</section>
<section id="multi-labels-zeroshotgptclassifier" class="level3">
<h3 class="anchored" data-anchor-id="multi-labels-zeroshotgptclassifier">Multi-Labels ZeroShotGPTClassifier</h3>
<p>In the previous section, we had a single-label classifier ([“positive”, “negative”, “neutral”]). Here, we are going to use the <code>MultiLabelZeroShotGPTClassifier</code> estimator to assign multiple labels to a list of restaurant reviews.</p>
<div id="multilabelzeroshotgptclassifier-dataset" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">restaurant_reviews <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb7-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The food was delicious and the service was excellent. A wonderful dining experience!"</span>,</span>
<span id="cb7-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The restaurant was in a great location, but the food was just average."</span>,</span>
<span id="cb7-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The service was very slow and the food was cold when it arrived. Not a good experience."</span>,</span>
<span id="cb7-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The restaurant has a beautiful ambiance, and the food was superb."</span>,</span>
<span id="cb7-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The food was great, but I found it to be a bit overpriced."</span>,</span>
<span id="cb7-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The restaurant was conveniently located, but the service was poor."</span>,</span>
<span id="cb7-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The food was not as expected, but the restaurant ambiance was really nice."</span>,</span>
<span id="cb7-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Great food and quick service. The location was also very convenient."</span>,</span>
<span id="cb7-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The prices were a bit high, but the food quality and the service were excellent."</span>,</span>
<span id="cb7-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The restaurant offered a wide variety of dishes. The service was also very quick."</span></span>
<span id="cb7-12">]</span>
<span id="cb7-13"></span>
<span id="cb7-14">restaurant_review_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb7-15">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>],</span>
<span id="cb7-16">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Location"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>],</span>
<span id="cb7-17">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>],</span>
<span id="cb7-18">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Atmosphere"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>],</span>
<span id="cb7-19">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Price"</span>],</span>
<span id="cb7-20">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Location"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>],</span>
<span id="cb7-21">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Atmosphere"</span>],</span>
<span id="cb7-22">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Location"</span>],</span>
<span id="cb7-23">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Price"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>],</span>
<span id="cb7-24">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food Variety"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Service"</span>]</span>
<span id="cb7-25">]</span>
<span id="cb7-26"></span>
<span id="cb7-27">new_restaurant_reviews <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb7-28">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The food was excellent and the restaurant was located in the heart of the city."</span>,</span>
<span id="cb7-29">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The service was slow and the food was not worth the price."</span>,</span>
<span id="cb7-30">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The restaurant had a wonderful ambiance, but the variety of dishes was limited."</span></span>
<span id="cb7-31">]</span></code></pre></div></div>
</div>
<p>Let’s train the model and then predict the labels for new reviews.</p>
<div id="multilabelzeroshotgptclassifier-train-predict" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MultiLabelZeroShotGPTClassifier</span>
<span id="cb8-2"></span>
<span id="cb8-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize the classifier with the OpenAI model</span></span>
<span id="cb8-4">clf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MultiLabelZeroShotGPTClassifier(max_labels<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb8-5"></span>
<span id="cb8-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train the model </span></span>
<span id="cb8-7">clf.fit(X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>restaurant_reviews, y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>restaurant_review_labels)</span>
<span id="cb8-8"></span>
<span id="cb8-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use the trained classifier to predict the labels of the new reviews</span></span>
<span id="cb8-10">predicted_restaurant_review_labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> clf.predict(X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>new_restaurant_reviews)</span>
<span id="cb8-11"></span>
<span id="cb8-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> review, labels <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(new_restaurant_reviews, predicted_restaurant_review_labels):</span>
<span id="cb8-13">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Review: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>review<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Predicted Labels: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>labels<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>  0%|          | 0/3 [00:00&lt;?, ?it/s] 33%|███▎      | 1/3 [00:01&lt;00:02,  1.44s/it] 67%|██████▋   | 2/3 [00:02&lt;00:01,  1.44s/it]100%|██████████| 3/3 [00:05&lt;00:00,  1.87s/it]100%|██████████| 3/3 [00:05&lt;00:00,  1.76s/it]</code></pre>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>Review: The food was excellent and the restaurant was located in the heart of the city.
Predicted Labels: ['Food', 'Location']


Review: The service was slow and the food was not worth the price.
Predicted Labels: ['Service', 'Price']


Review: The restaurant had a wonderful ambiance, but the variety of dishes was limited.
Predicted Labels: ['Atmosphere', 'Food Variety']

</code></pre>
</div>
</div>
<p>The predicted labels for each review are spot-on.</p>
</section>
</section>
<section id="text-vectorization" class="level2">
<h2 class="anchored" data-anchor-id="text-vectorization">Text Vectorization</h2>
<p>Scikit-LLM provides the <code>GPTVectorizer</code> class to convert the input text into a fixed-dimensional vector representation. Each resulting vector is an array of floating numbers, which is a representation of the corresponding sentence.</p>
<p>Let’s get a vectorized representation of the following sentences.</p>
<div id="gptvectorizer-dataset" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb11-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AI can revolutionize industries."</span>,</span>
<span id="cb11-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Robotics creates automated solutions."</span>,</span>
<span id="cb11-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"IoT connects devices for data exchange."</span></span>
<span id="cb11-5">]</span></code></pre></div></div>
</div>
<div id="gptvectorizer-fit-transform" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GPTVectorizer</span>
<span id="cb12-2"></span>
<span id="cb12-3">vectorizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GPTVectorizer()</span>
<span id="cb12-4"></span>
<span id="cb12-5">vectors <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> vectorizer.fit_transform(X)</span>
<span id="cb12-6"></span>
<span id="cb12-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(vectors)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>  0%|          | 0/3 [00:00&lt;?, ?it/s] 33%|███▎      | 1/3 [00:00&lt;00:00,  4.04it/s] 67%|██████▋   | 2/3 [00:00&lt;00:00,  5.81it/s]100%|██████████| 3/3 [00:00&lt;00:00,  6.90it/s]100%|██████████| 3/3 [00:00&lt;00:00,  6.24it/s]</code></pre>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>[[-0.00818074 -0.02555227 -0.00994665 ... -0.00266894 -0.02135153
   0.00325925]
 [-0.00944166 -0.00884305 -0.01260475 ... -0.00351341 -0.01211498
  -0.00738735]
 [-0.01084771 -0.00133671  0.01582962 ...  0.01247486 -0.00829649
  -0.01012453]]</code></pre>
</div>
</div>
<p>In practice, these vectors are inputs to other machine learning models for tasks like classification, clustering, or regression, rather than examining the vectors directly.</p>
</section>
<section id="text-translation" class="level2">
<h2 class="anchored" data-anchor-id="text-translation">Text Translation</h2>
<p>GPT models can be used to translate by making accurate readings from one language to another. We can translate a text into a language of interest using the <code>GPTTranslator</code> module.</p>
<div id="gpttranslator-dataset" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GPTTranslator</span>
<span id="cb15-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> get_translation_dataset</span>
<span id="cb15-3"></span>
<span id="cb15-4">translator <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GPTTranslator(openai_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>, output_language<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"English"</span>)</span>
<span id="cb15-5"></span>
<span id="cb15-6">text_to_translate <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Je suis content que vous lisiez ce post."</span>]</span>
<span id="cb15-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># "I am happy that you are reading this post."</span></span>
<span id="cb15-8"></span>
<span id="cb15-9">translated_text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> translator.fit_transform(text_to_translate)</span>
<span id="cb15-10"></span>
<span id="cb15-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb15-12">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Text in French: </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>text_to_translate[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Translated text in English: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>translated_text[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb15-13">)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>  0%|          | 0/1 [00:00&lt;?, ?it/s]100%|██████████| 1/1 [00:01&lt;00:00,  1.42s/it]100%|██████████| 1/1 [00:01&lt;00:00,  1.43s/it]</code></pre>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>Text in French: 
Je suis content que vous lisiez ce post.

Translated text in English: I am glad that you are reading this post.</code></pre>
</div>
</div>
</section>
<section id="text-summarization" class="level2">
<h2 class="anchored" data-anchor-id="text-summarization">Text Summarization</h2>
<p>GPT models are very useful for summarizing texts. The Scikit-LLM library provides the <code>GPTSummarizer</code> estimator for text summarization. Let’s see that in action by summarizing the long reviews given below.</p>
<div id="gptsummarizer-dataset" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">reviews <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb18-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I dined at The Gourmet Kitchen last night and had a wonderful experience. The service was impeccable, the food was exquisite, and the ambiance was delightful. I had the seafood pasta, which was cooked to perfection. The wine list was also quite impressive. I would highly recommend this restaurant to anyone looking for a fine dining experience."</span>,</span>
<span id="cb18-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I visited The Burger Spot for lunch today and was pleasantly surprised. Despite being a fast food joint, the quality of the food was excellent. I ordered the classic cheeseburger and it was juicy and flavorful. The fries were crispy and well-seasoned. The service was quick and the staff was friendly. It's a great place for a quick and satisfying meal."</span>,</span>
<span id="cb18-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The Coffee Corner is my favorite spot to work and enjoy a good cup of coffee. The atmosphere is relaxed and the coffee is always top-notch. They also offer a variety of pastries and sandwiches. The staff is always welcoming and the service is fast. I enjoy their latte and the blueberry muffin is a must-try."</span></span>
<span id="cb18-5">]</span></code></pre></div></div>
</div>
<div id="gptsummarizer-fit-transform" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> skllm.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GPTSummarizer</span>
<span id="cb19-2"></span>
<span id="cb19-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize the GPT summarizer model</span></span>
<span id="cb19-4">gpt_summarizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GPTSummarizer(openai_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"gpt-3.5-turbo"</span>, max_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>)</span>
<span id="cb19-5"></span>
<span id="cb19-6">summaries <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gpt_summarizer.fit_transform(reviews)</span>
<span id="cb19-7"></span>
<span id="cb19-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(summaries)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>  0%|          | 0/3 [00:00&lt;?, ?it/s] 33%|███▎      | 1/3 [00:02&lt;00:05,  2.61s/it] 67%|██████▋   | 2/3 [00:03&lt;00:01,  1.85s/it]100%|██████████| 3/3 [00:05&lt;00:00,  1.65s/it]100%|██████████| 3/3 [00:05&lt;00:00,  1.78s/it]</code></pre>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>['The Gourmet Kitchen offers impeccable service, exquisite food, delightful ambiance, and impressive wine list. Highly recommended.'
 'The Burger Spot offers excellent quality fast food with friendly service.'
 'The Coffee Corner is a great place to work with good coffee and food.']</code></pre>
</div>
</div>
<p>A short summary of each review is generated. The <code>max_words</code> parameter sets a rough upper bound on the summary’s length; in practice, it may be a slightly longer.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Jupyter Notebook
</div>
</div>
<div class="callout-body-container callout-body">
<p>📓 You can find the Jupyter notebook for this tutorial on <a href="https://github.com/e-alizadeh/data-science-blog/blob/master/notebooks/Scikit-LLM-tutorial.ipynb">GitHub</a>.</p>
</div>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Scikit-LLM is a powerful tool that adds the power of advanced language models like GPT-3 to the well-known scikit-learn framework. In this tutorial, we looked at some of Scikit-LLM’s most important features, such as zero-shot text classification, multi-label zero-shot text classification, text vectorization, text summary, and language translation.</p>
<p>Scikit-LLM is a promising tool that opens up new possibilities in the realm of text analysis using large language models. This library might be a useful addition to your toolbox; therefore, I suggest giving it a try.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-online:python_skllm_github" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">Iryna Kondrashchenko, <span>“<span>Scikit-LLM: Sklearn Meets Large Language Models</span>,”</span> 2023. <a href="https://github.com/iryna-kondr/scikit-llm">https://github.com/iryna-kondr/scikit-llm</a></div>
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {Scikit-LLM: {Bridging} the {Gap} {Between} {LLM} and
    {Scikit-learn}},
  date = {2023-06-06},
  url = {https://ealizadeh.com/blog/tutorial-scikit-llm/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2023" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Scikit-LLM: Bridging the Gap Between LLM and
Scikit-learn,”</span> Jun. 06, 2023. <a href="https://ealizadeh.com/blog/tutorial-scikit-llm/">https://ealizadeh.com/blog/tutorial-scikit-llm/</a></div>
</div></div></section></div> ]]></description>
  <category>GPT</category>
  <category>scikit-learn</category>
  <category>Data Science</category>
  <category>Machine Learning</category>
  <category>Natural Language Processing</category>
  <category>Python Library</category>
  <guid>https://ealizadeh.com/blog/tutorial-scikit-llm/</guid>
  <pubDate>Tue, 06 Jun 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/tutorial-scikit-llm/img/_featured_image.png" medium="image" type="image/png" height="107" width="144"/>
</item>
<item>
  <title>Python’s itertools: A Hidden Gem for Efficient Looping</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/itertools/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/itertools/img/_featured_image.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Featured image of itertools functions covered in the post."></p>
</figure>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>The itertools <span class="citation" data-cites="online:python_itertools">[1]</span> module in Python is a powerful tool that provides a set of functions for creating iterators to support efficient looping and handling of sequences. It’s part of Python’s standard library, meaning it’s available in every Python installation.</p>
<p>Let’s first talk about what a Python iterator is before diving into the itertools functions.</p>
</section>
<section id="what-is-an-iterator-in-python" class="level1">
<h1>What is an iterator in Python?</h1>
<p>An iterator is a Python object that can be looped over, or iterated. Data containers may be abstracted in order to get access to and perform operations on their contents without revealing their internal representation.</p>
<p>Python has several built-in functions and objects that return iterators. Some of the more frequent ones are as follows:</p>
<ul>
<li>Basic data types: Lists, tuples, strings, and dictionaries,</li>
<li>Built-in functions: <code>range()</code>, <code>enumerate()</code>, <code>zip()</code></li>
</ul>
<section id="how-is-an-iterator-defined-in-python" class="level2">
<h2 class="anchored" data-anchor-id="how-is-an-iterator-defined-in-python">How is an iterator defined in Python?</h2>
<p>An iterator object must implement two special methods: <code>__iter__()</code> and <code>__next__()</code>, collectively known as the iterator protocol <span class="citation" data-cites="online:python_builtin_types">[2]</span>.</p>
<p>The <code>__iter__()</code> method returns the iterator object itself, and is required for your object to be used in any iteration context, such as a for loop. The <code>__next__()</code> method returns the next value from the iterator. If there are no more items to return, it should raise <code>StopIteration</code>.</p>
<div id="cb747bfb" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> CountUpToThree:</span>
<span id="cb1-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-3">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-4"></span>
<span id="cb1-5">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__iter__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-6">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span></span>
<span id="cb1-7"></span>
<span id="cb1-8">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__next__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-9">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>:</span>
<span id="cb1-10">            value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.count</span>
<span id="cb1-11">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-12">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> value</span>
<span id="cb1-13">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb1-14">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">raise</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">StopIteration</span></span>
<span id="cb1-15"></span>
<span id="cb1-16">counter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CountUpToThree()</span>
<span id="cb1-17"></span>
<span id="cb1-18"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> c <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> counter:</span>
<span id="cb1-19">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(c)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>0
1
2</code></pre>
</div>
</div>
</section>
</section>
<section id="a-deep-dive-into-itertools-library" class="level1">
<h1>A deep dive into itertools library</h1>
<p>At its core, itertools offers a suite of building block functions that allow you to iterate over data in a fast, memory-efficient, and developer-friendly way. These functions can be categorized into three broad types:</p>
<ol type="1">
<li><strong>Infinite Iterators</strong>: These generate an infinite sequence of values.</li>
<li><strong>Combinatoric Generators</strong>: These iterators generate outputs by combining inputs in different ways. They are extremely useful when you want to produce complex combinations or permutations of data.</li>
<li><strong>Iterators Terminating on the Shortest Input Sequence</strong>: These, like <code>itertools.zip_longest()</code>, <code>itertools.chain()</code>, <code>itertools.takewhile()</code>, produce values from input sequences and stop when the shortest sequence is exhausted.</li>
</ol>
<p>All iterators in Python output values sequentially, but itertools’ operations may be chained together to construct more complicated iterators that can process big data sets without using a lot of memory. Additionally, because itertools’ operations are written in C, they are faster than comparable Python code written using conventional loops.</p>
<p>Itertools is a useful tool for Python programmers because it makes loops more efficient and the code easier to read. Itertools gives us a better way to run through lists, texts, dictionaries, files, and even our own custom data structures.</p>
<section id="infinite-iterators" class="level2">
<h2 class="anchored" data-anchor-id="infinite-iterators">Infinite Iterators</h2>
<p>Infinite iterators are a unique feature in the itertools module. They produce an endless sequence of items, only stopping when we explicitly break the loop. This can be particularly useful in scenarios where we have a repeating pattern or want to generate a continuous sequence. However, you must be careful when using these to avoid creating an infinite loop in your program. Let’s look at the three main infinite iterator functions: <code>count()</code>, <code>cycle()</code>, and <code>repeat()</code>.</p>
<section id="countstart-step" class="level3">
<h3 class="anchored" data-anchor-id="countstart-step"><code>count(start, step)</code></h3>
<p>The <code>count()</code> function works similarly to the built-in <code>range()</code> function but, instead of stopping at a certain point, it continues indefinitely. It takes two arguments: <code>start</code> and <code>step</code>. <code>start</code> is the number at which the count begins, and <code>step</code> is the increment.</p>
<div id="ac77ad6d" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> count</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> count(start<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>):</span>
<span id="cb3-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(idx)</span>
<span id="cb3-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">110</span>:  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Break the loop to prevent an infinite loop</span></span>
<span id="cb3-6">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>100
105
110
115</code></pre>
</div>
</div>
<p>In this example, we start counting from 100 and increase by 5 each time. The loop will continue indefinitely unless we stop it. Here, we stop it when <code>i</code> gets larger than 110.</p>
</section>
<section id="sec-cycle" class="level3">
<h3 class="anchored" data-anchor-id="sec-cycle"><code>cycle(iterable)</code></h3>
<p>The <code>cycle()</code> function cycles through an iterable indefinitely. This can be useful when you have a repeating pattern.</p>
<div id="733e4b94" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cycle</span>
<span id="cb5-2"></span>
<span id="cb5-3">count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb5-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> cycle(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ABC"</span>):</span>
<span id="cb5-5">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(item)</span>
<span id="cb5-6">    count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb5-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>:  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Break the loop to prevent infinite loop</span></span>
<span id="cb5-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">break</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>A
B
C
A
B</code></pre>
</div>
</div>
<p>In this example, we’re cycling through the string ‘ABC’. Once we reach ‘C’, it starts over with ‘A’ again. We stop the loop after 5 iterations.</p>
<section id="more-advanced-example-cycle-through-a-list" class="level4">
<h4 class="anchored" data-anchor-id="more-advanced-example-cycle-through-a-list">More advanced example: Cycle through a list</h4>
<p>Suppose we want to cycle through a list indefinitely and print out the current item and the next item.</p>
<div id="b909f0b3" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cycle</span>
<span id="cb7-2"></span>
<span id="cb7-3">items <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>]</span>
<span id="cb7-4">cycled_items <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cycle(items) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># an iterator that returns elements from the iterable indefinitely</span></span>
<span id="cb7-5"></span>
<span id="cb7-6">current_item <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">next</span>(cycled_items)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to advance through the iterator</span></span>
<span id="cb7-7"></span>
<span id="cb7-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>):</span>
<span id="cb7-9">    next_item <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">next</span>(cycled_items)</span>
<span id="cb7-10">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Current item: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>current_item<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">Next item: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>next_item<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb7-11">    current_item <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> next_item</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Current item: A
Next item: B

Current item: B
Next item: C

Current item: C
Next item: A

Current item: A
Next item: B

Current item: B
Next item: C
</code></pre>
</div>
</div>
</section>
</section>
<section id="repeatobject-times" class="level3">
<h3 class="anchored" data-anchor-id="repeatobject-times"><code>repeat(object, times)</code></h3>
<p>The <code>repeat()</code> function simply repeats an object over and over again. By default, it does this indefinitely, but you can also specify the number of times you want the object to be repeated.</p>
<div id="28fd8f50" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> repeat</span>
<span id="cb9-2"></span>
<span id="cb9-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> repeat([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>], times<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>):</span>
<span id="cb9-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(i)</span>
<span id="cb9-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb9-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> repeat(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AB"</span>, times<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>):</span>
<span id="cb9-7">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(i)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>['A', 'B']
['A', 'B']
['A', 'B']


AB
AB
AB</code></pre>
</div>
</div>
<p>Here, we’re repeating the string ‘ABC’ three times. Unlike the previous functions, <code>repeat()</code> can terminate on its own if we provide the <code>times</code> argument.</p>
<p>These functions can be very handy in various scenarios. They allow us to generate data on the fly without having to pre-generate large lists or sequences, making our code more memory efficient.</p>
</section>
</section>
<section id="combinatoric-iterators" class="level2">
<h2 class="anchored" data-anchor-id="combinatoric-iterators">Combinatoric Iterators</h2>
<p>Combinatoric iterators are used to create different types of iterators that generate all possible combinations, permutations, or Cartesian products (a set of all ordered pairs) of an iterable<sup>1</sup>. They are powerful tools when we need to consider all possible combinations of elements. Here we’ll focus on three functions: <code>product()</code>, <code>permutations()</code>, and <code>combinations()</code>.</p>
<section id="productiterable-repeat" class="level3">
<h3 class="anchored" data-anchor-id="productiterable-repeat"><code>product(iterable, repeat)</code></h3>
<p>The <code>product()</code> function computes the Cartesian product of the input iterable. This is equivalent to nested for-loops. The <code>repeat</code> argument specifies the number of repetitions of the iterable. The result is the Cartesian product of the input iterable with itself, repeated the specified number of times.</p>
<div id="ff67f98c" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> product</span>
<span id="cb11-2"></span>
<span id="cb11-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> product([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>], repeat<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>):</span>
<span id="cb11-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(item)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>('A', 'A')
('A', 'B')
('B', 'A')
('B', 'B')</code></pre>
</div>
</div>
<p>In this example, we’re generating the Cartesian product of the string ‘AB’ with itself. This gives us all possible pairs of ‘A’ and ‘B’ in a tuple.</p>
</section>
<section id="permutationsiterable-r" class="level3">
<h3 class="anchored" data-anchor-id="permutationsiterable-r"><code>permutations(iterable, r)</code></h3>
<p>The <code>permutations()</code> function generates all possible permutations of the input iterable. You can specify the length of the permutations using the ‘r’ argument. If ‘r’ is not specified, then ‘r’ defaults to the length of the iterable.</p>
<div id="5a923511" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> permutations</span>
<span id="cb13-2"></span>
<span id="cb13-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> permutations(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ABC"</span>, r<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>):  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># equivalent to permutations(["A", "B", "C"], 2)</span></span>
<span id="cb13-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(item)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>('A', 'B')
('A', 'C')
('B', 'A')
('B', 'C')
('C', 'A')
('C', 'B')</code></pre>
</div>
</div>
<p>Here, we’re generating all possible 2-element permutations of the string ‘ABC’. Each permutation is a tuple of two characters.</p>
</section>
<section id="combinationsiterable-r" class="level3">
<h3 class="anchored" data-anchor-id="combinationsiterable-r"><code>combinations(iterable, r)</code></h3>
<p>The <code>combinations()</code> function generates all possible combinations of the input iterable. The <code>r</code> argument specifies the length of the combinations. Unlike permutations, combinations don’t consider the order of elements.</p>
<div id="e04ecb30" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> combinations</span>
<span id="cb15-2"></span>
<span id="cb15-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> combinations([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>], r<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>):</span>
<span id="cb15-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(item)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>('A', 'B')
('A', 'C')
('B', 'C')</code></pre>
</div>
</div>
<p>Here, we’ll generate every pairwise permutation of the items in the list [“A”, “B”, “C”].</p>
<p>These operations come in handy when trying to solve a problem that requires us to think about every conceivable combination or subset of the given items.</p>
</section>
</section>
<section id="terminating-iterators" class="level2">
<h2 class="anchored" data-anchor-id="terminating-iterators">Terminating Iterators</h2>
<p>Functions that return a single iterable after using up all elements in the input iterable are called <em>terminating iterators</em>. They are used to reduce the input iterable in some way. For this section, we’ll focus on <code>accumulate()</code>, <code>groupby()</code>, and <code>chain()</code>.</p>
<section id="accumulateiterable-func" class="level3">
<h3 class="anchored" data-anchor-id="accumulateiterable-func"><code>accumulate(iterable, func)</code></h3>
<p>The <code>accumulate()</code> function provides a way to get the sum of values or the sum of the outcomes of other binary operations. In the absence of a specified function, addition will be used.</p>
<div id="c7ab9451" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> accumulate</span>
<span id="cb17-2"></span>
<span id="cb17-3">list_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>]</span>
<span id="cb17-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> item <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> accumulate(list_, func<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>):</span>
<span id="cb17-5">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(item)</span>
<span id="cb17-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3], func=max) -&gt; 3</span></span>
<span id="cb17-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4], func=max) -&gt; 4</span></span>
<span id="cb17-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4, 6], func=max) -&gt; 6</span></span>
<span id="cb17-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4, 6, 2], func=max) -&gt; 6</span></span>
<span id="cb17-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4, 6, 2, 1], func=max) -&gt; 6</span></span>
<span id="cb17-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4, 6, 2, 1, 9], func=max) -&gt; 9</span></span>
<span id="cb17-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># accumulate([3, 4, 6, 2, 1, 9, 8], func=max) -&gt; 9</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>3
4
6
6
6
9
9</code></pre>
</div>
</div>
<p>In this example, we’re using <code>accumulate()</code> with the max function to print the maximum value encountered at each step in the list.</p>
</section>
<section id="groupbyiterable-key" class="level3">
<h3 class="anchored" data-anchor-id="groupbyiterable-key"><code>groupby(iterable, key)</code></h3>
<p>The <code>groupby()</code> function makes an iterator that returns consecutive keys and groups from the iterable. The key is a function that computes a key value for each element.</p>
<div id="921200a2" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> groupby</span>
<span id="cb19-2"></span>
<span id="cb19-3">list_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb19-4">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"apple"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fruit"</span>), </span>
<span id="cb19-5">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"orange"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fruit"</span>), </span>
<span id="cb19-6">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lettuce"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vegetable"</span>), </span>
<span id="cb19-7">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"spinach"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vegetable"</span>)</span>
<span id="cb19-8">]</span>
<span id="cb19-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> key, group <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> groupby(list_, key<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]):</span>
<span id="cb19-10">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>key<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">" group: '</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(group))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>"fruit" group:  [('apple', 'fruit'), ('orange', 'fruit')]
"vegetable" group:  [('lettuce', 'vegetable'), ('spinach', 'vegetable')]</code></pre>
</div>
</div>
<p>In this case, we’re classifying a set of tuples according to their second element (thus, <code>x[1]</code>), which makes them either <em>fruit</em> or <em>vegetable</em>.</p>
</section>
<section id="chainiterables" class="level3">
<h3 class="anchored" data-anchor-id="chainiterables"><code>chain(iterables)</code></h3>
<p>The <code>chain()</code> function is used to treat multiple sequences as one continuous sequence.</p>
<div id="39b411c4" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> chain</span>
<span id="cb21-2"></span>
<span id="cb21-3">list_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>]</span>
<span id="cb21-4">list_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>]</span>
<span id="cb21-5">s <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cd"</span></span>
<span id="cb21-6"></span>
<span id="cb21-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> each <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> chain(list_1, list_2, s):</span>
<span id="cb21-8">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(each)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>A
B
1
2
3
c
d</code></pre>
</div>
</div>
<p>In this example, we’re using <code>chain()</code> to treat three separate lists as if they were one long list and iterating over their contents.</p>
</section>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In conclusion, the itertools module is a hidden gem in Python that enables simpler, more efficient code to be written when dealing with iterations. It simplifies our work by providing a set of tools for building and manipulating iterators that can handle complicated iteration patterns. As we deal with bigger datasets, efficiency in terms of memory use also becomes more crucial. In this post, we covered three main classes of itertools methods, which are: 1. <em>infinite iterators</em>, 2. <em>combinatoric iterators</em>, and 3. <em>terminating iterators</em>.</p>
<p>Despite its benefits, itertools is still one of Python’s lesser-known standard libraries. itertools is a necessary element of every Python programmer’s arsenal because of the variety of powerful capabilities it offers for looping, iterating, and producing combinations or permutations. Learning itertools is a good investment of time, whether you’re an experienced Pythonista wanting to hone your coding skills or a beginner trying to get a feel for Python’s potential.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>📓 You can find the Jupyter notebook for this blog post on <a href="https://github.com/e-alizadeh/data-science-blog/blob/master/notebooks/itertools.ipynb">GitHub</a>.</p>
</div>
</div>
<div class="callout callout-style-simple callout-note no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Update History
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 10%">
<col style="width: 70%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Sections</th>
<th style="text-align: left;">Changes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">June 7, 2023</td>
<td style="text-align: left;">-</td>
<td style="text-align: left;">Removed section “Practical Example: Solving a Problem with itertools” and the example 1.</td>
</tr>
<tr class="even">
<td style="text-align: left;">June 7, 2023</td>
<td style="text-align: left;">3.1.2</td>
<td style="text-align: left;">Moved the example “Cycle through a list” to a different setion.</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-online:python_itertools" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">Python Software Foundation, <span>“<span class="nocase">itertools — Functions creating iterators for efficient looping</span>,”</span> May 23, 2023. <a href="https://docs.python.org/3/library/itertools.html">https://docs.python.org/3/library/itertools.html</a></div>
</div>
<div id="ref-online:python_builtin_types" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">Python Software Foundation, <span>“<span class="nocase">The Python Standard Library » Built-in Types</span>,”</span> May 25, 2023. <a href="https://docs.python.org/3/library/stdtypes.html#iterator-types">https://docs.python.org/3/library/stdtypes.html#iterator-types</a></div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>For more information about Cartesian product, see <a href="https://en.wikipedia.org/wiki/Cartesian_product">Wikipedia</a>↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {Python’s Itertools: {A} {Hidden} {Gem} for {Efficient}
    {Looping}},
  date = {2023-05-25},
  url = {https://ealizadeh.com/blog/itertools/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2023" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Python’s itertools: A Hidden Gem for Efficient
Looping,”</span> May 25, 2023. <a href="https://ealizadeh.com/blog/itertools/">https://ealizadeh.com/blog/itertools/</a></div>
</div></div></section></div> ]]></description>
  <category>Python</category>
  <category>Python Library</category>
  <category>Looping</category>
  <category>Software Development</category>
  <guid>https://ealizadeh.com/blog/itertools/</guid>
  <pubDate>Thu, 25 May 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/itertools/img/_featured_image.png" medium="image" type="image/png" height="90" width="144"/>
</item>
<item>
  <title>Taming Text with string2string: A Powerful Python Library for String-to-String Algorithms</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/tutorial-string2string/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/tutorial-string2string/img/_featured_image.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Featured image of wordcloud of concepts covered in string2string library."></p>
</figure>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/tutorial-string2string-python-pkg-f9126b8474c5">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>The <code>string2string</code> library is an open-source tool that has a full set of efficient methods for string-to-string problems.<sup>1</sup> String pairwise alignment, distance measurement, lexical and semantic search, and similarity analysis are all covered in this library. Additionally, a variety of useful visualization tools and metrics that make it simpler to comprehend and evaluate the findings of these approaches are also included.</p>
<p>The library has well-known algorithms like the Smith-Waterman, Hirschberg, Wagner-Fisher, BARTScore, BERTScore, Knuth-Morris-Pratt, and Faiss search. It can be used for many jobs and problems in natural-language processing, bioinformatics, and computer social studies <span class="citation" data-cites="suzgun2023string2string">[1]</span>.</p>
<p>The <a href="https://nlp.stanford.edu/">Stanford NLP group</a>, which is part of the Stanford AI Lab, has developed the library and introduced it in <span class="citation" data-cites="suzgun2023string2string">[1]</span>. The library’s GitHub repository has several <a href="https://github.com/stanfordnlp/string2string/tree/main#tutorials">tutorials</a> that you may find useful.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>A <em>string</em> is a sequence of characters (letters, numbers, and symbols) that stands for a piece of data or text. From everyday phrases to DNA sequences, and even computer programs, strings may be used to represent just about everything <span class="citation" data-cites="suzgun2023string2string">[1]</span>.</p>
</div>
</div>
<!-- ![The `string2string` library provides algorithms and techniques to solve string-to-string mappings [@suzgun2023string2string].](img/string2sring_library_featured.png){#fig-s2s-library-algorithms fig-align="center" fig-alt="Algorithms and techniques in `string2string` library."} -->
<section id="installation" class="level2">
<h2 class="anchored" data-anchor-id="installation">Installation</h2>
<p>You can install the library via pip by running <code>pip install string2string</code>.</p>
</section>
</section>
<section id="pairwise-alignment" class="level1">
<h1>Pairwise Alignment</h1>
<p>String pairwise alignment is a method used in NLP and other disciplines to compare two strings, or sequences of characters, by highlighting their shared and unique characteristics. The two strings are aligned, and a similarity score is calculated based on the number of shared characters, as well as the number of shared gaps and mismatches. This procedure is useful for locating sequences of characters that share similarities and calculating the “distance” between two sets of strings. Spell checking, text analysis, and bioinformatics sequence comparison (e.g., DNA sequence alignment) are just some of the many uses for it.</p>
<p>Currently, the <code>string2string</code> package provides the following alignment techniques:</p>
<ul>
<li>Needleman-Wunsch for global alignment</li>
<li>Smith-Waterman for local alignment</li>
<li>Hirchberg’s algorithm for linear space global alignment</li>
<li>Longest common subsequence</li>
<li>Longest common substring</li>
<li>Dynamic time warping (DTW) for time series alignment</li>
</ul>
<p>In this post, we’ll look at two examples: one for global alignment and one for time series alignment.</p>
<section id="needleman-wunsch-algorithm-for-global-alignment" class="level2">
<h2 class="anchored" data-anchor-id="needleman-wunsch-algorithm-for-global-alignment">Needleman-Wunsch Algorithm for Global Alignment</h2>
<p>The Needleman-Wunsch algorithm is a type of dynamic programming algorithm that is often used in bioinformatics to match two DNA or protein sequences, globally.</p>
<div id="stdout-alignment-nw" class="cell" data-tags="[]" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.alignment <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> NeedlemanWunsch</span>
<span id="cb1-2"></span>
<span id="cb1-3">nw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> NeedlemanWunsch()</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define two sequences</span></span>
<span id="cb1-6">s1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ACGTGGA'</span></span>
<span id="cb1-7">s2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'AGCTCGC'</span></span>
<span id="cb1-8"></span>
<span id="cb1-9">aligned_s1, aligned_s2, score_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nw.get_alignment(s1, s2, return_score_matrix<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb1-10"></span>
<span id="cb1-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'The alignment between "</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>s1<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">" and "</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>s2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">":'</span>)</span>
<span id="cb1-12">nw.print_alignment(aligned_s1, aligned_s2)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>The alignment between "ACGTGGA" and "AGCTCGC":
A | C | G | - | T | G | G | A
A | - | G | C | T | C | G | C</code></pre>
</div>
</div>
<p>For a more informative comparison, we can use <code>plot_pairwise_alignment()</code> function in the library.</p>
<div id="cell-fig-alignment-nw-plot" class="cell" data-tags="[]" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.misc.plotting_functions <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_pairwise_alignment</span>
<span id="cb3-2"></span>
<span id="cb3-3">path, s1_pieces, s2_pieces <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> nw.get_alignment_strings_and_indices(aligned_s1, aligned_s2)</span>
<span id="cb3-4"></span>
<span id="cb3-5">plot_pairwise_alignment(</span>
<span id="cb3-6">    seq1_pieces<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>s1_pieces,</span>
<span id="cb3-7">    seq2_pieces<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>s2_pieces,</span>
<span id="cb3-8">    alignment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>path,</span>
<span id="cb3-9">    str2colordict<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb3-10">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pink"</span>, </span>
<span id="cb3-11">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"G"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lightblue"</span>, </span>
<span id="cb3-12">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lightgreen"</span>, </span>
<span id="cb3-13">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"T"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"yellow"</span>, </span>
<span id="cb3-14">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"-"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lightgray"</span></span>
<span id="cb3-15">    },</span>
<span id="cb3-16">    title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>,</span>
<span id="cb3-17">    seq1_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sequence 1"</span>,</span>
<span id="cb3-18">    seq2_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sequence 2"</span>,</span>
<span id="cb3-19">)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div id="fig-alignment-nw-plot" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-alignment-nw-plot-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/tutorial-string2string/index_files/figure-html/fig-alignment-nw-plot-output-1.png" width="758" height="374" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-alignment-nw-plot-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Global alignment between ACGTGGA and AGCTCGC
</figcaption>
</figure>
</div>
</div>
</div>
</section>
<section id="dynamic-time-warping" class="level2">
<h2 class="anchored" data-anchor-id="dynamic-time-warping">Dynamic Time Warping</h2>
<p>DTW is a useful tool to compare two time series that might differ in speed, duration, or both. It discovers the path across these distances that minimizes the total difference between the sequences by calculating the “distance” between each pair of points in the two sequences.</p>
<p>Let’s go over an example using the <code>alignment</code> module in the <code>string2string</code> library.</p>
<div id="stdout-alignment-dtw-path" class="cell" data-tags="[]" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.alignment <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DTW</span>
<span id="cb4-2"></span>
<span id="cb4-3">dtw <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DTW()</span>
<span id="cb4-4"></span>
<span id="cb4-5">x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-6">y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-7"></span>
<span id="cb4-8">dtw_path <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> dtw.get_alignment_path(x, y, distance<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"square_difference"</span>)</span>
<span id="cb4-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"DTW path: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>dtw_path<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>DTW path: [(0, 0), (1, 1), (1, 2), (2, 3), (3, 4), (4, 5), (4, 6)]</code></pre>
</div>
</div>
<p>Above is an example borrowed from my previous post, <em><a href="https://ealizadeh.com/blog/introduction-to-dynamic-time-warping/#example-1">An Illustrative Introduction to Dynamic Time Warping</a></em>. For those looking to delve deeper into the topic, in <span class="citation" data-cites="online:Essi2020dtw">[2]</span>, I explained the core concepts of DTW in a visual and accessible way.</p>
</section>
</section>
<section id="search-problems" class="level1">
<h1>Search Problems</h1>
<p>String search is the task of finding a pattern substring within another string. The library offers two types of search algorithms: lexical search and semantic search.</p>
<section id="lexical-search-exact-match-search" class="level2">
<h2 class="anchored" data-anchor-id="lexical-search-exact-match-search">Lexical Search (exact-match search)</h2>
<p>Lexical search, in layman’s terms, is the act of searching for certain words or phrases inside a text, analogous to searching for a word or phrase in a dictionary or a book.</p>
<p>Instead of trying to figure out what a string of letters or words means, it just tries to match them exactly. When it comes to search engines and information retrieval, lexical search is a basic strategy to finding relevant resources based on the keywords or phrases users enter, without any attempt at comprehending the linguistic context of the words or phrases in question.</p>
<p>Currently, the <code>string2string</code> library provides the following lexical search algorithm:</p>
<ul>
<li>Naive (brute-force) search algorithm</li>
<li>Rabin-Karp search algorithm</li>
<li>Knuth-Morris-Pratt (KMP) search algorithm (see the example below)</li>
<li>Boyer-Moore search algorithm</li>
</ul>
<div id="stdout-search-kmp" class="cell" data-tags="[]" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.search <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> KMPSearch</span>
<span id="cb6-2"></span>
<span id="cb6-3">kmp_search <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KMPSearch()</span>
<span id="cb6-4"></span>
<span id="cb6-5">pattern <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Redwood tree"</span></span>
<span id="cb6-6">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The gentle fluttering of a Monarch butterfly, the towering majesty of a Redwood tree, and the crashing of ocean waves are all wonders of nature."</span></span>
<span id="cb6-7"></span>
<span id="cb6-8">idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> kmp_search.search(pattern<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>pattern, text<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>text)</span>
<span id="cb6-9"></span>
<span id="cb6-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"The starting index of pattern: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>idx<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb6-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'The pattern (± characters) inside the text: "</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>text[idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>: idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(pattern)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"'</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>The starting index of pattern: 72
The pattern (± characters) inside the text: "of a Redwood tree, and"</code></pre>
</div>
</div>
</section>
<section id="semantic-search" class="level2">
<h2 class="anchored" data-anchor-id="semantic-search">Semantic Search</h2>
<p>Semantic search is a more sophisticated method of information retrieval that goes beyond simple word or phrase searches. It employs NLP (natural language processing) to decipher a user’s intent and return accurate results.</p>
<p>To put it another way, let’s say you’re interested in “how to grow apples.” While a lexical search may produce results including the terms “grow” and “apples,” a semantic search will recognize that you are interested in the cultivation of apple trees and deliver results accordingly. The search engine would then prioritize results that not only included the phrases it was looking for but also gave relevant information about planting, trimming, and harvesting apple trees.</p>
<section id="semantic-search-via-faiss" class="level3">
<h3 class="anchored" data-anchor-id="semantic-search-via-faiss">Semantic Search via Faiss</h3>
<p>Faiss (Facebook AI Similarity Search) is an efficient similarity search tool that is useful for dealing with high-dimensional data with numerical representations <span class="citation" data-cites="johnson2019faiss">[3]</span>. The <code>string2string</code> library has a wrapper for the FAISS library developed by Facebook (see <a href="https://github.com/facebookresearch/faiss">GitHub repository</a>.</p>
<p>In short, Faiss search ranks its results based on a “score,” representing the degree to which two objects are similar to one another. The score makes it possible to interpret and prioritize search results based on how close/relevant they are to the desired target.</p>
<p>Let’s see how the Faiss search is used in the <code>string2string</code> library. Here, we have a corpus<sup>2</sup> of 11 sentences, and we will do a semantic search by querying a target sentence to see how close/relevant it is to these sentences.</p>
<div id="fba6fe02" class="cell" data-tags="[]" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">corpus <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>: [</span>
<span id="cb8-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A warm cup of tea in the morning helps me start the day right."</span>,</span>
<span id="cb8-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Staying active is important for maintaining a healthy lifestyle."</span>,</span>
<span id="cb8-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I find inspiration in trying out new activities or hobbies."</span>,</span>
<span id="cb8-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The view from my window is always a source of inspiration."</span>,</span>
<span id="cb8-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The encouragement from my loved ones keeps me going."</span>,</span>
<span id="cb8-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The novel I've picked up recently has been a page-turner."</span>,</span>
<span id="cb8-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Listening to podcasts helps me stay focused during work."</span>,</span>
<span id="cb8-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I can't wait to explore the new art gallery downtown."</span>,</span>
<span id="cb8-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Meditating in a peaceful environment brings clarity to my thoughts."</span>,</span>
<span id="cb8-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I believe empathy is a crucial quality to possess."</span>,</span>
<span id="cb8-12">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I like to exercise a few times a week."</span></span>
<span id="cb8-13">    ]</span>
<span id="cb8-14">}</span>
<span id="cb8-15"></span>
<span id="cb8-16">query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I enjoy walking early morning before I start my work."</span></span></code></pre></div></div>
</div>
<p>Let’s initialize the <code>FaissSearch</code> object. Facebook’s BART Large model is the default model and tokenizer for the <code>FaissSearch</code> object.</p>
<div id="d4ec93b9" class="cell" data-tags="[]" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.search <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> FaissSearch</span>
<span id="cb9-2"></span>
<span id="cb9-3">faiss_search <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> FaissSearch(</span>
<span id="cb9-4">    model_name_or_path <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"facebook/bart-large"</span>,</span>
<span id="cb9-5">    tokenizer_name_or_path <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"facebook/bart-large"</span>,</span>
<span id="cb9-6">)</span>
<span id="cb9-7"></span>
<span id="cb9-8">faiss_search.initialize_corpus(</span>
<span id="cb9-9">    corpus<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>corpus,</span>
<span id="cb9-10">    section<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>, </span>
<span id="cb9-11">    embedding_type<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean_pooling"</span>,</span>
<span id="cb9-12">)</span></code></pre></div></div>
</div>
<p>Let’s find the top 3 most similar sentences in the corpus to the query and print them, as well as their similarity scores.</p>
<div id="stdout-search-faiss-top-results" class="cell" data-tags="[]" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">top_k_similar_answers <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb10-2">most_similar_results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> faiss_search.search(</span>
<span id="cb10-3">    query<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>query,</span>
<span id="cb10-4">    k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>top_k_similar_answers,</span>
<span id="cb10-5">)</span>
<span id="cb10-6">    </span>
<span id="cb10-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Query: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>query<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb10-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(top_k_similar_answers):</span>
<span id="cb10-9">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Result </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> (score=</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>most_similar_results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"score"</span>][i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">): "</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>most_similar_results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>][i]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"'</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Query: I enjoy walking early morning before I start my work.

Result 1 (score=208.49): "I find inspiration in trying out new activities or hobbies."
Result 2 (score=218.21): "I like to exercise a few times a week."
Result 3 (score=225.96): "I can't wait to explore the new art gallery downtown."</code></pre>
</div>
</div>
</section>
</section>
</section>
<section id="distance" class="level1">
<h1>Distance</h1>
<p>String distance is the task of quantifying the degree to which two supplied strings differ using a distance function. Currently, the <code>string2string</code> library offers the following distance functions:</p>
<ul>
<li>Levenshtein edit distance</li>
<li>Damerau-Levenshtein edit distance</li>
<li>Hamming distance</li>
<li>Jaccard distance<sup>3</sup></li>
</ul>
<section id="levenshtein-edit-distance" class="level2">
<h2 class="anchored" data-anchor-id="levenshtein-edit-distance">Levenshtein edit distance</h2>
<p>Levenshtein edit distance, or simply the edit distance, is the minimal number of insertions, deletions, or substitutions needed to convert one string into another.</p>
<div id="stdout-distance-edit" class="cell" data-tags="[]" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.distance <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LevenshteinEditDistance</span>
<span id="cb12-2"></span>
<span id="cb12-3">edit_dist <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LevenshteinEditDistance()</span>
<span id="cb12-4"></span>
<span id="cb12-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create two strings</span></span>
<span id="cb12-6">s1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The beautiful cherry blossoms bloom in the spring time."</span></span>
<span id="cb12-7">s2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The beutiful cherry blosoms bloom in the spring time."</span></span>
<span id="cb12-8"></span>
<span id="cb12-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Let's compute the edit distance between the two strings and measure the computation time</span></span>
<span id="cb12-10">edit_dist_score  <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> edit_dist.compute(s1, s2, method<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dynamic-programming"</span>)</span>
<span id="cb12-11"></span>
<span id="cb12-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'The distance between the following two sentences is </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>edit_dist_score<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">:'</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#\n- "{s1}"\n- "{s2}"')</span></span>
<span id="cb12-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'"The beautiful cherry blossoms bloom in the spring time."'</span>)</span>
<span id="cb12-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'"The beutiful cherry blosoms bloom in the spring time."'</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>The distance between the following two sentences is 2.0:
"The beautiful cherry blossoms bloom in the spring time."
"The beutiful cherry blosoms bloom in the spring time."</code></pre>
</div>
</div>
</section>
<section id="jaccard-index" class="level2">
<h2 class="anchored" data-anchor-id="jaccard-index">Jaccard Index</h2>
<p>The Jaccard index can be used to quantify the similarity between sets of words or tokens and is commonly used in tasks such as document similarity or topic modeling. For example, the Jaccard index can be used to measure the overlap between the sets of words in two different documents or to identify the most similar topics across a collection of documents.</p>
<div id="stdout-distance-jaccard" class="cell" data-tags="[]" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.distance <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> JaccardIndex </span>
<span id="cb14-2"></span>
<span id="cb14-3">jaccard_dist <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> JaccardIndex()</span>
<span id="cb14-4"></span>
<span id="cb14-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Suppose we have two documents</span></span>
<span id="cb14-6">doc1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"yellow"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"purple"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pink"</span>]</span>
<span id="cb14-7">doc2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"green"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"orange"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cyan"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>]</span>
<span id="cb14-8"></span>
<span id="cb14-9">jaccard_dist_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> jaccard_dist.compute(doc1, doc2)</span>
<span id="cb14-10"></span>
<span id="cb14-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Jaccard distance between doc1 and doc2: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>jaccard_dist_score<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Jaccard distance between doc1 and doc2: 0.75</code></pre>
</div>
</div>
</section>
</section>
<section id="sec-similarity" class="level1">
<h1>Similarity</h1>
<p>To put it simply, string similarity determines the degree to which two strings of text (or sequences of characters) are linked or similar to one another. Take, as an example, the following pair of sentences:</p>
<ul>
<li>“The cat sat on the mat.”</li>
<li>“The cat was sitting on the rug.”</li>
</ul>
<p>Although not identical, these statements share vocabulary and convey a connected sense. Methods based on string similarity analysis reveal and quantify the degree of similarity between such text pairings.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Duality
</div>
</div>
<div class="callout-body-container callout-body">
<p>There is a <em>duality</em> between string <em>similarity</em> and <em>distance</em> measures, meaning that they can be used interchangeably <span class="citation" data-cites="suzgun2023string2string">see [1]</span>.</p>
</div>
</div>
<p>The <code>similarly</code> module of the <code>string2string</code> library currently offers the following algorithms:</p>
<ul>
<li>Cosine similarity</li>
<li>BERTScore</li>
<li>BARTScore</li>
<li>Jaro similarity</li>
<li>LCSubsequence similarity</li>
</ul>
<p>Let’s go over an example of the BERTScore similarity algorithm with the following four sentences:</p>
<ol type="1">
<li>The bakery sells a variety of delicious pastries and bread.</li>
<li>The park features a playground, walking trails, and picnic areas.</li>
<li>The festival showcases independent movies from around the world.</li>
<li>A range of tasty bread and pastries are available at the bakery.</li>
</ol>
<p>Sentences 1 and 2 are similar semantically, as both are about bakery and pastry. Hence, we should expect a high similarity score between the two.</p>
<p>Let’s implement the above example in the library.</p>
<div id="code-similarity-bertscore-computation" class="cell" data-tags="[]" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.similarity <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> BERTScore</span>
<span id="cb16-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.misc <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ModelEmbeddings</span>
<span id="cb16-3"></span>
<span id="cb16-4">bert_score <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> BERTScore(model_name_or_path<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bert-base-uncased"</span>)</span>
<span id="cb16-5">bart_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ModelEmbeddings(model_name_or_path<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"facebook/bart-large"</span>)</span>
<span id="cb16-6"></span>
<span id="cb16-7">sentences <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb16-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The bakery sells a variety of delicious pastries and bread."</span>, </span>
<span id="cb16-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The park features a playground, walking trails, and picnic areas."</span>, </span>
<span id="cb16-10">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The festival showcases independent movies from around the world."</span>, </span>
<span id="cb16-11">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A range of tasty bread and pastries are available at the bakery."</span>, </span>
<span id="cb16-12">]</span>
<span id="cb16-13"></span>
<span id="cb16-14">embeds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb16-15">    bart_model.get_embeddings(</span>
<span id="cb16-16">        sentence, embedding_type<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mean_pooling'</span></span>
<span id="cb16-17">    ) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> sentence <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> sentences</span>
<span id="cb16-18">]</span>
<span id="cb16-19"></span>
<span id="cb16-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define source and target sentences (to compute BERTScore for each pair)</span></span>
<span id="cb16-21">source_sentences, target_sentences <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [], []</span>
<span id="cb16-22"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sentences)):</span>
<span id="cb16-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> j <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sentences)):</span>
<span id="cb16-24">        source_sentences.append(sentences[i])</span>
<span id="cb16-25">        target_sentences.append(sentences[j])</span>
<span id="cb16-26"></span>
<span id="cb16-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can rewrite above in a more concise way using itertools.product</span></span>
<span id="cb16-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># from itertools import product</span></span>
<span id="cb16-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># source_sentences, target_sentences = map(</span></span>
<span id="cb16-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#   list, zip(*product(sentences, repeat=2))</span></span>
<span id="cb16-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># )</span></span>
<span id="cb16-32"></span>
<span id="cb16-33">bertscore_similarity_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bert_score.compute(</span>
<span id="cb16-34">    source_sentences,</span>
<span id="cb16-35">    target_sentences,</span>
<span id="cb16-36">)</span>
<span id="cb16-37">bertscore_precision_scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bertscore_similarity_scores[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'precision'</span>].reshape(</span>
<span id="cb16-38">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sentences), <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sentences)</span>
<span id="cb16-39">)</span></code></pre></div></div>
</div>
<p>We can visualize the similarity between every pair of sentences using the <code>plot_heatmap()</code> function provided in the library.</p>
<div id="cell-fig-similarity-bertscore-heatmap" class="cell" data-tags="[]" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> string2string.misc.plotting_functions <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_heatmap</span>
<span id="cb17-2"></span>
<span id="cb17-3">plot_ticks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Sentence </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(sentences))]</span>
<span id="cb17-4"></span>
<span id="cb17-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We can also visualize the BERTScore similarity scores using a heatmap</span></span>
<span id="cb17-6">plot_heatmap(</span>
<span id="cb17-7">    bertscore_precision_scores,</span>
<span id="cb17-8">    title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>,</span>
<span id="cb17-9">    x_ticks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plot_ticks,</span>
<span id="cb17-10">    y_ticks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plot_ticks,</span>
<span id="cb17-11">    x_label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>,</span>
<span id="cb17-12">    y_label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>,</span>
<span id="cb17-13">    valfmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{x:.2f}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb17-14">    cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Blues"</span>,</span>
<span id="cb17-15">)</span></code></pre></div></div>
<div class="cell-output cell-output-display">
<div id="fig-similarity-bertscore-heatmap" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Semantic similarity (BERTScore) between sentences">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-similarity-bertscore-heatmap-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/tutorial-string2string/index_files/figure-html/fig-similarity-bertscore-heatmap-output-1.png" alt="Semantic similarity (BERTScore) between sentences" width="518" height="456" class="figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-similarity-bertscore-heatmap-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: Semantic similarity (BERTScore) between sentences
</figcaption>
</figure>
</div>
</div>
</div>
<p>As can be seen above, sentences 1 and 4 are much more similar (using the BERTScore algorithm) as we expected.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>📓 You can find the Jupyter notebook for this blog post on <a href="https://github.com/e-alizadeh/data-science-blog/blob/master/notebooks/string2string-tutorial.ipynb">GitHub</a>.</p>
</div>
</div>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>The <code>string2string</code> Python library is an open-source tool that provides a full set of efficient methods for string-to-string problems. In particular, the library has four main modules that address the following tasks: 1. <em>pairwise alignment</em> including both global and local alignments; 2. <em>distance measurement</em>; 3. <em>lexical and semantic search</em>; and 4. <em>similarity analysis</em>. The library offers various algorithms in each category and provides helpful visualization tools.</p>
<div class="callout callout-style-simple callout-note no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-5-contents" aria-controls="callout-5" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Update History
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-5" class="callout-5-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 10%">
<col style="width: 70%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Sections</th>
<th style="text-align: left;">Changes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">June 7, 2023</td>
<td style="text-align: left;">-</td>
<td style="text-align: left;">Fixed a few spelling errors.</td>
</tr>
<tr class="even">
<td style="text-align: left;">May 15, 2023</td>
<td style="text-align: left;">5</td>
<td style="text-align: left;">Added a commented code on using <code>product()</code> function from <code>itertools</code> library.</td>
</tr>
<tr class="odd">
<td style="text-align: left;">May 15, 2023</td>
<td style="text-align: left;">-</td>
<td style="text-align: left;">Added a link to the published version of this post on Towards Data Science blog.</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-suzgun2023string2string" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">M. Suzgun, S. M. Shieber, and D. Jurafsky, <span>“string2string: A modern python library for string-to-string algorithms,”</span> 2023, Available: <a href="http://arxiv.org/abs/2304.14395">http://arxiv.org/abs/2304.14395</a></div>
</div>
<div id="ref-online:Essi2020dtw" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">E. Alizadeh, <span>“<span class="nocase">An Illustrative Introduction to Dynamic Time Warping</span>,”</span> 2020. <a href="https://ealizadeh.com/blog/introduction-to-dynamic-time-warping/">https://ealizadeh.com/blog/introduction-to-dynamic-time-warping/</a></div>
</div>
<div id="ref-johnson2019faiss" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">J. Johnson, M. Douze, and H. Jégou, <span>“Billion-scale similarity search with <span>GPUs</span>,”</span> <em>IEEE Transactions on Big Data</em>, vol. 7, no. 3, pp. 535–547, 2019.</div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Visit the library’s <a href="https://github.com/stanfordnlp/string2string">GitHub Page</a>.↩︎</p></li>
<li id="fn2"><p>A corpus (plural of corpora) is a large and structured collections of texts used for linguistic research, NLP and ML applications.↩︎</p></li>
<li id="fn3"><p>Not to be confused with Jaccard similarity coefficient. Jaccard distance = 1 - Jaccard coefficient↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {Taming {Text} with String2string: {A} {Powerful} {Python}
    {Library} for {String-to-String} {Algorithms}},
  date = {2023-05-11},
  url = {https://ealizadeh.com/blog/tutorial-string2string/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2023" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Taming Text with string2string: A Powerful Python
Library for String-to-String Algorithms,”</span> May 11, 2023. <a href="https://ealizadeh.com/blog/tutorial-string2string/">https://ealizadeh.com/blog/tutorial-string2string/</a></div>
</div></div></section></div> ]]></description>
  <category>Data Science</category>
  <category>Machine Learning</category>
  <category>Natural Language Processing</category>
  <category>Python Library</category>
  <category>Similarity Analysis</category>
  <category>Semantic Search</category>
  <category>Lexical Search</category>
  <category>Text Mining</category>
  <category>Open-Source</category>
  <guid>https://ealizadeh.com/blog/tutorial-string2string/</guid>
  <pubDate>Thu, 11 May 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/tutorial-string2string/img/_featured_image.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>The ABCs of Differential Privacy</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/abc-of-differential-privacy/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/abc-of-differential-privacy/img/_featured_image.png" class="img-fluid"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://medium.com/towards-data-science/abcs-of-differential-privacy-8dc709a3a6b3">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>Differential privacy (DP) is a rigorous mathematical framework that permits the analysis and manipulation of sensitive data while providing robust privacy guarantees.</p>
<p>DP is based on the premise that the inclusion or exclusion of a single individual should not significantly change the results of any analysis or query carried out on the dataset as a whole. In other words, the algorithm should come up with comparable findings when comparing these two sets of data, making it difficult to figure out anything distinctive about that individual. This safety keeps private information from getting out, but still lets useful insights be drawn from the data.</p>
<p>Differential privacy initially appeared in the&nbsp;study&nbsp;“Differential Privacy” by Cynthia <span class="citation" data-cites="dwork2006dp">[1]</span> while she was working at Microsoft Research.</p>
</section>
<section id="sec-dp-examples" class="level1">
<h1>Examples of How Differential Privacy Safeguards Data</h1>
<section id="example-1" class="level2">
<h2 class="anchored" data-anchor-id="example-1">Example 1</h2>
<p>In this part, I’ll use an example from <span class="citation" data-cites="wood2018dp_primer_nontechnical">[2]</span> to help you understand why differential privacy is important. In a study that looks at the link between social class and health results, researchers ask subjects for private information like where they live, how much money they have, and their medical background.</p>
<p>John, one of the participants, is worried that his personal information could get out and hurt his applications for life insurance or a mortgage. To make sure that John’s worries are taken care of, the researchers can use differential privacy. This makes sure that any data that is shared won’t reveal specific information about him. Different levels of privacy can be shown by John’s “opt-out” situation, in which his data is left out of the study. This protects his anonymity because the analysis’ results are not tied to any of his personal details.</p>
<p>Differential privacy seeks to protect privacy in the real world as if the data were being looked at in an opt-out situation. Since John’s data is not part of the computation, the results regarding him can only be as accurate as the data available to everyone else.</p>
<p>A precise description of differential privacy requires formal mathematical language and technical concepts, but the basic concept is to protect the privacy of individuals by limiting the information that can be obtained about them from the released data, thereby ensuring that their sensitive information remains private.</p>
</section>
<section id="example-2" class="level2">
<h2 class="anchored" data-anchor-id="example-2">Example 2</h2>
<p>The U.S. Census Bureau used a differential privacy framework as a part of its disclosure avoidance strategy to strike a compromise between the data collection and reporting needs and the privacy concerns of the respondents. Check <span class="citation" data-cites="online:USbureau2023dp">[3]</span> to find more information about the confidentiality protection provided by the U.S. Census Bureau. Moreover, <span class="citation" data-cites="online:garfinkel2022dp_and_2020_us_census">[4]</span> provides an explanation of how DP was utilized in the 2020 US Census data <a href="https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census">here</a>.</p>
</section>
</section>
<section id="definition-and-key-concepts" class="level1">
<h1>Definition and key concepts</h1>
<section id="the-meanining-of-differential-within-the-realm-of-dp" class="level2">
<h2 class="anchored" data-anchor-id="the-meanining-of-differential-within-the-realm-of-dp">The meanining of “Differential” within the realm of DP?</h2>
<p>The term “differential” privacy refers to its emphasis on the dissimilarity between the results produced by a privacy-preserving algorithm on two datasets that differ by just one individual’s data.</p>
</section>
<section id="mechanism-m" class="level2">
<h2 class="anchored" data-anchor-id="mechanism-m">Mechanism <img src="https://latex.codecogs.com/png.latex?M"></h2>
<p>A <em>mechanism</em> <img src="https://latex.codecogs.com/png.latex?M"> is a mathematical method or process that is used on the data to make sure privacy is maintained while still giving useful information.</p>
</section>
<section id="epsilon-epsilon" class="level2">
<h2 class="anchored" data-anchor-id="epsilon-epsilon">Epsilon (<img src="https://latex.codecogs.com/png.latex?%5Cepsilon">)</h2>
<p><img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> is a privacy parameter that controls the level of privacy given by a differentially private mechanism. In other words, <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> regulates how much the output of the mechanism can vary between two neighboring databases and measures how much privacy is lost when the mechanism is run on the database <span class="citation" data-cites="online:brubaker2021dp_tutorial">[5]</span>.</p>
<p>Stronger privacy guarantees are provided by a smaller <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">, but the output may be less useful as a result <span class="citation" data-cites="dwork2014dp_algorithmic_foundations">[6]</span>. <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> controls the amount of noise added to the data and shows how much the output probability distribution can change when the data of a single person is altered.</p>
</section>
<section id="delta-delta" class="level2">
<h2 class="anchored" data-anchor-id="delta-delta">Delta (<img src="https://latex.codecogs.com/png.latex?%5Cdelta">)</h2>
<p><img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is an extra privacy option that lets you set how likely it is that your privacy will be compromised. Hence, <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> controls the probability of an extreme privacy breach, where the added noise (controlled by <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">) does not provide sufficient protection.</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is a non-negative number that measures the chance of a data breach. It is usually very small and close to zero. This change makes it easier to do more complicated studies and machine learning models while still protecting privacy <span class="citation" data-cites="dwork2014dp_algorithmic_foundations">see [6]</span>.</p>
<p>If <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is low, there is less of a chance that someone’s privacy is going to get compromised. But this comes at a cost. If <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is too small, more noise might be introduced into the data, diminishing the quality of the end-result. <img src="https://latex.codecogs.com/png.latex?%5Cdelta"> is one parameter to consider, but it must be balanced with epsilon and the data’s practicality.</p>
</section>
<section id="sec-dp-math-definition" class="level2">
<h2 class="anchored" data-anchor-id="sec-dp-math-definition">Unveiling the Mathematics behind Differential Privacy</h2>
<p>Consider two databases, <img src="https://latex.codecogs.com/png.latex?D"> and <img src="https://latex.codecogs.com/png.latex?D'">, that differ by only one record.</p>
<p>Formally, a mechanism <img src="https://latex.codecogs.com/png.latex?M"> is <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-differentially private if, for any two adjacent datasets <img src="https://latex.codecogs.com/png.latex?D"> and <img src="https://latex.codecogs.com/png.latex?D'">, and for any possible output <img src="https://latex.codecogs.com/png.latex?O">, the following holds:</p>
<p><span id="eq-dp-classic-def"><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20%20%5Ctext%7BPr%7D%5B%5Ctext%7BM%7D(D)%20%5Cin%20O%5D%20%5Cleq%20%5Cexp(%5Cepsilon)%20%5Ctimes%20%5Ctext%7BPr%7D%5B%5Ctext%7BM%7D(D')%20%5Cin%20O%5D%0A%5Ctag%7B1%7D"></span></p>
<p>However, we can reframe the Equation&nbsp;1 in terms of divergences, resulting in Equation&nbsp;2.</p>
<p><span id="eq-dp-divergence"><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20%20%5Ctext%7Bdiv%7D%5BM(D)%20%5C,%20%7C%7C%20%5C,%20M(D')%5D%20%5Cleq%20%5Cepsilon%0A%5Ctag%7B2%7D"></span></p>
<div id="fig-dp-divergence" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center" alt="Differential privacy and divergence.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-dp-divergence-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://ealizadeh.com/blog/abc-of-differential-privacy/img/divergence_and_dp.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Differential privacy and divergence.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-dp-divergence-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Differential privacy in the context of divergences
</figcaption>
</figure>
</div>
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bdiv%7D%5B%5Ccdot%20%5C,%20%7C%7C%20%5C,%20%5Ccdot%5D"> denotes the Rényi divergence.<sup>1</sup></p>
<section id="epsilon-delta-dp-definition" class="level3">
<h3 class="anchored" data-anchor-id="epsilon-delta-dp-definition"><img src="https://latex.codecogs.com/png.latex?(%5Cepsilon,%20%5Cdelta)">-DP Definition</h3>
<p>A randomized <img src="https://latex.codecogs.com/png.latex?M"> is considered <img src="https://latex.codecogs.com/png.latex?(%5Cepsilon,%20%5Cdelta)">-differentially private if the probability of a significant privacy breach (i.e., a breach that would not occur under <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-differential privacy) is no more than <img src="https://latex.codecogs.com/png.latex?%5Cdelta">. More formally, a mechanism <img src="https://latex.codecogs.com/png.latex?M"> is <img src="https://latex.codecogs.com/png.latex?(%5Cepsilon,%20%5Cdelta)">-differentially private if</p>
<p><span id="eq-dp-epsilon-delta-definition"><img src="https://latex.codecogs.com/png.latex?%0A%20%20%20%20%5Ctext%7BPr%7D%5B%5Ctext%7BM%7D(D)%20%5Cin%20O%5D%20%5Cleq%20%5Cexp(%5Cepsilon)%20%5Ctimes%20%5Ctext%7BPr%7D%5B%5Ctext%7BM%7D(D')%20%5Cin%20O%5D%20+%20%5Cdelta%0A%5Ctag%7B3%7D"></span></p>
<p>If <img src="https://latex.codecogs.com/png.latex?%5Cdelta%20=%200">, then <img src="https://latex.codecogs.com/png.latex?(%5Cepsilon,%20%5Cdelta)">-DP is reduced to a <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-DP.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p><img src="https://latex.codecogs.com/png.latex?(%5Cepsilon,%20%5Cdelta)">-DP mechanism may be thought of informally as <img src="https://latex.codecogs.com/png.latex?%5Cepsilon">-DP with the probability of <img src="https://latex.codecogs.com/png.latex?1%20-%20%5Cdelta">.</p>
</div>
</div>
<hr>
</section>
</section>
</section>
<section id="properties-of-differential-privacy" class="level1">
<h1>Properties of Differential Privacy</h1>
<section id="post-processing-immunity" class="level2">
<h2 class="anchored" data-anchor-id="post-processing-immunity">Post-processing immunity</h2>
<p>The differentially private output can be subjected to any function or analysis, and the outcome will continue to uphold the original privacy assurances. For instance, if you apply a differentially private mechanism to a dataset and then take the average age of the individuals in the dataset, the resulting average age will still be differentially private and will provide the same level of privacy assurances as the output it was originally designed to provide.</p>
<p>Thanks to the post-processing feature, we can use differentially private mechanisms in the same way as generic ones. Hence, it is possible to combine several differentially private mechanisms without sacrificing the integrity of differential privacy.</p>
</section>
<section id="composition" class="level2">
<h2 class="anchored" data-anchor-id="composition">Composition</h2>
<p>When multiple differentially private techniques are used on the same data or when queries are combined, composition is the property that ensures the privacy guarantees of differential privacy still apply. Composition can be either sequential or parallel. If you apply two mechanisms, <img src="https://latex.codecogs.com/png.latex?M_1"> with <img src="https://latex.codecogs.com/png.latex?%5Cepsilon_1">-DP and <img src="https://latex.codecogs.com/png.latex?M_2"> with <img src="https://latex.codecogs.com/png.latex?%5Cepsilon_2">-DP on a dataset, then the composition of <img src="https://latex.codecogs.com/png.latex?M_1"> and <img src="https://latex.codecogs.com/png.latex?M_2"> is at least <img src="https://latex.codecogs.com/png.latex?(%5Cepsilon_1%20+%20%5Cepsilon_2)">-DP.</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Despite composition’s ability to protect privacy, the composition theorem makes clear that there is a ceiling; as the value of <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> rises, so does the amount of privacy lost whenever a new mechanism is employed. If <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> becomes too large, then differential privacy guarantees are mostly meaningless <span class="citation" data-cites="online:brubaker2021dp_tutorial">[5]</span>.</p>
</div>
</div>
</section>
<section id="robustness-to-auxiliary-information" class="level2">
<h2 class="anchored" data-anchor-id="robustness-to-auxiliary-information">Robustness to auxiliary information:</h2>
<p>Differential privacy is resistant to auxiliary information attackers, which means that even if an attacker has access to other relevant data, they will not be able to learn anything about a person from a DP output. For instance, if a hospital were to share differentially private information regarding individuals’ medical situations, an attacker with access to other medical records would not be able to greatly increase their knowledge of a given patient from the published numbers.</p>
<hr>
</section>
</section>
<section id="common-misunderstandings" class="level1">
<h1>Common Misunderstandings</h1>
<p>The notion of differential privacy has been misunderstood in several publications, especially during its early days. <span class="citation" data-cites="dwork2011dp_primer_for_perplexed">[7]</span> wrote a short paper to correct some widespread misunderstandings. Here are a few examples of common misunderstandings:</p>
<ol type="1">
<li>DP is not an algorithm but rather a definition. DP is a mathematical guarantee that an algorithm must meet in order to disclose statistics about a dataset. Several distinct algorithms meet the criteria.</li>
<li>Various algorithms can be differentially private while still meeting various requirements. If someone claims that differential privacy, a specific requirement on ratios of probability distributions, is incompatible with any accuracy target, they must provide evidence for that claim. This means proving that there is no way a DP algorithm can perform to some specified standard. It’s challenging to come up with that proof, and our first guesses about what is and isn’t feasible are often off.</li>
<li>There are no “good” or “bad” results for any given database. Generating the outputs in a way that preserves privacy (perfect or differential) is the key.</li>
</ol>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>DP has shown itself as a viable paradigm for the protection of data privacy, which is particularly important in this day and age, when machine learning and big data are becoming more widespread. Several key concepts were covered in this essay, including the various DP control settings like <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> and <img src="https://latex.codecogs.com/png.latex?%5Cdelta">. In addition, we provided several mathematical definitions of the DP. We also explained key features of the DP and addressed some of the most common misconceptions.</p>
<div class="callout callout-style-simple callout-note no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-4-contents" aria-controls="callout-4" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Update History
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-4" class="callout-4-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 10%">
<col style="width: 70%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Date</th>
<th style="text-align: left;">Sections</th>
<th style="text-align: left;">Changes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">April 29, 2023</td>
<td style="text-align: left;">3.5</td>
<td style="text-align: left;">Created a new Figure&nbsp;1 (replacing an image previously adapted from <span class="citation" data-cites="online:brubaker2021dp_tutorial">[5]</span>)</td>
</tr>
<tr class="even">
<td style="text-align: left;">April 29, 2023</td>
<td style="text-align: left;">2</td>
<td style="text-align: left;">Rearranged examples (added more on Example 2)</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-dwork2006dp" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">C. Dwork, <span>“Differential privacy,”</span> in <em>Proceedings of the 33rd international colloquium on automata, languages and programming</em>, Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 1–12. doi: <a href="https://doi.org/10.1007/11787006_1">10.1007/11787006_1</a>.</div>
</div>
<div id="ref-wood2018dp_primer_nontechnical" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">A. Wood <em>et al.</em>, <span>“Differential <span>Privacy</span>: <span>A Primer</span> for a <span>Non-Technical Audience</span>,”</span> <em>Vand. J. Ent. &amp; Tech. L.</em>, vol. 21, no. 1, pp. 209–276, 2018, doi: <a href="https://doi.org/10.2139/ssrn.3338027">10.2139/ssrn.3338027</a>.</div>
</div>
<div id="ref-online:USbureau2023dp" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">U.S. Census Bureau, <span>“Confidentiality <span>Protection</span> in the 2020 <span>U</span>.<span>S</span>. <span>Census</span> of <span>Population</span> and <span>Housing</span>.”</span> Jun. 09, 2022. Available: <a href="https://www.census.gov/library/working-papers/2022/adrm/CED-WP-2022-003.html">https://www.census.gov/library/working-papers/2022/adrm/CED-WP-2022-003.html</a></div>
</div>
<div id="ref-online:garfinkel2022dp_and_2020_us_census" class="csl-entry">
<div class="csl-left-margin">[4] </div><div class="csl-right-inline">S. Garfinkel, <span>“Differential <span>Privacy</span> and the 2020 <span>US</span> <span>Census</span>,”</span> <em>MIT Case Studies in Social and Ethical Responsibilities of Computing</em>. MIT Schwarzman College of Computing, 2022. Available: <a href="https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census">https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census</a></div>
</div>
<div id="ref-online:brubaker2021dp_tutorial" class="csl-entry">
<div class="csl-left-margin">[5] </div><div class="csl-right-inline">M. Brubaker and S. Prince, Borealis AI, Feb. 10, 2021. Available: <a href="https://www.borealisai.com/research-blogs/tutorial-12-differential-privacy-i-introduction/">https://www.borealisai.com/research-blogs/tutorial-12-differential-privacy-i-introduction/</a></div>
</div>
<div id="ref-dwork2014dp_algorithmic_foundations" class="csl-entry">
<div class="csl-left-margin">[6] </div><div class="csl-right-inline">C. Dwork, A. Roth, <em>et al.</em>, <span>“The algorithmic foundations of differential privacy,”</span> <em>Foundations and Trends<span></span> in Theoretical Computer Science</em>, vol. 9, no. 3–4, pp. 211–407, 2014.</div>
</div>
<div id="ref-dwork2011dp_primer_for_perplexed" class="csl-entry">
<div class="csl-left-margin">[7] </div><div class="csl-right-inline">C. Dwork, F. McSherry, K. Nissim, and A. Smith, <span>“Differential <span>P</span>rivacy — <span>A</span> <span>P</span>rimer for the <span>P</span>erplexed,”</span> <em>Joint UNECE/Eurostat work session on statistical data confidentiality</em>, vol. 11, 2011.</div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>See the paper <a href="https://arxiv.org/abs/1702.07476">Renyi Differential Privacy</a> by Ilya Mironov for more information.↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2023,
  author = {Alizadeh, Esmaeil},
  title = {The {ABCs} of {Differential} {Privacy}},
  date = {2023-04-27},
  url = {https://ealizadeh.com/blog/abc-of-differential-privacy/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2023" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“The ABCs of Differential Privacy,”</span> Apr. 27,
2023. <a href="https://ealizadeh.com/blog/abc-of-differential-privacy/">https://ealizadeh.com/blog/abc-of-differential-privacy/</a></div>
</div></div></section></div> ]]></description>
  <category>Differential Privacy</category>
  <category>Machine Learning</category>
  <category>Data Privacy</category>
  <category>Privacy-Preserving Algorithm</category>
  <guid>https://ealizadeh.com/blog/abc-of-differential-privacy/</guid>
  <pubDate>Thu, 27 Apr 2023 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/abc-of-differential-privacy/img/_featured_image.png" medium="image" type="image/png" height="82" width="144"/>
</item>
<item>
  <title>What K is in KNN and K-Means</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/knn-and-kmeans/</link>
  <description><![CDATA[ 






<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>In this post, we will go over two popular machine learning algorithms: <em>K</em>-Nearest Neighbors (aka <em>K</em>NN) and <em>K</em>-Means, and what <em>K</em> stands for in each algorithm. An overview of both popular ML techniques (including a visual illustration) will be provided.</p>
<p>By the end of this post, we will be able to answer the following questions:</p>
<ul>
<li>What’s the difference between <em>K</em>NN and <em>K</em>-Means?</li>
<li>What does <em>K</em> mean in <em>K</em>NN and K-Means?</li>
<li>What is a <em>nonparametric</em> model?</li>
<li>What is a <em>lazy learner</em> model?</li>
<li>What is <em>within-cluster sum of squares</em>, WCSS (aka intracluster inertia/distance, within-cluster variance)?</li>
<li>How to determine the best value K in K-Means?</li>
<li>What are pros and cons of KNN?</li>
<li>What are pros and cons of K-Means?</li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 The goal of this post is not to compare <em>K</em>NN and <em>K</em>-Means as each one addresses a different problem. Hence, comparing them is like comparing apples to oranges.</p>
</div>
</div>
</section>
<section id="k-nearest-neighbor-knn" class="level1">
<h1>K-Nearest Neighbor (KNN)</h1>
<p>KNN is a <em>nonparametric lazy supervised learning algorithm</em> mostly used for classification problems. There are a lot to unpack there, but the two main properties of the K-NN that you need to know are:</p>
<ul>
<li>KNN is a nonparametric algorithm meaning that the model does not make any assumption regarding the distribution of the underlying data <span class="citation" data-cites="online_javatpoint_knn">see [1]</span>.</li>
<li>KNN is a lazy learner technique meaning that the algorithm does not learn the discriminative function from the training dataset. Instead it stores (memorizes) the training dataset, so, technically, a lazy learner algorithm doesn’t have a training step, and it delays the data abstraction until it’s asked to make a prediction <span class="citation" data-cites="online_rasbt_lazy_knn">see [2]</span>.</li>
</ul>
<section id="what-k-in-k-nn-stands-for" class="level3">
<h3 class="anchored" data-anchor-id="what-k-in-k-nn-stands-for">What <em>K</em> in <em>K</em>-NN stands for?</h3>
<p><em>K</em> in <em>K</em>-Nearest Neighbors refers to the number of neighbors that one should take into consideration when predicting (voting for) the class of a new point. It will get more clear from the below example.</p>
</section>
<section id="an-illustration-of-k-nn" class="level2">
<h2 class="anchored" data-anchor-id="an-illustration-of-k-nn">An Illustration of <em>K</em>-NN</h2>
<p>As I mentioned earlier, <em>K</em>NN is a supervised learning technique, so we should have a labeled dataset. Let’s say we have two classes as can be seen in below image: Class A (blue points) and Class B (green points). A new data point (red) is given to us and we want to predict whether the new point belongs to Class A or Class B.</p>
<p>Let’s first try <em>K</em> = 3. In this case, we have to find the three closest data points (aka three nearest neighbors) to the new (red) data point. As can be seen from the left side, two of three closest neighbors belong to Class B (green) and one belongs to Class A (blue). So, we should assign the new point to Class B.</p>
<p><img src="https://ealizadeh.com/blog/knn-and-kmeans/img/202203082151_KNN.png" class="img-fluid"></p>
<p>Now let’s set <em>K</em> = 5 (right side of above image). In this case, three out of the closest five points belong to Class A, so the new point should be classified as Class A. Unfortunately, there is no specific way of determining <em>K</em>, so we have to try a few values. Very low values of <em>K</em> like 1 or 2 may make the model very complex and sensitive to outliers. A common value for <em>K</em> is 5 <span class="citation" data-cites="online_javatpoint_knn">see [1]</span>.</p>
</section>
<section id="pros-and-cons" class="level2">
<h2 class="anchored" data-anchor-id="pros-and-cons">Pros and Cons</h2>
<p>Following are the advantages and drawbacks of KNN <span class="citation" data-cites="online_tutorialspoint_knn">see [3]</span>:</p>
<p><strong>Pros</strong></p>
<ul>
<li>Useful for nonlinear data because KNN is a&nbsp;nonparametric&nbsp;algorithm.</li>
<li>Can be used for both&nbsp;classification&nbsp;and&nbsp;regression&nbsp;problems, even though mostly used for classification.</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li>Difficult to choose <em>K</em> since there is no statistical way to determine that.</li>
<li>Slow prediction for large datasets.</li>
<li>Computationally expensive since it has to store all the training data (Lazy Learner).</li>
<li>Sensitive to non-normalized dataset.</li>
<li>Sensitive to presence of irrelevant features.</li>
</ul>
<hr>
</section>
</section>
<section id="k-means" class="level1">
<h1><em>K</em>-Means</h1>
<p><em>K</em>-Means (aka <em>K</em>-Means clustering) is an unsupervised learning algorithm&nbsp;that divide unlabeled data into different groups (or clusters). <em>K</em> in <em>K</em>-means refers to the number of clusters/groups (a cluster is a group of similar observations/records). For instance, in the following example, the unlabeled dataset is grouped into different number of clusters depending on the value of <em>K</em>.</p>
<p><img src="https://ealizadeh.com/blog/knn-and-kmeans/img/202203092227_K-Means.png" class="img-fluid"></p>
<p>K-Means minimizes the <em>within-cluster</em> <em>sum of squares</em>, <strong>WCSS</strong> (aka intracluster inertia/distance, within-cluster variance). To put it simply, K-Means minimizes the sum of squared differences between data points and the mean of the assigned cluster <span class="citation" data-cites="online_tds_helm_kmeans">see [4]</span>.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BWCSS%7D_%7Bk%7D%20=%20%5Csum%5Climits_%7Bx%20%5Cin%20k%7D%7C%7Cx%20-%20%5Coverline%7Bx%7D%7C%7C%5E%7B2%7D%0A"></p>
<section id="how-to-find-the-best-k" class="level2">
<h2 class="anchored" data-anchor-id="how-to-find-the-best-k">How to find the best <strong><em>K</em></strong>?</h2>
<p>There are several ways to determine <em>K</em> in the <em>K</em>-Means clustering algorithm:</p>
<ul>
<li><strong>Elbow Method</strong>: A common way to determine the number of ideal cluster (K) in <em>K</em>-means. In this approach, we run the <em>K</em>-means with several candidates and calculate the WCSS. The best <em>K</em> is selected based on a trade-off between the model complexity (overfitting) and the WCSS.</li>
<li><strong>Silhouette Score</strong>: A score between -1 and 1 measuring the similarity among points of a cluster and comparing that with other clusters. A score of -1 indicates that a point is in the wrong cluster, whereas a score of 1 indicates that the point is in the right cluster <span class="citation" data-cites="online_tds_helm_kmeans">see [4]</span>.</li>
<li><strong>gap statistics</strong>: A method to estimate the number of clusters in a dataset. gap statistic compares the change in the within-cluster variation of output of any clustering technique with an expected reference null distribution <span class="citation" data-cites="tibshirani2001gap_statistics">see [5]</span>.</li>
</ul>
<p>We usually normalize/standardize continuous variables in the data preprocessing stage in order to avoid variables with much larger values dominating any modeling or analysis process <span class="citation" data-cites="bruce2017practical">see [6]</span>.</p>
</section>
<section id="pros-and-cons-1" class="level2">
<h2 class="anchored" data-anchor-id="pros-and-cons-1">Pros and Cons</h2>
<p>Some pros and cons of <em>K</em>-Means are given below.</p>
<p><strong>Pros</strong></p>
<ul>
<li>High scalability since most of calculations can be run in parallel.</li>
</ul>
<p><strong>Cons</strong></p>
<ul>
<li>The outliers can skew the centroids of clusters.</li>
<li>Poor performance in higher dimensional.</li>
</ul>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>Few takeaways from this post:</p>
<ul>
<li><em>K</em>NN is a supervised learning algorithm mainly used for classification problems, whereas <em>K</em>-Means (aka <em>K</em>-means clustering) is an unsupervised learning algorithm.</li>
<li><em>K</em> in <em>K</em>-Means refers to the number of clusters, whereas <em>K</em> in <em>K</em>NN is the number of nearest neighbors (based on the chosen distance metric).</li>
<li><em>K</em> in <em>K</em>NN is determined by comparing the performance of algorithm using different values for <em>K</em>.</li>
<li>There are few ways to determine the number of groups/clusters in a dataset prior to the <em>K</em>-means clustering, and that are:
<ul>
<li>Elbow Method,</li>
<li>Silhouette Score,</li>
<li>gap statistic.</li>
</ul></li>
</ul>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-online_javatpoint_knn" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">JavaTpoint, <span>“K-nearest neighbor(KNN) algorithm for machine learning.”</span> N/A. Accessed: Apr. 10, 2022. [Online]. Available: <a href="https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning">https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning</a></div>
</div>
<div id="ref-online_rasbt_lazy_knn" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">S. Raschka, <span>“Why is nearest neighbor a lazy algorithm?”</span> N/A. Accessed: Apr. 10, 2022. [Online]. Available: <a href="https://sebastianraschka.com/faq/docs/lazy-knn.html">https://sebastianraschka.com/faq/docs/lazy-knn.html</a></div>
</div>
<div id="ref-online_tutorialspoint_knn" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">T. Point, <span>“KNN algorithm - finding nearest neighbors.”</span> N/A. Accessed: Apr. 10, 2022. [Online]. Available: <a href="https://tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm">https://tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm</a></div>
</div>
<div id="ref-online_tds_helm_kmeans" class="csl-entry">
<div class="csl-left-margin">[4] </div><div class="csl-right-inline">M. Helm, <span>“A deep dive into k-means.”</span> 2021-06-01. Accessed: Jun. 01, 2021. [Online]. Available: <a href="https://towardsdatascience.com/a-deep-dive-into-k-means-f9a1ef2490f8">https://towardsdatascience.com/a-deep-dive-into-k-means-f9a1ef2490f8</a></div>
</div>
<div id="ref-tibshirani2001gap_statistics" class="csl-entry">
<div class="csl-left-margin">[5] </div><div class="csl-right-inline">R. Tibshirani, G. Walther, and T. Hastie, <span>“Estimating the number of clusters in a data set via the gap statistic,”</span> <em>Journal of the Royal Statistical Society: Series B (Statistical Methodology)</em>, vol. 63, no. 2, pp. 411–423, 2001, Available: <a href="https://hastie.su.domains/Papers/gap.pdf">https://hastie.su.domains/Papers/gap.pdf</a></div>
</div>
<div id="ref-bruce2017practical" class="csl-entry">
<div class="csl-left-margin">[6] </div><div class="csl-right-inline">P. Bruce and A. Bruce, <em>Practical statistics for data scientists: 50 essential concepts</em>. O’Reilly Media, 2017.</div>
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2022,
  author = {Alizadeh, Esmaeil},
  title = {What {K} Is in {KNN} and {K-Means}},
  date = {2022-03-21},
  url = {https://ealizadeh.com/blog/knn-and-kmeans/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2022" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“What K is in KNN and K-Means,”</span> Mar. 21, 2022. <a href="https://ealizadeh.com/blog/knn-and-kmeans/">https://ealizadeh.com/blog/knn-and-kmeans/</a></div>
</div></div></section></div> ]]></description>
  <category>K-Means</category>
  <category>KNN</category>
  <category>Machine Learning</category>
  <guid>https://ealizadeh.com/blog/knn-and-kmeans/</guid>
  <pubDate>Mon, 21 Mar 2022 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/knn-and-kmeans/img/_featured_image.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Automate Your Workflow with GitHub Actions and Cron</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/automate-workflow-github-cron/</link>
  <description><![CDATA[ 






<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/automate-workflow-github-actions-cron-130a8bf68ca6">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>In this post, we will go over a simple but yet powerful tool to run your script (or a task that you can program) on a time-based schedule.</p>
<p>If you run a script manually from time to time, then there is a good chance that you may benefit from automating the process and setting up a schedule to run automatically without you worrying about it. Just set it up once, and forget about it.</p>
<p>Some examples of such automation are:</p>
<ul>
<li>Parsing an RSS feed and sending an email automatically,</li>
<li>Integrating between two services you’re using but have no native integration! (an example of this is covered in this post),</li>
<li>Pulling data from a source and manipulating the data,</li>
<li><em>etc</em>.</li>
</ul>
<p><em>In this post, I use the word task and script interchangeably. Moreover, the YAML file GitHub Actions uses to create the automation of the task/script is called the workflow file.</em></p>
</section>
<section id="requirement-a-script-you-want-to-run-on-a-time-based-schedule" class="level1">
<h1>Requirement: A script you want to run on a time-based schedule</h1>
<p>The most important part is to have a script that we want to run. This depends on what your task is. The example I will walk you through is the integration of my <a href="https://www.zotero.org/">Zotero</a> annotations to my <a href="https://readwise.io/">Readwise</a> (a paid service that integrates highlights from almost everywhere, like Twitter, Apple Books, <em>etc</em>) account using the&nbsp;<a href="https://github.com/e-alizadeh/Zotero2Readwise">Zotero2Readwise</a> Python library (<em>Disclaimer: I developed this library!</em>).</p>
<p>The script I will be running in this post is <a href="https://github.com/e-alizadeh/Zotero2Readwise/blob/master/zotero2readwise/run.py">here</a>. Since the script I have is in Python, so, I will run it like the following <code>python run.py &lt;app1_token&gt; &lt;app2_password&gt; &lt;app2_user&gt;</code>. Running this in my personal laptop works fine since I have everything set up. But how to run above in GitHub Actions on a pre-defined schedule?</p>
<section id="considerations-before-automation" class="level2">
<h2 class="anchored" data-anchor-id="considerations-before-automation">Considerations before automation</h2>
<p>Your script will most likely be different. However, here the following tips may help you get started:</p>
<p>First, run the script on your system. Once you’re happy with the result and you want to automate the workflow, then use the instructions below to setup a scheduled workflow using GitHub Actions.</p>
<p>When developing an automation task, it is always good to think about <em>how to run it starting from a fresh OS</em>! It is as if you have a new system and you try to run your script there. A few question to ask yourself:</p>
<ul>
<li>Where should I start?</li>
<li>What are software/libraries I need to install before running the script?</li>
<li>Where can I find the script I want to run?</li>
<li>Do I need to pass some environment variables or sensitive information like passwords?
<ul>
<li>How should I pass sensitive information like passwords or tokens?</li>
</ul></li>
</ul>
<p>I will answer above questions for my workflow. Hopefully this will give you enough information to automate your task!</p>
</section>
</section>
<section id="github-actions-set-up" class="level1">
<h1>GitHub Actions Set up</h1>
<p>First, we need to create the <code>.github/workflows</code> directory in our repository. This is where our <a href="https://github.com/e-alizadeh/Zotero2Readwise-Sync/blob/master/.github/workflows/automation.yml">automation file</a> (should be in <code>YAML</code> format) lives. We will go over each part of the file.</p>
<p>You can learn the basics of GitHub Actions <a href="https://docs.github.com/en/actions/learn-github-actions/understanding-github-actions">here</a>.</p>
<p>The first section of any GitHub Actions workflow is to specify when the workflow should be triggered. This can be achieved using the <code>on</code> keyword. Since we want to have a scheduled automation, we can run the workflow on a <code>schedule</code> that uses a <code>cron</code> notation (discussed in next section). In addition to running on a schedule, I also want to run the workflow when any change is pushed to the <code>master</code> branch.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb1-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">push</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb1-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">branches</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb1-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> master</span></span>
<span id="cb1-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">schedule</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb1-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cron</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0 3 * * 1,3,5"</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"> # Runs at 03:00 AM (UTC) every Monday, Wednesday, and Friday (Check https://crontab.guru/)</span></span></code></pre></div></div>
<p>You can also trigger your workflow on a pull request (use the <code>pull_request</code> keyword).</p>
</section>
<section id="cron-cron-job" class="level1">
<h1>Cron (Cron job)</h1>
<p>The cron tool , also known as <a href="https://en.wikipedia.org/wiki/Cron">cronjob</a>, is basically a job scheduler. It is used for <em>scheduling repetitive tasks</em>. Its syntax consists of 5 fields as follows:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb2-1"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">* * * * *</span></span></code></pre></div></div>
<p>Above cron job syntax means to run a task <em>every minute</em>. As can be seen, there are 5 parts that are given in bellow table (<em>note the order from left to right</em>)</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>minute</th>
<th>hour</th>
<th>day of the month</th>
<th>month</th>
<th>day of the week</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>0 - 59</td>
<td>0 - 23</td>
<td>1 - 31</td>
<td>1 - 12</td>
<td>0 - 6 (Sun - Sat)</td>
</tr>
</tbody>
</table>
<section id="cron-job-examples" class="level2">
<h2 class="anchored" data-anchor-id="cron-job-examples">Cron job examples</h2>
<p>Below examples covers different aspects of a cron syntax and all valid characters (<code>* , - /</code>).</p>
<ul>
<li>You can specify your schedule by choosing a valid number for each part. <code>*</code> means “<em>every”</em> (<code>* * * * *</code> means <em>at every minute on every hour of every day of the month in every month at every day of the week</em> 🙂). Another example is <code>30 13 1 * *</code> meaning <em>at 13:30 on day 1 of the month</em>.</li>
<li>You can have multiple parameters for a given section by using the value list separator <code>,</code>. For instance, <code>* * * * 0,3</code> means <em>every minute only on Sunday and Wednesday.</em></li>
<li>You can have step values by using <code>/</code>. For instance, <code>/10 * * * *</code> means <em>every 10 minutes.</em></li>
<li>You can have a range of values by using dash <code>-</code>. For instance, <code>4-5 1-10 1 *</code> means <em>every minute between 04:00 - 05:59 AM between day 1 and day 10 of January.</em></li>
</ul>
<p>And of course, you can have a combination of above options. For example, <code>*/30 1-5 * 1,6 0,1</code> means <em>every 30 minutes between 01:00-05:59 AM only on Sunday and Monday in January and June.</em></p>
<p><em>Check&nbsp;<a href="https://crontab.tech/">crontab</a> or <a href="https://crontab.guru/">crontab guru</a>&nbsp;to come up with the cron syntax for your schedule.</em></p>
</section>
</section>
<section id="use-case-1" class="level1">
<h1>Use Case 1</h1>
<p>As I mentioned earlier, I want to automate my <a href="https://www.zotero.org/">Zotero</a> to <a href="https://readwise.io/">Readwise</a> integration using the <a href="https://github.com/e-alizadeh/Zotero2Readwise">Zotero2Readwise</a>&nbsp;Python library. Let’s answer the questions we asked earlier:</p>
<section id="where-should-we-start-from" class="level2">
<h2 class="anchored" data-anchor-id="where-should-we-start-from">Where should we start from?</h2>
<p>We can start from a fresh Ubuntu system. So, we have the section below the jobs specifying <code>runs-on: ubuntu-latest</code> that will configures the job to run on a fresh virtual machine containing the latest version of an Ubuntu Linux.</p>
<p>Next step is to clone the current repo. You can achieve this by using <code>uses</code> keyword allowing us to use any action from the <a href="https://github.com/marketplace?type=actions">GitHub Actions Marketplace</a> . We can use the <code>master</code> branch of <code>actions/checkout</code> here (you can also specify the version like <code>actions/checkout@v2</code>).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb3-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🍽️ Checkout the repo</span></span>
<span id="cb3-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/checkout@master</span></span>
<span id="cb3-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fetch-depth</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span></code></pre></div></div>
</section>
<section id="which-softwarelibraries-we-must-install" class="level2">
<h2 class="anchored" data-anchor-id="which-softwarelibraries-we-must-install">Which software/libraries we must install?</h2>
<p>This step is only necessary if you have to install a library. In my case, I have to first install Python 3.8. This can be achieved by using the <code>actions/setup-python@v2</code> GitHub Action. Afterwards, we want to install the python package. We can install the <a href="https://github.com/e-alizadeh/Zotero2Readwise">Zotero2Readwise</a>&nbsp;package by running <code>pip install zotero2readwise</code>. However, in order to execute a command on the runner, we have to use the <code>run</code> keyword.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🐍 Set up Python 3.8</span></span>
<span id="cb4-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/setup-python@v2</span></span>
<span id="cb4-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb4-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">python-version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'3.8'</span></span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 💿 Install Zotero2Readwise Python package</span></span>
<span id="cb4-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> pip install zotero2readwise</span></span></code></pre></div></div>
</section>
<section id="where-can-i-find-the-script-i-want-to-run" class="level2">
<h2 class="anchored" data-anchor-id="where-can-i-find-the-script-i-want-to-run">Where can I find the script I want to run?</h2>
<p>If the script you are trying to run lives in the same repository, you can just skip this step. But here, since the Python script I want to run lives in another GitHub repository, I have to download the script using the <code>curl</code> Linux command.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 📥 Download the Python script needed for automation</span></span>
<span id="cb5-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  curl https://raw.githubusercontent.com/e-alizadeh/Zotero2Readwise/master/zotero2readwise/run.py -o run.py</span></span></code></pre></div></div>
</section>
<section id="run-the-script" class="level2">
<h2 class="anchored" data-anchor-id="run-the-script">Run the script</h2>
<p>Now that we have set up our environment, we can run the script as mentioned earlier in the Requirements section.</p>
<p>But one last point is that since we need to pass some sensitive information (like tokens), we can achieve that by passing the secrets to <strong><em>Settings → Secrets → New repository secret</em></strong>.</p>
<div id="fig-surus" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-surus-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/automate-workflow-github-cron/img/github_action_secrets.gif" class="img-fluid figure-img" alt="Pass secrets to the GitHub environment."></p>
<figcaption>How to pass secrets to the environment of a GitHub repository</figcaption>
</figure>
</div>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig quarto-uncaptioned" id="fig-surus-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1
</figcaption>
</figure>
</div>
<p>These secrets will then be available using the following syntax: <code>${{ secrets.YOUR_SECRET_NAME }}</code> in your YAML file.</p>
<p>For more information about handling variables and secrets, you can check the following two pages on the GitHub Docs about <a href="https://docs.github.com/en/actions/learn-github-actions/environment-variables">Environment variables</a> and <a href="https://docs.github.com/en/actions/security-guides/encrypted-secrets">Encrypted secrets</a>.</p>
<p>Now that we have added our secrets, we can run the script as following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🚀 Run Automation</span></span>
<span id="cb6-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> python run.py ${{ secrets.READWISE_TOKEN }} ${{ secrets.ZOTERO_KEY }} ${{ secrets.ZOTERO_ID }}</span></span></code></pre></div></div>
</section>
<section id="putting-everything-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-everything-together">Putting everything together</h2>
<p>The file containing all steps above is shown below. The file lives on <a href="https://github.com/e-alizadeh/Zotero2Readwise-Sync/blob/50ca5d8475dec360538770bbbbaefa3067eaab5a/.github/workflows/automation.yml">GitHub</a>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> Zotero to Readwise Automation</span></span>
<span id="cb7-2"></span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">push</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">branches</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> master</span></span>
<span id="cb7-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">schedule</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cron</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0 3 * * 1,3,5"</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"> # Runs at 03:00 AM (UTC) every Monday, Wednesday, and Friday (Check https://crontab.guru/)</span></span>
<span id="cb7-9"></span>
<span id="cb7-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">jobs</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">zotero-to-readwise-automation</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runs-on</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> ubuntu-latest</span></span>
<span id="cb7-13"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">steps</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-14"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🍽️ Checkout the repo</span></span>
<span id="cb7-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/checkout@master</span></span>
<span id="cb7-16"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-17"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">          </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fetch-depth</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb7-18"></span>
<span id="cb7-19"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🐍 Set up Python 3.8</span></span>
<span id="cb7-20"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">uses</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> actions/setup-python@v2</span></span>
<span id="cb7-21"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-22"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">          </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">python-version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'3.8'</span></span>
<span id="cb7-23"></span>
<span id="cb7-24"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 💿 Install Zotero2Readwise Python package</span></span>
<span id="cb7-25"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> pip install zotero2readwise</span></span>
<span id="cb7-26"></span>
<span id="cb7-27"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 📥 Download the Python script needed for automation</span></span>
<span id="cb7-28"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  curl https://raw.githubusercontent.com/e-alizadeh/Zotero2Readwise/master/zotero2readwise/run.py -o run.py</span></span>
<span id="cb7-29"></span>
<span id="cb7-30"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> 🚀 Run Automation</span></span>
<span id="cb7-31"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">        </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">run</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> python run.py ${{ secrets.READWISE_TOKEN }} ${{ secrets.ZOTERO_KEY }} ${{ secrets.ZOTERO_ID }}</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/automate-workflow-github-cron/img/scheduled_automation.jpg" class="img-fluid figure-img"></p>
<figcaption>A screenshot of the GitHub Actions showing how the workflow is run on a schedule or via a push to the master branch.</figcaption>
</figure>
</div>
</section>
</section>
<section id="use-case-2" class="level1">
<h1>Use Case 2</h1>
<p>Following above process, I set up a scheduled update of my GitHub page that runs twice a week. It has been working fine till now (so far, it has been over an year at the time of this post). This automation does the following:</p>
<ol type="1">
<li>Parse my <a href="https://ealizadeh.com/blog">personal blog</a> and retrieve the title and the link to my latest posts.</li>
<li>Add the blog posts to a section of my <a href="https://github.com/e-alizadeh">GitHub README page</a>.</li>
<li>Add a note about the date and time of the auto-generated README page to the page footer.</li>
</ol>
<p>You can find the file that run above tasks on <a href="https://github.com/e-alizadeh/e-alizadeh/blob/main/.github/workflows/medium-blog-posts-update.yml">GitHub</a>. This automation also runs a <a href="https://github.com/e-alizadeh/e-alizadeh/blob/main/src/update_latest_blog_posts.py">python script</a>.</p>
</section>
<section id="github-actions-cost" class="level1">
<h1>GitHub Actions Cost</h1>
<p><em>GitHub Actions usage is free for public repositories.</em> However, the free plan for private repositories has some limitations (at the time of publishing this post, the GitHub Actions free plan offers 2000 minutes (per month) and 500MB of storage which should be enough for most use cases) [1].</p>
<hr>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we saw how GitHub Actions can be used to run a task on a time-based schedule. The sky is the limit here. Think about a workflow or a task you are currently running every so often. If you can put the task in a script that can be run on your computer (it doesn’t matter if it is in Python, Bash or any other script), then you can actually set up the automation.</p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2022,
  author = {Alizadeh, Esmaeil},
  title = {Automate {Your} {Workflow} with {GitHub} {Actions} and
    {Cron}},
  date = {2022-01-20},
  url = {https://ealizadeh.com/blog/automate-workflow-github-cron/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2022" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Automate Your Workflow with GitHub Actions and
Cron,”</span> Jan. 20, 2022. <a href="https://ealizadeh.com/blog/automate-workflow-github-cron/">https://ealizadeh.com/blog/automate-workflow-github-cron/</a></div>
</div></div></section></div> ]]></description>
  <category>Automation</category>
  <category>GitHub Actions</category>
  <category>Tutorial</category>
  <guid>https://ealizadeh.com/blog/automate-workflow-github-cron/</guid>
  <pubDate>Thu, 20 Jan 2022 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/automate-workflow-github-cron/img/feature_image.png" medium="image" type="image/png" height="102" width="144"/>
</item>
<item>
  <title>Visualize your Pandas Data Transformation using PandasTutor</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/pandas-tutor-tool/</link>
  <description><![CDATA[ 






<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/visualize-data-transformation-using-pandastutor-6126627dd225">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>Pandas is a powerful Python library for any exploratory data analysis. Sometimes, you may have difficulties in visualizing data transformations. Here comes <a href="https://pandastutor.com/">PandasTutor</a>— a web app that allows you to see how your pandas code transforms the data step-by-step.</p>
<p>This may come handy particularly if you have complicated transformations and want to visualize your steps or explain it to others.</p>
<p>PandasTutor lets you visualize different pandas transformation, from <a href="https://pandastutor.com/vis.html#trace=example-code/py_sort_values.json">sorting</a> to <a href="https://pandastutor.com/vis.html#trace=example-code/py_groupby_multi.json">grouping by multiple columns</a>, and even grouping by a column and <a href="https://pandastutor.com/vis.html#trace=example-code/py_sort_groupby_agg.json">performing multiple aggregations</a>.</p>
<section id="pandastutor-creators" class="level2">
<h2 class="anchored" data-anchor-id="pandastutor-creators">PandasTutor Creators</h2>
<p>Pandas Tutor was created by <a href="https://www.samlau.me/">Sam Lau</a>&nbsp;and&nbsp;<a href="https://pg.ucsd.edu/">Philip Guo</a> at UC San Diego. This tool is mainly developed for teaching purposes as its creator stated <a href="https://docs.google.com/document/d/1kvY8baGjaMbg8ucMTjXlmLeYJXVQKQr09AttwUu3F_k/edit#heading=h.3xhjglvrau6z">here</a>. This explains some of the limitations this tool have (I will cover some of those limitations later in the post).</p>
<p>A similar tool called <a href="https://tidydatatutor.com/">Tidy Data Tutor</a> but for R users is created by <a href="https://seankross.com/">Sean Kross</a>&nbsp;and&nbsp;<a href="https://pg.ucsd.edu/">Philip Guo</a>.</p>
</section>
</section>
<section id="case-study" class="level1">
<h1>Case Study</h1>
<p>In this article, I will provide an example where I will do a sort + group by multiple columns + performing different aggregations on multiple columns!</p>
<section id="dataset" class="level2">
<h2 class="anchored" data-anchor-id="dataset">Dataset</h2>
<p>Let’s use the Heart Failure Prediction Dataset Kaggle Dataset (available <a href="https://www.kaggle.com/fedesoriano/heart-failure-prediction">here</a>). The data is available under&nbsp;<a href="https://opendatacommons.org/licenses/odbl/1-0/">Open Database (ODbl) License</a>&nbsp;allowing&nbsp;<em>“users to freely share, modify, and use this Database while maintaining this same freedom for others.”</em> Since Pandas Tutor only works with small data, I will take the first 50 rows of hearts data).</p>
</section>
<section id="code" class="level2">
<h2 class="anchored" data-anchor-id="code">Code</h2>
<p>Below is the code used for the visualization in this post. You may notice that the CSV data is encoded here which is a current limitation of this tool.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> io</span>
<span id="cb1-3"></span>
<span id="cb1-4">csv <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'''</span></span>
<span id="cb1-5"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease</span></span>
<span id="cb1-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">40,M,ATA,140,289,0,Normal,172,N,0,Up,0</span></span>
<span id="cb1-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">49,F,NAP,160,180,0,Normal,156,N,1,Flat,1</span></span>
<span id="cb1-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">37,M,ATA,130,283,0,ST,98,N,0,Up,0</span></span>
<span id="cb1-9"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1</span></span>
<span id="cb1-10"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,M,NAP,150,195,0,Normal,122,N,0,Up,0</span></span>
<span id="cb1-11"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">39,M,NAP,120,339,0,Normal,170,N,0,Up,0</span></span>
<span id="cb1-12"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">45,F,ATA,130,237,0,Normal,170,N,0,Up,0</span></span>
<span id="cb1-13"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,M,ATA,110,208,0,Normal,142,N,0,Up,0</span></span>
<span id="cb1-14"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1</span></span>
<span id="cb1-15"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">48,F,ATA,120,284,0,Normal,120,N,0,Up,0</span></span>
<span id="cb1-16"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">37,F,NAP,130,211,0,Normal,142,N,0,Up,0</span></span>
<span id="cb1-17"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">58,M,ATA,136,164,0,ST,99,Y,2,Flat,1</span></span>
<span id="cb1-18"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">39,M,ATA,120,204,0,Normal,145,N,0,Up,0</span></span>
<span id="cb1-19"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">49,M,ASY,140,234,0,Normal,140,Y,1,Flat,1</span></span>
<span id="cb1-20"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">42,F,NAP,115,211,0,ST,137,N,0,Up,0</span></span>
<span id="cb1-21"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,F,ATA,120,273,0,Normal,150,N,1.5,Flat,0</span></span>
<span id="cb1-22"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">38,M,ASY,110,196,0,Normal,166,N,0,Flat,1</span></span>
<span id="cb1-23"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">43,F,ATA,120,201,0,Normal,165,N,0,Up,0</span></span>
<span id="cb1-24"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">60,M,ASY,100,248,0,Normal,125,N,1,Flat,1</span></span>
<span id="cb1-25"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">36,M,ATA,120,267,0,Normal,160,N,3,Flat,1</span></span>
<span id="cb1-26"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">43,F,TA,100,223,0,Normal,142,N,0,Up,0</span></span>
<span id="cb1-27"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">44,M,ATA,120,184,0,Normal,142,N,1,Flat,0</span></span>
<span id="cb1-28"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">49,F,ATA,124,201,0,Normal,164,N,0,Up,0</span></span>
<span id="cb1-29"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">44,M,ATA,150,288,0,Normal,150,Y,3,Flat,1</span></span>
<span id="cb1-30"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">40,M,NAP,130,215,0,Normal,138,N,0,Up,0</span></span>
<span id="cb1-31"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">36,M,NAP,130,209,0,Normal,178,N,0,Up,0</span></span>
<span id="cb1-32"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">53,M,ASY,124,260,0,ST,112,Y,3,Flat,0</span></span>
<span id="cb1-33"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">52,M,ATA,120,284,0,Normal,118,N,0,Up,0</span></span>
<span id="cb1-34"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">53,F,ATA,113,468,0,Normal,127,N,0,Up,0</span></span>
<span id="cb1-35"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">51,M,ATA,125,188,0,Normal,145,N,0,Up,0</span></span>
<span id="cb1-36"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">53,M,NAP,145,518,0,Normal,130,N,0,Flat,1</span></span>
<span id="cb1-37"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">56,M,NAP,130,167,0,Normal,114,N,0,Up,0</span></span>
<span id="cb1-38"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,M,ASY,125,224,0,Normal,122,N,2,Flat,1</span></span>
<span id="cb1-39"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">41,M,ASY,130,172,0,ST,130,N,2,Flat,1</span></span>
<span id="cb1-40"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">43,F,ATA,150,186,0,Normal,154,N,0,Up,0</span></span>
<span id="cb1-41"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">32,M,ATA,125,254,0,Normal,155,N,0,Up,0</span></span>
<span id="cb1-42"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">65,M,ASY,140,306,1,Normal,87,Y,1.5,Flat,1</span></span>
<span id="cb1-43"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">41,F,ATA,110,250,0,ST,142,N,0,Up,0</span></span>
<span id="cb1-44"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">48,F,ATA,120,177,1,ST,148,N,0,Up,0</span></span>
<span id="cb1-45"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">48,F,ASY,150,227,0,Normal,130,Y,1,Flat,0</span></span>
<span id="cb1-46"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,F,ATA,150,230,0,Normal,130,N,0,Up,0</span></span>
<span id="cb1-47"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">54,F,NAP,130,294,0,ST,100,Y,0,Flat,1</span></span>
<span id="cb1-48"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">35,M,ATA,150,264,0,Normal,168,N,0,Up,0</span></span>
<span id="cb1-49"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">52,M,NAP,140,259,0,ST,170,N,0,Up,0</span></span>
<span id="cb1-50"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">43,M,ASY,120,175,0,Normal,120,Y,1,Flat,1</span></span>
<span id="cb1-51"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">59,M,NAP,130,318,0,Normal,120,Y,1,Flat,0</span></span>
<span id="cb1-52"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">37,M,ASY,120,223,0,Normal,168,N,0,Up,0</span></span>
<span id="cb1-53"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">50,M,ATA,140,216,0,Normal,170,N,0,Up,0</span></span>
<span id="cb1-54"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">36,M,NAP,112,340,0,Normal,184,N,1,Flat,0</span></span>
<span id="cb1-55"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">41,M,ASY,110,289,0,Normal,170,N,0,Flat,1</span></span>
<span id="cb1-56"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'''</span></span>
<span id="cb1-57"></span>
<span id="cb1-58">df_hearts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(io.StringIO(csv))</span>
<span id="cb1-59">df_hearts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df_hearts[</span>
<span id="cb1-60">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sex"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RestingBP"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ChestPainType"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cholesterol"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HeartDisease"</span>]</span>
<span id="cb1-61">]</span>
<span id="cb1-62"></span>
<span id="cb1-63">(df_hearts.sort_values(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>)</span>
<span id="cb1-64">.groupby([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sex"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HeartDisease"</span>])</span>
<span id="cb1-65">.agg({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RestingBP"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"std"</span>], </span>
<span id="cb1-66">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cholesterol"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"std"</span>],</span>
<span id="cb1-67">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sex"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"count"</span>]</span>
<span id="cb1-68">      })</span>
<span id="cb1-69">)</span></code></pre></div></div>
<p>So our transformations is only the last few lines</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">(df_hearts.sort_values(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>)</span>
<span id="cb2-2">.groupby([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sex"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"HeartDisease"</span>])</span>
<span id="cb2-3">.agg({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RestingBP"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"std"</span>], </span>
<span id="cb2-4">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cholesterol"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"std"</span>],</span>
<span id="cb2-5">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sex"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"count"</span>]</span>
<span id="cb2-6">      })</span>
<span id="cb2-7">)</span></code></pre></div></div>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<section id="step-1-sorting-the-dataframe" class="level3">
<h3 class="anchored" data-anchor-id="step-1-sorting-the-dataframe">Step 1: Sorting the DataFrame</h3>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/pandas-tutor-tool/img/ex01_step1_sort.gif" class="img-fluid figure-img"></p>
<figcaption>Visualization of the <code>sort_values()</code> result (steps 1) (generated using <a href="https://pandastutor.com/vis.html">PandasTutor</a>)</figcaption>
</figure>
</div>
<p>Visualization of the <code>sort_values()</code> result (steps 1) (generated using <a href="https://pandastutor.com/vis.html">PandasTutor</a>)</p>
</section>
<section id="step-2-visualize-pandas-groupby-operation" class="level3">
<h3 class="anchored" data-anchor-id="step-2-visualize-pandas-groupby-operation">Step 2: Visualize Pandas Groupby operation</h3>
<p>After sorting the results in Step 1 and visualizing it, we can visualize the groupby() operation</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/pandas-tutor-tool/img/ex01_step2_groupby.png" class="img-fluid figure-img"></p>
<figcaption>Visualization of the <code>groupby()</code> result (steps 1 and 2) (generated using <a href="https://pandastutor.com/vis.html">PandasTutor</a>)</figcaption>
</figure>
</div>
</section>
<section id="step-3-calculate-different-aggregations-on-multiple-columns" class="level3">
<h3 class="anchored" data-anchor-id="step-3-calculate-different-aggregations-on-multiple-columns">Step 3: Calculate different aggregations on multiple columns</h3>
<p>Here, I will be calculating the mean and standard deviation of two columns “RestingBP” and “Cholesterol” and also provide a count for each group (here I’m using the “Sex” column to get that information.)</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/pandas-tutor-tool/img/ex01_step3_aggregations.png" class="img-fluid figure-img"></p>
<figcaption>Visualization of the final result that is the aggregation (steps 1 - 3) (generated using <a href="https://pandastutor.com/vis.html">PandasTutor</a>)</figcaption>
</figure>
</div>
<p>Visualization of the final result that is the aggregation (steps 1 - 3) (generated using <a href="https://pandastutor.com/vis.html">PandasTutor</a>)</p>
</section>
<section id="interesting-sharing-feature" class="level3">
<h3 class="anchored" data-anchor-id="interesting-sharing-feature"><strong>Interesting sharing feature</strong></h3>
<p>Pandas Tutor also provides you with a <strong>shareable URL</strong> that even includes the CSV data used in the transformation. For instance, you can check my transformation code and results <a href="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0A%0Acsv%20%3D%20'''%0AAge,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease%0A40,M,ATA,140,289,0,Normal,172,N,0,Up,0%0A49,F,NAP,160,180,0,Normal,156,N,1,Flat,1%0A37,M,ATA,130,283,0,ST,98,N,0,Up,0%0A48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1%0A54,M,NAP,150,195,0,Normal,122,N,0,Up,0%0A39,M,NAP,120,339,0,Normal,170,N,0,Up,0%0A45,F,ATA,130,237,0,Normal,170,N,0,Up,0%0A54,M,ATA,110,208,0,Normal,142,N,0,Up,0%0A37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1%0A48,F,ATA,120,284,0,Normal,120,N,0,Up,0%0A37,F,NAP,130,211,0,Normal,142,N,0,Up,0%0A58,M,ATA,136,164,0,ST,99,Y,2,Flat,1%0A39,M,ATA,120,204,0,Normal,145,N,0,Up,0%0A49,M,ASY,140,234,0,Normal,140,Y,1,Flat,1%0A42,F,NAP,115,211,0,ST,137,N,0,Up,0%0A54,F,ATA,120,273,0,Normal,150,N,1.5,Flat,0%0A38,M,ASY,110,196,0,Normal,166,N,0,Flat,1%0A43,F,ATA,120,201,0,Normal,165,N,0,Up,0%0A60,M,ASY,100,248,0,Normal,125,N,1,Flat,1%0A36,M,ATA,120,267,0,Normal,160,N,3,Flat,1%0A43,F,TA,100,223,0,Normal,142,N,0,Up,0%0A44,M,ATA,120,184,0,Normal,142,N,1,Flat,0%0A49,F,ATA,124,201,0,Normal,164,N,0,Up,0%0A44,M,ATA,150,288,0,Normal,150,Y,3,Flat,1%0A40,M,NAP,130,215,0,Normal,138,N,0,Up,0%0A36,M,NAP,130,209,0,Normal,178,N,0,Up,0%0A53,M,ASY,124,260,0,ST,112,Y,3,Flat,0%0A52,M,ATA,120,284,0,Normal,118,N,0,Up,0%0A53,F,ATA,113,468,0,Normal,127,N,0,Up,0%0A51,M,ATA,125,188,0,Normal,145,N,0,Up,0%0A53,M,NAP,145,518,0,Normal,130,N,0,Flat,1%0A56,M,NAP,130,167,0,Normal,114,N,0,Up,0%0A54,M,ASY,125,224,0,Normal,122,N,2,Flat,1%0A41,M,ASY,130,172,0,ST,130,N,2,Flat,1%0A43,F,ATA,150,186,0,Normal,154,N,0,Up,0%0A32,M,ATA,125,254,0,Normal,155,N,0,Up,0%0A65,M,ASY,140,306,1,Normal,87,Y,1.5,Flat,1%0A41,F,ATA,110,250,0,ST,142,N,0,Up,0%0A48,F,ATA,120,177,1,ST,148,N,0,Up,0%0A48,F,ASY,150,227,0,Normal,130,Y,1,Flat,0%0A54,F,ATA,150,230,0,Normal,130,N,0,Up,0%0A54,F,NAP,130,294,0,ST,100,Y,0,Flat,1%0A35,M,ATA,150,264,0,Normal,168,N,0,Up,0%0A52,M,NAP,140,259,0,ST,170,N,0,Up,0%0A43,M,ASY,120,175,0,Normal,120,Y,1,Flat,1%0A59,M,NAP,130,318,0,Normal,120,Y,1,Flat,0%0A37,M,ASY,120,223,0,Normal,168,N,0,Up,0%0A50,M,ATA,140,216,0,Normal,170,N,0,Up,0%0A36,M,NAP,112,340,0,Normal,184,N,1,Flat,0%0A41,M,ASY,110,289,0,Normal,170,N,0,Flat,1%0A'''%0A%0Adf_hearts%20%3D%20pd.read_csv%28io.StringIO%28csv%29%29%0Adf_hearts%20%3D%20df_hearts%5B%0A%20%20%20%20%5B%22Age%22,%20%22Sex%22,%20%22RestingBP%22,%20%22ChestPainType%22,%20%22Cholesterol%22,%20%22HeartDisease%22%5D%0A%5D%0A%0A%28df_hearts.sort_values%28%22Age%22%29%0A.groupby%28%5B%22Sex%22,%20%22HeartDisease%22%5D%29%0A.agg%28%7B%22RestingBP%22%3A%20%5B%22mean%22,%20%22std%22%5D,%20%0A%20%20%20%20%20%20%22Cholesterol%22%3A%20%5B%22mean%22,%20%22std%22%5D,%0A%20%20%20%20%20%20%22Sex%22%3A%20%5B%22count%22%5D%0A%20%20%20%20%20%20%7D%29%0A%29&amp;d=2021-12-08&amp;lang=py&amp;v=v1">here</a> or via below link!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">https:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//</span>pandastutor.com<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>vis.html<span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#code=import%20pandas%20as%20pd%0Aimport%20io%0A%0Acsv%20%3D%20'''%0AAge,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease%0A40,M,ATA,140,289,0,Normal,172,N,0,Up,0%0A49,F,NAP,160,180,0,Normal,156,N,1,Flat,1%0A37,M,ATA,130,283,0,ST,98,N,0,Up,0%0A48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1%0A54,M,NAP,150,195,0,Normal,122,N,0,Up,0%0A39,M,NAP,120,339,0,Normal,170,N,0,Up,0%0A45,F,ATA,130,237,0,Normal,170,N,0,Up,0%0A54,M,ATA,110,208,0,Normal,142,N,0,Up,0%0A37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1%0A48,F,ATA,120,284,0,Normal,120,N,0,Up,0%0A37,F,NAP,130,211,0,Normal,142,N,0,Up,0%0A58,M,ATA,136,164,0,ST,99,Y,2,Flat,1%0A39,M,ATA,120,204,0,Normal,145,N,0,Up,0%0A49,M,ASY,140,234,0,Normal,140,Y,1,Flat,1%0A42,F,NAP,115,211,0,ST,137,N,0,Up,0%0A54,F,ATA,120,273,0,Normal,150,N,1.5,Flat,0%0A38,M,ASY,110,196,0,Normal,166,N,0,Flat,1%0A43,F,ATA,120,201,0,Normal,165,N,0,Up,0%0A60,M,ASY,100,248,0,Normal,125,N,1,Flat,1%0A36,M,ATA,120,267,0,Normal,160,N,3,Flat,1%0A43,F,TA,100,223,0,Normal,142,N,0,Up,0%0A44,M,ATA,120,184,0,Normal,142,N,1,Flat,0%0A49,F,ATA,124,201,0,Normal,164,N,0,Up,0%0A44,M,ATA,150,288,0,Normal,150,Y,3,Flat,1%0A40,M,NAP,130,215,0,Normal,138,N,0,Up,0%0A36,M,NAP,130,209,0,Normal,178,N,0,Up,0%0A53,M,ASY,124,260,0,ST,112,Y,3,Flat,0%0A52,M,ATA,120,284,0,Normal,118,N,0,Up,0%0A53,F,ATA,113,468,0,Normal,127,N,0,Up,0%0A51,M,ATA,125,188,0,Normal,145,N,0,Up,0%0A53,M,NAP,145,518,0,Normal,130,N,0,Flat,1%0A56,M,NAP,130,167,0,Normal,114,N,0,Up,0%0A54,M,ASY,125,224,0,Normal,122,N,2,Flat,1%0A41,M,ASY,130,172,0,ST,130,N,2,Flat,1%0A43,F,ATA,150,186,0,Normal,154,N,0,Up,0%0A32,M,ATA,125,254,0,Normal,155,N,0,Up,0%0A65,M,ASY,140,306,1,Normal,87,Y,1.5,Flat,1%0A41,F,ATA,110,250,0,ST,142,N,0,Up,0%0A48,F,ATA,120,177,1,ST,148,N,0,Up,0%0A48,F,ASY,150,227,0,Normal,130,Y,1,Flat,0%0A54,F,ATA,150,230,0,Normal,130,N,0,Up,0%0A54,F,NAP,130,294,0,ST,100,Y,0,Flat,1%0A35,M,ATA,150,264,0,Normal,168,N,0,Up,0%0A52,M,NAP,140,259,0,ST,170,N,0,Up,0%0A43,M,ASY,120,175,0,Normal,120,Y,1,Flat,1%0A59,M,NAP,130,318,0,Normal,120,Y,1,Flat,0%0A37,M,ASY,120,223,0,Normal,168,N,0,Up,0%0A50,M,ATA,140,216,0,Normal,170,N,0,Up,0%0A36,M,NAP,112,340,0,Normal,184,N,1,Flat,0%0A41,M,ASY,110,289,0,Normal,170,N,0,Flat,1%0A'''%0A%0Adf_hearts%20%3D%20pd.read_csv%28io.StringIO%28csv%29%29%0Adf_hearts%20%3D%20df_hearts%5B%0A%20%20%20%20%5B%22Age%22,%20%22Sex%22,%20%22RestingBP%22,%20%22ChestPainType%22,%20%22Cholesterol%22,%20%22HeartDisease%22%5D%0A%5D%0A%0A%28df_hearts.sort_values%28%22Age%22%29%0A.groupby%28%5B%22Sex%22,%20%22HeartDisease%22%5D%29%0A.agg%28%7B%22RestingBP%22%3A%20%5B%22mean%22,%20%22std%22%5D,%20%0A%20%20%20%20%20%20%22Cholesterol%22%3A%20%5B%22mean%22,%20%22std%22%5D,%0A%20%20%20%20%20%20%22Sex%22%3A%20%5B%22count%22%5D%0A%20%20%20%20%20%20%7D%29%0A%29&amp;d=2021-12-08&amp;lang=py&amp;v=v1</span></span></code></pre></div></div>
<hr>
</section>
</section>
<section id="pros" class="level2">
<h2 class="anchored" data-anchor-id="pros">Pros:</h2>
<ul>
<li>Step-by-step visualization</li>
<li>Interactive plots (you can track the data rows before and after the transformation)</li>
<li>Shareable URL</li>
</ul>
</section>
<section id="cons-current-limitations" class="level2">
<h2 class="anchored" data-anchor-id="cons-current-limitations">Cons (current limitations):</h2>
<ul>
<li>Only works for small codes (The code should be 5000bytes). Since the data is also encoded and not read from a file, hence, you can only visualize small datasets.</li>
<li>As stated in the previous step, you have to encode the data along with the code as reading from external resources (files or links) are not supported.</li>
<li>Limited Pandas’ methods support.</li>
<li>You can visualize the Pandas expression only on the last line. You may have to pipe multiple steps together or run the visualizations separately.</li>
</ul>
<p><em>For a complete list of unsupported features or other FAQ, you can check <a href="https://docs.google.com/document/d/1kvY8baGjaMbg8ucMTjXlmLeYJXVQKQr09AttwUu3F_k/edit#heading=h.3xhjglvrau6z">here</a>.</em></p>
<hr>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we checked a nice tool for a step-by-step visualization of Pandas data transformation that generates interactive plots to compare the data before and after each transformation. This is very useful for those who want to solidify their understanding of Pandas transformation or those who wants to share those transformations with others (Pandas Tutor even provides a shareable URL).</p>
<hr>
</section>
<section id="references" class="level1">
<h1>References</h1>
<p><a href="https://pandastutor.com/">Pandas Tutor - Visualize Python Pandas code</a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {Visualize Your {Pandas} {Data} {Transformation} Using
    {PandasTutor}},
  date = {2021-12-08},
  url = {https://ealizadeh.com/blog/pandas-tutor-tool/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Visualize your Pandas Data Transformation using
PandasTutor,”</span> Dec. 08, 2021. <a href="https://ealizadeh.com/blog/pandas-tutor-tool/">https://ealizadeh.com/blog/pandas-tutor-tool/</a></div>
</div></div></section></div> ]]></description>
  <category>Python</category>
  <category>Data Science</category>
  <category>Visualization</category>
  <category>Pandas</category>
  <guid>https://ealizadeh.com/blog/pandas-tutor-tool/</guid>
  <pubDate>Wed, 08 Dec 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/pandas-tutor-tool/img/_featured_image.png" medium="image" type="image/png" height="69" width="144"/>
</item>
<item>
  <title>MLxtend: A Python Library with Interesting Tools for Data Science Tasks</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/mlxtend-library-for-data-science/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/_featured_image.jpg" class="img-fluid" alt="Featured image of the post"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/mlxtend-a-python-library-with-interesting-tools-for-data-science-tasks-d54c723f89cd"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<p><a href="https://rasbt.github.io/mlxtend/">MLxtend</a>&nbsp;library <sup>1</sup> (Machine Learning extensions) has many interesting functions for everyday data analysis and machine learning tasksAlthough there are many machine learning libraries available for Python such as&nbsp;<a href="https://scikit-learn.org/">scikit-learn</a>,&nbsp;<a href="https://www.tensorflow.org/">TensorFlow</a>,&nbsp;<a href="https://keras.io/">Keras</a>,&nbsp;<a href="https://pytorch.org/">PyTorch</a>,&nbsp;<em>etc</em>, however, MLxtend offers additional functionalities and can be a valuable addition to your data science toolbox.</p>
<p>In this post, I will go over several tools of the library, in particular, I will cover:</p>
<ul>
<li>Create counterfactual (for model interpretability)</li>
<li>PCA correlation circle</li>
<li>Bias-variance decomposition</li>
<li>Decision regions of classification models</li>
<li>Matrix of scatter plots</li>
<li>Bootstrapping</li>
<li></li>
</ul>
<p>For a list of all functionalities this library offers, you can visit MLxtend’s&nbsp;<a href="https://rasbt.github.io/mlxtend/">documentation</a><span class="citation" data-cites="raschka2018mlxtend">see [1]</span>.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>👉 A link to a free one-page summary of this post is available at the end of the post.</p>
</div>
</div>
</div>
<hr>
<section id="mlxtend-library" class="level2">
<h2 class="anchored" data-anchor-id="mlxtend-library">MLxtend Library</h2>
<p>MLxtend library is developed by&nbsp;<a href="https://sebastianraschka.com/">Sebastian Raschka</a>&nbsp;(a professor of statistics at the University of Wisconsin-Madison). The library has nice API documentation as well as many examples.</p>
<p>You can install the MLxtend package through the Python Package Index (PyPi) by running <code>pip install mlxtend</code>.</p>
</section>
<section id="dataset" class="level2">
<h2 class="anchored" data-anchor-id="dataset">Dataset</h2>
<p>In this post, I’m using the wine data set obtained from the&nbsp;<a href="https://www.kaggle.com/tug004/3wine-classification-dataset">Kaggle</a>. The data contains 13 attributes of alcohol for three types of wine. This is a multiclass classification dataset, and you can find the description of the dataset&nbsp;<a href="https://archive.ics.uci.edu/ml/datasets/wine">here</a>.</p>
<p>First, let’s import the data and prepare the input variables&nbsp;<img src="https://latex.codecogs.com/png.latex?X">&nbsp;(feature set) and the output variable&nbsp;<img src="https://latex.codecogs.com/png.latex?y">&nbsp;(target).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load wine data set (available at https://www.kaggle.com/tug004/3wine-classification-dataset)</span></span>
<span id="cb1-4">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./data/wine.csv"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Target values (wine classes) in y</span></span>
<span id="cb1-7">y_s <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Wine"</span>].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">map</span>({<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>})  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert classes 1, 2, 3 to 0, 1, 2 to avoid strange behavior </span></span>
<span id="cb1-8">y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> y_s.values</span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Feature columns </span></span>
<span id="cb1-11">X_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.drop(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Wine"</span>])</span>
<span id="cb1-12">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_df.values</span>
<span id="cb1-13">attribute_names <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_df.columns</span></code></pre></div></div>
</section>
<section id="mlxtend-functionalities" class="level2">
<h2 class="anchored" data-anchor-id="mlxtend-functionalities">MLxtend Functionalities</h2>
<section id="create-counterfactual-for-model-interpretability" class="level3">
<h3 class="anchored" data-anchor-id="create-counterfactual-for-model-interpretability">Create Counterfactual (for model interpretability)</h3>
<p>For creating counterfactual records (in the context of machine learning), we need to modify the features of some records from the training set in order to change the model prediction<span class="citation" data-cites="online_rasbt_mlxtend_counterfactual">see [2]</span>. This may be helpful in explaining the behavior of a trained model. The algorithm used in the library to create counterfactual records is developed by Wachter&nbsp;<em>et al</em> <span class="citation" data-cites="wachter2017counterfactual">see [3]</span>.</p>
<p>You can create counterfactual records using&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/evaluate/create_counterfactual/">create_counterfactual()</a></em>&nbsp;from the library. Note that this implementation works with any scikit-learn estimator that supports the&nbsp;<code>predict()</code>&nbsp;function. Below is an example of creating a counterfactual record for an ML model. The counterfactual record is highlighted in a red dot within the classifier’s decision regions (we will go over how to draw decision regions of classifiers later in the post).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb2-2">clf_logistic_regression <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LogisticRegression(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb2-3">clf_logistic_regression.fit(X_2d, y)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.evaluate <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> create_counterfactual</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.plotting <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_decision_regions</span>
<span id="cb3-3"></span>
<span id="cb3-4">counterfact <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> create_counterfactual(</span>
<span id="cb3-5">    x_reference<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>X_2d[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>], </span>
<span id="cb3-6">    y_desired<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Desired class</span></span>
<span id="cb3-7">    model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>clf_logistic_regression, </span>
<span id="cb3-8">    X_dataset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>X_2d,</span>
<span id="cb3-9">    y_desired_proba<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>,</span>
<span id="cb3-10">    lammbda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, </span>
<span id="cb3-11">    random_seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span></span>
<span id="cb3-12">)</span>
<span id="cb3-13"></span>
<span id="cb3-14">scatter_highlight_defaults <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb3-15">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'c'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'red'</span>,</span>
<span id="cb3-16">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'edgecolor'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'yellow'</span>,</span>
<span id="cb3-17">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'alpha'</span>: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>,</span>
<span id="cb3-18">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'linewidths'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb3-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'marker'</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>,</span>
<span id="cb3-20">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'s'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">120</span></span>
<span id="cb3-21">}</span>
<span id="cb3-22"></span>
<span id="cb3-23">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb3-24">plot_decision_regions(X_2d, y, clf<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>clf_logistic_regression, legend<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span>
<span id="cb3-25"></span>
<span id="cb3-26">ax.tick_params(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'both'</span>, which<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'major'</span>, labelsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>)</span>
<span id="cb3-27">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Create a Counterfactual Record"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb3-28">ax.set_xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Color.int"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb3-29">ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Phenols"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb3-30"></span>
<span id="cb3-31">ax.scatter(</span>
<span id="cb3-32">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>counterfact,</span>
<span id="cb3-33">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>scatter_highlight_defaults</span>
<span id="cb3-34">)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/counterfactual_instance.png" class="img-fluid figure-img" alt="A counterfactual record is highlighted within a classifier's decision region"></p>
<figcaption>A counterfactual record is highlighted within a classifier’s decision region</figcaption>
</figure>
</div>
<p><strong>PCA Correlation Circle</strong></p>
<p>An interesting and different way to look at PCA results is through a correlation circle that can be plotted using&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/plotting/plot_pca_correlation_graph/">plot_pca_correlation_graph()</a></em>. We basically compute the correlation between the original dataset columns and the PCs (principal components). Then, these correlations are plotted as vectors on a unit-circle. The axes of the circle are the selected dimensions (<em>a.k.a.</em>&nbsp;PCs). You can specify the PCs you’re interested in by passing them as a tuple to&nbsp;<code>dimensions</code> function argument. The correlation circle axes labels show the percentage of the&nbsp;<a href="https://en.wikipedia.org/wiki/Explained_variation">explained variance</a>&nbsp;for the corresponding PC<span class="citation" data-cites="raschka2018mlxtend">see [1]</span>.</p>
<p>Remember that the normalization is important in PCA because the PCA projects the original data on to the directions that maximize the variance.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.plotting <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_pca_correlation_graph</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> StandardScaler</span>
<span id="cb4-3"></span>
<span id="cb4-4">X_norm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> StandardScaler().fit_transform(X) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Normalizing the feature columns is recommended (X - mean) / std</span></span>
<span id="cb4-5"></span>
<span id="cb4-6">fig, correlation_matrix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plot_pca_correlation_graph(</span>
<span id="cb4-7">    X_norm, </span>
<span id="cb4-8">    attribute_names,</span>
<span id="cb4-9">    dimensions<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb4-10">    figure_axis_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span></span>
<span id="cb4-11">)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/correlation_circle_dim1_vs_dim2.png" class="img-fluid figure-img" alt="PCA correlation circle diagram."></p>
<figcaption>PCA correlation circle diagram between the first two principal components and all data attributes</figcaption>
</figure>
</div>
<p>PCA correlation circle diagram between the first two principal components and all data attributes</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/pca_correlation_matrix.png" class="img-fluid figure-img"></p>
<figcaption>Correlation matrix between wine features and the first two PCs</figcaption>
</figure>
</div>
<p>Correlation matrix between wine features and the first two PCs</p>
</section>
<section id="bias-variance-decomposition" class="level3">
<h3 class="anchored" data-anchor-id="bias-variance-decomposition">Bias-Variance Decomposition</h3>
<p>You often hear about the bias-variance tradeoff to show the model performance. In supervised learning, the goal often is to minimize both the bias error (to prevent underfitting) and variance (to prevent overfitting) so that our model can generalize beyond the training set <span class="citation" data-cites="wiki:bias_variance">see [4]</span>. This process is known as a bias-variance tradeoff.</p>
<p>Note that we cannot calculate the actual bias and variance for a predictive model, and the bias-variance tradeoff is a concept that an ML engineer should always consider and tries to find a sweet spot between the two.Having said that, we can still study the model’s expected generalization error for certain problems. In particular, we can use the bias-variance decomposition to decompose the generalization error into a sum of 1) bias, 2) variance, and 3)&nbsp;<em>irreducible error</em>[4,5].</p>
<p>The bias-variance decomposition can be implemented through&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/">bias_variance_decomp()</a></em>&nbsp;in the library. An example of such implementation for a decision tree classifier is given below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.tree <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DecisionTreeClassifier</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> train_test_split</span>
<span id="cb5-3"></span>
<span id="cb5-4">X_train, X_test, y_train, y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_test_split(X_df.values, y,</span>
<span id="cb5-5">                                                    test_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>,</span>
<span id="cb5-6">                                                    random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>,</span>
<span id="cb5-7">                                                    shuffle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb5-8">                                                    stratify<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>y)</span>
<span id="cb5-9">tree <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DecisionTreeClassifier(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.evaluate <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> bias_variance_decomp</span>
<span id="cb6-2"></span>
<span id="cb6-3">avg_expected_loss, avg_bias, avg_var <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bias_variance_decomp(</span>
<span id="cb6-4">        tree, X_train, y_train, X_test, y_test, </span>
<span id="cb6-5">        loss<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mse'</span>,</span>
<span id="cb6-6">        num_rounds<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Number of bootstrap rounds for implementing the decomposition</span></span>
<span id="cb6-7">        random_seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span></span>
<span id="cb6-8">)</span>
<span id="cb6-9"></span>
<span id="cb6-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Average expected loss: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg_expected_loss<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb6-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Average bias: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg_bias<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb6-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Average variance: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg_var<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Average expected loss: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.108</span> </span>
<span id="cb7-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Average bias: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.032</span> </span>
<span id="cb7-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Average variance: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.076</span></span></code></pre></div></div>
</section>
<section id="plotting-decision-regions-of-classifiers" class="level3">
<h3 class="anchored" data-anchor-id="plotting-decision-regions-of-classifiers">Plotting Decision Regions of Classifiers</h3>
<p>MLxtend library has an out-of-the-box function&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/plotting/plot_decision_regions/">plot_decision_regions()</a></em>&nbsp;to draw a classifier’s decision regions in 1 or 2 dimensions.</p>
<p>Here, I will draw decision regions for several scikit-learn as well as MLxtend models. Let’s first import the models and initialize them.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Models</span></span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb8-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb8-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.naive_bayes <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GaussianNB </span>
<span id="cb8-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.classifier <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> EnsembleVoteClassifier </span>
<span id="cb8-6"></span>
<span id="cb8-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initializing Classifiers</span></span>
<span id="cb8-8">clf_logistic_regression <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LogisticRegression(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-9">clf_nb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GaussianNB()</span>
<span id="cb8-10">clf_random_forest <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-11">clf_ensemble <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> EnsembleVoteClassifier(</span>
<span id="cb8-12">    clfs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[clf_logistic_regression, clf_nb, clf_random_forest], </span>
<span id="cb8-13">    weights<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], </span>
<span id="cb8-14">    voting<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'soft'</span></span>
<span id="cb8-15">)</span>
<span id="cb8-16"></span>
<span id="cb8-17">all_classifiers <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb8-18">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Logistic Regression"</span>, clf_logistic_regression),</span>
<span id="cb8-19">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Naive Bayes"</span>, clf_nb),</span>
<span id="cb8-20">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Random Forest"</span>, clf_random_forest),</span>
<span id="cb8-21">    (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Ensemble"</span>, clf_ensemble),</span>
<span id="cb8-22">]</span></code></pre></div></div>
<p>Now that we have initialized all the classifiers, let’s train the models and draw decision boundaries using <a href="https://rasbt.github.io/mlxtend/user_guide/plotting/plot_decision_regions/"><em>plot_decision_regions()</em></a> from the MLxtend library.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.plotting <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> plot_decision_regions</span>
<span id="cb9-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> itertools <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> product  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Used to generate indices for figure subplots!</span></span>
<span id="cb9-3"></span>
<span id="cb9-4">fig, axs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>), sharey<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb9-5"></span>
<span id="cb9-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> classifier, grid <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(</span>
<span id="cb9-7">    all_classifiers,</span>
<span id="cb9-8">    product([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># generate [(0, 0), (0, 1), (1, 0), (1, 1)]</span></span>
<span id="cb9-9">):</span>
<span id="cb9-10">    clf_name, clf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> classifier[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], classifier[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb9-11">    ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> axs[grid[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], grid[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]</span>
<span id="cb9-12"></span>
<span id="cb9-13">    clf.fit(X_2d, y)</span>
<span id="cb9-14">    </span>
<span id="cb9-15">    plot_decision_regions(</span>
<span id="cb9-16">        X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>X_2d, </span>
<span id="cb9-17">        y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>y, </span>
<span id="cb9-18">        clf<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>clf, </span>
<span id="cb9-19">        legend<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb9-20">        ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax</span>
<span id="cb9-21">    )</span>
<span id="cb9-22"></span>
<span id="cb9-23">    ax.set_title(clf_name, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb9-24">    ax.tick_params(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'both'</span>, which<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'major'</span>, labelsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>)</span>
<span id="cb9-25">    ax.set_xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Color.int"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb9-26">    ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Phenols"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/all_classifiers_decision_regions.png" class="img-fluid figure-img" alt="Decision regions of all classifiers"></p>
<figcaption>Decision regions of all classifiers</figcaption>
</figure>
</div>
</section>
<section id="matrix-of-scatter-plots" class="level3">
<h3 class="anchored" data-anchor-id="matrix-of-scatter-plots">Matrix of Scatter Plots</h3>
<p>Another useful tool from MLxtend is the ability to draw a matrix of scatter plots for features (using&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/plotting/scatterplotmatrix/">scatterplotmatrix()</a></em>). In order to add another dimension to the scatter plots, we can also assign different colors for different target classes.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.plotting <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> scatterplotmatrix</span>
<span id="cb10-2"></span>
<span id="cb10-3">fig, axes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scatterplotmatrix(X[y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>], figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">34</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>), alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb10-4">fig, axes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scatterplotmatrix(X[y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], fig_axes<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(fig, axes), alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb10-5">fig, axes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scatterplotmatrix(X[y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>], fig_axes<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(fig, axes), alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>attribute_names)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/matrix_scatter_plots.png" class="img-fluid figure-img" alt="Scatter plots of all wine attributes."></p>
<figcaption>A matrix of scatter plot of all wine attributes with different colors for wine types</figcaption>
</figure>
</div>
<p>A matrix of scatter plot of all wine attributes with different colors for wine types</p>
<p>By the way, for plotting similar scatter plots, you can also use Pandas’ <em><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.scatter_matrix.html">scatter_matrix()</a></em> or seaborn’s <em><a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html">pairplot()</a></em> function.</p>
</section>
<section id="bootstrapping" class="level3">
<h3 class="anchored" data-anchor-id="bootstrapping"><strong>Bootstrapping</strong></h3>
<p>The&nbsp;<a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrap</a>&nbsp;is an easy way to estimate a sample statistic and generate the corresponding confidence interval by drawing&nbsp;<a href="https://en.wikipedia.org/wiki/Sampling_(statistics)#Replacement_of_selected_units">random samples with replacement</a>. For this, you can use the&nbsp;<em><a href="https://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap/">bootstrap()</a></em>&nbsp;function from the library. Note that you can pass a custom statistic to the bootstrap function through argument&nbsp;<code>func</code>. The custom function must return a scalar value.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.evaluate <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> bootstrap</span>
<span id="cb11-2"></span>
<span id="cb11-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generating 100 random data with a mean of 5</span></span>
<span id="cb11-4">random_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.RandomState(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>).normal(loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5.</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb11-5"></span>
<span id="cb11-6">avg, std_err, ci_bounds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bootstrap(</span>
<span id="cb11-7">    random_data, </span>
<span id="cb11-8">    num_rounds<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, </span>
<span id="cb11-9">    func<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>np.mean,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A function to compute a sample statistic can be passed here</span></span>
<span id="cb11-10">    ci<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>, </span>
<span id="cb11-11">    seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span></span>
<span id="cb11-12">)</span>
<span id="cb11-13"></span>
<span id="cb11-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb11-15">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb11-16">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Standard Error: +/- </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_err<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb11-17">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"CI95: [</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ci_bounds[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ci_bounds[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">]"</span></span>
<span id="cb11-18">)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5.03</span></span>
<span id="cb12-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Standard Error: <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+/-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.11</span></span>
<span id="cb12-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> CI95: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.8</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5.26</span>]</span></code></pre></div></div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 You can download a one-page summary of this post <a href="./mlxtend-one-page-summary.pdf">here</a>.</p>
</div>
</div>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we went over several MLxtend library functionalities, in particular, we talked about creating counterfactual instances for better model interpretability and plotting decision regions for classifiers, drawing PCA correlation circle, analyzing bias-variance tradeoff through decomposition, drawing a matrix of scatter plots of features with colored targets, and implementing the bootstrapping. The library is a nice addition to your data science toolbox, and I recommend giving this library a try.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>📓 You can find the Jupyter notebook for this blog post on <a href="https://github.com/e-alizadeh/medium/blob/master/notebooks/MLxtend.ipynb">GitHub</a>.</p>
</div>
</div>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-raschka2018mlxtend" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">S. Raschka, <span>“MLxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack,”</span> <em>The Journal of Open Source Software</em>, vol. 3, no. 24, Apr. 2018, doi: <a href="https://doi.org/10.21105/joss.00638">10.21105/joss.00638</a>.</div>
</div>
<div id="ref-online_rasbt_mlxtend_counterfactual" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">S. Raschka, <span>“Create_counterfactual: Interpreting models via counterfactuals.”</span> N/A. Accessed: Jul. 17, 2021. [Online]. Available: <a href="https://rasbt.github.io/mlxtend/user_guide/evaluate/create_counterfactual/">https://rasbt.github.io/mlxtend/user_guide/evaluate/create_counterfactual/</a></div>
</div>
<div id="ref-wachter2017counterfactual" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">S. Wachter, B. Mittelstadt, and C. Russell, <span>“Counterfactual explanations without opening the black box: Automated decisions and the GDPR,”</span> <em>Harvard Journal of Law &amp; Technology</em>, vol. 31, p. 841, 2017, Available: <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3063289">https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3063289</a></div>
</div>
<div id="ref-wiki:bias_variance" class="csl-entry">
<div class="csl-left-margin">[4] </div><div class="csl-right-inline">Wikipedia, <span>“Bias–variance tradeoff.”</span> <a href="https://en.wikipedia.org/wiki/Bias-variance_tradeoff" class="uri">https://en.wikipedia.org/wiki/Bias-variance_tradeoff</a>, 2021-07-17.</div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p><a href="https://rasbt.github.io/mlxtend/">MLxtend Documentation</a>↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {MLxtend: {A} {Python} {Library} with {Interesting} {Tools}
    for {Data} {Science} {Tasks}},
  date = {2021-07-17},
  url = {https://ealizadeh.com/blog/mlxtend-library-for-data-science/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“MLxtend: A Python Library with Interesting Tools for
Data Science Tasks,”</span> Jul. 17, 2021. <a href="https://ealizadeh.com/blog/mlxtend-library-for-data-science/">https://ealizadeh.com/blog/mlxtend-library-for-data-science/</a></div>
</div></div></section></div> ]]></description>
  <category>Data Science</category>
  <category>Exploratory Data Analysis</category>
  <category>Machine Learning</category>
  <category>Python Library</category>
  <guid>https://ealizadeh.com/blog/mlxtend-library-for-data-science/</guid>
  <pubDate>Sat, 17 Jul 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/mlxtend-library-for-data-science/img/_featured_image.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Step-by-Step Deployment of a Free PostgreSQL Database And Data Ingestion</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/deploy-postgresql-db-heroku/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/deploy-postgresql-db-heroku/img/_featured_image.png" class="img-fluid"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/deploy-free-postgresql-database-in-heroku-and-ingest-data-8002c574a57d">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<section id="in-this-post-you-will-learn-how-to" class="level3">
<h3 class="anchored" data-anchor-id="in-this-post-you-will-learn-how-to">In this post, you will learn how to …</h3>
<ul>
<li><em>Deploy a free PostgreSQL database in Heroku</em></li>
<li><em>Generate a Heroku API token (in two ways)</em></li>
<li><em>Dynamically retrieve Heroku database URL (useful to overcome a shortcoming of the free plan)</em></li>
<li>Ingest data to <em>a table in the database using Pandas and SQLAlchemy</em></li>
<li><em>Specify data type of columns in the table using SQLAlchemy datatypes</em></li>
</ul>
</section>
<section id="one-line-summary-of-related-technologies" class="level2">
<h2 class="anchored" data-anchor-id="one-line-summary-of-related-technologies">One-line summary of related technologies</h2>
<p><a href="https://www.postgresql.org/">PostgreSQL</a>: a free and open-source object-relational database management system that emphasizes extensibility and SQL compliance.</p>
<p><a href="https://www.heroku.com/">Heroku</a>: a platform as a service (PaaS) suitable for quick deployments with minimal needed DevOps experience.</p>
<p><a href="https://www.sqlalchemy.org/">SQLAlchemy</a>: a Python SQL library and Object Relational Mapper (ORM) to interact with databases.</p>
<p><a href="https://pandas.pydata.org/">Pandas</a>: An open-source Python library for data analysis and manipulation.</p>
</section>
<section id="prerequisite" class="level1">
<h1>Prerequisite</h1>
<p>You will need the following Python libraries</p>
<ul>
<li><a href="https://pandas.pydata.org/">pandas</a></li>
<li><a href="https://pandas.pydata.org/">psycopg2</a></li>
<li><a href="https://www.sqlalchemy.org/">SQLAlchemy</a></li>
</ul>
<p>And also</p>
<ul>
<li><a href="https://devcenter.heroku.com/articles/heroku-cli">Heroku CLI</a> (verify your installation by entering <code>heroku --version</code> in the terminal)</li>
</ul>
<hr>
</section>
<section id="heroku-signup-and-deployment" class="level1">
<h1>Heroku Signup and Deployment</h1>
<p>Heroku is a platform as a service (PaaS) that enables developers to build and run applications entirely in the cloud. Heroku offers a ready-to-use environment that makes it very simple to deploy your code as quickly as possible with little development experience. This is an excellent choice for beginners and small to medium-sized companies, unlike AWS, which usually requires experienced developers and has complicated deployment processes.</p>
<section id="sign-up-to-heroku-deploy-your-first-postgresql-database" class="level2">
<h2 class="anchored" data-anchor-id="sign-up-to-heroku-deploy-your-first-postgresql-database">1. Sign up to Heroku &amp; Deploy Your First PostgreSQL Database</h2>
<p>You can <a href="https://signup.heroku.com/">signup</a> for free to Heroku. After signing up and logging into your account, you will be directed to the Heroku Dashboard. Then, you can follow the instructions in the following clip to create a new app and add a PostgreSQL database.</p>
<!-- [Deploy a new Heroku app and add the PostgreSQL driver](Step-by-Step%20Deployment%20of%20a%20Free%20PostgreSQL%20Datab%20c6565be042204579af1564b8f7be1c3f/) -->
<div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://player.vimeo.com/video/754955742" frameborder="0" title="" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen=""></iframe></div>
<p>The free plan allows you to have a maximum of 20,000 rows of data and up to 20 connections to the database. This plan is usually enough for a small personal project.</p>
<div class="callout callout-style-default callout-caution callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Caution
</div>
</div>
<div class="callout-body-container callout-body">
<p>In the free plan, the database credentials will occasionally change since Heroku rotates credentials periodically and sometimes perform maintenances.</p>
</div>
</div>
<p>To address the issue of occasional changes in the database credentials, we can use Heroku CLI to retrieve the database URL dynamically. But first, let’s go over the procedure for logging in to your account via Heroku CLI.</p>
</section>
<section id="access-your-heroku-account-using-token" class="level2">
<h2 class="anchored" data-anchor-id="access-your-heroku-account-using-token">2. Access your Heroku account using Token</h2>
<p>What’s covered in this section is applicable in general for working with any Heroku applications through <strong>Heroku CLI</strong>.</p>
<section id="generate-a-heroku-api-token" class="level3">
<h3 class="anchored" data-anchor-id="generate-a-heroku-api-token">2.1. Generate a Heroku API Token</h3>
<p>You can generate the token in the following two ways:</p>
<p><strong>2.1.1. Heroku account (browser)</strong></p>
<p>Go to <strong>Account settings → Applications</strong>. Under the <strong>Authorizations</strong> section, click on <strong>Create authorization</strong>. You have to give a description in the opened window and set the expiry time or just set no expiry for the token (by leaving the box blank).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/deploy-postgresql-db-heroku/img/heroku_token_generation_web-app.png" class="img-fluid figure-img" alt="Create a Heroku API token from Heroku dashboard"></p>
<figcaption>Create a Heroku API token from Heroku dashboard</figcaption>
</figure>
</div>
<p><strong>2.1.2. Heroku CLI (terminal)</strong></p>
<p>After installing the Heroku CLI, the first time you try to use a command that requires access to your account, you will be prompted to log in to your Heroku account on your browser. Once you’re logged in, you can do almost anything through the Heroku API. For example, we can create a token by running the following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$heroku</span> authorization:create</span></code></pre></div></div>
<p>The above command will generate a long-lived token for you. The first time you run the above command, you will be prompted to log in to your account in a browser. Once you successfully log into your account, you can get back to the terminal and see the generated token, as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/deploy-postgresql-db-heroku/img/heroku_token_generation_cli.png" class="img-fluid figure-img" alt="Generate a Heroku API token via Heroku CLI"></p>
<figcaption>Generate a Heroku API token via Heroku CLI</figcaption>
</figure>
</div>
</section>
<section id="store-your-heroku-token-in-your-environment" class="level3">
<h3 class="anchored" data-anchor-id="store-your-heroku-token-in-your-environment">2.2. Store your Heroku token in your environment</h3>
<p>Now that you have your Heroku API token, you need to set it in your terminal/environment as <code>HEROKU_API_KEY</code>. You can achieve this by running the following in your terminal:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">export</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">HEROKU_API_KEY</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=&lt;</span>your_token<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>A variable set in the shell terminal will only be available in the terminal from which you ran and will die after closing it. Instead, you can put above command in your <code>~/.bash</code> or <code>~/.bashrc</code> file so that the variable will be available in any new terminal you open. This way, you don’t need to worry about setting this variable again!</p>
</div>
</div>
<p>Once you have <code>HEROKU_API_KEY</code> variable set in your terminal, you no longer need to use the web-based authentication or username and password to log in. This is particularly important if you want to use Heroku CLI as a part of an automation process or CI/CD. This way, you don’t need to log in each time and use the token in any different terminals.</p>
</section>
<section id="retrieve-heroku-postgresql-database-url" class="level3">
<h3 class="anchored" data-anchor-id="retrieve-heroku-postgresql-database-url">2.3 Retrieve Heroku PostgreSQL Database URL</h3>
<p>You can get the database URL by running the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">$heroku</span> config:get DATABASE_URL <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--app</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>your-app-name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>This will output the database URL in the format of</p>
<p><code>postgres://&lt;db_user&gt;:&lt;db_password&gt;@&lt;db_host&gt;/&lt;db_name&gt;</code></p>
<p>We can use Python’s standard library <a href="https://docs.python.org/3/library/subprocess.html">subprocess</a> to run above command and retrieve the database credentials. This way we will have all our codes in Python!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> subprocess</span>
<span id="cb4-2">heroku_app_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"your-app-name"</span></span>
<span id="cb4-3">raw_db_url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> subprocess.run(</span>
<span id="cb4-4">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"heroku"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"config:get"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATABASE_URL"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--app"</span>, heroku_app_name],</span>
<span id="cb4-5">    capture_output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># capture_output arg is added in Python 3.7</span></span>
<span id="cb4-6">).stdout </span></code></pre></div></div>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>Your Python (iPython) terminal/environment should have <code>HEROKU_API_KEY</code> set. You can verify that by running <code>os.environ["HEROKU_API_KEY"]</code> and verifying the token in the output.</p>
</div>
</div>
<hr>
</section>
</section>
</section>
<section id="data-ingestion-to-a-table" class="level1">
<h1>Data Ingestion to a&nbsp;Table</h1>
<section id="create-sqlalchemy-engine" class="level2">
<h2 class="anchored" data-anchor-id="create-sqlalchemy-engine">Create SQLAlchemy Engine</h2>
<p>Before we ingest data to a table in the deployed PostgreSQL database using Pandas, we have to create an SQLAlchemy engine that will be passed to the Pandas method. The SQLAlchemy engine/connection can be created using the following code snippet:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> subprocess</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sqlalchemy.engine.create <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> create_engine</span>
<span id="cb5-3"></span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the Database URL using Heroku CLI</span></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># -------------------------------------</span></span>
<span id="cb5-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Running the following from Python: $heroku config:get DATABASE_URL --app your-app-name</span></span>
<span id="cb5-7">heroku_app_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"your-app-name"</span></span>
<span id="cb5-8"></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Assumption: HEROKU_API_KEY is set in your terminal</span></span>
<span id="cb5-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can confirm that it's set by running the following python command os.environ["HEROKU_API_KEY"]</span></span>
<span id="cb5-11">raw_db_url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> subprocess.run(</span>
<span id="cb5-12">    [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"heroku"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"config:get"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DATABASE_URL"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"--app"</span>, heroku_app_name],</span>
<span id="cb5-13">    capture_output<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># capture_output arg is added in Python 3.7</span></span>
<span id="cb5-14">).stdout </span>
<span id="cb5-15"></span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert binary string to a regular string &amp; remove the newline character</span></span>
<span id="cb5-17">db_url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> raw_db_url.decode(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ascii"</span>).strip()</span>
<span id="cb5-18"></span>
<span id="cb5-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert "postgres://&lt;db_address&gt;"  --&gt; "postgresql+psycopg2://&lt;db_address&gt;" needed for SQLAlchemy</span></span>
<span id="cb5-20">final_db_url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"postgresql+psycopg2://"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> db_url.lstrip(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"postgres://"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># lstrip() is more suitable here than replace() function since we only want to replace postgres at the start!</span></span>
<span id="cb5-21"></span>
<span id="cb5-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create SQLAlchemy engine</span></span>
<span id="cb5-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ------------------------</span></span>
<span id="cb5-24">engine <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> create_engine(final_db_url)</span></code></pre></div></div>
<p>As you may note in the above, some string manipulation is required before creating the SQLAlchemy engine.</p>
</section>
<section id="ingest-data-using-pandas-sqlalchemy" class="level2">
<h2 class="anchored" data-anchor-id="ingest-data-using-pandas-sqlalchemy">Ingest Data using Pandas &amp; SQLAlchemy</h2>
<p>We can ingest the data into a table by simply using pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html"><code>to_sql()</code></a> function and passing the SQLAlchemy engine/connection object to it.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb6-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sqlalchemy.types <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Integer, DateTime</span>
<span id="cb6-3"></span>
<span id="cb6-4">DATA_URL <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv"</span></span>
<span id="cb6-5">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(DATA_URL)</span>
<span id="cb6-6"></span>
<span id="cb6-7">df.to_sql(</span>
<span id="cb6-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"covid19"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># table name</span></span>
<span id="cb6-9">    con<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>engine,</span>
<span id="cb6-10">    if_exists<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'replace'</span>,</span>
<span id="cb6-11">    index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># In order to avoid writing DataFrame index as a column</span></span>
<span id="cb6-12">    dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb6-13">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"last_updated_date"</span>: DateTime(),</span>
<span id="cb6-14">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"total_cases"</span>: Integer(),</span>
<span id="cb6-15">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"new_cases"</span>: Integer()</span>
<span id="cb6-16">    }</span>
<span id="cb6-17">)</span></code></pre></div></div>
<p>In the above example, the data type of few columns is specified. You can determine the dtype of columns by passing a dictionary in which keys should be the column names and the values should be the SQLAlchemy types. For all available dtypes, you can check SQLAlchemy <a href="https://docs.sqlalchemy.org/en/14/core/type_basics.html">documentation</a> for the data types it supports.</p>
</section>
</section>
<section id="additional-tips" class="level1">
<h1>Additional Tips</h1>
<p>Reading data into a Pandas dataframe using pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html"><code>read_sql_table()</code></a>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_sql_table(</span>
<span id="cb7-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"covid19"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># table name</span></span>
<span id="cb7-3">    con<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>engine</span>
<span id="cb7-4">)</span></code></pre></div></div>
<p>You can run raw SQL queries using SQLAlchemy’s <a href="https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine"><code>.execute("&lt;your SQL query&gt;")</code></a> function. For instance, if you want to drop the above table by running a SQL query, you can do so by doing the following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">engine.execute(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"DROP TABLE covid19"</span>) </span></code></pre></div></div>
<p>Above will create an SQLAlchemy cursor.</p>
<hr>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we deployed a free PostgreSQL database using Heroku free plan. We also addressed the issue of changing database credentials by Heroku in the free plan by retrieving the database credentials dynamically via Heroku CLI. Using Pandas’ <code>to_sql()</code> function, we quickly created a table and even specified data types of columns via SQLAlchemy data types.</p>
<hr>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://devcenter.heroku.com/articles/authentication">Heroku CLI Authentication</a></p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html">pandas.DataFrame.to_sql - pandas 1.2.5 documentation</a></p>
<p><a href="https://docs.sqlalchemy.org/en/14/core/type_basics.html">SQLAlchemy 1.4 Documentation</a></p>
</section>
<section id="related-posts" class="level1">
<h1>Related posts</h1>
<p><a href="https://towardsdatascience.com/how-to-deploy-a-postgres-database-for-free-95cf1d8387bf">How To Deploy A Postgres Database For Free</a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {Step-by-Step {Deployment} of a {Free} {PostgreSQL} {Database}
    {And} {Data} {Ingestion}},
  date = {2021-06-26},
  url = {https://ealizadeh.com/blog/deploy-postgresql-db-heroku/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Step-by-Step Deployment of a Free PostgreSQL Database
And Data Ingestion,”</span> Jun. 26, 2021. <a href="https://ealizadeh.com/blog/deploy-postgresql-db-heroku/">https://ealizadeh.com/blog/deploy-postgresql-db-heroku/</a></div>
</div></div></section></div> ]]></description>
  <category>Database</category>
  <category>Guide to</category>
  <category>Python</category>
  <guid>https://ealizadeh.com/blog/deploy-postgresql-db-heroku/</guid>
  <pubDate>Sat, 26 Jun 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/deploy-postgresql-db-heroku/img/_featured_image.png" medium="image" type="image/png" height="108" width="144"/>
</item>
<item>
  <title>dbt for Data Transformation - A Hands-on Tutorial</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/dbt-tutorial/</link>
  <description><![CDATA[ 






<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://www.kdnuggets.com/2021/07/dbt-data-transformation-tutorial.html">KDnuggets</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>dbt (data build tool) is a data transformation tool that uses select SQL statements. It allows you to create complex models, use variables and macros (aka functions), run tests, generate documentation, and many more features.</p>
<p>dbt does not extract or load data, but it’s powerful at transforming data that’s already available in the database —dbt does the <strong>T</strong> in ELT (Extract, Load, Transform) processes.</p>
<p>In this post, you will learn how to …</p>
<ul>
<li>Configuring a dbt project</li>
<li>Creating dbt models (SELECT statements)</li>
<li>Build complex dbt models using global variables and macros</li>
<li>Building complex models by referring to other dbt models</li>
<li>Running tests</li>
<li>Generating documentation</li>
</ul>
</section>
<section id="pre-requisite" class="level1">
<h1>Pre-requisite</h1>
<section id="signup" class="level2">
<h2 class="anchored" data-anchor-id="signup">Signup</h2>
<p>You can sign up at <a href="https://cloud.getdbt.com/">getdbt.com</a>. The free plan is a great plan for small projects and testing.</p>
</section>
<section id="database-with-populated-data" class="level2">
<h2 class="anchored" data-anchor-id="database-with-populated-data">Database with populated data</h2>
<p>You can check my post on <a href="https://ealizadeh.com/blog/deploy-postgresql-db-heroku">how to deploy a <em>free</em> PostgreSQL database on Heroku</a>. The post provides step-by-step instructions on how to do it.</p>
<p>You can also check the <a href="https://github.com/e-alizadeh/sample_dbt_project/blob/master/data/data_ingestion.py">data ingestion script</a> in the GitHub repo accompanying this article.</p>
<p><a href="https://github.com/e-alizadeh/sample_dbt_project">e-alizadeh/sample_dbt_project</a></p>
<p>Following the above, we have generated two tables in a PostgreSQL database that we are going to use in this post. There are two tables in the database, namely <code>covid_latest</code> and <code>population_prosperity</code>. You can find the ingestion script on the GitHub repo for this post.</p>
</section>
<section id="dbt-cli-installation" class="level2">
<h2 class="anchored" data-anchor-id="dbt-cli-installation">dbt CLI Installation</h2>
<p>You can install the dbt command-line interface (CLI) by following the instructions on the following <a href="https://docs.getdbt.com/dbt-cli/installation/">dbt documentation page</a>.</p>
<p><a href="https://docs.getdbt.com/dbt-cli/installation">Installation | dbt Docs</a></p>
<hr>
</section>
</section>
<section id="basics-of-a-dbt-project" class="level1">
<h1>Basics of a dbt project</h1>
<p>There are three main things to know about in order to use the dbt tool.</p>
<ul>
<li>dbt project</li>
<li>database connection</li>
<li>dbt commands</li>
</ul>
<section id="how-to-use-dbt" class="level2">
<h2 class="anchored" data-anchor-id="how-to-use-dbt">How to use dbt?</h2>
<p>A dbt project is a directory containing <code>.sql</code> and <code>.yml</code> files. The minimum required files are:</p>
<ul>
<li>A project file named <code>dbt_project.yml</code>: This file contains configurations of a dbt project.</li>
<li>Model(s) (<code>.sql</code> files): A model in dbt is simply a single <code>.sql</code> file containing a <strong>single <code>select</code> statement</strong>.</li>
</ul>
<p><strong>Every dbt project needs a dbt_project.yml file — this is how dbt knows a directory is a dbt project. It also contains important information that tells dbt how to operate on your project.</strong></p>
<p>You can find more information about dbt projects <a href="https://docs.getdbt.com/docs/introduction#dbt-projects">here</a>.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>💡 A <strong>dbt model</strong> is basically a <code>.sql</code> file with a <strong>SELECT</strong> statement.</p>
</div>
</div>
</section>
<section id="dbt-commands" class="level2">
<h2 class="anchored" data-anchor-id="dbt-commands">dbt Commands</h2>
<p>dbt commands start with <code>dbt</code> and can be executed using one of the following ways:</p>
<ul>
<li>dbt Cloud (the command section at the bottom of the dbt Cloud dashboard),</li>
<li>dbt CLI</li>
</ul>
<p>Some commands can only be used in dbt CLI like <code>dbt init</code>. Some dbt commands we will use in this post are</p>
<ul>
<li><code>dbt init</code> (only in dbt CLI)</li>
<li><code>dbt run</code></li>
<li><code>dbt test</code></li>
<li><code>dbt docs generate</code></li>
</ul>
</section>
</section>
<section id="dbt-project-setup" class="level1">
<h1>dbt Project Setup</h1>
<section id="step-1-initialize-a-dbt-project-sample-files-using-dbt-cli" class="level2">
<h2 class="anchored" data-anchor-id="step-1-initialize-a-dbt-project-sample-files-using-dbt-cli">Step 1: Initialize a dbt project (sample files) using dbt CLI</h2>
<p>You can use <code>[dbt init](https://docs.getdbt.com/reference/commands/init)</code> to generate sample files/folders. In particular, <code>dbt init project_name</code> will create the following:</p>
<ul>
<li>a&nbsp;<code>~/.dbt/profiles.yml</code>&nbsp;file if one does not already exist</li>
<li>a new folder called&nbsp;<code>[project_name]</code></li>
<li>directories and sample files necessary to get started with dbt</li>
</ul>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Since <code>dbt init</code> generates a directory named<code>project_name</code>, and in order to avoid any conflict, you should <em>not have any existing folder with an identical name</em>.</p>
</div>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/dbt_init.png" class="img-fluid figure-img"></p>
<figcaption>dbt init <project_name></project_name></figcaption>
</figure>
</div>
<p>The result is a directory with the following sample files.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">sample_dbt_project</span></span>
<span id="cb1-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> README.md</span>
<span id="cb1-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> analysis</span>
<span id="cb1-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> data</span>
<span id="cb1-5"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> dbt_project.yml</span>
<span id="cb1-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> macros</span>
<span id="cb1-7"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> models</span>
<span id="cb1-8"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">│&nbsp;&nbsp;</span> └── example</span>
<span id="cb1-9"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">│&nbsp;&nbsp;</span>     ├── my_first_dbt_model.sql</span>
<span id="cb1-10"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">│&nbsp;&nbsp;</span>     ├── my_second_dbt_model.sql</span>
<span id="cb1-11"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">│&nbsp;&nbsp;</span>     └── schema.yml</span>
<span id="cb1-12"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> snapshots</span>
<span id="cb1-13"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">└──</span> tests</span></code></pre></div></div>
<p>For this post, we will just consider the minimum files and remove the extra stuff.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">sample_dbt_project</span></span>
<span id="cb2-2"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> README.md</span>
<span id="cb2-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> dbt_project.yml</span>
<span id="cb2-4"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">└──</span> models</span>
<span id="cb2-5"> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">&nbsp;&nbsp;</span> ├── my_first_dbt_model.sql</span>
<span id="cb2-6">    <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">├──</span> my_second_dbt_model.sql</span>
<span id="cb2-7"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">&nbsp;</span>   └── schema.yml</span></code></pre></div></div>
</section>
<section id="step-2-set-up-a-git-repository" class="level2">
<h2 class="anchored" data-anchor-id="step-2-set-up-a-git-repository">Step 2: Set Up a Git Repository</h2>
<p>You can use an existing repo, as specified during the setup. You can configure the repositories by following the dbt documentation <a href="https://docs.getdbt.com/docs/dbt-cloud/cloud-configuring-dbt-cloud/cloud-configuring-repositories">here</a>.</p>
<section id="or-if-you-want-to-create-a-new-repo" class="level3">
<h3 class="anchored" data-anchor-id="or-if-you-want-to-create-a-new-repo"><strong>Or, if you want to create a new repo…</strong></h3>
<p>you can create a new repository from inside the created directory. You can do that as below</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> init</span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> add .</span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> commit <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-m</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"first commit"</span></span>
<span id="cb3-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> remote add origing <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>repo_url<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb3-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> push <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-u</span> origin master</span></code></pre></div></div>
</section>
</section>
<section id="step-3-set-up-a-new-project-on-dbt-cloud-dashboard" class="level2">
<h2 class="anchored" data-anchor-id="step-3-set-up-a-new-project-on-dbt-cloud-dashboard">Step 3: Set Up a New Project on dbt Cloud Dashboard</h2>
<p>In the previous step, we created a sample dbt project containing sample models and configurations. Now, we want to create a new project and connect our database and repository on the dbt Cloud dashboard.</p>
<p>Before we continue, you should have</p>
<ul>
<li>some data already available in a database,</li>
<li>a repository with the files generated at the previous step</li>
</ul>
<p>You can follow the steps below to set up a new project in dbt Cloud (keep in mind this step is different than the previous step in that we only generated some sample files).</p>
<p></p><div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://player.vimeo.com/video/576196451" frameborder="0" title="" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen=""></iframe></div> <!-- title='The documentation generated by the command `dbt docs generate`' >}} --> Set up a new dbt project on dbt Cloud<p></p>
<p>The <code>dbt_project.yml</code> file for our project is shown below (you can find the complete version in the <a href="https://github.com/e-alizadeh/sample_dbt_project.git">GitHub repo</a> to this post).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'my_new_project'</span></span>
<span id="cb4-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1.0.0'</span></span>
<span id="cb4-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">config-version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vars</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb4-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">selected_country</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> USA</span></span>
<span id="cb4-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">selected_year</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2019</span></span>
<span id="cb4-8"></span>
<span id="cb4-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This setting configures which "profile" dbt uses for this project.</span></span>
<span id="cb4-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">profile</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'default'</span></span>
<span id="cb4-11"></span>
<span id="cb4-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># There are other stuff that are generated automatically when you run `dbt init`</span></span></code></pre></div></div>
</section>
</section>
<section id="dbt-models-and-features" class="level1">
<h1>dbt Models and Features</h1>
<section id="dbt-models" class="level2">
<h2 class="anchored" data-anchor-id="dbt-models">dbt models</h2>
<p>Let’s create simple dbt models that retrieve few columns of the tables.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb5-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"iso_code"</span>, <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"total_cases"</span>, <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"new_cases"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> covid_latest</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"code"</span>, <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"year"</span>, <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"continent"</span>, <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"total_population"</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> population_prosperity</span></code></pre></div></div>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>The dbt model name is the filename of the sql file in the <code>models</code> directory. The model name may differ from the table name in the database. For instance, in above, the dbt model <code>population</code> is the result of a <code>SELECT</code> statement on <code>population_prosperity</code> table in the database.</p>
</div>
</div>
<section id="run-models" class="level3">
<h3 class="anchored" data-anchor-id="run-models">Run models</h3>
<p>You can run all models in your dbt project by executing <code>dbt run</code>. A sample dbt run output is shown below. You can see a summary or detailed log of running all dbt models. This helps a lot to debug any issue you may have in the queries. For instance, you can see a failed model that throws a Postgres error.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/dbr_run.png" class="img-fluid figure-img"></p>
<figcaption>Detailed log of failed <strong>jinja_and_variable_usage</strong> dbt model</figcaption>
</figure>
</div>
</section>
</section>
<section id="jinja-macros" class="level2">
<h2 class="anchored" data-anchor-id="jinja-macros">Jinja &amp; Macros</h2>
<p>dbt uses <a href="https://jinja.palletsprojects.com/">Jinja</a> templating language, making a dbt project an ideal programming environment for SQL. With Jinja, you can do transformations that are not normally possible in SQL, like using environment variables, or macros — abstract snippets of SQL, which is analogous to functions in most programming languages. Whenever you see a <code>{ ... }</code>, you’re already using Jinja. For more information about Jinja and additional Jinja-style functions defined, you can check <a href="https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros/">dbt documentation</a>.</p>
<p>Later in this post, we will cover custom macros defined by dbt.</p>
</section>
<section id="using-variables" class="level2">
<h2 class="anchored" data-anchor-id="using-variables">Using Variables</h2>
<section id="define-a-variable" class="level3">
<h3 class="anchored" data-anchor-id="define-a-variable">Define a variable</h3>
<p>You can define your variables under the <code>vars</code> section in your <code>dbt_project.yml</code>. For instance, let’s define a variable called <code>selected_country</code> whose default value is <code>USA</code> and another one called <code>selected_year</code> whose default value is <code>2019</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'my_new_project'</span></span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1.0.0'</span></span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">config-version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb7-4"></span>
<span id="cb7-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vars</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb7-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">selected_country</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> USA</span></span>
<span id="cb7-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">selected_year</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2019</span></span></code></pre></div></div>
</section>
<section id="use-a-variable" class="level3">
<h3 class="anchored" data-anchor-id="use-a-variable">Use a Variable</h3>
<p>You can use variables in your dbt models via <code>[var()](https://docs.getdbt.com/reference/dbt-jinja-functions/var)</code> Jinja function (<code>{ var("var_key_name") }</code> .</p>
</section>
</section>
<section id="macros" class="level2">
<h2 class="anchored" data-anchor-id="macros">Macros</h2>
<p>There are many useful transformations and useful macros in <code>dbt_utils</code> that can be used in your project. For a list of all available macros, you can check their <a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/">GitHub repo</a>.</p>
<p>Now, let’s add dbt_utils to our project and install it by following the below steps:</p>
<ol type="1">
<li>Add dbt_utils macro to your&nbsp;<code>packages.yml</code>&nbsp;file, as follows:</li>
</ol>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">packages</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb8-2"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">package</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> dbt-labs/dbt_utils</span></span>
<span id="cb8-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6.6</span></span></code></pre></div></div>
<ol start="2" type="1">
<li>Run&nbsp;<code>dbt deps</code>&nbsp;to install the package.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/dbt_deps.png" class="img-fluid figure-img"></p>
<figcaption>Install packages using <code>dbt deps</code></figcaption>
</figure>
</div>
</section>
<section id="complex-dbt-models" class="level2">
<h2 class="anchored" data-anchor-id="complex-dbt-models">Complex dbt models</h2>
<p>The models (selects) are usually stacked on top of one another. For building more complex models, you will have to use <code>[ref()](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)</code> macro. <code>ref()</code> is the most important function in dbt as it allows you to refer to other models. For instance, you may have a model (aka SELECT query) that does multiple stuff, and you don’t want to use it in other models. It will be difficult to build a complex model without using macros introduced earlier.</p>
<section id="dbt-model-using-ref-and-global-variables" class="level3">
<h3 class="anchored" data-anchor-id="dbt-model-using-ref-and-global-variables">dbt model using <code>ref()</code> and global variables</h3>
<p>We can build more complex models using the two dbt models defined earlier in the post. For instance, let’s create a new dbt model that joins the above two tables on the country code and then filters based on selected country and year.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb9-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span></span>
<span id="cb9-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> {{<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'population'</span>)}} </span>
<span id="cb9-3"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">inner</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">join</span> {{<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'covid19_latest_stats'</span>)}} </span>
<span id="cb9-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">on</span> {{<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'population'</span>)}}.code <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {{<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'covid19_latest_stats'</span>)}}.iso_code </span>
<span id="cb9-5"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">where</span> code<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{{ var("selected_country") }}'</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">AND</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'{{ var("selected_year") }}'</span></span></code></pre></div></div>
<p>Few points about the query above:</p>
<ul>
<li><code>{ref('dbt_model_name')}</code>is used to refer to dbt models available in the project.</li>
<li>You can get a column from the model like <code>{ref('dbt_model_name')}.column_name</code>.</li>
<li>You can use variables defined in <code>dbt_project.yml</code> file by <code>{var("variable_name)}</code>.</li>
</ul>
<p>The abbove code snippet joins the data from <code>population</code> and <code>covid19_latest_stats</code> models on the country code and filters them based on the selected_country=USA and selected_year=2019. The output of the model is shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/jinja_and_variable_usage_output.png" class="img-fluid figure-img"></p>
<figcaption>The output of the <strong>jinja_and_variable_usage</strong> dbt model</figcaption>
</figure>
</div>
<p>You can also see the compiled SQL code snippet by clicking on <strong>compile sql</strong> button. This is very useful particularly if you want to run the query outside the dbt tool.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/jinja_and_variable_usage_compiled_sql.png" class="img-fluid figure-img"></p>
<figcaption>Compiled SQL code for <strong>jinja_and_variable_usage</strong> dbt model</figcaption>
</figure>
</div>
</section>
<section id="dbt-model-using-dbt_utils-package-and-macros" class="level3">
<h3 class="anchored" data-anchor-id="dbt-model-using-dbt_utils-package-and-macros">dbt model using dbt_utils package and macros</h3>
<p><code>dbt_utils</code> package contains macros (aka functions) you can use in your dbt projects. A list of all macros is available on <a href="https://github.com/dbt-labs/dbt-utils/">dbt_utils’ GitHub page</a>.</p>
<p>Let’s use dbt_utils <code>[pivot()](https://github.com/dbt-labs/dbt-utils/#pivot-source)</code> and <code>[get_column_values()](https://github.com/dbt-labs/dbt-utils/#get_column_values-source)</code> macros in a dbt model as below:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb10-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span></span>
<span id="cb10-2">  continent,</span>
<span id="cb10-3">  {{ dbt_utils.pivot(</span>
<span id="cb10-4">      <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"population.year"</span>,</span>
<span id="cb10-5">      dbt_utils.get_column_values(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'population'</span>), <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"year"</span>)</span>
<span id="cb10-6">  ) }}</span>
<span id="cb10-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> {{ <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ref</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'population'</span>) }}</span>
<span id="cb10-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">group</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">by</span> continent</span></code></pre></div></div>
<p>The above dbt model will compile to the following SQL query in dbt.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode sql code-with-copy"><code class="sourceCode sql"><span id="cb11-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">select</span></span>
<span id="cb11-2">  continent,</span>
<span id="cb11-3">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">case</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">when</span> population.<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2015'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"2015"</span>,</span>
<span id="cb11-4">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">case</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">when</span> population.<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2017'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"2017"</span>,</span>
<span id="cb11-5">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">case</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">when</span> population.<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2017'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"2016"</span>,</span>
<span id="cb11-6">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">case</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">when</span> population.<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2017'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"2018"</span>,</span>
<span id="cb11-7">        <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">case</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">when</span> population.<span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">year</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'2017'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">then</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">end</span>) <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">as</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"2019"</span></span>
<span id="cb11-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">from</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"d15em1n30ihttu"</span>.<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"dbt_ealizadeh"</span>.<span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">"population"</span></span>
<span id="cb11-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">group</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">by</span> continent</span>
<span id="cb11-10"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">limit</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span></span>
<span id="cb11-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/* limit added automatically by dbt cloud */</span></span></code></pre></div></div>
<hr>
</section>
</section>
</section>
<section id="run-tests-in-dbt" class="level1">
<h1>Run Tests in dbt</h1>
<p>Another benefit of using dbt is the ability to test your data. Out of the box, dbt have the following generic tests: <code>unique</code>, <code>not_null</code>, <code>accepted_values</code> and <code>relationships</code>. An example of these tests on the model is shown below:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb12-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">version</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb12-2"></span>
<span id="cb12-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">models</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb12-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> covid19_latest_stats</span></span>
<span id="cb12-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">description</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A model of latest stats for covid19"</span></span>
<span id="cb12-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">      </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">columns</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb12-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">          </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> iso_code</span></span>
<span id="cb12-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">            </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">description</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"The country code"</span></span>
<span id="cb12-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">            </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tests</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb12-10"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">                </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> unique</span></span>
<span id="cb12-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">                </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> not_null</span></span></code></pre></div></div>
<p>You can run the tests via <code>dbt test</code>. You can see the output below</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/dbt_test.png" class="img-fluid figure-img"></p>
<figcaption>Results of running dbt test on the dbt Cloud dashboard</figcaption>
</figure>
</div>
<p>For more information on testing in dbt, you can visit <a href="https://docs.getdbt.com/docs/building-a-dbt-project/tests">dbt documentation</a>.</p>
<hr>
</section>
<section id="generate-documentation-in-dbt" class="level1">
<h1>Generate Documentation in dbt</h1>
<p>You can generate documentation for your dbt project by simply running <code>dbt docs generate</code> in the command section as shown below:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/dbt-tutorial/img/dbt_docs_generate.png" class="img-fluid figure-img"></p>
<figcaption>Generate documentation for a dbt project</figcaption>
</figure>
</div>
<p>You can browse through the generated documentation by clicking on <strong>view docs</strong>. You can see an overview of the generated docs below.</p>
<p></p><div class="quarto-video ratio ratio-16x9"><iframe data-external="1" src="https://player.vimeo.com/video/576196029" frameborder="0" title="" allow="autoplay; fullscreen; picture-in-picture" allowfullscreen=""></iframe></div> <!-- title='The documentation generated by the command `dbt docs generate`' >}} --><p></p>
<p>In addition to <code>dbt docs generate</code>, dbt docs can also serve a webserver with the generated documentation. To do so, you need to simply run <code>dbt docs serve</code>. More information about generating docs for your dbt project is available <a href="https://docs.getdbt.com/docs/building-a-dbt-project/documentation">here</a>.</p>
<hr>
</section>
<section id="other-features" class="level1">
<h1>Other Features</h1>
<section id="database-administration-using-hooks-operations" class="level2">
<h2 class="anchored" data-anchor-id="database-administration-using-hooks-operations">Database administration using Hooks &amp; Operations</h2>
<p>There are database management tasks that require running additional SQL queries, such as:</p>
<ul>
<li>Create user-defined functions</li>
<li>Grant privileges on a table</li>
<li>and many more</li>
</ul>
<p>dbt has two interfaces (hooks and operations) for executing these tasks and importantly version control them. Hooks and operations are briefly introduced here. For more info, you can check <a href="https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations">dbt documentation</a>.</p>
<section id="hooks" class="level3">
<h3 class="anchored" data-anchor-id="hooks">Hooks</h3>
<p>Hooks are simply SQL snippets that are executed at different times. Hooks are defined in the <code>dbt_project.yml</code> file. Different hooks are:</p>
<ul>
<li><code>pre-hook</code>: executed before a model is built</li>
<li><code>post-hook</code>: executed after a model is built</li>
<li><code>on-run-start</code>: executed at the start of <code>dbt run</code></li>
<li><code>on-run-end</code>: executed at the end of <code>dbt run</code></li>
</ul>
</section>
<section id="operations" class="level3">
<h3 class="anchored" data-anchor-id="operations">Operations</h3>
<p>Operations are a convenient way to invoke a macro without running a model. Operations are triggered using <code>[dbt run-operation](https://docs.getdbt.com/reference/commands/run-operation)</code> command.</p>
<p>Note that, unlike hooks, you need to explicitly execute the SQL in a <a href="https://docs.getdbt.com/docs/building-a-dbt-project/hooks-operations#operations">dbt operation</a>.</p>
<hr>
</section>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>dbt is a nice tool that is definitely worth giving a try as it may simplify your data ELT(or ETL) pipeline. In this post, we learned how to set up and use dbt for data transformation. I walked you through the different features of this tool. In particular, I provided a step-by-step guide on</p>
<ul>
<li>Configuring a dbt project</li>
<li>Creating dbt models (SELECT statements)</li>
<li>Build complex dbt models using global variables and macros</li>
<li>Building complex models by referring to other dbt models</li>
<li>Running tests</li>
<li>Generating documentation</li>
</ul>
<p>You can find the GitHub repo containing all scripts (including the data ingestion script) below. <em>Feel free to fork the source code of this article.</em></p>
<p><a href="https://github.com/e-alizadeh/sample_dbt_project">e-alizadeh/sample_dbt_project</a></p>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://ealizadeh.com/blog/deploy-postgresql-db-heroku">Step-by-Step Deployment of a Free PostgreSQL Database And Data Ingestion</a></p>
</section>
<section id="references" class="level1">
<h1>References</h1>
<p><a href="https://docs.getdbt.com/docs/introduction">What is dbt? | dbt Docs</a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {Dbt for {Data} {Transformation} - {A} {Hands-on} {Tutorial}},
  date = {2021-06-18},
  url = {https://ealizadeh.com/blog/dbt-tutorial/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“dbt for Data Transformation - A Hands-on
Tutorial,”</span> Jun. 18, 2021. <a href="https://ealizadeh.com/blog/dbt-tutorial/">https://ealizadeh.com/blog/dbt-tutorial/</a></div>
</div></div></section></div> ]]></description>
  <category>SQL</category>
  <category>Data Science</category>
  <category>Database</category>
  <category>ETL</category>
  <category>Tutorial</category>
  <guid>https://ealizadeh.com/blog/dbt-tutorial/</guid>
  <pubDate>Fri, 18 Jun 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/dbt-tutorial/img/_featured_image.png" medium="image" type="image/png" height="65" width="144"/>
</item>
<item>
  <title>How to Publish Your Python Package with just 2 commands</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/img/bp12_featured_image.png" class="img-fluid" alt="Featured image of the post"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/how-to-publish-your-python-package-with-just-2-commands-39ea6a400285"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<p>Packaging your Python library has never been easier now using <a href="https://python-poetry.org/">Poetry</a>. You may have a side project in Python that benefits others. You can publish it using Poetry. This post will show you how to build your own Python library and publish it on the most popular Python package repository <a href="https://pypi.org/">PyPI</a>.</p>
<p>I will use one of my recent Python projects, <a href="https://github.com/e-alizadeh/PyPocket/">PyPocket</a>: a Python library (wrapper) for <a href="https://getpocket.com/">Pocket</a> (previously known as Read It Later).</p>
<section id="prerequisite" class="level1">
<h1>Prerequisite</h1>
<section id="project-environment" class="level3">
<h3 class="anchored" data-anchor-id="project-environment">1. Project environment</h3>
<p>You need to have your project environment managed in Poetry since we will be using the <code>pyproject.toml</code> file to build our package and publish it.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p><em>You can check my post on <a href="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/">how to set up your Python environment using Conda and Poetry</a>.</em> If you are not using Conda, you can follow the steps I provided in the post but instead use other environment management systems like <a href="https://github.com/pypa/pipenv">Pipenv</a> or <a href="https://virtualenv.pypa.io/en/latest/">Virtualenv</a>.</p>
</div>
</div>
</div>
</section>
<section id="package-repository" class="level3">
<h3 class="anchored" data-anchor-id="package-repository">2. Package repository</h3>
<p>We will need a package repository to host the Python package; the most popular one is <a href="https://pypi.org/">PyPI</a>. So, if you want to publish your library on PyPI, you need to first create an account on PyPI.</p>
</section>
</section>
<section id="packaging-instructions" class="level1">
<h1>Packaging Instructions</h1>
<section id="step-1-build-your-package" class="level2">
<h2 class="anchored" data-anchor-id="step-1-build-your-package">Step 1: Build your package</h2>
<p>Once you have your Python package ready to be published, you first need to build your package using the following command from the directory that contains the <code>pyproject.toml</code> file:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> build</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/img/poetry_build.png" class="img-fluid figure-img" alt="Screenshot of running poetry build"></p>
<figcaption>Screenshot of running <code>poetry build</code></figcaption>
</figure>
</div>
<p>The above command will create two files in the&nbsp;<code>dist</code>&nbsp;(distribution) directory. A folder will be created if there is no&nbsp;<code>dist</code>&nbsp;folder.</p>
<p>First, a source distribution (often known as&nbsp;<strong>sdist</strong>) is created that is an archive of your package based on the current platform (<code>.tar.gz</code>&nbsp;for Unix and&nbsp;<code>.zip</code>&nbsp;for Windows systems)<sup>1</sup>.</p>
<p>In addition to sdist,&nbsp;<code>poetry build</code>&nbsp;creates a Python wheel (<code>.whl</code>) file. In a nutshell, a Python wheel is a ready-to-install format allowing you to skip the build stage, unlike the source distribution. A wheel filename is usually in the following format<sup>2</sup>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">{pkg-name}-{pkg-version}</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">(</span><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">-{build}?</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">)</span><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">-{python-implementation}-{application</span> binary interface}-{platform}.whl</span></code></pre></div></div>
<p>From the above figure, the package I built is called <strong>pypocket</strong> with version <strong>0.2.0</strong> in <strong>Python 3</strong> that is <strong>not OS-specific</strong> (none ABI) and suitable to run on <strong>any</strong> processor architecture.</p>
</section>
<section id="step-2-publish-your-package" class="level2">
<h2 class="anchored" data-anchor-id="step-2-publish-your-package">Step 2: Publish your package</h2>
<p>Once the package is built, you can publish it on PyPI (or other package repositories).</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Once you publish your package to PyPI, you will not be able to release an identical version (you can delete the package, but when trying to republish with the same version, you will get an error! I’ve been there). Hence, it’s recommended to test any package before pushing it to PyPI.</p>
</div>
</div>
<section id="test-your-package-on-testpypi" class="level3">
<h3 class="anchored" data-anchor-id="test-your-package-on-testpypi">Test Your Package on TestPyPI</h3>
<p>It’s a good idea to first publish your package using the <a href="https://test.pypi.org/">TestPyPI</a> framework. This way, if there is an issue with the published package, you can fix it and then publish it on PyPI. TestPyPI has an identical setup and user interface as PyPI, but it’s a <em>separate framework</em>. So, you need to create an account on TestPyPI too.</p>
<p>Now, let’s publish our package on TestPyPI. First, add TestPyPI as an alternative package repository using the following command.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb3-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> config repositories.testpypi https://test.pypi.org/legacy/</span></code></pre></div></div>
<p>You can publish your package to TestPyPI as the following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> publish <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-r</span> testpypi</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/img/poetry_publish_testpypi.png" class="img-fluid figure-img" alt="Screenshot of running `poetry publish -r testpypi`"></p>
<figcaption>Poetry publish to TestPyPi</figcaption>
</figure>
</div>
<p><code>poetry publish</code> will ask for your username and password (you can also use a token instead, more on this later). Notice that both the source distribution (<code>.tar.gz</code>) and the Python wheel are uploaded. Once the package is published, you should see something like the following on TestPyPI.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/img/pypocket_testpypi.png" class="img-fluid figure-img" alt="Published package on TestPyPi framework."></p>
<figcaption>You can check that here <a href="https://test.pypi.org/project/pypocket/">https://test.pypi.org/project/pypocket/</a></figcaption>
</figure>
</div>
<p>You can check that here <a href="https://test.pypi.org/project/pypocket/">https://test.pypi.org/project/pypocket/</a></p>
<p>As can be seen from the above screenshot, you can install the package <code>pip install -i https://test.pypi.org/simple/ pypocket</code> and test it.</p>
</section>
<section id="publish-package-on-pypi" class="level3">
<h3 class="anchored" data-anchor-id="publish-package-on-pypi">Publish Package on PyPI</h3>
<p>Once you’re happy with your Python library, you can publish it on PyPI using the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> publish</span></code></pre></div></div>
<p>Note that by default, Poetry publishes a package to PyPI. Therefore, you do not need to do <code>poetry config</code> or pass any argument to <code>poetry publish</code>.</p>
</section>
<section id="a-point-on-using-api-token-instead-of-username-and-password" class="level3">
<h3 class="anchored" data-anchor-id="a-point-on-using-api-token-instead-of-username-and-password">A point on using API Token instead of username and password</h3>
<p>You may notice that I’ve used my username and password when trying to publish the package. I would recommend using a token instead. You may have multiple projects in your PyPI account, and you can generate an API token for each project (package). This is particularly important if you want to automate your python packaging not to use your username and password during automated deployments. Another advantage of using an API token is that you can easily remove a token and even generate multiple tokens for a project.</p>
<p>You can generate an API token by going to <strong>Account settings</strong> of your PyPI (or TestPyPI) account and then add an API token under the API tokens section. You will then be prompted to select a scope for your token (to use the token for a particular project or all your PyPI projects). The instruction to use the token will also be provided at this stage.</p>
</section>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post we saw how we can build and publish a Python package using Poetry in two simple commands: <code>poetry build</code> and then <code>poetry publish</code>. We also went through the TestPyPI framework in order to test the Python package before publishing it on PyPI.</p>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://python-poetry.org/docs/">Introduction | Documentation | Poetry - Python dependency management and packaging made easy.</a></p>
<section id="related-posts" class="level2">
<h2 class="anchored" data-anchor-id="related-posts">Related posts</h2>
<p><a href="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/">A Guide to Python Environment, Dependency and Package Management: Conda + Poetry - Personal Website &amp; Blog</a></p>


</section>
</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Learn more about the Source Distribution: <a href="https://docs.python.org/3/distutils/sourcedist.html">Python documentation: Creating a Source Distribution</a>↩︎</p></li>
<li id="fn2"><p>Brad Solomon (2020), <a href="https://realpython.com/python-wheels/">What Are Python Wheels and Why Should You Care?</a>, Real Python↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {How to {Publish} {Your} {Python} {Package} with Just 2
    Commands},
  date = {2021-02-08},
  url = {https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“How to Publish Your Python Package with just 2
commands,”</span> Feb. 08, 2021. <a href="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/">https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/</a></div>
</div></div></section></div> ]]></description>
  <category>Poetry</category>
  <category>Python</category>
  <category>Software Development</category>
  <guid>https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/</guid>
  <pubDate>Mon, 08 Feb 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/how-to-publish-your-python-package-with-just-2-commands/img/bp12_featured_image.png" medium="image" type="image/png" height="83" width="144"/>
</item>
<item>
  <title>A Guide to Python Environment, Dependency and Package Management: Conda + Poetry</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/img/_featured_image.png" class="img-fluid" alt="A wordcloud of concepts used in the post."></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/a-guide-to-python-environment-dependency-and-package-management-conda-poetry-f5a6c48d795"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<p>If you work on multiple Python projects at different development stages, you probably have different environments on your system. There are various tools for creating an isolated environment and install the libraries you need for your project. This post discusses different available technologies for Python packaging, environment, and dependencies management systems. Then, we will go over an ideal setup (of course, in my opinion 🙂) suitable for most Python projects using <a href="https://conda.io/">conda</a> and <a href="https://python-poetry.org/">Poetry</a>.</p>
<p>In this post, <em>library</em> and <em>package</em> are used interchangeably, and they both refer to the Python package.</p>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>Let’s first list different groups of technologies and highlight few tools</p>
<ol type="1">
<li>An environment management system
<ul>
<li><a href="https://virtualenv.pypa.io/en/latest/">Virtualenv</a></li>
<li>Conda environment</li>
<li><a href="https://github.com/pypa/pipenv">Pipenv</a></li>
</ul></li>
<li>Package dependency resolver
<ul>
<li>Conda</li>
<li>Pipenv</li>
<li>Poetry</li>
</ul></li>
<li>Package repository
<ul>
<li><a href="https://pypi.org/">PyPI</a></li>
<li><a href="https://www.anaconda.com/">Anaconda</a></li>
<li><em>etc</em></li>
</ul></li>
</ol>
<section id="a-quick-note-on-package-repositories" class="level2">
<h2 class="anchored" data-anchor-id="a-quick-note-on-package-repositories">A Quick Note on Package Repositories</h2>
<p>The most popular Python package repository is the Python Package Index (PyPI), a public repository for many Python libraries. You can install packages from PyPI by running <code>pip install package_name</code>. Python libraries can also be packaged using conda, and a popular host for conda packages is Anaconda. You can install conda packages by running <code>conda install package_name</code> in your conda environment.</p>
</section>
</section>
<section id="conda-jack-of-all-trades" class="level1">
<h1>Conda: Jack of all trades?</h1>
<p>Pipenv was created to address many shortcomings of virtualenv. However, the main reason I will not consider virtualenv nor the Pipenv as the environment managers are:</p>
<ul>
<li>I want to have the flexibility to install conda packages.</li>
<li>Unlike conda, both virtualenv and Pipenv are Python environments only.</li>
</ul>
<p>As you may note from the introduction, conda manages the environment and the packages, and the dependencies. Not only that, but it is language-agnostic too. Besides, conda can install PyPI packages by using pip in an active conda environment. You can install a fresh conda environment by running the following command</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> create <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-n</span> env_name python=3.7</span></code></pre></div></div>
<p>It’s always recommended to have an environment file that contains your libraries and their specific versions. This is important due to portability, maintainability, and reproducibility. You can create a conda environment from a file (e.g., environment.yaml file below) using the following command</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">conda</span> env create <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-f</span> environment.yaml</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> post</span></span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">channels</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> default</span></span>
<span id="cb3-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> conda-forge</span></span>
<span id="cb3-6"></span>
<span id="cb3-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dependencies</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> python=3.8</span></span>
<span id="cb3-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> pandas=1.1.0</span></span>
<span id="cb3-10"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> pip=20.3.3</span></span>
<span id="cb3-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> </span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pip</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb3-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">    </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> requests==2.25.0</span></span></code></pre></div></div>
<p>By now, you may say, great, conda does everything, so, let’s use conda packages in conda environments and let conda resolve any dependency issues.</p>
<section id="issues-with-conda" class="level2">
<h2 class="anchored" data-anchor-id="issues-with-conda">Issues with conda</h2>
<p>I think conda tries to do too much. After several years of using conda, here are few of my observations on conda as a package and dependency management:</p>
<section id="performance-issues" class="level3">
<h3 class="anchored" data-anchor-id="performance-issues">Performance issues</h3>
<p>My main problem with conda is its performance issues. Creating a new environment or even updating an old one may sometimes take a long time, especially if you have many packages. This is probably because conda tries to resolve the dependencies. There were few times that it took more than 30 minutes (yes, 30 minutes, not 30 seconds!) to create an environment. I initially thought that there is a connection issue or problems with connecting to the package repositories.</p>
</section>
<section id="dependency-resolver-issues" class="level3">
<h3 class="anchored" data-anchor-id="dependency-resolver-issues">Dependency resolver issues</h3>
<p>Conda may not even resolve the dependency issues. Since we cannot see the dependencies of specific conda packages (unlike Poetry), it may not be easy to resolve those issues.</p>
</section>
<section id="python-packaging" class="level3">
<h3 class="anchored" data-anchor-id="python-packaging">Python packaging</h3>
<p>Another issue with conda is when you want to build a conda package for your library and publish it. It’s not trivial (at least for me) since you would need several configuration files (like meta.yml, setup.py, <em>etc</em>.). You may have dependency issues too. You can find more information on how to build a conda package <a href="https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs.html">here</a>.</p>
</section>
</section>
</section>
<section id="poetry" class="level1">
<h1>Poetry</h1>
<p><a href="https://python-poetry.org/">Poetry</a> is a python packaging and dependency management system initially released in 2018. It smoothly handles the dependencies, especially if you use Poetry in a fresh environment and then add your Python packages. It can also handle other tools and configurations of your project in a deterministic way since it uses <a href="https://toml.io/en/">TOML</a> format as the Python configuration file. In a nutshell, TOML is intended for using an easy-to-read minimal configuration file. Poetry uses the <code>pyproject.toml</code> configuration file to install python packages and set up the configurations.</p>
<section id="pyproject.toml-python-configuration-file" class="level2">
<h2 class="anchored" data-anchor-id="pyproject.toml-python-configuration-file">pyproject.toml: Python Configuration file</h2>
<p><code>pyproject.toml</code> file is a new Python configuration file defined in <a href="https://www.python.org/dev/peps/pep-0518/">PEP518</a> to store build system requirements, dependencies, and many other configurations. You can even replace <code>setup.cfg</code> and <code>setup.py</code> files in most scenarios. You can save most configurations related to specific python packages like pytest, coverage, bumpversion, Black code styling, and many more in a single <code>pyproject.toml</code> file. You previously had to either write those configurations in individual files or other configuration files like <code>setup.cfg</code>. However, <code>pyproject.toml</code> can include all of them and also all project package requirements too.</p>
</section>
</section>
<section id="the-proposed-setup" class="level1">
<h1>The Proposed Setup</h1>
<p>I would recommend using conda as an environment manager, pip as the package installer, and Poetry as the dependency manager. In this case, you get all PyPI packages within the conda environment, and in rare cases where you want to install a conda package, you will be able to do so. Here are few benefits of using Poetry and the proposed setup:</p>
<ul>
<li>Better dependency management (often faster than conda dependency resolver)</li>
<li>Having most package configurations (e.g., pytest, coverage, bump2version, <em>etc.</em>) in a single file.</li>
<li>The ability to install a conda package if you have to (this should be your last resort!)</li>
<li>Poetry can automatically add new packages to <code>pyproject.toml</code> file.</li>
<li>Poetry can show the list of library dependencies of individual packages.</li>
<li>Build a Python package and publishing to PyPI is as easy as running two commands!</li>
<li>No need to have separate environment files for your production and development environments.</li>
</ul>
<section id="step-1-create-a-minimal-conda-environment" class="level2">
<h2 class="anchored" data-anchor-id="step-1-create-a-minimal-conda-environment">Step 1: Create a minimal conda environment</h2>
<p>You can create a conda environment from the following YAML file by running <code>conda env create -f environment.yaml</code>. This will create a fresh conda environment that has Python 3.8. In a conda environment, you can pass a list of channels (the order is important) from which you want to install your packages. In addition to the <em>default</em> channel on Anaconda Cloud that is curated by <a href="https://www.anaconda.com/">Anaconda</a> Inc., there are other channels that you can install packages. A popular channel is <a href="https://conda-forge.org/">conda-forge</a> that includes a community-led collection of packages. If you have a private conda channel, you can write it in the channels section.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">name</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> post</span></span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">channels</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb4-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> default</span></span>
<span id="cb4-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> conda-forge</span></span>
<span id="cb4-6"></span>
<span id="cb4-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dependencies</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">:</span></span>
<span id="cb4-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">  </span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">-</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;"> python=3.8</span></span></code></pre></div></div>
</section>
<section id="step-2-install-poetry-tool" class="level2">
<h2 class="anchored" data-anchor-id="step-2-install-poetry-tool">Step 2: Install Poetry tool</h2>
<p>You can install Poetry as per their instruction <a href="https://python-poetry.org/docs/#installation">here</a>. The recommended way is to install Poetry using the following command for OSx, Linux, or WSL (Windows Subsystem Linux).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb5-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">curl</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-sSL</span> https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">|</span> <span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-</span></span></code></pre></div></div>
<p>Note: Installing Poetry using the preferred approach that is by the custom installer (the first approach that downloads get-poetry.py script) will install Poetry isolated from the rest of the system.</p>
<p>⚠️ Although not recommended, there is also a pip version of Poetry that you can install (<code>pip install poetry</code>). The developers warn against using the pip version in the documentation since it might cause some conflicts with other packages in the environment. But, if our environment is basically empty (although some base packages are installed like pip when creating a conda environment), then it is probably fine to install it through <code>pip</code>!</p>
</section>
<section id="step-3-configure-your-poetry" class="level2">
<h2 class="anchored" data-anchor-id="step-3-configure-your-poetry">Step 3: Configure your Poetry</h2>
<p>To configure Poetry for a new project, Poetry makes it very easy to create a configuration file with all your desired settings. You can interactively create a <code>pyproject.toml</code> file by simply running <code>poetry init</code>. This will prompt few questions about the desired Python packages you want to install. You can press Enter to process with default options.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/img/poetry_init.png" class="img-fluid figure-img" alt="Screenshot of the result of running command: poetry init"></p>
<figcaption>Interactive configuration by running poetry init</figcaption>
</figure>
</div>
<p>As you can see in the above screenshot, you can add some packages only for development dependencies. Initializing the Poetry for your project will create the <code>pyproject.toml</code> file that includes all configurations we defined during the setup. We have one main section for all dependencies (used in both production and development environments), but we also have a section that contains packages used mainly for development purposes like pytest, sphinx, <em>etc</em>. This is the other advantage over other dependency management tools. You only need one configuration file for both your production and development environments.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode toml code-with-copy"><code class="sourceCode toml"><span id="cb6-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[tool.poetry]</span></span>
<span id="cb6-2"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">name</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"my_package"</span></span>
<span id="cb6-3"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">version</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"0.0.1"</span></span>
<span id="cb6-4"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">description</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span></span>
<span id="cb6-5"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">authors</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ealizadeh &lt;abc@edf.com&gt;"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb6-6"></span>
<span id="cb6-7"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[tool.poetry.dependencies]</span></span>
<span id="cb6-8"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">python</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^3.8"</span></span>
<span id="cb6-9"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">requests</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^2.25.1"</span></span>
<span id="cb6-10"></span>
<span id="cb6-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[tool.poetry.dev-dependencies]</span></span>
<span id="cb6-12"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">pytest</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^6.2.1"</span></span>
<span id="cb6-13"></span>
<span id="cb6-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[build-system]</span></span>
<span id="cb6-15"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">requires</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"poetry-core&gt;=1.0.0"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span></span>
<span id="cb6-16"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">build-backend</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"poetry.core.masonry.api"</span></span>
<span id="cb6-17"></span>
<span id="cb6-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">[tool.pytest.ini_options]</span></span>
<span id="cb6-19"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">minversion</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"6.0"</span></span>
<span id="cb6-20"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">addopts</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"-ra -q"</span></span>
<span id="cb6-21"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">testpaths</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span></span>
<span id="cb6-22">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tests"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You should have a "tests" directory</span></span>
<span id="cb6-23"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span></span></code></pre></div></div>
</section>
<section id="step-4-installing-dependencies" class="level2">
<h2 class="anchored" data-anchor-id="step-4-installing-dependencies">Step 4: Installing dependencies</h2>
<p>Once you have your dependencies and other configurations in a <code>pyproject.toml</code> file, you can install the dependencies by simply running</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb7-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> install</span></code></pre></div></div>
<p>This will create a <code>poetry.lock</code> file. This file basically contains the exact versions of all the packages locking the project with those specific versions. You need to commit both the <code>pyproject.toml</code> file and <code>poetry.lock</code> file. I would strongly recommend you not to update the poetry.lock file manually. Let poetry does its magic!!</p>
</section>
</section>
<section id="poetry-tips" class="level1">
<h1>Poetry tips</h1>
<section id="add-new-packages" class="level2">
<h2 class="anchored" data-anchor-id="add-new-packages">Add new packages</h2>
<p>If you want to add (or remove) a package to your environment, I would highly recommend you to do so by using the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb8-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> add package_name</span></code></pre></div></div>
<p>This will <em>automatically</em> add the package name and version to your <code>pyproject.toml</code> file and updates the poetry.lock accordingly. <code>poetry add</code> takes care of all dependencies, and adds the package in the <code>[tool.poetry.dependencies]</code> section.</p>
<p>If you want to add a package to your development environment, you can simly pass a <code>--dev</code> option as below:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb9-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> add package_name <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--dev</span></span></code></pre></div></div>
<p>You can specify a specific version of a package, or even adding a package through git+https or git+ssh (see <a href="https://python-poetry.org/docs/cli/#add">here</a> for more details).</p>
</section>
<section id="remove-packages" class="level2">
<h2 class="anchored" data-anchor-id="remove-packages">Remove packages</h2>
<p>You can remove a package as following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb10-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> remove package_to_remove</span></code></pre></div></div>
</section>
<section id="show-package-dependencies" class="level2">
<h2 class="anchored" data-anchor-id="show-package-dependencies">Show package dependencies</h2>
<p>If you want to see a list of all installed packages in your environment, you can run the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb11-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> show</span></code></pre></div></div>
<p>Note that this will show the package dependencies too. It is sometimes helpful to see the dependencies of a Python package. Fortunately, you can do so using <code>poetry show</code>. For instance, we can see the list of dependencies of <code>requests</code> package in our environment using the following command:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb12-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> show requests</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/img/poetry_show_requests_dependencies.png" class="img-fluid figure-img" alt="Screenshot of running command: poetry show request"></p>
<figcaption>All dependencies of requests package in the project</figcaption>
</figure>
</div>
<p>Even better, you can see all your project’s dependencies by just running:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb13-1"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">poetry</span> show <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--tree</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/img/poetry_show_tree.png" class="img-fluid figure-img" alt="Screenshot of running command: poetry show --tree"></p>
<figcaption>A tree of all your project dependencies</figcaption>
</figure>
</div>
<p>From above figure, you can see that the blue-font package names (requests and pytest) are explicitly added to&nbsp;<code>pyproject.toml</code>&nbsp;file. Other libraries, in yellow, are their dependencies and do not need to be in your toml file.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>You may use&nbsp;<code>pip freeze</code>&nbsp;(<code>pip freeze &gt; requirements.txt</code>&nbsp;if you want to output the result into a file) to output all installed packages in your environment, but that will be quite messy.</p>
</div>
</div>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we talked about different Python environment, package management, and dependency resolver tools. Then, we went over a setup for how to use conda as the environment manager and Poetry as the package manager and dependency resolver, and the benefits of using this combination in your Python projects.</p>
<p>Hope you find this article useful.</p>
<hr>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://python-poetry.org/docs/">Introduction | Documentation | Poetry - Python dependency management and packaging made easy.</a></p>
<p><a href="https://github.com/carlosperate/awesome-pyproject">carlosperate/awesome-pyproject</a></p>
<p><a href="https://towardsdatascience.com/a-guide-to-conda-environments-bc6180fc533">The Definitive Guide to Conda Environments</a></p>
<p><a href="https://ahmed-nafies.medium.com/pip-pipenv-poetry-or-conda-7d2398adbac9">Pip, Pipenv, Poetry or Conda</a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {A {Guide} to {Python} {Environment,} {Dependency} and
    {Package} {Management:} {Conda} + {Poetry}},
  date = {2021-01-29},
  url = {https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“A Guide to Python Environment, Dependency and Package
Management: Conda + Poetry,”</span> Jan. 29, 2021. <a href="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/">https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/</a></div>
</div></div></section></div> ]]></description>
  <category>Conda</category>
  <category>Package Management</category>
  <category>Poetry</category>
  <category>Programming</category>
  <category>Python</category>
  <guid>https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/</guid>
  <pubDate>Fri, 29 Jan 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/guide-to-python-env-pkg-dependency-using-conda-poetry/img/_featured_image.png" medium="image" type="image/png" height="115" width="144"/>
</item>
<item>
  <title>Data Distribution vs. Sampling Distribution: What You Need to Know</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/img/_featured_image.png" class="img-fluid" alt="Featured image of the post"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/data-distribution-vs-sampling-distribution-what-you-need-to-know-294819109796"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>It is important to distinguish between the data distribution (aka population distribution) and the sampling distribution. The distinction is critical when working with the central limit theorem or other concepts like the standard deviation and standard error.</p>
<p>In this post we will go over the above concepts and as well as bootstrapping to estimate the sampling distribution. In particular, we will cover the following:</p>
<ul>
<li>Data distribution (aka population distribution)</li>
<li>Sampling distribution</li>
<li>Central limit theorem (CLT)</li>
<li>Standard error and its relation with the standard deviation</li>
<li>Bootstrapping</li>
</ul>
<hr>
</section>
<section id="data-distribution" class="level1">
<h1>Data Distribution</h1>
<p>Much of the statistics deals with inferring from samples drawn from a larger population. Hence, we need to distinguish between the analysis done the original data as opposed to analyzing its samples. First, let’s go over the definition of the data distribution:</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>💡 <strong>Data distribution:</strong>&nbsp;The frequency distribution of individual data points in the original dataset.</p>
</div>
</div>
<p>Let’s first generate random skewed data that will result in a non-normal (non-Gaussian) data distribution. The reason behind generating non-normal data is to better illustrate the relation between data distribution and the sampling distribution.</p>
<p>So, let’s import the Python plotting packages and generate right-skewed data.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plotting packages and initial setup</span></span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb1-3">sns.set_theme(palette<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pastel"</span>)</span>
<span id="cb1-4">sns.set_style(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb1-7"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> mpl</span>
<span id="cb1-8">mpl.rcParams[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"figure.dpi"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate Right-Skewed data set</span></span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy.stats <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> skewnorm</span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MinMaxScaler</span>
<span id="cb2-4"></span>
<span id="cb2-5">num_data_points <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span></span>
<span id="cb2-6">max_value <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb2-7">skewness <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Positive values are right-skewed</span></span>
<span id="cb2-8"></span>
<span id="cb2-9">skewed_random_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> skewnorm.rvs(a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> skewness,loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>max_value, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>num_data_points, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)  </span>
<span id="cb2-10">skewed_data_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MinMaxScaler().fit_transform(skewed_random_data.reshape(<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb2-11"></span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plot the data (population) distribution</span></span>
<span id="cb2-13">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb2-14">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Data Distribution"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb2-15"></span>
<span id="cb2-16">sns.histplot(skewed_data_scaled, bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, stat<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"density"</span>, kde<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, legend<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/img/data_distribution.png" class="img-fluid figure-img" alt="The histogram of generated right-skewed data"></p>
<figcaption>The histogram of generated right-skewed data</figcaption>
</figure>
</div>
</section>
<section id="sampling-distribution" class="level1">
<h1>Sampling Distribution</h1>
<p>In the sampling distribution, you draw samples from the dataset and compute a statistic like the mean. It’s very important to differentiate between the data distribution and the sampling distribution as most confusion comes from the operation done on either the original dataset or its (re)samples.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>💡 <strong>Sampling distribution</strong>: The frequency distribution of a sample statistic (aka metric) over many samples drawn from the dataset <span class="citation" data-cites="bruce2017practical">see [1]</span>. Or to put it simply, the distribution of sample statistics is called the sampling distribution.</p>
</div>
</div>
<p>The algorithm to obtain the sampling distribution is as follows:</p>
<ol type="1">
<li>Draw a sample from the dataset.</li>
<li>Compute a statistic/metric of the drawn sample in Step 1 and save it.</li>
<li>Repeat Steps 1 and 2 many times.</li>
<li>Plot the distribution (histogram) of the computed statistic.</li>
</ol>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> random</span>
<span id="cb3-3"></span>
<span id="cb3-4">sample_size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb3-5">sample_mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb3-6"></span>
<span id="cb3-7">random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setting the seed for reproducibility of the result</span></span>
<span id="cb3-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> _ <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>):</span>
<span id="cb3-9">    sample <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> random.sample(skewed_data_scaled.tolist(), sample_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>) </span>
<span id="cb3-10">    sample_mean.append(np.mean(sample))</span>
<span id="cb3-11">                    </span>
<span id="cb3-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>np<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>mean(sample_mean)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb3-13"></span>
<span id="cb3-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plot the sampling distribution</span></span>
<span id="cb3-15">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>))</span>
<span id="cb3-16">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sampling Distribution"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span>
<span id="cb3-17"></span>
<span id="cb3-18">sns.histplot(sample_mean, bins<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, stat<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"density"</span>, kde<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, legend<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.23269</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/img/sampling_distribution.png" class="img-fluid figure-img" alt="Sampling distribution"></p>
<figcaption>Sampling Distrubtion</figcaption>
</figure>
</div>
<p>Above sampling distribution is basically the histogram of the mean of each drawn sample (in above, we draw samples of 50 elements over 2000 iterations). The mean of the above sampling distribution is around 0.23, as can be noted from computing the mean of all samples means.</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>⚠️ Do not confuse the sampling distribution with the sample distribution. The sampling distribution considers the distribution of sample statistics (e.g. mean), whereas the sample distribution is basically the distribution of the sample taken from the population.</p>
</div>
</div>
</section>
<section id="central-limit-theorem-clt" class="level1">
<h1>Central Limit Theorem (CLT)</h1>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>💡 <strong>Central Limit Theorem</strong>: As the sample size gets larger, the <strong>sampling distribution</strong> tends to be more like a normal distribution (bell-curve shape).</p>
</div>
</div>
<p><em>In CLT, we analyze the sampling distribution and not a data distribution, an important distinction to be made.</em> CLT is popular in hypothesis testing and confidence interval analysis, and it’s important to be aware of this concept, even though with the use of bootstrap in data science, this theorem is less talked about or considered in the practice of data science <span class="citation" data-cites="bruce2017practical">see [1]</span>. More on bootstrapping is provided later in the post.</p>
</section>
<section id="standard-error-se" class="level1">
<h1>Standard Error (SE)</h1>
<p>The <a href="https://en.wikipedia.org/wiki/Standard_error"><strong>standard error</strong></a> is a metric to describe <em>the variability of a statistic in the sampling distribution</em>. We can compute the standard error as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BStandard~Error%7D%20=%20SE%20=%20%5Cfrac%7Bs%7D%7B%5Csqrt%7Bn%7D%7D%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?s"> denotes the standard deviation of the sample values and <img src="https://latex.codecogs.com/png.latex?n"> denotes the sample size. It can be seen from the formula that <em>as the sample size increases, the SE decreases</em>.</p>
<p>We can estimate the standard error using the following approach[1]:</p>
<ol type="1">
<li>Draw a new sample from a dataset.</li>
<li>Compute a statistic/metric (e.g., mean) of the drawn sample in Step 1 and save it.</li>
<li>Repeat Steps 1 and 2 several times.</li>
<li>An estimate of the standard error is obtained by computing the standard deviation of the previous steps’ statistics.</li>
</ol>
<p>While the above approach can be used to estimate the standard error, we can use bootstrapping instead, which is preferable. I will go over that in the next section.</p>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Warning
</div>
</div>
<div class="callout-body-container callout-body">
<p>⚠️ Do not confuse the standard error with the standard deviation. The standard deviation captures the variability of the individual data points (how spread the data is), unlike the standard error that captures a sample statistic’s variability.</p>
</div>
</div>
</section>
<section id="bootstrapping" class="level1">
<h1>Bootstrapping</h1>
<p>Bootstrapping is an easy way of estimating the sampling distribution by randomly drawing samples from the population (<em>with replacement</em>) and computing each resample’s statistic. Bootstrapping does not depend on the CLT or other assumptions on the distribution, and it is the standard way of estimating SE[1].</p>
<p>Luckily, we can use <code>[bootstrap()](https://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap/)</code> functionality from the <a href="https://rasbt.github.io/mlxtend/">MLxtend library</a> (<em>You can read my <a href="https://ealizadeh.com/blog/mlxtend-library-for-data-science/">post</a> on MLxtend library covering other interesting functionalities</em>). This function also provides the flexibility to pass a custom sample statistic.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> mlxtend.evaluate <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> bootstrap</span>
<span id="cb5-2"></span>
<span id="cb5-3">avg, std_err, ci_bounds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bootstrap(</span>
<span id="cb5-4">    skewed_data_scaled,</span>
<span id="cb5-5">    num_rounds<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>,</span>
<span id="cb5-6">    func<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>np.mean,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A function to compute a sample statistic can be passed here</span></span>
<span id="cb5-7">    ci<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>,</span>
<span id="cb5-8">    seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setting the seed for reproducibility of the result</span></span>
<span id="cb5-9">)</span>
<span id="cb5-10"></span>
<span id="cb5-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(</span>
<span id="cb5-12">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Mean: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>avg<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-13">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Standard Error: +/- </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>std_err<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> </span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span></span>
<span id="cb5-14">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"CI95: [</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ci_bounds[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>ci_bounds[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">]"</span></span>
<span id="cb5-15">)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean: <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.23293</span></span>
<span id="cb6-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Standard Error: <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+/-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.00144</span></span>
<span id="cb6-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> CI95: [<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.23023</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.23601</span>]</span></code></pre></div></div>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>The main takeaway is to differentiate between whatever computation you do on the original dataset or the sample of the dataset. Plotting a histogram of the data will result in data distribution, whereas plotting a sample statistic computed over samples of data will result in a sampling distribution. On a similar note, the standard deviation tells us how the data is spread, whereas the standard error tells us how a sample statistic is spread out.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>👉 You can find the Jupyter notebook for this blog post on <a href="https://github.com/e-alizadeh/medium/blob/master/notebooks/data_vs_sampling_distributions.ipynb">GitHub</a>.</p>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-bruce2017practical" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">P. Bruce and A. Bruce, <em>Practical statistics for data scientists: 50 essential concepts</em>. O’Reilly Media, 2017.</div>
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2021,
  author = {Alizadeh, Esmaeil},
  title = {Data {Distribution} Vs. {Sampling} {Distribution:} {What}
    {You} {Need} to {Know}},
  date = {2021-01-11},
  url = {https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2021" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Data Distribution vs. Sampling Distribution: What You
Need to Know,”</span> Jan. 11, 2021. <a href="https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/">https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/</a></div>
</div></div></section></div> ]]></description>
  <category>Bootstrapping</category>
  <category>Data Science</category>
  <category>Distribution</category>
  <category>Statistics</category>
  <guid>https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/</guid>
  <pubDate>Mon, 11 Jan 2021 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution/img/_featured_image.png" medium="image" type="image/png" height="44" width="144"/>
</item>
<item>
  <title>A Guide to Metrics (Estimates) in Exploratory Data Analysis</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/img/_featured_image.png" class="img-fluid" alt="A wordcloud of different metrics used in Exploratory Data Analysis"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/a-guide-to-metrics-in-exploratory-data-analysis-250b33f72297">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<p>Exploratory data analysis (EDA) is an important step in any data science project. We always try to get a glance of our data by computing descriptive statistics of our dataset. If you are like me, the first function you call might be Pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html"><code>dataframe.describe()</code></a> to obtain descriptive statistics. While such analysis is important, <em>we often underestimate the importance of choosing the correct sample statistics/metrics/estimates</em>.</p>
<p>In this post, we will go over several metrics that you can use in your data science projects. In particular, we are going to cover several estimates of location and variability and their robustness (sensitiveness to outliers).</p>
<p>The following common metrics/estimates are covered in this article:</p>
<ul>
<li>Estimates of location (first moment of the distribution)
<ul>
<li>mean, trimmed/truncated mean, weighted mean</li>
<li>median, weighted median</li>
</ul></li>
<li>Estimates of variability (second moment of the distribution)
<ul>
<li>range</li>
<li>variance and standard deviation</li>
<li>mean absolute deviation, median absolute deviation</li>
<li>percentiles (quantiles)</li>
</ul></li>
</ul>
<p>For each metric, we will cover:</p>
<ul>
<li>The definition and mathematical formulation along with some insights.</li>
<li>Whether the metric is robust (sensitiveness to extreme cases)</li>
<li>Python implementation and an example</li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>The focus of this article is on the metrics and estimates used in the univariate analysis of numeric data.</p>
</div>
</div>
<p>A note before we start: data scientists and business analysts usually refer to values calculated from the data as a <em>metric</em>, whereas statisticians use the term <em>estimates</em> for such values<span class="citation" data-cites="bruce2017practical">see [1]</span>.</p>
<section id="estimates-of-location" class="level1">
<h1>Estimates of Location</h1>
<p>Estimates of location are measures of the central tendency of the data (where most of the data is located). In statistics, this is usually referred to as the first moment of a distribution.</p>
<section id="mean" class="level2">
<h2 class="anchored" data-anchor-id="mean">Mean</h2>
<p>The <em>arithmetic mean</em>, or simply <em>mean</em> or <em>average</em> is probably the most popular estimate of location. There different variants of mean, such as <em>weighted mean</em> or <em>trimmed/truncated mean</em>. You can see how they can be computed below.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A%20%20%20%20%5Ctext%7BMean%7D%20&amp;%20%7B=%20%5Cbar%7Bx%7D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Bi%7D%5E%7Bn%7Dx_%7Bi%7D%7D%7Bn%7D%5Cquad%5Cquad%5Cquad%5Cquad%7D%20&amp;%20%7B(1.1)%7D%20%5C%5C%0A%20%20%20%20%5Ctext%7BWeighted%20Mean%7D%20&amp;%20%7B=%20%7B%5Cbar%7Bx%7D%7D_%7Bw%7D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Bi%20=%201%7D%5E%7Bn%7Dw_%7Bi%7Dx_%7Bi%7D%7D%7B%5Csum%5Climits_%7Bi%7D%5E%7Bn%7Dw_%7Bi%7D%7D%7D%20&amp;%20%7B(1.2)%7D%20%5C%5C%0A%20%20%20%20%5Ctext%7BTruncated%20Mean%7D%20&amp;%20%7B=%20%7B%5Cbar%7Bx%7D%7D_%7B%5Ctext%7Btr%7D%7D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Bi%20=%20p%20+%201%7D%5E%7Bn%20-%20p%7Dx_%7Bi%7D%7D%7Bn%20-%202p%7D%7D%20&amp;%20%7B(1.3)%7D%0A%5Cend%7Bmatrix%7D%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?n"> denotes the total number of observations (rows).</p>
<p>Weighted mean (equation 1.2) is a variant of mean that can be used in situations where the sample data does not represent different groups in a dataset. By assigning a larger weight to groups that are under-represented, the computed weighted mean will more accurately represent all groups in our dataset.</p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>Extreme values can easily influence both the <strong>mean</strong> and <strong>weighted mean</strong> since neither one is a robust metric!</p>
</div>
</div>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 <strong>Robust estimate</strong>: A metric that is not sensitive to extreme values (outliers).</p>
</div>
</div>
</div>
<!-- Equation numbering -->
<p>Another variant of mean is the <em>trimmed mean</em> (eq. 1.3) that is a robust estimate. This metric is used in calculating the final score in many sports where a panel of judges will each give a score. Then the lowest and the highest scores are dropped and the mean of the remaining scores are computed as a part of the final score<sup>1</sup>. One such example is in the international diving score system.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 In statistics, <img src="https://latex.codecogs.com/png.latex?%5Cbar%7B%5Cmathbf%7Bx%7D%7D"> refers to a <strong>sample</strong> mean, whereas <img src="https://latex.codecogs.com/png.latex?%5Cmu"> refers to the <strong>population</strong> mean.</p>
</div>
</div>
</div>
<section id="a-use-case-for-the-weighted-mean" class="level3">
<h3 class="anchored" data-anchor-id="a-use-case-for-the-weighted-mean">A Use Case for the Weighted Mean</h3>
<p>If you want to buy a smartphone or a smartwatch or any gadget where there are many options, you can use the following method to choose among various options available for a gadget.</p>
<p>Let’s assume you want to buy a smartphone, and the following features are important to you: 1) battery life, 2) camera quality, 3) price, and 4) the phone design. Then, you give the following weights to each one:</p>
<table class="caption-top table">
<caption>List of features and their corresponding weights</caption>
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">FEATURE</th>
<th style="text-align: center;">WEIGHT</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Battery life</td>
<td style="text-align: center;">0.15</td>
</tr>
<tr class="even">
<td style="text-align: center;">Camera quality</td>
<td style="text-align: center;">0.30</td>
</tr>
<tr class="odd">
<td style="text-align: center;">Price</td>
<td style="text-align: center;">0.25</td>
</tr>
<tr class="even">
<td style="text-align: center;">Phone design</td>
<td style="text-align: center;">0.30</td>
</tr>
</tbody>
</table>
<!-- 
```
#| label: tbl-weight-mean
#| tbl-cap: Planets

from IPython.display import Markdown
from tabulate import tabulate
table = [
    ["Battery life", 0.15],
    ["Camera quality", 0.30],
    ["Price", 0.25],
    ["Phone design", 0.30]
]
Markdown(tabulate(table, headers=["FEATURE","WEIGHT"]))
``` 
-->
<p>Let’s say you have two options an iPhone and Google’s Pixel. You can give each feature a score of some value between 1 and 10 (1 being the worst and 10 being the best). After going over some reviews, you may give the following scores to the features of each phone.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/img/weighted_mean_scores.png" class="img-fluid figure-img"></p>
<figcaption>Table 2: Scores given to iPhone and Pixel for each score</figcaption>
</figure>
</div>
<p>So, which phone is better for you?</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A%20%20%20%20%5Ctext%7BiPhone%20score%7D%20&amp;%20%7B=%200.15%20%5Ctimes%206%20+%200.3%20%5Ctimes%209%20+%200.25%20%5Ctimes%201%20+%200.3%20%5Ctimes%209%20=%206.55%7D%20%5C%5C%0A%20%20%20%20%5Ctext%7BGoogle%20Pixel%20score%7D%20&amp;%20%7B=%200.15%20%5Ctimes%205%20+%200.3%20%5Ctimes%209.5%20+%200.25%20%5Ctimes%208%20+%200.3%20%5Ctimes%205%20=%207.1%7D%20%5C%5C%0A%5Cend%7Bmatrix%7D%0A"></p>
<p>And based on your feature preferences, the Google Pixel might be the better option for you!</p>
</section>
</section>
<section id="median" class="level2">
<h2 class="anchored" data-anchor-id="median">Median</h2>
<p>Median is the middle of a sorted list, and it’s a robust estimate. For an ordered sequence <img src="https://latex.codecogs.com/png.latex?x_1,%E2%80%86x_2,%E2%80%86...,%E2%80%86x_n">, the median is computed as follows:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A%20%20%20%20%7B%5Ctext%7Bif~%7Dn%5Ctext%7B~is~odd%7D%5Cquad%7D%20&amp;%20%5Cleft.%20%7B%7D%5Clongrightarrow%5Cquad%5Ctext%7BMedian%7D%20=%20x_%7B%5Cfrac%7Bn%20+%201%7D%7B2%7D%7D%20%5Cright.%20%5C%5C%0A%20%20%20%20%7B%5Ctext%7Bif~%7Dn%5Ctext%7B~is~even%7D%5Cquad%7D%20&amp;%20%5Cleft.%20%7B%7D%5Clongrightarrow%5Cquad%5Ctext%7BMedian%7D%20=%20%5Cfrac%7B1%7D%7B2%7D(x_%7B%5Cfrac%7Bn%7D%7B2%7D%7D%20+%20x_%7B%5Cfrac%7Bn%20+%201%7D%7B2%7D%7D)%20%5Cright.%20%5C%5C%0A%5Cend%7Bmatrix%7D%0A"></p>
<p>Analogous to the weighted mean, we can also have the <em>weighted median</em> that can be computed as follows for an ordered sequence <img src="https://latex.codecogs.com/png.latex?x_1,%E2%80%86x_2,%E2%80%86...,%E2%80%86x_n"> with weights <img src="https://latex.codecogs.com/png.latex?w_1,%E2%80%86w_2,%E2%80%86%E2%80%A6,%E2%80%86w_n"> where <img src="https://latex.codecogs.com/png.latex?w_i%3E0">.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A%20%20%20%20&amp;%20%7B%5Ctext%7BWeighted~Median%7D%20=%20x_%7Bk%7D%7D%20%5C%5C%0A%20%20%20%20&amp;%20%7B%5Ctext%7Bwhere%7D%5Cquad%5Csum%5Climits_%7Bi%20=%201%7D%5E%7Bn%7Dw_%7Bi%7D%20=%201%5Cquad%5Ctext%7Band%7D%5Cquad%5Csum%5Climits_%7Bi%20=%20k%20+%201%7D%5E%7Bn%7Dw_%7Bi%7D%20%5Cleq%20%5Cfrac%7B1%7D%7B2%7D%5Cquad%5Ctext%7Band%7D%5Cquad%5Csum%5Climits_%7Bi%20=%201%7D%5E%7Bk%20-%201%7Dw_%7Bi%7D%20%5Cleq%20%5Cfrac%7B1%7D%7B2%7D%7D%20%5C%5C%0A%5Cend%7Bmatrix%7D%0A"></p>
</section>
<section id="mode" class="level2">
<h2 class="anchored" data-anchor-id="mode">Mode</h2>
<p>The mode is the value that appears most often in the data and is typically used for categorical data, and less for numeric data <span class="citation" data-cites="bruce2017practical">see [1]</span>.</p>
</section>
<section id="python-implementation" class="level2">
<h2 class="anchored" data-anchor-id="python-implementation">Python Implementation</h2>
<p>Let’s first import all necessary Python libraries and generate our dataset.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> stats</span>
<span id="cb1-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> robustats</span>
<span id="cb1-5"></span>
<span id="cb1-6">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame({</span>
<span id="cb1-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data"</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>],</span>
<span id="cb1-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"weights"</span>: [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>] <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Not necessarily add up to 1!!</span></span>
<span id="cb1-9">})</span>
<span id="cb1-10">data, weights <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data"</span>], df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"weights"</span>]</span></code></pre></div></div>
<!-- Equation numbering -->
<p>You can use NumPy’s <a href="https://numpy.org/doc/stable/reference/generated/numpy.average.html"><code>average()</code></a> function to calculate the mean and weighted mean (equations 1.1 &amp; 1.2). For computing truncated mean, you can use <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.trim_mean.html"><code>trim_mean()</code></a> from the SciPy stats module. A common choice for truncating the top and bottom of the data is 10%<span class="citation" data-cites="bruce2017practical">see [1]</span>.</p>
<p>You can use NumPy’s <code>[median()](https://numpy.org/doc/stable/reference/generated/numpy.median.html)</code> function to calculate the median. For computing the weighted median, you can use <code>weighted_median()</code> from the <a href="https://github.com/FilippoBovo/robustats">robustats</a> Python library (you can install it using <code>pip install robustats</code>)<sup>2</sup>. Robustats is a high-performance Python library to compute robust statistical estimators implemented in C.</p>
<p>For computing the mode, you can either use the <code>mode()</code> function either from the robustats library that is particularly useful on large datasets or from <code>scipy.stats</code> module.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.average(data) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can use Pandas dataframe.mean()</span></span>
<span id="cb2-2">weighted_mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.average(data, weights<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>weights)</span>
<span id="cb2-3">truncated_mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stats.trim_mean(data, proportiontocut<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>)</span>
<span id="cb2-4">median <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.median(data) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can use Pandas dataframe.median()</span></span>
<span id="cb2-5">weighted_median <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> robustats.weighted_median(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>data, weights<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>weights)</span>
<span id="cb2-6">mode <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stats.mode(data)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can also use robustats.mode() on larger datasets</span></span>
<span id="cb2-7"></span>
<span id="cb2-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: "</span>, mean.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb2-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Weighted Mean: "</span>, weighted_mean.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb2-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Truncated Mean: "</span>, truncated_mean.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb2-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median: "</span>, median)</span>
<span id="cb2-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Weighted Median: "</span>, weighted_median)</span>
<span id="cb2-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mode: "</span>, mode)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.375</span></span>
<span id="cb3-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Weighted Mean:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.5</span></span>
<span id="cb3-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Truncated Mean:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">4.375</span></span>
<span id="cb3-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Median:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.0</span></span>
<span id="cb3-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Weighted Median:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.0</span></span>
<span id="cb3-6"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mode:  ModeResult(mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>array([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]), count<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>array([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>]))</span></code></pre></div></div>
<p>Now, let’s see if we just remove 20 from our data, how that will impact our mean.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.average(data[:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Remove the last data point (20)</span></span>
<span id="cb4-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: "</span>, mean.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.143</span></span></code></pre></div></div>
<p>You can see how the last data point (20) impacted the mean (4.375 vs 2.143). There can be many situations that we may end up with some outliers that should be cleaned from our datasets like faulty measurements that are in orders of magnitude away from other data points.</p>
</section>
</section>
<section id="estimates-of-variability" class="level1">
<h1>Estimates of Variability</h1>
<p>The second dimension (or moment) addresses how the data is spread out (variability or dispersion of the data). For this, we have to measure the difference (aka residual) between an estimate of location and an observed value<span class="citation" data-cites="bruce2017practical">see [1]</span>.</p>
<section id="mean-absolute-deviation" class="level2">
<h2 class="anchored" data-anchor-id="mean-absolute-deviation">Mean Absolute Deviation</h2>
<p>One way to get this estimate is to calculate the difference between the largest and the lowest value to get the <em>range</em>. However, the range is, by definition, very sensitive to the two extreme values. Another option is the mean absolute deviation that is <em>the average of the sum of all absolute deviation from the mean</em>, as can be seen in the below formula:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BMean~absolute~deviation%7D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Bi%20=%201%7D%5E%7Bn%7D%5Cmid%20x_%7Bi%7D%20-%20%5Cbar%7Bx%7D%5Cmid%7D%7Bn%7D%0A"></p>
<p>One reason why the mean absolute deviation receives less attention is since mathematically it’s preferable not to work with absolute values if there are other desirable options such as squared values available <img src="https://latex.codecogs.com/png.latex?(">for instance, <img src="https://latex.codecogs.com/png.latex?x%5E2"> is differentiable everywhere while the derivative of  <img src="https://latex.codecogs.com/png.latex?%7Cx%7C"> is not defined at <img src="https://latex.codecogs.com/png.latex?x=0)">.</p>
</section>
<section id="variance-standard-deviation" class="level2">
<h2 class="anchored" data-anchor-id="variance-standard-deviation">Variance &amp; Standard Deviation</h2>
<p>The variance and standard deviation are much more popular statistics than the mean absolute deviation to estimate the data dispersion.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A%5Ctext%7BVariance%7D%20&amp;%20%7B=%20s%5E%7B2%7D%20=%20%5Cfrac%7B%5Csum%5Climits_%7Bi%20=%201%7D%5E%7Bn%7D(x_%7Bi%7D%20-%20%5Cbar%7Bx%7D)%5E%7B2%7D%7D%7Bn%20-%201%7D%7D%20%5C%5C%0A&amp;%20%5C%5C%0A%5Ctext%7BStandard%20Deviation%7D%20&amp;%20%7B=%20s%20=%20%5Csqrt%7B%5Ctext%7BVariance%7D%7D%7D%20%5C%5C%0A%5Cend%7Bmatrix%7D%0A"></p>
<div class="callout callout-style-default callout-important callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Important
</div>
</div>
<div class="callout-body-container callout-body">
<p>In statistics, <img src="https://latex.codecogs.com/png.latex?s"> is used to refer to a <strong>sample</strong> standard deviation, whereas <img src="https://latex.codecogs.com/png.latex?%5Csigma"> refers to the <strong>population</strong> standard deviation.</p>
</div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>The variance is actually the average of the squared deviations from the mean.</p>
</div>
</div>
<p>As can be noted from the formula, the standard deviation is on the same scale as the original data making it an easier metric to interpret than the variance. Analogous to the trimmed mean, we can also compute the <em>trimmed/truncated standard deviation</em> that is less sensitive to outliers.</p>
<p>A good way of remembering some of the above estimates of variability is to link them to other metrics or distances that share a similar formulation <span class="citation" data-cites="bruce2017practical">see [1]</span>. For instance,</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 Variance <img src="https://latex.codecogs.com/png.latex?%5Cequiv"> Mean Squared Error (MSE) (aka Mean Squared Deviation MSD)</p>
</div>
</div>
</div>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 Standard deviation <img src="https://latex.codecogs.com/png.latex?%5Cequiv"> L2-norm, Euclidean norm</p>
</div>
</div>
</div>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 Mean absolute deviation <img src="https://latex.codecogs.com/png.latex?%5Cequiv"> L1-norm, Manhattan norm, Taxicab norm</p>
</div>
</div>
</div>
</section>
<section id="median-absolute-deviation-mad" class="level2">
<h2 class="anchored" data-anchor-id="median-absolute-deviation-mad">Median Absolute Deviation (MAD)</h2>
<p>Like the arithmetic mean, none of the estimates of variability (variance, standard deviation, mean absolute deviation) is robust to outliers. Instead, we can use the <em>median absolute deviation from the median</em> to check how our data is spread out in the presence of outliers. The median absolute deviation is a robust estimator, just like the median.</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbegin%7Bmatrix%7D%0A&amp;%20%7B%5Ctext%7BMedian%20absolute%20deviation%7D%20=%20%5Ctext%7BMedian%7D(%5Cmid%20x_%7B1%7D%20-%20m%5Cmid,%5Cmid%20x_%7B2%7D%20-%20m%5Cmid,...,%5Cmid%20x_%7Bn%7D%20-%20m%5Cmid)%7D%20%5C%5C%0A&amp;%20%7B%5Cquad%5Cquad%5Ctext%7Bwhere%20%7Dm%5Ctext%7B%20is%20the%20median%7D%7D%20%5C%5C%0A%5Cend%7Bmatrix%7D%0A"></p>
</section>
<section id="percentiles" class="level2">
<h2 class="anchored" data-anchor-id="percentiles">Percentiles</h2>
<p>Percentiles (or quantiles) is another measure of the data dispersion that is based on <a href="https://en.wikipedia.org/wiki/Order_statistic">order statistics</a> (statistics based on sorted data). <img src="https://latex.codecogs.com/png.latex?P">-th percentile is the least percentage of the values that are lower than or equal to <img src="https://latex.codecogs.com/png.latex?P"> percent.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 The median is the 50th percentile (0.5 quantile, or Q2).</p>
</div>
</div>
</div>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>💡 The percentile is technically a weighted average<span class="citation" data-cites="bruce2017practical">see [1]</span>.</p>
</div>
</div>
</div>
<p>25th (Q1) and 75th (Q3) percentiles are particularly interesting since their difference (Q3 – Q1) shows the middle 50% of the data. The difference is known as the <a href="https://en.wikipedia.org/wiki/Interquartile_range"><strong>interquartile range (IQR)</strong></a> (IQR=Q3-Q1). Percentiles are used to visualize data distribution using boxplots.</p>
<p>A nice article about boxplots is available on <a href="https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51">Towards Data Science blog</a>.</p>
</section>
<section id="python-implementation-1" class="level2">
<h2 class="anchored" data-anchor-id="python-implementation-1">Python Implementation</h2>
<p>You can use NumPy’s <a href="https://numpy.org/doc/stable/reference/generated/numpy.var.html"><code>var()</code></a> and <a href="https://numpy.org/doc/stable/reference/generated/numpy.std.html"><code>std()</code></a> function to calculate the variance and standard deviation, respectively. On the other hand, to calculate the mean absolute deviation, you can use Pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mad.html"><code>mad()</code></a> function. For computing the trimmed standard deviation, you can use SciPy’s <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tstd.html"><code>tstd()</code></a> from the stats module. You can use Pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html"><code>boxplot()</code></a> to quickly visualize a boxplot of the data.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb5-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> stats</span>
<span id="cb5-4"></span>
<span id="cb5-5">variance <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.var(data)</span>
<span id="cb5-6">standard_deviation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.std(data)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># df["Population"].std()</span></span>
<span id="cb5-7">mean_absolute_deviation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data"</span>].mad()</span>
<span id="cb5-8">trimmed_standard_deviation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stats.tstd(data)</span>
<span id="cb5-9">median_absolute_deviation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stats.median_abs_deviation(data, scale<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"normal"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># stats.median_absolute_deviation() is deprecated</span></span>
<span id="cb5-10"></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Percentile</span></span>
<span id="cb5-12">Q1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.quantile(data, q<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Can also use dataframe.quantile(0.25)</span></span>
<span id="cb5-13">Q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.quantile(data, q<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Can also use dataframe.quantile(0.75)</span></span>
<span id="cb5-14">IQR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Q3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> Q1</span>
<span id="cb5-15"></span>
<span id="cb5-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Variance: "</span>, variance.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb5-17"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Standard Deviation: "</span>, standard_deviation.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb5-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean Absolute Deviation: "</span>, mean_absolute_deviation.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb5-19"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Trimmed Standard Deviation: "</span>, trimmed_standard_deviation.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb5-20"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Absolute Deviation: "</span>, median_absolute_deviation.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb5-21"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Interquantile Range (IQR): "</span>, IQR)</span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Variance:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">35.234</span></span>
<span id="cb6-2"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Standard Deviation:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5.936</span></span>
<span id="cb6-3"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Mean Absolute Deviation:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">3.906</span></span>
<span id="cb6-4"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Trimmed Standard Deviation:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">6.346</span></span>
<span id="cb6-5"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Median Absolute Deviation:  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.741</span></span>
<span id="cb6-6"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;</span> Interquantile Range (IQR):  <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span></span></code></pre></div></div>
<hr>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/img/list_of_all_metrics.png" class="img-fluid figure-img"></p>
<figcaption>Table 3: A list of all metrics/estimates</figcaption>
</figure>
</div>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, I talked about various estimates of location and variability. In particular, I covered more than 10 different sample statistics and whether they are robust metrics or not. A table of all the metric along with their corresponding Python and R functions are summarized in Table 3. <!-- Quarto numbering --> We also saw how the presence of an outlier may impact non-robust metrics like the mean. In this case, we may want to use a robust estimate. However, in some problems, we are interested in studying extreme cases and outliers such as anomaly detection.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>📓 You can find the Jupyter notebook for this blog post on <a href="https://github.com/e-alizadeh/medium/blob/master/notebooks/A_Guide_to_Metrics_Estimates_in_EDA.ipynb">GitHub</a>.</p>
</div>
</div>
</div>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51">Understanding Boxplots</a></p>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-bruce2017practical" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">P. Bruce and A. Bruce, <em>Practical statistics for data scientists: 50 essential concepts</em>. O’Reilly Media, 2017.</div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Wikipedia, <a href="https://en.wikipedia.org/wiki/Truncated_mean">Truncated mean</a>↩︎</p></li>
<li id="fn2"><p>https://github.com/FilippoBovo/robustats↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2020,
  author = {Alizadeh, Esmaeil},
  title = {A {Guide} to {Metrics} {(Estimates)} in {Exploratory} {Data}
    {Analysis}},
  date = {2020-12-14},
  url = {https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2020" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“A Guide to Metrics (Estimates) in Exploratory Data
Analysis,”</span> Dec. 14, 2020. <a href="https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/">https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/</a></div>
</div></div></section></div> ]]></description>
  <category>Data Science</category>
  <category>Exploratory Data Analysis</category>
  <category>Guide to</category>
  <category>Python</category>
  <category>Statistics</category>
  <guid>https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/</guid>
  <pubDate>Mon, 14 Dec 2020 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/guide-to-estimates-in-exploratory-data-analysis/img/_featured_image.png" medium="image" type="image/png" height="85" width="144"/>
</item>
<item>
  <title>NeuralProphet: A Time-Series Modeling Python Library based on Neural-Networks</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/neural-prophet-library/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/neural-prophet-library/img/_featured_image.png" class="img-fluid" alt="Featured image of the post"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/neural-prophet-a-time-series-modeling-library-based-on-neural-networks-dd02dc8d868d"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<p><a href="https://github.com/ourownstory/neural_prophet">NeuralProphet</a> <sup>1</sup>&nbsp;is a python library for modeling time-series data based on neural networks. It’s built on top of&nbsp;<a href="https://pytorch.org/">PyTorch</a>&nbsp;and is heavily inspired by&nbsp;<a href="https://github.com/facebook/prophet">Facebook Prophet</a>&nbsp;and&nbsp;<a href="https://github.com/ourownstory/AR-Net">AR-Net</a>&nbsp;libraries.</p>
<section id="neuralprophet-library" class="level1">
<h1>NeuralProphet Library</h1>
<section id="neuralprophet-vs.-prophet" class="level2">
<h2 class="anchored" data-anchor-id="neuralprophet-vs.-prophet">NeuralProphet vs.&nbsp;Prophet</h2>
<p>From the library name, you may ask what is the main difference between Facebook’s Prophet library and NeuralProphet. According to NeuralProphet’s&nbsp;<a href="https://ourownstory.github.io/neural_prophet/changes-from-prophet/">documentation</a>, the added features are:</p>
<ul>
<li>Using PyTorch’s Gradient Descent optimization engine making the modeling process much faster than Prophet</li>
<li>Using AR-Net for modeling time-series autocorrelation (aka serial correlation)</li>
<li>Custom losses and metrics</li>
<li>Having configurable non-linear layers of feed-forward neural networks,</li>
<li><em>etc</em>.</li>
</ul>
</section>
<section id="project-maintainers" class="level2">
<h2 class="anchored" data-anchor-id="project-maintainers">Project Maintainers</h2>
<p>Based on the project’s GitHub page, the main maintainer of this project is&nbsp;<a href="https://github.com/ourownstory">Oskar Triebe</a>&nbsp;from Stanford University with collaboration from Facebook and Monash University.</p>
</section>
<section id="installation" class="level2">
<h2 class="anchored" data-anchor-id="installation">Installation</h2>
<p>The project is in the beta phase, so I would advise you to be cautious if you want to use this library in a production environment.</p>
<p>You can install the package using&nbsp;<code>pip install neuralprophet</code>. However, if you are going to use the package in a Jupyter Notebook environment, you should install their live version&nbsp;<code>pip install neuralprophet[live]</code>. This will provide more features such as a live plot of train and validation loss using&nbsp;<code>plot_live_loss()</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> clone https://github.com/ourownstory/neural_prophet</span>
<span id="cb1-2"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">cd</span> neural_prophet</span>
<span id="cb1-3"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">pip</span> install .<span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">[</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">live</span><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">]</span></span></code></pre></div></div>
<p>I would recommend creating a fresh environment (a conda or venv) and installing the NeuralProphet package from the new environment letting the installer take care of all dependencies (it has Pandas, Jupyter Notebook, PyTorch as dependencies).</p>
<p>Now that we have the package installed, let’s play!</p>
</section>
<section id="implementation-with-a-case-study" class="level2">
<h2 class="anchored" data-anchor-id="implementation-with-a-case-study"><strong>Implementation with a Case Study</strong></h2>
<p>Here, I’m using the daily climate data in Delhi from 2013 to 2017 that I found on&nbsp;<a href="https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data">Kaggle</a>. First, let’s import the main packages.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> neuralprophet <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> NeuralProphet</span></code></pre></div></div>
<p>Then, we can read the data into a Panda DataFrame. NeuralProphet object expects the time-series data to have a date column named <code>ds</code> and the time-series column value we want to predict as <code>y</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Data is from https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data</span></span>
<span id="cb3-2">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./DailyDelhiClimateTrain.csv"</span>, parse_dates<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>])</span>
<span id="cb3-3">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"meantemp"</span>]]</span>
<span id="cb3-4">df.rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ds"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"meantemp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"y"</span>}, inplace<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
<p>Now let’s initialize the model. Below, I’ve brought all default arguments defined for the NeuralProphet object, including additional information about some. These are the hyperparameters you can configure in the model. Of course, if you are planning to use the default variables, you can just do <code>model = NeuralProphet()</code>.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># model = NeuralProphet() if you're using default variables below.</span></span>
<span id="cb4-2">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> NeuralProphet(</span>
<span id="cb4-3">    growth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"linear"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Determine trend types: 'linear', 'discontinuous', 'off'</span></span>
<span id="cb4-4">    changepoints<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># list of dates that may include change points (None -&gt; automatic )</span></span>
<span id="cb4-5">    n_changepoints<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb4-6">    changepoints_range<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>,</span>
<span id="cb4-7">    trend_reg<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb4-8">    trend_reg_threshold<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb4-9">    yearly_seasonality<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>,</span>
<span id="cb4-10">    weekly_seasonality<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>,</span>
<span id="cb4-11">    daily_seasonality<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>,</span>
<span id="cb4-12">    seasonality_mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"additive"</span>,</span>
<span id="cb4-13">    seasonality_reg<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb4-14">    n_forecasts<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb4-15">    n_lags<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb4-16">    num_hidden_layers<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb4-17">    d_hidden<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dimension of hidden layers of AR-Net</span></span>
<span id="cb4-18">    ar_sparsity<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sparcity in the AR coefficients</span></span>
<span id="cb4-19">    learning_rate<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb4-20">    epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>,</span>
<span id="cb4-21">    loss_func<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Huber"</span>,</span>
<span id="cb4-22">    normalize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Type of normalization ('minmax', 'standardize', 'soft', 'off')</span></span>
<span id="cb4-23">    impute_missing<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb4-24">    log_level<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Determines the logging level of the logger object</span></span>
<span id="cb4-25">)</span></code></pre></div></div>
<p>After configuring the model and its hyperparameters, we need to train the model and make predictions. Let’s make up to a one-year prediction of the temperature.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">metrics <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.fit(df, validate_each_epoch<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, freq<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D"</span>)</span>
<span id="cb5-2">future <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.make_future_dataframe(df, periods<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">365</span>, n_historic_predictions<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(df))</span>
<span id="cb5-3">forecast <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict(future)</span></code></pre></div></div>
<p>You can simply plot the forecast by calling <code>model.plot(forecast)</code> as following:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>))</span>
<span id="cb6-2">model.plot(forecast, xlabel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>, ylabel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span>
<span id="cb6-3">ax.set_title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean Temperature in Delhi"</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>, fontweight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>)</span></code></pre></div></div>
<p>The one-year forecast plot is shown below, where the time period between 2017-01-01 to 2018-01-01 is the prediction. As can be seen, the forecast plot resembles the historical time-series. It both captured the seasonality as well as the slow-growing linear trend.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/neural-prophet-library/img/mean_temp_1yr_prediction.png" class="img-fluid figure-img" alt="The mean temperature in Delhi and the one-year prediction."></p>
<figcaption>The mean temperature in Delhi and the one-year prediction</figcaption>
</figure>
</div>
<p>You can plot the parameters by calling <code>model.plot_parameters()</code></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/neural-prophet-library/img/model_parameters.png" class="img-fluid figure-img" alt="Neural-Prophet model parameters"></p>
<figcaption>Model parameters</figcaption>
</figure>
</div>
<p>The model loss using Mean Absolute Error (MAE) is plotted below. You can also use the Smoothed L1-Loss function.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>))</span>
<span id="cb7-2">ax.plot(metrics[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"MAE"</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ob'</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Training Loss"</span>)  </span>
<span id="cb7-3">ax.plot(metrics[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"MAE_val"</span>], <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'-r'</span>, linewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, label<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Validation Loss"</span>)</span>
<span id="cb7-4"></span>
<span id="cb7-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can use metrics["SmoothL1Loss"] and metrics["SmoothL1Loss_val"] too.</span></span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/neural-prophet-library/img/model_loss_mae.png" class="img-fluid figure-img" alt="Model Loss using MAE"></p>
<figcaption>Model Loss using MAE</figcaption>
</figure>
</div>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we talked about NeuralProphet, a python library that models time-series based on Neural Networks. The library uses PyTorch as a backend. As a case study, we created a prediction model for daily Delhi climate time-series data and made a one-year prediction. An advantage of using this library is its similar syntax to Facebook’s Prophet library.</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>📓 You can find the Jupyter notebook for this blog post on&nbsp;<a href="https://github.com/e-alizadeh/medium/blob/master/notebooks/NeuralProphet/neural_prophet.ipynb">GitHub</a>.</p>
</div>
</div>
</div>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://facebook.github.io/prophet/">Facebook’s Prophet Library</a></p>
<p><a href="https://github.com/ourownstory/AR-Net">AR-Net GitHub</a></p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p><a href="https://ourownstory.github.io/neural_prophet/">NeuralProphet Documentation</a>↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2020,
  author = {Alizadeh, Esmaeil},
  title = {NeuralProphet: {A} {Time-Series} {Modeling} {Python}
    {Library} Based on {Neural-Networks}},
  date = {2020-12-03},
  url = {https://ealizadeh.com/blog/neural-prophet-library/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2020" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“NeuralProphet: A Time-Series Modeling Python Library
based on Neural-Networks,”</span> Dec. 03, 2020. <a href="https://ealizadeh.com/blog/neural-prophet-library/">https://ealizadeh.com/blog/neural-prophet-library/</a></div>
</div></div></section></div> ]]></description>
  <category>Machine Learning</category>
  <category>Neural-Network</category>
  <category>Python Library</category>
  <category>Time Series Analysis</category>
  <category>Tutorial</category>
  <guid>https://ealizadeh.com/blog/neural-prophet-library/</guid>
  <pubDate>Thu, 03 Dec 2020 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/neural-prophet-library/img/_featured_image.png" medium="image" type="image/png" height="107" width="144"/>
</item>
<item>
  <title>15 Cognitive Errors Every Analyst Must Know (+ Network Graph View)</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/img/_featured_image.png" class="img-fluid" alt="A connected graph of various cognitive biases."></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://towardsdatascience.com/15-cognitive-errors-every-analyst-must-know-268540e34ade">Towards Data Science blog</a></strong>.</p>
</div>
</div>
<p>In this article, I will go over 15 cognitive errors from the book “The Art of Thinking Clearly” by Rolf Dobelli. Any scientist, analyst, or basically anyone who works with data needs to be familiar with these fallacies. My goal is to present these cognitive errors to think about them in your daily life and avoid falling into one!</p>
<p>I have created a graph network showing the connections among these cognitive errors. An interactive network graph on all 98 cognitive errors mentioned in the book can be found <a href="https://ealizadeh.com/assets/98-cognitive-errors/">here</a>. However, since that’s overwhelming and many of them are not directly related to this article, I have also created the network graph of all biases explained in this article and a few more below.</p>
<!-- [https://codepen.io/e-alizadeh/pen/dyOjJYG](https://codepen.io/e-alizadeh/pen/dyOjJYG) -->
<iframe height="910" style="width: 100%;" scrolling="no" title="34 Cognitive Errors" src="https://codepen.io/e-alizadeh/embed/dyOjJYG?default-tab=result" frameborder="no" loading="lazy" allowtransparency="true" allowfullscreen="true">
See the Pen <a href="https://codepen.io/e-alizadeh/pen/dyOjJYG"> 34 Cognitive Errors</a> by Essi Alizadeh (<a href="https://codepen.io/e-alizadeh">@e-alizadeh</a>) on <a href="https://codepen.io">CodePen</a>.
</iframe>
<section id="cognitive-biases" class="level1">
<h1>Cognitive Biases</h1>
<p>The cognitive error refers to any systematic (not occasional) deviation from the logic <span class="citation" data-cites="dobelli2013art">see [1]</span>. Let’s go over the interesting ones (in no particular order) below.</p>
<section id="sec-base-rate-neglect" class="level2">
<h2 class="anchored" data-anchor-id="sec-base-rate-neglect">1. Base-Rate Neglect</h2>
<p>This fallacy is a common reasoning error where people neglect the distribution of the data in favor of specific individual information. Here is an example of this bias from the book.</p>
<p>Mark is a man from Germany who wears glasses and listens to Mozart. Which one is more likely?</p>
<p>He is 1) a truck driver or 2) a literature professor in Frankfurt?</p>
<p>Most people will bet on Option 2 (the wrong option). The number of truck drivers in Germany is 10,000 times more than the number of literature professors in Frankfurt. Hence, it’s more likely that Mark is a truck driver <span class="citation" data-cites="dobelli2013art">see [1]</span>.</p>
<section id="sec-false-positive-paradox" class="level3">
<h3 class="anchored" data-anchor-id="sec-false-positive-paradox">1.1 False Positive Paradox</h3>
<p>The <strong>false positive paradox</strong> is *an example of base-rate bias when the number of false positives is more than the number of true positives<sup>1</sup>.</p>
<p><strong>Example:</strong> Imagine that 1% of a population is actually infected with a disease, and there is a test with a 5% <strong><a href="https://en.wikipedia.org/wiki/False_positive_rate">false-positive rate</a></strong> and no false-negative rate, <em>i.e.</em> False Negative or <em>FN</em> = 0FN=0. The expected outcome of 10000 tests would be</p>
<ul>
<li>Infected and the test correctly indicates the diseases (True Positive): <img src="https://latex.codecogs.com/png.latex?10000%20%5Ctimes%20%5Cfrac%7B1%7D%7B100%7D%20=%20100%20%5C;%20(TP%20=%20100)"></li>
<li>Uninfected and the test incorrectly indicates the person has the disease (False Positive) <img src="https://latex.codecogs.com/png.latex?10000%20%5Ctimes%20%5Cfrac%7B100%20-%201%7D%7B100%7D%20%5Ctimes%200.05%20=%20495%20%5C;%20(FP=495)"></li>
</ul>
<p>So, a total of <img src="https://latex.codecogs.com/png.latex?100+495=595"> people tested positive. And the remaining <img src="https://latex.codecogs.com/png.latex?10000%E2%88%92595=9405%20%5C;%20(TN=9405)"> tests have correct negative results (True Negative).</p>
<p>Overall, only 100 of the 595 positive results are actually correct. The probability of actually being infected when the test results are positive is <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B100%7D%7B100%20+%20495%7D%20=%200.168"> or <img src="https://latex.codecogs.com/png.latex?16.8%5C%25">, for a test with an accuracy of <img src="https://latex.codecogs.com/png.latex?95%5C%25"> <img src="https://latex.codecogs.com/png.latex?(%5Cfrac%7BTP%20+%20TN%7D%7BTP%20+%20FP%20+%20FN%20+%20TN%7D%20=%20%5Cfrac%7B100%20+%209405%7D%7B10,000%7D=0.9505)"></p>
</section>
</section>
<section id="clustering-illusion" class="level2">
<h2 class="anchored" data-anchor-id="clustering-illusion">2. Clustering Illusion</h2>
<blockquote class="blockquote">
<p>The brain seeks patterns.</p>
</blockquote>
<p>The brain seeks patterns, coherence, and order where none really exists. As per this cognitive bias, we are oversensitive to finding a structure or rule. The human’s false pattern recognition is also known as <a href="https://en.wikipedia.org/wiki/Apophenia">apophenia</a>. That is the tendency to make meaningful connections between unrelated things. Examples of such phenomena are gambler’s fallacy, figures in clouds, and patterns with no deliberate designs. A popular example of this is the “Face on Mars” as shown below:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/img/wikipedia_Martian_face_viking_cropped.png" class="img-fluid figure-img" alt="Face on Mars (clustering illusion)" width="400"></p>
<figcaption>A cropped version of the small part of the Cydonia region, taken by the Viking 1 orbiter and released by NASA/JPL on July 25, 1976. (Source: <a href="https://en.wikipedia.org/wiki/Cydonia_(Mars)#%22Face_on_Mars%22">Wikipedia</a>)</figcaption>
</figure>
</div>
<p>This is prominent in the gambling, misinterpretation of statistics, conspiracy theories, <em>etc</em>. This cognitive error is also linked to two other errors: coincidence and false causality. The way to overcome the clustering illusion, particularly for anyone working with data, is to assume any pattern found in the data as random and statistically test the pattern found <span class="citation" data-cites="beitman2009brains">see [2]</span>.</p>
</section>
<section id="confirmation-bias" class="level2">
<h2 class="anchored" data-anchor-id="confirmation-bias">3. Confirmation Bias</h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/img/ConfirmationBias.png" class="img-fluid figure-img" alt="A discussion between two people illustrating the confirmation bias."></p>
<figcaption>Confirmation bias</figcaption>
</figure>
</div>
<p>The confirmation bias is the only cognitive error that the book author wrote two parts about it. The following quote from the book says it all:</p>
<blockquote class="blockquote">
<p>The confirmation bias is the mother of all misconceptions. —Rolf Dobelli</p>
</blockquote>
<p>Confirmation bias is the tendency to interpret new information based on our existing beliefs and discard any opposing evidence (disconfirming evidence). The internet is the main ground for this bias. Think about the YouTube video suggestions made to us based on our watch history. It gives us recommendations to videos that align with our current views/beliefs/theories. We are actually prone to confirmation bias in many platforms that use <a href="https://en.wikipedia.org/wiki/Recommender_system">Recommender systems</a> like Google, Facebook, Twitter, <em>etc</em>.</p>
</section>
<section id="overconfidence-effect" class="level2">
<h2 class="anchored" data-anchor-id="overconfidence-effect">4. Overconfidence Effect</h2>
<p>This cognitive bias tells us that we systematically overestimate our ability to predict. In other words, this bias says that we believe subjectively that our judgment is better than it objectively is <span class="citation" data-cites="dobelli2013art">see [1]</span>. No wonder why most major projects are not completed in less time or cheaper than initially predicted. The overconfidence bias is more common among experts. The reason is that experts obviously know more about their own field, but they overestimate exactly how much more.</p>
</section>
<section id="regression-to-mean" class="level2">
<h2 class="anchored" data-anchor-id="regression-to-mean">5. Regression to Mean</h2>
<p>Imagine the weather in your city reaches a record hot weather. The temperature will most likely drop in the next few days, back towards the monthly average. This bias depends on the random variance impacting any measurement, causing some to be extreme. Ignoring this bias leads to overestimating the correlation between the two measures. For instance, if an athlete performs extremely well in a year, we expect a better performance the year after. If that’s not the case, we may come up with causal relationships instead of considering that we probably overestimated the next year’s performance!</p>
</section>
<section id="induction" class="level2">
<h2 class="anchored" data-anchor-id="induction">6. Induction</h2>
<p>The inductive bias occurs when we draw universal conclusions from individual observations. For example, all observed people in a city wear glasses. Therefore all the people in that city wear glasses. This can lead to false causality (another bias I will talk about later). For example, every time Bob drinks milk, he gets cramps, and therefore he concludes that he gets cramps because he drinks milk!</p>
</section>
<section id="intention-to-treat-error" class="level2">
<h2 class="anchored" data-anchor-id="intention-to-treat-error">7. Intention-To-Treat Error</h2>
<p>Mainly used in medical research studies, the <a href="https://www.verywellhealth.com/understanding-intent-to-treat-3132805"><strong>intention-to-treat (ITT) principle</strong></a> is crucial in interpreting the results of randomized clinical trials. It helps the researchers to assess the true effect of choosing a medical treatment. Let’s have an example to understand this fallacy better.</p>
<p>Suppose a pharmaceutical company developed a new drug for heart diseases. A study “proves” that the drug is improving the patient’s health and reduces the chance of dying from heart diseases. The five-year mortality rate of patients who take the drug <em>regularly</em> is 15%. The company may not tell you that the mortality rate of patients who took the drug irregularly was 30%. So, is the drug a complete success?</p>
<p>The point here is that the drug may not be the decisive factor here, but the patient’s behavior. The patients who didn’t take the drug according to the schedules may have side effects causing them to stop taking the drug. Hence, these patients will not be in the “regular category” of the study, for which a 15% rate was reported. So, because of the ITT error, the drug looks much more effective than it actually is <span class="citation" data-cites="dobelli2013art">see [1]</span>.</p>
</section>
<section id="false-causality" class="level2">
<h2 class="anchored" data-anchor-id="false-causality">8. False Causality</h2>
<blockquote class="blockquote">
<p>The Correlation does not imply causation.</p>
</blockquote>
<p>This fallacy occurs when we wrongly infer a cause-and-effect relationship between two measurements solely based on their correlation. For instance, the per capita consumption of mozzarella cheese is highly correlated with civil engineering PhDs awarded (r=0.9586). Does this mean that the consumption of mozzarella cheese leads to more civil engineering doctorates awarded?</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/img/FalseCausality_Correlation.png" class="img-fluid figure-img" alt="A time-series plot showing the correlation between mozzarella cheese consumption and awarded civil engineering doctorates (false causality)."></p>
<figcaption>Does the consumption of mozzarella cheese actually lead to more civil engineering PhDs awarded? (Graph: <a href="http://tylervigen.com/">TylerVigen.com</a>, Data sources: <a href="https://www.census.gov/compendia/statab/2012/tables/12s0217.xls">U.S. Department of Agriculture</a> and <a href="https://www.nsf.gov/statistics/infbrief/nsf11305/">National Science Foundation</a>)</figcaption>
</figure>
</div>
</section>
<section id="the-problem-with-averages" class="level2">
<h2 class="anchored" data-anchor-id="the-problem-with-averages">9. The Problem with Averages</h2>
<blockquote class="blockquote">
<p>Don’t cross a river if it is (on average) four feet deep. —Nassim Taleb</p>
</blockquote>
<p>Working with averages may mask the underlying distribution. An outlier may significantly change the picture. This is important for anyone that works with data. A few extreme outliers may dominate.</p>
</section>
<section id="information-bias" class="level2">
<h2 class="anchored" data-anchor-id="information-bias">10. Information Bias</h2>
<p>This is when additional information does not add any value to the action or decision you are taking. Extra information does not guarantee better decisions, and at times it may actually put you at a disadvantage in addition to wasting time and money. Observational studies may be more prone to this type of bias, mainly because they rely on self-reporting data collection, although introducing randomness in interventional studies may reduce that. Missing data can be another reason for introducing information bias<sup>2</sup>.</p>
</section>
<section id="the-law-of-small-numbers" class="level2">
<h2 class="anchored" data-anchor-id="the-law-of-small-numbers">11. The Law of Small Numbers</h2>
<p>This law refers to incorrect reasoning that a small sample drawn from a population resembles the overall population. This is particularly important in statistical analysis. In his book “Thinking, Fast and Slow” <span class="citation" data-cites="kahneman2011thinking">see [3]</span>, the Nobel prize winner Daniel Kahneman points out that even professionals and experts sometimes fall into this fallacy.</p>
<p>Suppose you have a bag of 1000 marbles, of which 500 are red, and the other 500 marbles are black. Without looking, you draw three marbles, and all turn out to be black. You may infer that all marbles in the bag are black. In this scenario, you’ve fallen into the fallacy of the law of small numbers.</p>
<p>The best way to counter this fallacy, as you may have guessed by now, is to use the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers"><strong>Law of Large Numbers</strong></a> to have a sample size that resembles the overall population.</p>
</section>
<section id="ambiguity-aversion-risk-vs-uncertainty" class="level2">
<h2 class="anchored" data-anchor-id="ambiguity-aversion-risk-vs-uncertainty">12. Ambiguity Aversion (Risk vs Uncertainty)</h2>
<p>Suppose we have two bags. Bag A has 50 black and another 50 red marbles. Bag B also has 100 marbles (red and black), but we don’t know how many are for each color. Now, if you pull a red marble from a bag (of course without taking a look!!), you will win $100. So, do you choose bag A or bag B?</p>
<p>The majority will go with bag A. Now, let’s repeat the experiment, and this time if you pull a black marble from a bag, you will win $100. In this case, the majority will pick bag A too. This illogical result is known as <a href="https://en.wikipedia.org/wiki/Ellsberg_paradox"><strong>Ellsberg Paradox</strong></a>, stating that people prefer known probabilities over unknown (ambiguous) ones <span class="citation" data-cites="dobelli2013art">see [1]</span>. We should be aware of the difference between risks and uncertainties. These two terms are used interchangeably. However, there is a significant difference between the two.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 Risks are known probabilities, whereas uncertainties are unknown probabilities (ambiguous).</p>
</div>
</div>
<p>Although difficult, tolerating ambiguity/uncertainty may help us to handle this fallacy better.</p>
</section>
<section id="planning-fallacy" class="level2">
<h2 class="anchored" data-anchor-id="planning-fallacy">13. Planning Fallacy</h2>
<p>Have you wondered why you always underestimate the amount of time it takes you to finish a task or a project? You may fall into the planning fallacy. This cognitive error says that our planning is usually very ambitious. The two main reasons behind this fallacy are 1) wishful thinking and 2) neglecting the external influences. One way we can better plan is to have a premortem session that you can go over similar projects or consider outside influences beforehand.</p>
</section>
<section id="déformation-professionnelle" class="level2">
<h2 class="anchored" data-anchor-id="déformation-professionnelle">14. Déformation Professionnelle</h2>
<blockquote class="blockquote">
<p>If your only tool is a hammer, all your problems will be nails. —Mark Twain</p>
</blockquote>
<p>This fallacy is in place when you view things from one’s own profession instead of a broader perspective. This may impact anyone who learns something new and uses that in every situation. Let’s say you just learned about a new machine learning model. You may use it in cases that better models are available. I actually did the same. During my studies, I learned about <a href="https://en.wikipedia.org/wiki/Support_vector_machine">support vector machine (SVM)</a> (a binary classifier). Whenever I had a classification problem, I would initially think of using SVM, even in multiclass classification for which SVM may not be the best option to start for.</p>
</section>
<section id="exponential-growth-bias" class="level2">
<h2 class="anchored" data-anchor-id="exponential-growth-bias">15. Exponential Growth Bias</h2>
<p>Unlike linear growths, exponential growths are not intuitive to understand. Consider the following two options:</p>
<ol type="1">
<li>You will get $1,000 every day for the next 30 days.</li>
<li>You will get a cent on the first day, two cents on the second day, and the amount gets doubled every day for the next 30 days.</li>
</ol>
<p>Which one do you choose? Option 1 will earn you $30,000 at the end, whereas Option 2 will get you over $10.7 million (you can calculate it yourself using <img src="https://latex.codecogs.com/png.latex?x(t)%20=%20x_%7B0%7D(1%20+%20%5Cfrac%7Br%7D%7B100%7D)%5E%7Bt%7D"> where <img src="https://latex.codecogs.com/png.latex?x_0=0.01"> (1 cent), <img src="https://latex.codecogs.com/png.latex?r=100%5C%25"> (doubling), <img src="https://latex.codecogs.com/png.latex?t=30">).</p>
<p>We can use a logarithmic scale to transform the original data for a better understanding.</p>
<hr>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this article, I went over 15 cognitive biases that everyone who works with data should be aware of. For some, I’ve also provided quantitative examples to illustrate how we may fall into these fallacies. These cognitive errors are from the bestseller book “The Art of Thinking Clearly” by Rolf Dobelli. I recommend everyone to read this book as the author covers 98 cognitive errors that touch on different life aspects.</p>
</section>
<section id="useful-links" class="level1">
<h1>Useful Links</h1>
<p><a href="https://towardsdatascience.com/why-most-of-you-made-an-irrational-decision-aeeac532ef92">Why most of you made an irrational decision</a></p>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-dobelli2013art" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">R. Dobelli, <em>The art of thinking clearly: Better thinking, better decisions</em>. Hachette UK, 2013.</div>
</div>
<div id="ref-beitman2009brains" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">B. D. Beitman, <span>“Brains seek patterns in coincidences,”</span> <em>Psychiatric Annals</em>, vol. 39, no. 5, pp. 255–264, 2009.</div>
</div>
<div id="ref-kahneman2011thinking" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">D. Kahneman, <em>Thinking, fast and slow</em>. Macmillan, 2011.</div>
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Wikipedia. <a href="https://en.wikipedia.org/wiki/Base_rate_fallacy#False_positive_paradox">Base rate fallacy</a>↩︎</p></li>
<li id="fn2"><p>Catalogue of bias collaboration, <a href="https://catalogofbias.org/biases/information-bias/">Information Bias</a> (2019), Sackett Catalogue Of Biases↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2020,
  author = {Alizadeh, Esmaeil},
  title = {15 {Cognitive} {Errors} {Every} {Analyst} {Must} {Know} (+
    {Network} {Graph} {View)}},
  date = {2020-12-01},
  url = {https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2020" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“15 Cognitive Errors Every Analyst Must Know (+ Network
Graph View),”</span> Dec. 01, 2020. <a href="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/">https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/</a></div>
</div></div></section></div> ]]></description>
  <category>Psychology</category>
  <category>Decision-Making</category>
  <category>Cognitive Bias</category>
  <category>Book</category>
  <category>Self-Improvement</category>
  <guid>https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/</guid>
  <pubDate>Tue, 01 Dec 2020 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/cognitive-errors-art-of-thinking-clearly/img/_featured_image.png" medium="image" type="image/png" height="91" width="144"/>
</item>
<item>
  <title>Synthetic Data Vault (SDV): A Python Library for Dataset Modeling</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/</link>
  <description><![CDATA[ 






<p><img src="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/img/_featured_image.png" class="img-fluid" alt="Featured image of the post"></p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<a href="https://towardsdatascience.com/synthetic-data-vault-sdv-a-python-library-for-dataset-modeling-b48c406e7398"><strong>Towards Data Science blog</strong></a>.</p>
</div>
</div>
<p>In data science, you usually need a realistic dataset to test your proof of concept. Creating fake data that captures the behavior of the actual data may sometimes be a rather tricky task. Several python packages try to achieve this task. Few popular python packages are&nbsp;<a href="https://github.com/joke2k/faker/">Faker</a>,&nbsp;<a href="https://github.com/lk-geimfari/mimesis">Mimesis</a>. However, there are mostly generating simple data like generating names, addresses, emails,&nbsp;<em>etc</em>.</p>
<p>To create data that captures the attributes of a complex dataset, like having time-series that somehow capture the actual data’s statistical properties, we will need a tool that generates data using different approaches. <a href="https://sdv.dev/">Synthetic Data Vault (SDV)</a> python library is a tool that models complex datasets using statistical and machine learning models. This tool can be a great new tool in the toolbox of anyone who works with data and modeling.</p>
<section id="why-this-library" class="level2">
<h2 class="anchored" data-anchor-id="why-this-library">Why this library?</h2>
<p>The main reason I’m interested in this tool is for&nbsp;<em>system testing</em>: It’s much better to have datasets that are generated from the same actual underlying process. This way we can test our work/model in a realistic scenario rather than having unrealistic cases. There are other reasons why we need synthetic data such as <em>data understanding, data compression, data augmentation</em>, and <em>data privacy</em> <span class="citation" data-cites="thesis:xu2020synthesizing">see [1]</span>.</p>
<p>The&nbsp;<a href="https://sdv.dev/">Synthetic Data Vault (SDV)</a>&nbsp;was first introduced in the paper&nbsp;<a href="https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf">“The Synthetic data vault”</a>, then used in the context of generative modeling in the master thesis <a href="https://dspace.mit.edu/handle/1721.1/109616">“The Synthetic Data Vault: Generative Modeling for Relational Databases”</a>”&nbsp;by Neha Patki. Finally, the SDV library was developed as a part of Andrew Montanez’s master thesis <a href="https://dai.lids.mit.edu/wp-content/uploads/2018/12/Andrew_MEng.pdf">“SDV: An Open Source Library for Synthetic Data Generation”</a>. Another master thesis to add new features to SDV was done by Lei Xu&nbsp;<a href="https://dai.lids.mit.edu/wp-content/uploads/2020/02/Lei_SMThesis_neo.pdf">“Synthesizing Tabular Data using conditional GAN”</a>.</p>
<p>All these work and research were done in the MIT Data-to-AI laboratory under the supervision of Kalyan Veeramachaneni – a principal research scientist at MIT Laboratory for Information and Decision Systems (LIDS, MIT).</p>
<p>The reason I’m bringing the history of the SDV is to appreciate the amount of work and research that has gone behind this library. An interesting article talking about the potential of using this tool, particularly in data privacy is available <a href="https://news.mit.edu/2020/real-promise-synthetic-data-1016">here</a>.</p>
</section>
<section id="sdv-library" class="level1">
<h1>SDV Library</h1>
<p>The workflow of this library is shown below. A user provides the data and the schema and then fits a model to the data. At last, new synthetic data is obtained from the fitted model <span class="citation" data-cites="patki2016sdv">see [2]</span>. Moreover, the SDV library allows the user to save a fitted model (<code>model.save("model.pkl")</code>) for any future use.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/img/sdv_workflow_reprinted.png" class="img-fluid figure-img" alt="The SDV workflow"></p>
<figcaption>The SDV workflow (reprinted from <span class="citation" data-cites="patki2016sdv">see [2]</span>)</figcaption>
</figure>
</div>
<section id="time-series-data-modeling-using-par" class="level2">
<h2 class="anchored" data-anchor-id="time-series-data-modeling-using-par">Time-Series Data Modeling using PAR</h2>
<p>A probabilistic autoregressive (PAR) model is used to model multi-type multivariate time-series data. The SDV library has this model implemented in&nbsp;the <code>PAR</code>&nbsp;class&nbsp;(from time-series module).</p>
<p>Let’s work out an example to explain different arguments of&nbsp;<code>PAR</code>&nbsp;class. We are going to work with a time-series of temperatures in multiple cities. The dataset will have the following column:&nbsp;<em>Date</em>,&nbsp;<em>City</em>,&nbsp;<em>Measuring Device</em>,&nbsp;<em>Where, Noise</em>.</p>
<p>In&nbsp;<code>PAR</code>, there are four types of columns considered in a dataset.</p>
<ol type="1">
<li><p>Sequence Index: This is the data column with the row dependencies (should be sorted like datetime or numeric values). In time-series, this is usually the time axis. In our example, the sequence index will be the&nbsp;Date&nbsp;column.</p></li>
<li><p>Entity Columns: These columns are the abstract entities that form the group of measurements, where each group is a time-series (hence the rows within each group should be sorted). However, rows of different entities are independent of each other. In our example, the entity column(s) will be only the&nbsp;City&nbsp;column. By the way, we can have more columns as the argument type should be a list.</p></li>
<li><p>Context Columns: These columns provide information about the time-series’ entities and will not change over time. In other words, the&nbsp;context columns should be constant within groups. In our example,&nbsp;Measuring Device&nbsp;and&nbsp;Where&nbsp;are the context columns.</p></li>
<li><p>Data Columns: Any other columns that do not belong to the above categories will be considered data columns. The&nbsp;PAR&nbsp;class does not have an argument for assigning data columns. So, the remaining columns that are not listed in any of the previous three categories will automatically be considered data columns. In our example, the&nbsp;<em>Noise</em>&nbsp;column is the data column.</p></li>
</ol>
<section id="sample-code" class="level3">
<h3 class="anchored" data-anchor-id="sample-code">Sample code</h3>
</section>
<section id="example-1-single-time-series-one-entity" class="level3">
<h3 class="anchored" data-anchor-id="example-1-single-time-series-one-entity">Example 1: Single Time-Series (one entity)</h3>
<p>The PAR model for time series is implemented in&nbsp;<code>PAR()</code>&nbsp;class from&nbsp;<code>sdv.timeseries</code>&nbsp;module. If we want to model a single time-series data, then we only need to set the&nbsp;<code>sequence_index</code>&nbsp;argument of the&nbsp;<code>PAR()</code>&nbsp;class to the datetime column (the column illustrating the order of the time-series sequence).</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sdv.timeseries <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PAR</span>
<span id="cb1-3"></span>
<span id="cb1-4">actual_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./daily_min_temperature_data_melbourne.csv"</span>)</span>
<span id="cb1-5">actual_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.to_datetime(actual_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>])</span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define -&gt; Fit -&gt; Save a PAR model</span></span>
<span id="cb1-8">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PAR(sequence_index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>)</span>
<span id="cb1-9">model.fit(actual_data)</span>
<span id="cb1-10">model.save(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data_generation_model_single_city.pkl"</span>)</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># After a model is trained and pickled.</span></span>
<span id="cb1-13">We can comment above code <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> just load the model <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">next</span> time.</span>
<span id="cb1-14"></span>
<span id="cb1-15">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PAR.load(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data_generation_model_single_city.pkl"</span>)</span>
<span id="cb1-16"></span>
<span id="cb1-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate new data</span></span>
<span id="cb1-18">new_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model1.sample(num_sequences<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># may take few seconds to generate</span></span>
<span id="cb1-19">new_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.to_datetime(new_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>])</span>
<span id="cb1-20"></span>
<span id="cb1-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compare descriptive statistics</span></span>
<span id="cb1-22">stat_info1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> actual_data.describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Real Data"</span>})</span>
<span id="cb1-23">stat_info2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> new_data.describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PAR Generated Data"</span>})</span>
<span id="cb1-24"></span>
<span id="cb1-25"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(stat_info1.join(stat_info2))</span>
<span id="cb1-26"></span>
<span id="cb1-27">stat_info1.join(stat_info2)</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/img/ex1_summary_stats_real_vs_generated_data.png" class="img-fluid figure-img" alt="Comparison of summary statistics between real data and the PAR generated data."></p>
<figcaption>Comparison of summary statistics between real data and the PAR generated data.</figcaption>
</figure>
</div>
</section>
<section id="example-2-time-series-with-multiple-entities" class="level3">
<h3 class="anchored" data-anchor-id="example-2-time-series-with-multiple-entities">Example 2: Time-series with multiple entities</h3>
<p>The SDV is capable of having multiple entities meaning multiple time-series. In our example, we have temperature measurements for multiple cities. In other words, each city has a group of measurements that will be treated independently.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sdv.timeseries <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PAR</span>
<span id="cb2-3"></span>
<span id="cb2-4">all_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./fake_time_series_data_multiple_cities.csv"</span>)</span>
<span id="cb2-5">all_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.to_datetime(all_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>])</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define -&gt; Fit -&gt; Save a PAR model</span></span>
<span id="cb2-8">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PAR(</span>
<span id="cb2-9">    entity_columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>],</span>
<span id="cb2-10">    context_columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Measuring Device"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Location"</span>],</span>
<span id="cb2-11">    sequence_index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Date"</span>,</span>
<span id="cb2-12">)</span>
<span id="cb2-13">model.fit(all_data)</span>
<span id="cb2-14">model.save(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data_generation_model_multiple_city.pkl"</span>)</span>
<span id="cb2-15"></span>
<span id="cb2-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># After a model is trained and pickled. We can comment above code and just load the model next time.</span></span>
<span id="cb2-17"></span>
<span id="cb2-18">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PAR.load(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data_generation_model_multiple_city.pkl"</span>)</span>
<span id="cb2-19"></span>
<span id="cb2-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate new data for two fake cities</span></span>
<span id="cb2-21">new_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model1.sample(num_sequences<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># may take few seconds to generate</span></span>
<span id="cb2-22"></span>
<span id="cb2-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compare descriptive statistics</span></span>
<span id="cb2-24">stat_info1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> all_data.describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original Data"</span>})</span>
<span id="cb2-25">stat_info2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> new_cities[new_cities[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> cities[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]].describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"PAR Generated Data </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cities[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>})</span>
<span id="cb2-26">stat_info3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> new_cities[new_cities[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> cities[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]].describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"PAR Generated Data </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>cities[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>})</span>
<span id="cb2-27">stat_info4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> all_data[all_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City A"</span>].describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original Data (City A)"</span>})</span>
<span id="cb2-28">stat_info5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> all_data[all_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City B"</span>].describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original Data (City B)"</span>})</span>
<span id="cb2-29">stat_info6 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> all_data[all_data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"City C"</span>].describe().rename(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Temp"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original Data (City C)"</span>})</span>
<span id="cb2-30"></span>
<span id="cb2-31">stat_comparison <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> stat_info1.join(stat_info2).join(stat_info3).join(stat_info4).join(stat_info5).join(stat_info6)</span>
<span id="cb2-32">stat_comparison</span></code></pre></div></div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/img/ex2_summary_stats_real_vs_generated_data.png" class="img-fluid figure-img" alt="Comparison of summary statistics between real data and the PAR generated data."></p>
<figcaption>Comparison of summary statistics between real data and the PAR generated data.</figcaption>
</figure>
</div>
<p>A detailed example of time-series modeling using the PAR model can be found <a href="https://sdv.dev/SDV/user_guides/timeseries/par.html">here</a>.</p>
</section>
</section>
<section id="relational-data" class="level2">
<h2 class="anchored" data-anchor-id="relational-data">Relational Data</h2>
<p>SDV can model relational datasets by generating data after you specify the data schema using&nbsp;<code>sdv.Metadata()</code>. Moreover, you can plot the <a href="https://www.guru99.com/er-diagram-tutorial-dbms.html"><strong>entity-relationship (ER) diagram</strong></a> by using the library built-in function. After the metadata is ready, new data can be generated using the Hierarchical Modeling Algorithm. You can find more information&nbsp;<a href="https://sdv.dev/SDV/user_guides/relational/index.html"><strong>here</strong></a>.</p>
</section>
<section id="single-table-data" class="level2">
<h2 class="anchored" data-anchor-id="single-table-data">Single Table Data</h2>
<p>SDV can also model a single table dataset. It uses statistical and deep learning models that are:</p>
<ul>
<li>A <a href="https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html#gaussian-copula"><strong>Gaussian Copula</strong></a> to model the multivariate distribution, and</li>
<li>A Generative Adversarial Network (GAN) to model tabular data (based on the paper&nbsp;“<a href="https://arxiv.org/abs/1907.00503"><strong>Modeling Tabular data using Conditional GAN</strong></a>“.</li>
</ul>
<p>More information about modeling single table datasets is available&nbsp;<a href="https://sdv.dev/SDV/user_guides/single_table/index.html"><strong>here</strong></a>.</p>
</section>
<section id="benchmarking-data" class="level2">
<h2 class="anchored" data-anchor-id="benchmarking-data">Benchmarking Data</h2>
<p>The SDV library provides the ability to benchmark synthetic data generators using&nbsp;the <a href="https://github.com/sdv-dev/SDGym"><strong>SDGym library</strong></a>&nbsp;to evaluate the performance of synthesizer. You can find more information&nbsp;<a href="https://sdv.dev/SDV/user_guides/benchmarking/single_table.html"><strong>here</strong></a>.</p>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we went over the main features of the SDV library and how useful it is in generating anonymous datasets based on realistic data. The main features are modeling single table data, time-series, relational datasets, and also data benchmarking. One point to mention here is that you need to provide a large dataset for SDV models to train with. This way, the model can generate a meaningful dataset that truly captures the real process. Give this library a try and let me know what you think?</p>
<div class="callout callout-style-simple callout-none no-icon">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-body-container">
<p>📓 You can find the notebook for this post on <a href="https://github.com/e-alizadeh/medium/blob/master/notebooks/SDV/SDV.ipynb">GitHub</a>.</p>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-thesis:xu2020synthesizing" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">L. Xu, <span>“Synthesizing tabular data using conditional GAN,”</span> Master’s thesis, Massachusetts Institute of Technology, 2020. Available: <a href="https://dai.lids.mit.edu/wp-content/uploads/2020/02/Lei_SMThesis_neo.pdf">https://dai.lids.mit.edu/wp-content/uploads/2020/02/Lei_SMThesis_neo.pdf</a></div>
</div>
<div id="ref-patki2016sdv" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">N. Patki, R. Wedge, and K. Veeramachaneni, <span>“The synthetic data vault,”</span> in <em>2016 IEEE international conference on data science and advanced analytics (DSAA)</em>, IEEE, 2016, pp. 399–410. Available: <a href="https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf">https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf</a></div>
</div>
</div></section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2020,
  author = {Alizadeh, Esmaeil},
  title = {Synthetic {Data} {Vault} {(SDV):} {A} {Python} {Library} for
    {Dataset} {Modeling}},
  date = {2020-11-09},
  url = {https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2020" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“Synthetic Data Vault (SDV): A Python Library for
Dataset Modeling,”</span> Nov. 09, 2020. <a href="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/">https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/</a></div>
</div></div></section></div> ]]></description>
  <category>Data Modeling</category>
  <category>Data Science</category>
  <category>Machine Learning</category>
  <category>Python Library</category>
  <category>Time Series Analysis</category>
  <guid>https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/</guid>
  <pubDate>Mon, 09 Nov 2020 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/sdv-library-for-modeling-datasets/img/_featured_image.png" medium="image" type="image/png" height="51" width="144"/>
</item>
<item>
  <title>3 Ways to Add Images to Your Jupyter Notebook</title>
  <dc:creator>Esmaeil Alizadeh</dc:creator>
  <link>https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/img/_featured_image.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>👉 This article is also published on&nbsp;<strong><a href="https://medium.com/better-programming/3-ways-to-add-images-to-your-jupyter-notebook-61ddfa27e565">Better Programming blog</a></strong>.</p>
</div>
</div>
<section id="introduction" class="level1">
<h1>Introduction</h1>
<p>The Jupyter Notebook (formerly IPython Notebooks) is a popular web-based interactive environment that was first started from the IPython project and is currently maintained by the nonprofit organization&nbsp;<a href="https://jupyter.org/">Project Jupyter</a>. It’s a convenient tool to create and share documents that contain codes, equations, texts, and visualizations. A Jupyter Notebook can be easily converted to HTML, LaTeX, PDF, Markdown, Python, and other open standard formats<sup>1</sup>.</p>
<p>In this post, I will present three ways to add images to your notebook. The first two approaches are pretty standard that rely on external resources to illustrate the images, and those are to use the image URL or to load an image from a local file. However, both of these methods rely on external resources. To contain all images used in the notebook within itself without relying on any external source, we can use the Base64 encoding algorithm to encode our images and use those encoded data to illustrate them. So, we will briefly talk about the Base64 algorithm too.</p>
<p>Here, I will be using the&nbsp;<code>Image</code>&nbsp;class from IPython’s&nbsp;<code>display</code>&nbsp;module to show all images.</p>
</section>
<section id="sec-approach1_local_file" class="level1">
<h1>Approach 1: Add an image from a local file</h1>
<p>We can add images from your local drive by providing the path to the file.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> IPython <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> display</span>
<span id="cb1-2">display.Image(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"./image.png"</span>)</span></code></pre></div></div>
<p>There are two downsides to this approach:</p>
<ol type="1">
<li><p>The local or absolute path provided may not work well on another system.</p></li>
<li><p>You have to make sure to include all images used in a notebook with anyone you want to share. You may end up compressing all files to a single zip file for convenience when sharing your notebook.</p></li>
</ol>
</section>
<section id="sec-approach2_url" class="level1">
<h1>Approach 2: Add an image from a URL</h1>
<p>You can also add an image to your notebook using the URL link to the image, as shown below.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> IPython <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> display</span>
<span id="cb2-2">display.Image(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"URL of the image"</span>)</span></code></pre></div></div>
<p>In this case, the image provider may remove the image or change the image properties without knowing it. So, let’s say you have an old notebook that has a broken image link. It might be difficult to retrieve the original image. Even if you are taken the image from your website, you should be careful not to change the image link or properties!</p>
</section>
<section id="sec-approach3_base64embed" class="level1">
<h1>Approach 3: Embed an image by Base64 Encode-Decode</h1>
<p>The first two approaches rely on external resources. In Section&nbsp;2, we used the path to a file that is saved locally. Any change in the filename or path may impact the image in the notebook. In Section&nbsp;3, we rely on a URL, and any change in the original link will impact the image in the notebook. Unlike the previous methods, Approach 3 embeds the image as a text using the&nbsp;<em><a href="https://en.wikipedia.org/wiki/Base64">Base64 encoding algorithm</a></em>. This way, we will not be relying on any external resources for the embedded image. Hence, we can have all images embedded in the same notebook file.</p>
<p>Base64 is a binary-to-text encoding algorithm to convert data (including but not limited to images) as plain text. It is one of the most popular binary-to-text encoding schemes (if not the most one). It’s widely used in text documents such as HTML, JavaScript, CSS, or XML scripts<sup>2</sup>. However, technically speaking, you can even encode/decode audio or video files too!!</p>
<p>First, you need to encode your image. For this, you can use the online tool&nbsp;<a href="https://www.base64-image.de/">Base64-Image</a>. After you upload your image, you can then click on&nbsp;the copy image, as shown below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/img/base64encoding.png" class="img-fluid figure-img" alt="Screenshot"></p>
<figcaption>Screenshot of the uploaded image at <a href="https://www.base64-image.de/">base64-image</a></figcaption>
</figure>
</div>
<p>Now you can paste the encoded image code into your notebook, but first, you should remove&nbsp;<em><code>data:image/png;base64,</code></em>&nbsp;at the beginning. Don’t forget to also remove the comma after base64!</p>
<p>Now that we have the encoded image code, we can use the Python standard&nbsp;<a href="https://docs.python.org/3/library/base64.html">base64 library</a>&nbsp;to decode the base64 data, as shown below.</p>
<div class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> IPython <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> display</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> base64 <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> b64decode</span>
<span id="cb3-3">base64_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"iVBORw0KGgoAAAANSUhEUgAAA8oAAACVCAYAAACAXwOLAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAC9TSURBVHhe7d0PeFT1ne/xL4JBCgFK4h+wQOpi/HNJ710D1w1t+VOfJt6t9FqoexuoCvtU2OfR6C500cb2iblt458lPNsN3EfQ5wJWSPdaUq+xrYlrQXol2wfivdthoQaqATTxT5ICASER9f7+nZkzk5nJJMwkk+T90kPOnDNz5syZ8+/z+/3OmVEffnjm03HjxgsAAAAAACPVuXNn5e233zT9lxCSAQAAAAAjnc7GEyd+1vRfYv4FAAAAAGCEy8ycbP4SlAEAAAAAULwW1wRlAAAAAACUUaNGmb8EZQAAAAAAfAjKAAAAAAD4EJQBAAAAAPAhKAMAAAAA4ENQBgAAAADAh6AMAAAAAIAPQRkAAAAAAB+CMgAAAAAAPgRlAAAAAAB8CMoAAAAAAPgQlAEAAAAA8CEoAwAAAADgQ1AGAAAAAMCHoAwAAAAAgA9BGQAAAAAAH4IyAAAAAAA+BGUAAAAAAHwIygAAAAAA+BCUAQAAAADwISgDAAAAAOBDUAYAAAAAwIegHE1brZTMmSNzTFcpATe4K7BNViycI/OKSqWuxQ0EAAAAAAwrSQ/KbbUlLmBG7yq91DnktMmeLRvl4BmR7vZ62Vj/hhsOAAAAABhOqFFOWLZcVzDT9WdIfs401w8AAAAAGE5SG5QnzJTc3NywbqobNRTlLNslu3++VXa8sFseWZjphgIAAAAAhpPUBuXFj8jOnTvDumV5btwQlZmTJ9dNG+seAQAAAABS5tw5Od1xXj5yDwfK4DW9DlS665aXya5m9bitUbaVFMk8PWzhUlm7rVHa7DPDdL5RJ5Vrl8pC89p5UrRsrVTWNUuXG2+oae0sXSFF8+x10QuX6ue8IZ1udLg2adxWYp87r0hK9Pt2u1ERApV2erorqfXmrk1qS9zwin1qPrqkubbC3PTLzN+KCqltDps7K2Iee3QltVE/PwAAAAAMZx+9967Uv9Qkf7/jsNxX0yylv35L/k71/+3/Oio/+78n5aOP3RNTKA2uUW6SxkCtVHx9tWxsaBeTUc8ck1c3rpbV28JvmNWlwvWK5Q9L9avH5IwZ0i3tTa9K9YuBYAjuemOnrFDT2lB/UNpd4D1zTD3n4eWy4pE9PcLnG9vU+2xssM/tbpcG/b717f1rIt7QJPW7Vss3y2vMTb/M/B2skfJl5bLHn9JbaqUkYh4jZWRkuD4AAAAAGAnOyVuvqoD8L3+SF9o/Vo/CXfjoI/k/h1rl73a9JYdOpTYtp8XNvOrLy6VGsqRgSbEU5oYC4rGNu2RfsDL2DakurZZj7pFMmC2FSwpldlaGLLhtoWSbgfo5G+SgC58ZWQWyZEmBqKcYx14slQ3+xNpZJ1s2BqeoJlkohQVZ0rpxo5qffmjdKOWPHjTTKV6SLxPcYOmuly31utpc65LGbY9Kg5vHgu/9XF577efyvQL7WLvtn16TfZVF7jMBAAAAwHB3Tv7tpWNS+fbHcsENGXPppXLTzEmy8sZJ8vUrLpXgXaI+Oi//41fH5N9s7WlKpDYoV6+MaFIc+k3icFOluOoFqSpdKxWby2SBGyrSIE1evnyjQXa1uv6MJfJPddukorRCttXtk8oiu8i6GnfJFi/3zv6e7KyrktLSKnlh/RKxWblb6rfVi/cTyJ2Nr8qrrl8/f9u2CqmoqpOt913ELccKymSnms7a0s2y2TedpsZmV+vdJPte9KqRC2VpYY6MHZujwn4oKTcGPzQAAAAADH8f/nuLPNv+qXskcvXnrpSKv5olf/2laZL/59Ok8Kuz5NHbsuTmS90TPumSp/7lHTntHiZbWtQoS0ahFOa7G2Rl5snCfNsr0hpsmtzW1KgeObctFO/pfk37XrRNt5XcxfmS4/rHziuSpa5fDu6RgGt/3dxYb3sU//Ovy1/o+vquQAVe74ejrsvzTaezM3gddbc3k5IlWa5YJHua9+7qU8dqjw0AAAAAw86f5Ff/3h1qaj1xkjywYIp8Rvd3n5fTp1ySmnSF3PmfPyNT7CORs53yv4+kpgl2aoNyj5+HilFTm5/juyY4w/wfqb09VMuakZkpPXNyp7S3hgJmTlaW69OmSm4wfDfIQZO41fPbzQAjKyPYUFo9PVd8LaH7JGear8F0lM+hhS4/bpd21xK8rSX0+XJz+OkpAAAAACNE82lp9N/W+vRpearhT/JRywl59Lm3pPTFN+UnB87acTlj5XO2T/lUfvfHD1x/cqU2KPf4eahl0t9fh+puD9Ynq1wdLXB36UrboCyvqtaIFr7Dnx8WcFMqV+bd5s3MHtm1p1nNR7O8+mKDGzZVCmeHapcBAAAAYDg7/d5HEb9Q9KkcefNd+bvdZ+Qd9Shz/Gfkq9ePt6POfRL+U1Fnu+Vd15tM6dH0OgEZWaFw3OyvCg7KFN9TpLnFf3/rbvN/uLGS6cvS4c9PpbGSv6pMCkxW7paG8m/KokXflEddTp5ZXCHF19l+AAAAABjuPozze09Xz7xK/vvtM+VG3QC4+5T86+7TctiOss5fkA7Xm0xDJihnZfmu4W1pD17vGzJWsqaFknL7Gf8t0FqlqdH1SoHMNk9TwdrXOru92/f81mYJPj0VshfKfatmuwdWRtZsWVK2Q7atzYvSrBwAAAAAhqdLR49yfZHGyl9+6bNi7t/V8o78ZFeLPPun0A2/jMvGhK5ZTqIhE5Szc/ND1zG/uEcafUm5s63NBGd9Ey6vUXNTbaMEf5BpX53scv2Su1Cuc62sc/ILbY/if35zoKFnBXQSde4pl5UbD4rMXCM7DhyQA6rbV7dNShdfF7rlOQAAAACMAFnZl8o41x/uI3n55Tflf6puU0OnHPnEDfYbnyFXud5kSm1Qrn1Eli1bFtbtjP77UL27rlBWeJWw3TVyf9EKKS1dK8uWLpRF5Q22TXveUlnrPefgo+r9SqWyokS+/t0aF3wzpODuBcG7W2fmLwj9FJV6/orVFVJZukJWbGhyA1OjpTlg5+fYRrnfv3xWq/ndtU9aelaXAwAAAMDwdM14+ULUZDpGvpR/pXxTdYXZ0Wudb7jys64vuVIblM8ck6amprAudEuuvpomt60tlpnukZw5KPX1r0rTMX8T6xxZWuFd/6vydFO9VNc0BH9iasKCMvleke+mXZlFsuq+4BTlTGONVNcflO7CNbIm9GPOSXdd8XpZk28a2Uu7f/k0qvl99H75+rIqCRCWAQAAAIwEo6+Q/3qN9wPJfqNk3JTxMlF1maPdIL/LxsvSP0/NhatDpum1NjZvrex84Z/kvgUzxf6YU4Zk5S6Q+4rzQ02Wpy2Wqhe2yveWzJYsF5gnzFwgxT/eIbWVRcHfOPZct2KbbFWpeKaZ4ASZvaRMdpYtk8KF/f2BqN61NFTLiwF/wI9wbLtUN4Tf9w0AAAAAhquJN18t92RF1hp/Ih3vnZXTHWelozvi2mQZLV8vuDolza61UZ8qrh8DoCtQKd9cWS2tuhl42c+lanEourfsKpGve7e/Lt4qB9b298e0AAAAAGCI+bhT/rXunZ437Ip0yaWy9Es5smj6GDcguY4cCQytGuWhr0saa3VI1m6T4sLw+u1pCwoldfXYAAAAAJDGRmfKX/zltVLxF+PlhmgtsVV8zZk6WcqXzkpZSPZQozyg2qSu5FZ52FQaL5Af766UIt9trjvr1sqih181/QU/fkmq/NdTAwAAAMBIcu6cnD7n3ep6tIybcpn9qagUo0Z5wGVLTr73I1evypbKXdLY3CZtzQHZt+sRWV1uQ7JkFMrSeYRkAAAAACPYuHHmRl62G5iQ7KFGeaB17pFHFn9XXox1L6+MXLl781YpyUvN3dsAAAAAALHpGmWC8mDofEPqtmyR7fUN0mR+u0rfvTtP8guWyN0riuQ6X3NsAAAAAMDAISgDAAAAAODDNcoAAAAAAEQgKAMAAAAA4ENQBgAAAADAh6AMAAAAAIAPQRkAAAAAAB+CMgAAAAAAPgRlAAAAAAB8CMoAAAAAAPgQlAEAAAAA8CEoAwAAAADgQ1AGAAAAAMBn1KeK60+qkydPyunTp+XChQtuCAAAAAAA/TdmzBiZOHGiTJ482Q1JviNHAqkJyjoknzlzRsaNG2c+CAAAAAAAF0tXxJ47d04mTJiQsrCsg3JKml7rmmRCMgAAAAAgmXTG1FlTZ85USklQ1imfkAwAAAAASDadNVN9iS838wIAAAAAwIegDAAAAACAD0EZAAAAAAAfgjIAAAAAAD4DHpQPbbpFbrnFdpv+tU4eUn8fqutQYzqk7iE1/KE61QcAfXBok92nHHKPAWBI0Oc+m6TPuy72eQCQcgMclA/J7hr1Z+46ee6VV+TevyiSx9Tfx4qm2NEA0B833iuv6H3Kje4xAAwBhzbdIU/sdw8AAGllAIOyLjUtEZ2TZf8Tcsctm+RQh79GuaeOuoeCtc+hmuZQzXNdsHY6vDTWX2sdGndINunHm9Tr9Ou9ca5UVndh8+EbHjl9AIMjbJ/g3y7d9mpqV8K2XdsFt+2wcWzX6cfbv2+STcH99EMS+vrs97bJ7eO92rSwfX5YqyS33zfD3TS98d7xZ9Mm89cOd+/vXhN2THDP98b5a/JirpdID3G+u5jjgucndaF1KLTCxXhuL/sZ73VqnTPj1t0nJeakqEZKguu5b53Vz/XNq38927TbDQQApMwABuUpUvRYlSzRvaZG+V6JW/mjDih3PLFfllS9Iq+8ol6nw7X/iLFfHSWK1bjn1slcdZB5xneAKjlha6y9cSX+19UclemPqXFVek7UuGdmmefqh/ufeMIeqPRBTx295q57Tr33c7Jurnpe2MkXgAGntssn1D7Bbpd6XxKxbXtc7bJ9jjZXFs2dwnY9lOw/IbPWed/hfvW9+78nlSwW6XG2BYEOySU1S6TK+859x4pDm3Th7FxZ95wat26WnIhSc7f/xCxZp1/7WJG8a2r33LTUQWH/E3e4oKICtDo+7HetofTxoqbEBaBE10sMEhU873hC1Epg1pnI7+4h37jn1qkzhpJQwYxmTzXsOKl5xo67cZHZt5w4YZ/YoZ6k1oCE9zM1ssi83ytPbDTzI2pqVa88JkVTXIWCW8/C5setZ+qkyLx2kd4OAAAplbY38zpk2mgvkUUmTd8oi+xRyXewmS7TdYvtKeqv+rP/6LtmqDlJNic8t8gt6gDY47xoySIb0K+apQ5r6hR60VwV4fVD/cgKO+ipsfo56g3EvQOAQaTDy0N1V8m9+kQzTlvrjrpn7KnkkrvUCSjb9ZAyd5GYr0ntrYt1QNHfXXDn7x0XNHc5j7df955vAo13qY+b1hT93ZsnhfGOAT2mFRGGDB3CVei56l617kUU9ia6XmKAHdptCkvsdq9PEULfXfg+Qa0iRXep73y/PFHtK+iYPt2sH1PUX/Utiz3VsOck+3fvV+ckHeavt54lsp9ZElqBw3XsFzspu05O0dN07+lNd12xfe2N5qQIAJBKaRqUO3QmVmqkxDUzMs2T/AebubPkKtfr5zVNKhFd6urVKPXNu0d1vFYHyzvse+uabXW6JP7zJQADbEqR3OU2aB1K9LYZ67KNYO2L78SS7XoY6lDfn/ozd1a0o4Hjgk6v3LSkpsSsH7fcYi8VsoWwU6QotPLJHXq8V0vYl/USA67DnkxEZfcJrtA9hljrlgmqOrweCg+3F7WfefeoemVoPfIK+8MKawAAAyZNg/IUfW6jeM3pvK6X5trBkt118txFlOjb2mXXXC/43rpZlB0PYHDY2iDV2faK6oTSXS4R4VC1PcGcu25dcLtlux6GIlsURRPWEikONy2vaWuw844lkU36VWB+wq18ia6XGHjhNcHh7D6hn4VlpsWBCsnPhNdKX9R+xmvp5pqCe13ohqfRPwcAIDXStum1bVZUI7tNCyjvRly9XfflArareQ42vewjr7nTbtPWL3TzMM57gEF0yHcDHRVabCaJUhvUUSfPmA1/idzlOztlux5CVAi1rV8PSbWukfOaT/fgLsup2W2vOfWeb5rbu3Fes+1D1b3cXThiWv71LewYdKPc6wLxdL3yJbpeYnC4JvQ19mTCtTqz1/2G7xO8c4ZQK5T4XPPr/eHr50XtZ9zlAbZJd7R5DX0Oe3kaACCV0jYo6xMOeyMLdZDRTeASrCW+sdjdwEuduOh7dMw1BcYJ1ih4phTJY+psxzZ/cjd3eawoseZ7AFIjbJ9gL8dYUtWzlYm9lk8LXbphTlTZrocOtb+ftdvt+3XLojjfk67NrVrifdfq+bpG2B0rbrzX3QxMN4N9Rq0v+ngQx433upsv6Wm5GzLZSalwrG8O6TXLtiufHZfgeonBEv7d2ZuEuhpevU9Q49QK4sbp+3ol3srEu044dJ270sf9jFcpoNe5TYfcTU9d8/4e86pLYdzn2G3iPwAglUZ9qrj+pHnzzTclKyvLPQIAIBG6Bk6FC1knzyWhECN0R2wdXHWtcHiQBgAAQ1d7e7tcc8017lFyHTkSSOMaZQAALoK/hVGwdpqQDAAAEkCNMgAAAABgSEl1jTJBGQAAAMCg+Xj2HNeXWqMPHnB9GA4IygAAAACGvIEKxH1FgB6auEYZAAAAwJCkw7HXpauhMI8YeNQoAwAAAEiK4RY2qW1OX0Oy6fXx48dl/PjxMmbMGDcEAAAAwHA0UmpiCc3p48KFC3L27FmZMWOGG5JcKWt6PXHiRPnwww/NBwAAAAAw/Iy05soj7fOmK50xddbUmTOVUlKjrJ08eVJOnz5NWAYAAACGkckLilzfyHby1TrXh4GkWy3rkDx58mQ3JPlS1vQaAAAAwPDSMf1a1we/KSeOuD4MFwRlAAAAAHENREBOddgcDp8BA4egDAAAACCmZAfMdAuTqQjQBOahj6AMAAAAoIdkBsihEhxH4mdGdARlAAAAAEHJCIvDJSSyLEaulAbl/3f6Gfn96Z/J6Qsn3BAAAAAA6WrZl8+7vv7Z+dvLXN/wwnJJLxPHTJcvTPyW/KeJd7khyZeyoKxD8hunX5LPZxTJ+FFXuqEAAAAA0tHUm0pcX9+1vl7l+oY3llF6OPvpe9LcXS/XTiyUP594txuaXCkLys+c+Eu5fux/IyQDAAAAaa6/AXAkh7/+LDPCcvLosPyHrn+Wu6b/yg1JLh2UL3H9SXX64xOEZAAAACDN9TfwjfTQ15/PfzE10gins6bOnKmUkqAMAAAAIL31NbgRkMP1Z3kQloeOlDS93ticJ4vG/4N7BAAAAPTfJ2fGm+5CNz/W0l9jMkbJJRM+VN0ZN6Rvoe2t53/m+hDN52//lutLjD9gs35HF22d9dt99u/lvpyAe5RcKbtGmaAMYLh66+SfZF/r7+WDrjY5/8nF3QUTQOIuu+QyuXxstsyb+gX5/OTPuqEYCXSAuOzjK+VzV0+XcePGuaHoq3Pnzsnb75yQ86PfN8Ej0ZDMzxv1TaI/KeUFZdbv2CLX2UipDso0vQaABL35p3b55zd/JSfOvU1IBgaY3ub0tqe3Qb0tYuTQQYIQcfH08tPL8ZMznyEkp1Ciy8z7Dli/Y/Ovs4OBoAwACdr3bmpKLQH0DdviyKKboxIikkMvx0Sb9xKS+68vy471O76+rLPJRlAGgAS1dVOLBaQDtkWg/xK5ljZe0Ovq7JTOYNflhiJSImE58WvEu3zLXHUs9gFBUAaABJ3/mObWQDpgWwQGQ4vUrV0oXyx5VLZt26a6KildtlFo35FiLXWyduEXpeRRvcxVV1Uqyzay1AdC2gblky9vl7v/y3b5bYcb0B+Hf6mm8X3bbb6I39lS01n/cqd7kConZEdC89kpv/2+et73X5eTbgiQLo5ujlyH3Xptut63Z7PdR2wDZppuGjG3w47XZX3k9M0wNz9ef1j3SznqnoqBc88Xfyo/LbLd47PdQKNISr/ijXtSSme6wYr/NT8telzuccP7ZWapPNnLtIrmPik//aJ/zD3yePA1ugufPwx1ve2n9Ph4+y//63tOI7QPiz2NnvtO73X2Nf79oNel/rxkKOuSljda1L/91FYrJXMq0z8ABiplTh/nM5EbTUWvCdUhuVQa5y2V/LxiKSkpUd0qKcxxo9NKm9SWzJHKNPgCE6lVjlvDr0NyaaPMW5ovecV6matuVaGk5WIfhoZ1jfLRvQ3q32vlOzt+JNtXT7cD+0qfYK/R00kXmfLlH6nP86ObZLIbAqSFw7+UHz7v+g1dqLNZWtc8KNt/rdbZDdny9IbYBTw6JD+wIeKAoqcpq+3rf71apm54XHYcduPi0dvt8hr1fP+27/YFZlo/kh/c3iA/7LVgCsmkA+j8sYfl2bo75c4/HJZJV4cC5z1f/Lbc0LVX7tTj3jklN1zvQuzsx2W+uOGq23tmmswPC7F9ocL4n82Q43+w03q2Y5LM/0qpGhqi5/HbU8a7R87My2WSOknc6+bhzrq/kYpjbhyGuN72UzoEb5Z69yiqjg+kVQrkB27fsv3Xd8uXp9hRer/2wxNL5CfetJcnVkCng7F9XWhacru3L1TdjiUiG2rihPcRrrNBNm5vkm73EBfLheSl66V0QY5kuKFWs9RWrpVly5bZbnWF7GlzowZNtiyuOiBr89zDocqF5KXrS2VBTvhSl+ZaqVzrlrnqVlfskUFf7MPQgAdlW1PslYiGHzD8pa6/fNMNVOzw0HMTqW02Bydz0n5EHZi+706uw0t9/Sfc4fPlTVs9X51s6wKpgDpBN6W9rpbae23YvLhaq/Wbf2lrr1ytb9i0/TXB/lquzb93A3sTXqMcfP+XQ7Xn618+YZ9jHvuWcUStmr802j+PO9T8m7/B5ePeM8rr/LzvaYev5Ds4DW+5vRyah8TfH+lPbStrRApvdw+1jiPyO1ki3/lqpn18w9diFvDodeeBV/PkB2siSrr1a4JBd7rcrKbf+nZvtSh2u9UhefkNblAUs+YXqKd+EDO4I/lmZKgA2vWB1OkHxz6QUzJeJpnVo0guHytytvu4fqB2OafkrIqml+sQffBBufO1p+xw5alTLSITZvWoCda1zk/O9SKvrZ0Or7HW6qTiN6GQW/f+cTl76SSZYR+aaXw787js7TjrhjiZk2T8R6fEzV1Mdh5KQ7XPXqBXYf+nvkDes8YagybefsocMzerQLpaCs3IGN5rlUD+VHVq3lPbm0ckb8G1dno3fEFNp0F+18uxTR8PTUiOVyA+5Vq5Of+IHH/PPR7hugJVsmLFLhXZrM7GeuksUDuAfRVSVFKrYl5sgco5MmeO6yKqIJtrS9y4Eqn1pRD/a0rciDb93ODrA1KpxoUeVsocNR/Rgox/WqG3tzWiPYcrprbbDY8owYk2X30VrQa0ZVepPNzQKgc33i/L7t8o7VPd9tJ1TJpaGuXFpnlStnmzbFZdWeExaWy1o/3M8okyb9Hn2dUI14Y+qzdOP9//ej3dnp/VV6PsWgfUhn2X9vsxj/0L19TQu+Gq848KzX+JVFb6v+s435WTSK1yTy2yq/RhaWg9KBvvXyb3b2yX0GJvkpbGF6VpXplZ5ps3l0nhsUaJstj7tn5FXVZmRGh5Onq60T6rGhNatr7WDv758C/zWMPTxcAGZXXQeXqDOmiYklt94PHV6LjaKDtuiczwrVTmhFZ9Sa0mvHZK4FU1Lj9P8rxS1igmf/Vu+YE5abe1SMtvsKXG9fm2ZPcn6oS8fk0o4D69IduVBuv5UuHalChPl+U7logukDLzlWCtdODEVPmOnpY+yKnPpWvJCnXNlp52Y408YD6zmp8NOoS7Uuj5Er/EOi59oJ+vpv+gfCdfvf+GzXJ8uS1xzgsuY/t+poZNvZ9eNgGvNNp9L15p9c3qNX5HNz8uTze6+dxQYAoNYodY9dr5+j3svNSvCS8MqX/zcvlu8P332nG9vD/S31G1o25dM199dz765FFtMoFgIUvswq1Zq9U6o7aXaCeaISfkd2ofMfVz7kgRla390bVD8UKyplucBE9gMSDCQq6ppT0rp0y5R500dp6V8Zn5NkzqYKpi9AdRam2Lxk0SiRJan3rtWTme+Q1bQz37KzKj81l58KAdF0vRFTPCAvBTr90pd/6mose0zXteeoN82wXgUCDvafyUGXLK1FjvlZYJ821YV2H/2c4Z8g3zunvkKyqMP+sL/xhE8fZTU24yx6ve9iUn31ZnkvrY7gp7Q4XAndKqDr+hfdZkmaGOi/EK+3RIfkCfj8QLyZoO+Oq4fHMv8zYS6JC8ulJk7ealrjlqlwT2dUphQbZkziuVrbftkZWxwrIKRiuby+SlAwfkwIGXpKx5iy8QV6vzsjI1/IBsLW6Q8nIbdHVgWlldLFv1a14qEym/1QSG7IJCKWhuNs+RtmZpLlDnrs12Ym1qeEFhQY9jnJmWbDXvoafVvNIGi0DlrVKu3tvM19ZiqV7pBRYVQm4tFyl7ybymUM2jx0zL+yy++UqG9uZMKXuhTnbu3Kk69XeZWtJdAalavUWyKtbL0owMycrMlEzVZan+HtRyvrW+0C3nrZJTXm4+T6xl6akub5JVwXH2NXmFxdJQ32CXs/q3QZ086+86vojv8tYtkvuSnW5BtfvOdUhc2awWrZ7HA/JSWYFUb3GFG2qcens3bpWaXOg81XxXOe47DPuuLla7NGeWyQt1epmrTv21i12t71uypGL9UsnIyDLLPDMzS/W7l/n0ff3SqqW8aZVbBnqx62WQbdbf6nrvywlIvfreCnvU2OtAvVLUF+pe3ywr9RcaazuLu/2lh0Fpeq2D1vqXJ8tyHbxc+LTNpAvkdlOqm2lOYINMKaw+KKiDizk4qA2lrye4Ea+brIO2mqYpjTUHw69JtqnV7KWJVQL88+Z9LnswszViphbLmx8VLmbpUeYz9pcuWbbLbapZnO79plwuU/VDwzbZXn6lrdH1N5E92RgwG07hfPtd2IIJjw0ncvsXwuYz9oHe+6zq/Zbr6YSXnuddY5dM9jX6+7WFH/HfH2mv43V5/oSvRsbv+RpbaKO29Z+sEVcA1Xe2xYEt6PpazBND3Xok3vZrW5fYk2HbrDHqPCN1dO3wOyLzdeC8fpI0+pow1+3/GxMmTRi94pQ8W/eg9IiSM0vlG1NEDv+xwtZKh6mTij8elxl/9qQ8ebXKLft7PiPEXnOsm1i3vB9tWuFMTfgZ1/z7D2qHNuXbUWqrnTON7jM9Jb/pOCvTJtma47r9vzBB/smvzBdJ4D0xgC5yP6VrjYPNok2TaK8w+aQcV8f5hD2/ueflJ35qvLf/utu0mvmaPS6PYKGQXCJ5Y91AaZKGpoWS57LTtKLK+GE5SDfXrZLFwcxVEAxgOpx5WpsapKDsblOJItmLZZUaZQJEdo7kqNTWoE/0W5skZ5UKVCbQxQpzeniDFHtpQ02r6sBaNd02UblailcttsE6724pK2hQk1IT1gFcimWVm8m8u1XQM312WsEw7uar2QX1pFMhedvKR6VzVZksnaaLJuIL1Ff7CgryZO0Bu5xjLksnfJxbBnmFUtzQZGtP29QwdVbaa072fZdTc9USK15lv2f9nZmhiln+oe8/Oyd0FXCb/l6916g5uluFaEsHRjU57zt031VTtKrdJOgKbJOVj3bKqrKlMq3Xpd6P9csokLK77WuyF69Sy9qu02EFQYF6qS4utN+Nn/4+GkIBOntxlRzo0f49cjvzxBo+uAY2KKtAerup5bVhObzkNR4bMHVN8lETqrxg2Ae61Fj98d5XH2T0YxP4XLNg3fTzJ64mNDlsabIOiz90BzcTUBtb7Yo2oFyz8+UBuXmHrVEPd63MuNL1+plrr5TgAdoGkcCb/WuwGrs2MMb7I83plgpqnVoTo/bDF2xN4ZTefmPUKsejW4iYk9gFAXkgzo3sTMuPsBNVv/BrlLcvb5UHuKHXgDI35Zp01F3ne1RmqbBqA6dtKv0N+YUd90eRb0TeMEvfhOv6G+TUO3GuDz5WIY1dKtR2/KZnyA7zlDxo5mGvyNXRmmiHMzXNXg2wfo8z6uTbBeBIwebjPagg/75ubH5YftNLTTcG2EXup0yLGK/FmTvPqd+rD/62BjlxtuWWuX9CtP2c/xplda4yY0ei51DDVZc0BxqlO2+e5ARDsvJGowQWzg672dG0vIWS294oTZEnX3lr5aXCernVNf0Mb8KrQlSPk3YbMvxM8DLypFCFOR2SAvXNkjt1quSq0N6q/mtqiDYtPdz1hokcni3BzKYCeNSXOA3ltwabsa5UAa4hVYmtqV52Zc2WzMZdsm1bo+QszZPYZ+U9l5kVb1laOT0XmqKXc7XoPK0DrErALoDHE2359xTWDFgvQEcH+niqV3qvu1XK1VNTVUDRVL9LsmZnSuOubbKtMUeW5sXLQv1Yv4wYyyq7QApVAtChWRd8BAO4X6z1M9Z2Fnf7Sw8DXqNsDih6J7/BbgzB5r+G17y6J1PLqA5ezyfQ7DqqK6eakg/btNs70PxIvvvVzGCtb69NnfosVMMbusmH7kKlwP0NnH12+Pcm4BZu8N0YJEyMa528WumwA7TqEmyGnjiutRqSTMuIUE2tKQjShSq6ub/e5pJcKDT5c2rvHXOargBNtxBR+5fIZv89mNYRsfc5SLZ7ZNYEkZZTXoR9So56gXNmvsy49Kwcf9/Vsx5rlOMfjZcZV7gmzvoa3+vtTbjiNqc2N/46HGqC3avjcuojkUnjYjeljiVWIB6f4V3x7Gqig+6Rx69Wu+JgE2ykhRTspzTbesqeA4RaYNka5pgFxq7l1qzVD8p3pEaejhuCbcu7ATuHSEtj5bplm6UsS9cqB4L1a80H90he/nXukdKim9TukdvWPyILowQAU+vlmn4W1vfWXDkyVISHKF3z3NwcUJ0OG/q5zdJcG6P2TZ1dReRCJ3K4L1BOzVVnlLEVuyavwS6Fd7PKKVzl7nxdIssW5qhvI5aey8yKvyy1UOAMD9V6OVfX1ybY7DpBuhmw1wxcd1tDrQgiA3y4gmBzba+rSlm1aI4UrnJ3vi5ZJgvDSogi9WP9MtQ6G1zsugWDRze/1o0kamM0u1birJ+xtrO+bX8Db2CDsv9GWDd8zV1DnC1TVXCzzW1d82px1yH7uebXgf40u9bMjS/UdqCmqw8rtimnvRbJ3wxYDu+Vp+M1lXKB2x74osxnBPu5vObHrlZXhwg3P/L87+3JvAuyKRM23yfkl77mXbYZulcC3im/3eHfUbnm4sH59H2HHbYZd3iJdoM8bx570+n9Gqr474+05q7h8wpQzDatC1V0QYpZx0NN748+VyOBPhZyme3Uu4+BYgq1vMsA4jH7lxi1Mh6zzdn9DwaCLxgbNjibwBkZjE1wFjl1TgVnXZN89SQ5/Ife7jRtg+je1ypcE+zwu1lbusm1r6Y6MqBHFXFjMDU/+RPivCZ4o7HwgoF7vjhf5J0HpcI1wU4syCPlLno/pY5Z3/fdfFJfivK8K7RT9PmFd95h9zmJXFecKV9eE6tljMeef3iXM41cKiyv0GFZhRwTlpulsTZXFua60V5I3lopRdPcMB99DWf4DYQKJDd0zVpUOjQ1lG83rRL1tatb/E1vVVCQ+i3qe841lQx5hTlSX98cvfbNBcVgU2NzIyV9ragb7l0fG9gu5Q2u6bCu1Suoli2u5i2wvdzV4LlrSL3XqLnTN1RKWg1dRsDcUCp4Z+tHat2IxJhlFryuOHSjs7jLUgm+xjTpDTWfNs2vq8ulPKFm1/3RJrV6ZhzT7Ni7llnN7XZdbWzYVgTl2/3fYaybXPVdRmCj3O8tc9X1bbH3Y/0yQs2wTZNz/dndKL0c9MXa0Qt+FNOU3db2a2b7KqmVphjbWX+2v4E2sEFZnbzam2iFap8KvWts3DjbNLpGjvf4nTcX2Lxaoz7T1+i6m2mp97Y32LK1q5O/+hUTwk2t2Jo2ydMB1ith9gK2ni990u2aVcWezwhhn9ndTMzUxrr5USukaZa9Vy0L+4rUCJvvzdKar+c7dI32d9Q82ubVj6vPFF4eZEq31YmEmc81DaZWPvbNTdRJgNSY6TzdqJu6JnANVS/vj6HKruMS3N59rTZMIUvvzZ7NTfkkdF2euRNsgq0ZZq1223swLIdfo6y39YTWTySNvuHW4bHz3W8Rz5dpZ/bK35hrifXdqPfKqSnftuOuVzuYDnszLnPDLfXfDdfr13hd5O8f6zBrg6iJpccq5Bf6euced5Z+Sh78gwrR3rR6a8pt2HnTTbQTec3ZMyL5Zh7nyyTvM+ifxZK9rjbcXUvt/fwVBlmc/VQc+k79tpA4/PX22uFQyy2zD5vubvTVl32OOy7qc4dgWPZfo6yPsdNXm1ZxsGF5/dKpMrYtIHtyF8psr6ItM0/W7ogekrXsxWVS1rwy2Gy2vrCs12skdQ3Y1uJqWalf426sFay4Nc1TVYjymgPr4NyQE732Tclbq29g5N5fTStnq71G0wxXMdA0SV1ZLcVuuHoDWVxmb3qlX7Mlt0y8ek8zXznuNXNWqjCzNWk1m3kldfJzd1dr063t2xlr5Lw1u2UWd1kqxTlN9jW+ZWPpgKrOOBNqdp0gc62um5c5al5WlUmBdy109mKxi12Pq5fc4DXKPb/DyM/Qf3lSUvfz0DJXXR8Xez/WL61Ycprs+nVreY5srXLXMmumoCa8MCOc2t7MTcP0cgq9PjfGdtaf7W+gjfpUcf1Js7E5TxaN/wf3CGlPhxZ1cBcVgPVB1951U9QBPVYz7ej0iYM5yfA1LU9Ikt4fSLXHXt/h+oCe9DXY+d3PuvCPVHvopuWuD8Pd+eOXS35+Ahd8t9TKtuaFsmIeBQixdPRSwZPQTxkFKmV1022yvrBnCURL/XflxVwV6i4qLOq7J98qTati/RayrpXWd672B7yBYuetvvClPhVEJGO5BypXS9Nt66XnYm+R+u++KLmb9c26LoKudb5V32k8xnR6G59CjY2NctmMD9yjkN1n/17uy0lSFX6EI0cCQzwo62bAa2I00/Wafw5Fg/C5bMh1DxRda9zXkup+B2UlGe8PpBpBGfEQlAcWQXnkSDgoo1e9BTat19DWtkcqHt4iB/1X3nkyZ8uqH5dGvR48cXGCsglrtuY2ddcCRwhUht3cS4q39un676Qsc6VtT4U8vOWgRF/sq+THpQsvroY9XhB2y0BfB5+cGvO+ISgDQJr7x3/bJec/Pu8eARgsl42+TP72Py51jzDcEZSTJ1mhDYljmV+8wQrKg/I7ygAwFF1x2YC38QIQBdsi0D+JBLJEgh0Sk8iyfOv5n7k+pBuCMgAkaN5VX3B9AAYT2+LIMiZjlJw7d849wsVgOaYf1u/49LLRy2gw0PQaAPqg6f33ZU/TAXmv8305d4EDGzBQxo0ZJ1dmXiELc+dI7hVXuKEYCT45kynjPr5Srr76ahk3bpwbir7SgeOdd96RrtEfyNT5d7mhsdEc+OIkUpvc+noV63cc3jp7fvT7MmrCaTc0hGuUAQAAMKJ9emaifHzmMrnQnfTT1hFD18qNnnDeBI6pN5W4ofHRLLh/Pn/7t1xffDooa6zf0fnX2WgIygAAAACSKtGw7IU5JIblOnC4mRcAAACApEo0qOngl2j4G8lYTsNPSoLyxNHT5eyn77lHAAAAAIYyQmBsfV021CZfPJ01deZMpZQE5byJfyVvdb9EWAYAAADSVF8DG7Wm4fqzPAjJF09nzLe66+Q/TEzt7+mn5Bpl7fVTW+Vg53Ny+sIJNwQAAABAuln25fOuL3E7f3uZ6xuZWGaDZ+KY6TI785ty06S/dkOSL2U38wIAAAAwdCTyc0bRjLSfkWI5jQwEZQAAAABGf0OgZ7iGQZbLyENQBgAAABB0saFQGy7BkGUxchGUAQAAAPSQjJDoGUphMVmfm4A8tBGUAQAAAESVzLDsSbcAORI+I/qOoAwAAAAgrlSESb+BCJap/gwaAXn4ICgDAAAA6NVABM1EeYE0HecJwwNBGQAAAECfpFNAHWwE5OGJoAwAAACgX0ZqYCYcD38EZQAAAAAXZaQEZgLyyEFQBgAAAJA0wy00E45HJoIyAAAAgJQYqqGZcAyCMgAAAIABka7BmWCMSARlAAAAAINqoAI0gRiJIigDAAAAAOCjg/Ilrh8AAAAAACgEZQAAAAAAfAjKAAAAAAD4EJQBAAAAAPAhKAMAAAAAoHz88QXzl6AMAAAAAIDS1XXe/CUoAwAAAACgnDzZbv5e8uGHZ0wPAAAAAAAj1dmzp00nIvL/ATr9BRu5TEnOAAAAAElFTkSuQmCC"</span></span>
<span id="cb3-4"></span>
<span id="cb3-5">display.Image(b64decode(base64_data))</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="7">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/index_files/figure-html/cell-2-output-1.png" class="img-fluid figure-img"></p>
<figcaption>An image decoded from Base64</figcaption>
</figure>
</div>
</div>
</div>
<p>As you may have noticed by now, the main advantage of using Base64 to add all images to your Notebook is the fact that do yo no longer need to worry about any external resources for your images as they are all self-contained in your Notebook. The other point to be aware of is that including the images in your notebook will increase your notebook’s file size depending on the image resolution.</p>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>In this post, we went over three ways to add an image to a Jupyter Notebook, and those are through 1) a URL, 2) a local file, or 3) by Base64 encoding the image data. I also provided a resource link that you can use to Base64 encode your image. The main benefit of using the Base64 encoding scheme is to reduce (or even) remove any external images in your notebook.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p>📓 You can find the notebook for this post on <a href="https://github.com/e-alizadeh/data-science-blog/blob/master/notebooks/Add_Images_to_Jupyter_Notebook/AddImage2Notebook.ipynb">GitHub</a>.</p>
</div>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Wikipedia,&nbsp;<a href="https://en.wikipedia.org/wiki/Project_Jupyter">Project Jupyter</a>&nbsp;(Accessed on November 16, 2020)↩︎</p></li>
<li id="fn2"><p>Wikipedia,&nbsp;<a href="https://en.wikipedia.org/wiki/Base64">Base64</a>&nbsp;(Accessed on November 16, 2020)↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{alizadeh2020,
  author = {Alizadeh, Esmaeil},
  title = {3 {Ways} to {Add} {Images} to {Your} {Jupyter} {Notebook}},
  date = {2020-11-06},
  url = {https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-alizadeh2020" class="csl-entry quarto-appendix-citeas">
<div class="">E.
Alizadeh, <span>“3 Ways to Add Images to Your Jupyter Notebook,”</span>
Nov. 06, 2020. <a href="https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/">https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/</a></div>
</div></div></section></div> ]]></description>
  <category>Python</category>
  <category>Data Science</category>
  <category>Visualization</category>
  <category>Jupyter</category>
  <guid>https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/</guid>
  <pubDate>Fri, 06 Nov 2020 00:00:00 GMT</pubDate>
  <media:content url="https://ealizadeh.com/blog/3-ways-to-add-images-to-your-jupyter-notebook/img/_featured_image.png" medium="image" type="image/png" height="80" width="144"/>
</item>
</channel>
</rss>
