A Bit Less Wrong
http://jxieeducation.com/
Wed, 19 Apr 2017 15:08:43 +0000Wed, 19 Apr 2017 15:08:43 +0000Jekyll v3.4.3Nonparametric Density Estimation Parzen Windows And Beyond<h3 id="smoothed-histograms">Smoothed Histograms</h3>
<p>I recently picked up Seaborn for data visualization. It’s very user friendly.</p>
<p>One of my favorite plots is the Kernel Density Estimation (KDE) plot. And this post is about how KDE is generated.</p>
<center><img src="http://pandas.pydata.org/pandas-docs/version/0.17.0/_images/rplot-seaborn-example2.png" /></center>
<h3 id="probability-density-estimation">Probability Density Estimation</h3>
<p>There are 2 general approaches to estimating the PDF of a distribution.</p>
<p>The first group is parametric. We have some prior assumption that the data follows a certain distribution (e.g. gaussian, student t, gamma) and try to fit parameters around these distributions.</p>
<p>The second is nonparametric. They make no prior assumptions on the distribution of the data by modeling the data instead of the parameters. I will go into more details the Parzen Window approach.</p>
<p style="color:#8B0000">(Edited December 15, 2016 - There are also the mixture model that combines parametric and nonparametric approaches, e.g. fitting Mixture of Gaussians with EM)</p>
<h3 id="the-general-approach">The General Approach</h3>
<p>The PDF of x can be modeled as the following:</p>
<script type="math/tex; mode=display">p(x) = \frac{ \frac{k}{n} }{v} = \frac{k}{nv}</script>
<p>Here v stands for the volume. First, we specify a bin size (or volume), like we do with histograms.</p>
<p>Then we get k (the number of samples in the bin) and n (the total number of samples) to calculate the density of the bin.</p>
<center><img src="http://matplotlib.org/1.3.0/_images/histogram_demo_extended_01.png" /></center>
<p>Overall, the formula works exact like a histogram, but converted (via division over the volume / bin size) to make it behave like density.</p>
<h3 id="parzen-window">Parzen Window</h3>
<p>The Parzen Window is a very simplistic kernel that is best explained pictorially.</p>
<center><img style="width:500px;" src="http://jxieeducation.com/static/img/kde_parzen_window.png" /></center>
<p><a href="http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture6.pdf">Source</a></p>
<p>Here the 7 points are colored. The density estimation is:</p>
<script type="math/tex; mode=display">% <![CDATA[
p(x) = \frac{ \sum_i^n (\| x - p_i \| <= \frac{3}{2})}{21} %]]></script>
<p>Here the denom is $ 21 = nv = 7 * 3 $</p>
<p>This way of calculating density is nonparametric because we need to remember every single sample in $ D $.</p>
<h3 id="smoothed-parzen-window">Smoothed Parzen Window</h3>
<p>Instead of a kernel that returns 0 or 1, we can use a gaussian instead</p>
<script type="math/tex; mode=display">\frac{ 1 }{ \sqrt{2 \pi } } e^{ \frac{-d^2}{2} }</script>
<p>As a result, the generated density estimation is smoothed.</p>
<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Comparison_of_1D_histogram_and_KDE.png/500px-Comparison_of_1D_histogram_and_KDE.png" /></center>
<h3 id="choosing-the-right-window-size">Choosing the Right Window Size</h3>
<p>The window size h is the only parameter that we need to decide for the KDE.</p>
<center><img style="width:500px" src="http://jxieeducation.com/static/img/kde_h_window_values.png" /></center>
<p>Generally, if h is too big (left) or h is too small (right), then the quality of the density estimation will suffer.</p>
<p>In a classification context, we can systematically choose h by fitting KDEs to each category, classify test points under the maximum posterier, and varying h to get the best accuracy.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Check out the kdeplot function in Seaborn. Hopefully this post explained the basics of how the density estimation really works.</p>
Thu, 06 Oct 2016 00:00:00 +0000
http://jxieeducation.com/2016-10-06/Nonparametric-Density-Estimation-Parzen-Windows-And-Beyond/
http://jxieeducation.com/2016-10-06/Nonparametric-Density-Estimation-Parzen-Windows-And-Beyond/probability_theorymlAttractive Mathematical Properties Of The Roc Curve<h3 id="roc-curve">ROC Curve</h3>
<p>Most of us use the ROC curve to assess our binary classifiers everyday. Sometimes we take for granted its theoretical properties. In this post, I will take some time and analyze why the properties are what they are.</p>
<center><img src="https://www.unc.edu/courses/2006spring/ecol/145/001/images/lectures/lecture37/fig4.png" /></center>
<h3 id="2-properties-will-be-covered">2 properties will be covered:</h3>
<ol>
<li>The baseline is the diagonal line</li>
<li>The area can be interpreted as how strong the classifier is</li>
</ol>
<h3 id="1-the-baseline-is-the-diagonal-line">1. The baseline is the diagonal line</h3>
<p>So as we know, the x-axis of the ROC curve is the False Positive Rate (FPR) and the y-axis is the True Positive Rate (TPR).</p>
<script type="math/tex; mode=display">TPR = \frac{ \text{True Positive (TP)} }{ \text{TP + False Negative (FN)} } = \frac{TP}{ \text{Positive (P)} } = \text{ % of positives captured }</script>
<script type="math/tex; mode=display">FPR = \frac{ \text{False Positive (FP)} }{ \text{FP + True Negative (TN)} } = \frac{FP}{ \text{Negative (N)} } = \text{ % of negatives misclassified }</script>
<p>If we have a dataset with $\pi \in [0,1] $ positives, and $ 1 - \pi $ in the negatives, and we predict randomly with a positive rate of $ p $ and negative rate of $ 1 - p $, then we can calculate the TPR and FPR as a ratio.</p>
<script type="math/tex; mode=display">TPR = \frac{TP}{ \text{Positive (P)} } = \frac{ P(\text{positive}) P(\text{predicted positive}) }{\pi} = \frac{\pi p}{\pi} = p</script>
<script type="math/tex; mode=display">FPR = \frac{FP}{ \text{Negative (N)} } = \frac{ P(\text{negative}) P(\text{predicted positive}) }{1 - \pi} = \frac{(1 - \pi) p}{1 - \pi} = p</script>
<p>Since TPR and FPR are both p, a random classifier (baseline) will have a ROC curve of slope 1 (the diagonal) and an AUC of 0.5.</p>
<h3 id="2-the-area-can-be-interpreted-as-how-strong-the-classifier-is">2. The area can be interpreted as how strong the classifier is</h3>
<p>Technically, the area is a bit different from I described.</p>
<script type="math/tex; mode=display">A_{ROC} = P(\text{score of a positive > score of a negative})</script>
<p>Let’s go into why.</p>
<script type="math/tex; mode=display">A_{ROC} = \int_0^1 \frac{TP}{P} d\frac{FP}{N} = \frac{1}{PN} \int_0^{\text{negative samples} } TP \, dFP</script>
<p>The integral means that for each negative example, count the number of positive examples with a higher score than this negative example.</p>
<p>If we have a perfect classifier, then all P positives will be scored higher than the negative example, so the integral will result in a maximum area of $ P * N$.</p>
<center><img src="http://taint.org/x/2008/roc_zoomed.png" /></center>
<p><em>Note: Intuitively, the perfect classifier has {TPR = 1 and FPR = 0}, which is the upper left point</em></p>
<p>Combined together, the 2 terms give us a nice interpretation of the $A_{AUC} \in [0,1] $ as the classifier’s ability to discern positive and negative data.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Unfortunately, I did not go over other properties such as linear correlation with <strong>accuracy</strong>, <strong>pareto optimality</strong> and relationships with the <strong>calibration curve</strong>. That’s for another day.</p>
<p>The 2 main properties outlined in this post make the ROC curve a fairly good way to compare binary classifiers. These are great theoretical advantages that other popular metrics (such as the <strong>precision-recall</strong> or the <strong>calibration</strong> curves) don’t have.</p>
Tue, 27 Sep 2016 00:00:00 +0000
http://jxieeducation.com/2016-09-27/Attractive-Mathematical-Properties-Of-The-ROC-Curve/
http://jxieeducation.com/2016-09-27/Attractive-Mathematical-Properties-Of-The-ROC-Curve/optimization,paretomlProbability Calibration And Isotonic Regression<h3 id="what-is-this-isotonic-regression">What is this Isotonic Regression</h3>
<p>Isotonic Regression (IR) is a special case of a degree 0 spline (aka piece-wise function), such that it is monotonically increasing.</p>
<p>Mathematicaly it tries to fit a weighted least-squares via Quadratic Programming.</p>
<script type="math/tex; mode=display">min \sum_{i=0}^{n samples} w_i * (y_i - \hat{y_i})^2</script>
<script type="math/tex; mode=display">% <![CDATA[
s.t. \hat{y_i} <= \hat{y}_{i+1} %]]></script>
<p>Basically, the degree 0 spline needs to be increasing, as illustrated via the green line below.</p>
<center><img style="width:700px;" src="http://jxieeducation.com/static/img/ir_fit1.png" /></center>
<h3 id="application-1-probability-calibration">Application 1: Probability Calibration</h3>
<p>In the classification setting, we’d often like to obtain an accurate probability of a class. A well tuned probability score is the foundation for correctly assessing a situation. For instance, in advertising, we want to know the exact probability that an user clicks on an ad, in order to figure out how much an impression is worth.</p>
<p>However, many classifiers, such as SVM, output a score instead of a probability. Unlike logistic regression, the output of the classifier can’t be directly interpreted as probability. For instance, an svm score indicates how far the sample is from the hyperplane; a random forest classifier rarely gives extremely scores because bagging averages the predictions.</p>
<p>Isotonic Regression comes in as function that maps the scores to probabilities. In the below diagram, we can see that IR transforms the green curve into the red curve, which is generalizes the scores to probabilties.</p>
<p>Note: We chose Isotonic Regression instead of a normal spline because we want the probability function of the score to be strictly increasing.</p>
<center><img style="width:500px;" src="http://jxieeducation.com/static/img/ir_calibration.png" /></center>
<h3 id="application-2-non-metric-multi-dimensional-scaling">Application 2: Non-metric Multi Dimensional Scaling</h3>
<p>First, let’s go over what non-metric and Multi-Dimensional Scaling (MDS) mean respectively.</p>
<p><strong>MDS</strong> is a powerful visualization technique that maps high dimensional data into the 2D space. In modern MDS, we typically construct an item-item similarity matrix from high dimensions.</p>
<table>
<thead>
<tr>
<th> </th>
<th>site 1</th>
<th>site 2</th>
<th>site 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>site 1</td>
<td>1</td>
<td>0.8</td>
<td>0.4</td>
</tr>
<tr>
<td>site 2</td>
<td>0.8</td>
<td>1</td>
<td>0.6</td>
</tr>
<tr>
<td>site 3</td>
<td>0.4</td>
<td>0.6</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>Then we learn via SGD low dimensional points such that the similarity between items in low dimensions approximates the similarity in high dimensions.</p>
<p><strong>Non-metric</strong> on the other hand means that the quatitative variable at hand (e.g. spiciness: 0, 1, 2, 3, 4) is not scaled properly. The difference between spiciness lv1 and lv2 might be 20 times the difference between lv2 and lv3. Since the data is ordinal, we care more about the ranks instead of distance metrics. So the similarity of items in high dimensions loses its meaning.</p>
<p>Isotonic Regression comes handy here by mapping these ordinal data into a monotonically increasing function, which preserves order. Essentially, we construct the item-item similarity matrix from the isotonic Regression function instead of metrics like Euclidean distance.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Isotonic Regression is a special case of splines. It is typically solved via quadratic programming and has some niche applications because of the monotonically increasing constraint.</p>
Sun, 18 Sep 2016 00:00:00 +0000
http://jxieeducation.com/2016-09-18/Probability-Calibration-And-Isotonic-Regression/
http://jxieeducation.com/2016-09-18/Probability-Calibration-And-Isotonic-Regression/isotonicmlFirst Order Optimization Methods<h2 id="optimization-methods">Optimization methods</h2>
<p>One of the biggest mysteries for beginners in neural networks is figuring out when to use which optimization method. This post focuses on describing popular first-order methods: SGD, Momentum, Nesterov Momentum, Adagrad, RMSProp and Adam.</p>
<p><em>Second order methods are not included largely because inverting the Hessian takes too much computation power, and most deep learning researchers find it impractical to use second order methods.</em></p>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<p>Using stochastic gradient descent, we simply follow the gradient. The gradient is controlled by a learning rate, which typically tends to be a linearly decreasing number.</p>
<script type="math/tex; mode=display">\theta = \theta - \alpha \cdot \nabla J(\theta)</script>
<p>The problem with SGD is that chosing a proper learning rate is difficult and it’s easy to get trapped in a saddle point when optimizing a highly non-convex loss function.</p>
<h3 id="momentum">Momentum</h3>
<p>Stochastic Gradient Descent can oscillate and have trouble converging. Momentum tries to improve on SGD by dampening the oscillation and emphasizing the optimal direction.</p>
<p>If we call the update <script type="math/tex">v</script>, the SGD equation becomes <script type="math/tex">\theta = \theta - v</script>. Momentum combines the past update along with the current update, in order to stablize the updates.</p>
<script type="math/tex; mode=display">v_t = \mu \cdot v_{t-1} + \alpha \cdot \nabla J(\theta)</script>
<p>The value of <script type="math/tex">\mu</script> is usually close to 1 e.g. 0.9, 0.95.</p>
<h3 id="nesterov-momentum">Nesterov Momentum</h3>
<p>Momentum is an improvement on SGD because <script type="math/tex">v_t</script> is combined with <script type="math/tex">t_{t-1}</script>. However, because <script type="math/tex">v_t</script> is very dependent on <script type="math/tex">v_{t-1}</script>, momentum by itself is slow to adapt and change directions.</p>
<p>Nesterov momentum builds on raw momentum. Instead of calculating <script type="math/tex">v_t</script> via <script type="math/tex">\nabla J(\theta_t)</script>, Nesterov tries to calculate <script type="math/tex">\nabla J(\theta_{t + 1})</script>. But how can we calculate the gradient of the parameters in the future? We can’t. However, we can approximate the future parameter by assuming that <script type="math/tex">v_t = \mu \cdot v_{t-1}</script>, which is largely true.</p>
<p><img src="http://cs231n.github.io/assets/nn3/nesterov.jpeg" alt="nesterov pic" /></p>
<script type="math/tex; mode=display">v_t = \mu \cdot v_{t-1} + \alpha \cdot \nabla J(\theta - \mu \cdot v_{t-1})</script>
<h3 id="adagrad">Adagrad</h3>
<p>Adagrad, RMSProp and Adam take a different approach on how to improve SGD. SGD, Momentum and Nesterov Momentum all have a single learning rate for all parameters. The following 3 methods instead adaptively tune the learning rates for each parameter.</p>
<p>Adagrad normalizes the update for each parameter. Parameters that had large gradients will have smaller updates; small or infrequent gradients will be bumped to take larger steps.</p>
<script type="math/tex; mode=display">v = \dfrac{ \alpha \cdot \nabla J(\theta) }{ \sqrt{ cache } }</script>
<script type="math/tex; mode=display">cache = \sum_{t=1}^n { \nabla J(\theta_{t}) }^2</script>
<p>Adagrad will eventually be unable to make any updates, because the size of cache becomes infinity.</p>
<h3 id="rmsprop">RMSProp</h3>
<p>RMSProp improves on adagrad by decaying the size of the cache. The cache is “leaky” and prevents the updates from becoming 0. Hinton recommends setting <script type="math/tex">\gamma</script> to 0.99 and <script type="math/tex">\alpha</script> to 0.01.</p>
<script type="math/tex; mode=display">cache_t = \gamma \cdot cache_{t-1} + (1 - \gamma) \cdot {\nabla J(\theta_{t})}^2</script>
<h3 id="adam">Adam</h3>
<p>Adam is a variation of RMSProp.</p>
<script type="math/tex; mode=display">m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot \nabla J(\theta_{t})</script>
<script type="math/tex; mode=display">v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot {\nabla J(\theta_{t})}^2</script>
<script type="math/tex; mode=display">\theta_t = \theta_{t-1} - \dfrac{ \alpha \cdot m_t }{ \sqrt( v_t ) }</script>
<h3 id="when-to-use-what">When to use what?</h3>
<p>There is no concensus which optimization methods are the best. In my experience, adagrad based methods are a lot safer but slower than momentum based methods. I usually stick to RMSProp for the first cut of the model and test other methods afterwards.</p>
Sat, 02 Jul 2016 00:00:00 +0000
http://jxieeducation.com/2016-07-02/First-Order-Optimization-Methods/
http://jxieeducation.com/2016-07-02/First-Order-Optimization-Methods/optimizationmlFactorization Machines A Theoretical Introduction<h2 id="factorization-machines">Factorization Machines</h2>
<p>This post is going to focus on explaining what factorization machines are and why they are important. The future posts will provide practical modeling examples and a numpy clone implementation of factorization machines.</p>
<h3 id="what-problem-do-factorization-machines-solve">What problem do Factorization Machines solve?</h3>
<p>TL;DR: FMs are a combination of linear regression and matrix factorization that models sparse feature interactions but in linear time.</p>
<p>Normally when we think of linear regression, we think of this formula.</p>
<script type="math/tex; mode=display">y = w_0 + \sum_{i=1}^n w_i x_i</script>
<p>In the formula above, the run time is <script type="math/tex">O(n)</script> where <script type="math/tex">n</script> is the number of features. When we consider quadratic feature interactions, the complexity increases to <script type="math/tex">O(n^2)</script>, in the formula below.</p>
<script type="math/tex; mode=display">y = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n w_{ij} x_i x_j</script>
<p>Now consider a very sparse set of features, the runtime blows up. In most cases, we instead model a very limited set of feature interactions to manage the complexity.</p>
<h3 id="how-do-factorization-machines-solve-the-problem">How do Factorization Machines solve the problem?</h3>
<p>In the recommendation problem space, we have historically dealt with the sparsity problem with a well documented technique called (non-negative) matrix factorization.</p>
<p><img src="http://data-artisans.com/img/blog/factorization.svg" alt="matrix factorization diagram" /></p>
<p>We factorize sparse user item matrix (r) <script type="math/tex">\in R^{UxI}</script> into a user matrix (u) <script type="math/tex">\in R^{UxK}</script> and an item matrix (i) <script type="math/tex">\in R^{IxK}</script>, where <script type="math/tex">% <![CDATA[
K << U %]]></script> and <script type="math/tex">% <![CDATA[
K << I %]]></script>.</p>
<p>User (<script type="math/tex">u_i</script>)’s preference for item <script type="math/tex">i_j</script> can be approximated by <script type="math/tex">u_i \cdot i_j</script>.</p>
<p>Factorization Machines takes inspiration from matrix factorization, and models the feature iteractions like using latent vectors of size <script type="math/tex">K</script>. As a result, every sparse feature <script type="math/tex">f_i</script> has a corresponding latent vector <script type="math/tex">v_i</script>. And two feature’s interactions are modelled as <script type="math/tex">v_i \cdot v_j</script>.</p>
<h3 id="factorization-machines-math">Factorization Machines Math</h3>
<script type="math/tex; mode=display">% <![CDATA[
y = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n <v_i, v_j> x_i x_j %]]></script>
<p>Instead of modeling feature interactions explicitly, factorization machines uses the dot product of features’ interactions. The model learns <script type="math/tex">v_i</script> implicitly during training via techniques like gradient descent.</p>
<p>The intuition is that each feature will learn a dense encoding, with the property that high positive correlation between two features have a high dot product value and vise versa.</p>
<p>Of course, the latent vectors can only encode so much. There is an expected hit on accuracy compared to using conventional quadratic interactions. The benefit is that this model will run in linear time.</p>
<h3 id="fm-complexity">FM complexity</h3>
<p>For the full proof, see lemma 3.1 in Rendle’s <a href="http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf">paper</a>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\sum_{i=1}^n \sum_{j=i+1}^n <v_i, v_j> x_i x_j %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
= \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n <v_i, v_j> x_i x_j
- \frac{1}{2} \sum_{i=1}^n <v_i, v_i> x_i x_i %]]></script>
<script type="math/tex; mode=display">= \frac{1}{2} \sum_{f=1}^k ((\sum_{i=1}^n v_{i,f} x_i )^2 - \sum_{i=1}^n v_{i,f}^2 x_i^2 )</script>
<p>Since <script type="math/tex">\sum_{i=1}^n v_{i,f} x_i</script> can be precomputed, the complexity is reduced to <script type="math/tex">O(KN)</script>.</p>
<p><em>Now that we have a theoretical understanding, in the next post, we will discuss practical use cases for factorization machines.</em></p>
<h3 id="benchmark-added-sep-24-2016">Benchmark (added Sep 24, 2016)</h3>
<p>On the standard recommendation dataset, MovieLens-100k, using Vowpal Wabbit, normal linear regression achieved a MSE of 0.9652, while a FM achieved a MSE of 0.9140.</p>
<p>For a guide on how to do Factorization Machines via Vowpal Wabbit, <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki/Matrix-factorization-example">see here</a>.</p>
Sun, 26 Jun 2016 00:00:00 +0000
http://jxieeducation.com/2016-06-26/Factorization-Machines-A-Theoretical-Introduction/
http://jxieeducation.com/2016-06-26/Factorization-Machines-A-Theoretical-Introduction/factorizationmachinesmlDocument Similarity With Word Movers Distance<h3 id="document-similarity-with-word-movers-distance">Document Similarity with Word Mover’s Distance</h3>
<p>While Word2Vec is universally adopted in most modern NLP projects, document embedding and similarity have seen less success. There are many different approaches out there.</p>
<p><img src="http://cdn-ak.f.st-hatena.com/images/fotolife/T/TJO/20140619/20140619150536.png" alt="w2v" /></p>
<h4 id="popular-document-embedding-and-similarity-measures">Popular document embedding and similarity measures:</h4>
<ul>
<li>Doc2Vec</li>
<li>Average w2v vectors</li>
<li>Weighted average w2v vectors (e.g. tf-idf)</li>
<li>RNN-based embeddings (e.g. deep LSTM networks)</li>
</ul>
<p>I am going to try to explain another approach called Word Mover’s Distance (WMD) by <a href="http://mkusner.github.io/">Matt Kusner</a>.</p>
<h3 id="high-level-intuition">High Level Intuition</h3>
<h4 id="i-have-2-sentences">I have 2 sentences:</h4>
<ul>
<li>Obama speaks to the media in Illinois</li>
<li>The president greets the press in Chicago</li>
</ul>
<h4 id="removing-stop-words">Removing stop words:</h4>
<ul>
<li>Obama speaks media Illinois</li>
<li>president greets press Chicago</li>
</ul>
<p>My sparse vectors for the 2 sentences have no common words and will have a cosine distance of 1. This is a terrible distance score because the 2 sentences have very similar meanings. Word Mover’s Distance solves this problem by taking account of the words’ similarities in word embedding space.</p>
<p><img src="https://vene.ro/images/wmd-obama.png" alt="WMD" /></p>
<p>In this particular 2 sentence problem, each word is mapped to their closest counter part and the distance is calculated as the sum of their Euclidean distance.</p>
<p>In practice, WMD is very good at finding short documents (10 words or less) that are similar to each other.</p>
<p>Here are some of my results from training on IMDB.</p>
<p>Original sentence:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>this show was incredible i ve seen all three and this is the best...
</code></pre>
</div>
<p>Results for <code class="highlighter-rouge">three</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>this is the first michael vartan movie i ve seen i haven t seen alias
sure titanic ... but you really should see it a second time
</code></pre>
</div>
<p>Results for <code class="highlighter-rouge">show</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>this is one of the best movies i ve ever seen it has very good acting by ha
i d have to say this is one of the best animated films i ve ever seen
</code></pre>
</div>
<p>Results for <code class="highlighter-rouge">best</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>this show is wonderful it has some of the best writing i ever seen it has b
this is absolutely one of the best movies i ve ever seen
</code></pre>
</div>
<h3 id="flow">Flow</h3>
<p>It’s very intuitive when all the words line up with each other, but what happens when the number of words are different?</p>
<h4 id="example">Example:</h4>
<ol>
<li>Obama speaks to the media in Illinois –> Obama speaks media Illinois –> 4 words</li>
<li>The president greets the press –> president greets press –> 3 words</li>
</ol>
<p>WMD stems from an optimization problem called the Earth Mover’s Distance, which has been applied to tasks like image search. EMD is an optimization problem that tries to solve for <strong>flow</strong>.</p>
<p>Every unique word (out of N total) is given a flow of <code class="highlighter-rouge">1 / N</code>. Each word in sentence 1 has a flow of <code class="highlighter-rouge">0.25</code>, while each in sentence 2 has a flow of <code class="highlighter-rouge">0.33</code>. Like with liquid, what goes out must sum to what went in.</p>
<p><img src="http://jxieeducation.com/static/img/wmd_flow.png" alt="flow" /></p>
<p>Since <code class="highlighter-rouge">1/3 > 1/4</code>, excess flow from words in the bottom also flows towards the other words. At the end of the day, this is an optimization problem to minimize the distance between the words.</p>
<p><img src="http://jxieeducation.com/static/img/wmd_optimization_equation.png" alt="eqn" /></p>
<p><code class="highlighter-rouge">T</code> is the flow and <code class="highlighter-rouge">c(i,j)</code> is the Euclidean distance between words i and j.</p>
<h3 id="the-tradeoff-between-granularity--complexity">The Tradeoff between Granularity & Complexity</h3>
<p>All of my code for this experiemnt is <a href="https://github.com/PragmaticLab/Word_Mover_Distance">here</a>. Feel free to try it out yourself. If I had more time, I’d definitely experiment more via trying out tf-idf weighed flow and visualizing documents in t-SNE.</p>
<h4 id="pros-of-wmd">Pros of WMD:</h4>
<ul>
<li>Very accurate for small documents, KNN methods can beat supervised embedding based ones</li>
<li>Interpretable for small documents unlike embeddings</li>
</ul>
<h4 id="cons-of-wmd">Cons of WMD:</h4>
<ul>
<li>No embeddings for a document, meaning a lot of computation to get pairwise distance</li>
<li>Slow unless you relax the lower bound</li>
<li>Does not scale to large documents because flow become very convoluted between similar words</li>
</ul>
<p>My personal thoughts are that while WMD has some very nice theoretical properties, currently it doesn’t scale well enough to be adopted in production.</p>
Mon, 13 Jun 2016 00:00:00 +0000
http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/
http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/word2vec,wordembeddingmlTranslating W2v Embedding From One Space To Another<h3 id="the-problem-with-word-embeddings">The Problem With Word Embeddings</h3>
<p>Refreshing Word2Vec embeddings is a pain. Say that you’d like to change the dimensions of your embeddings or to use a better trained model, you’d need to recalculate almost everything that you have used embeddings for, because the new embeddings are totally different. This applies to any modeling, unsupervised learning or visualizations.</p>
<h3 id="the-solution">The Solution</h3>
<p>It turns out that maybe you don’t need to recalculate everything. Word2Vec embeddings can be translated from one to another as long as the relationship between words are the same.</p>
<p>I trained 2 different w2v models on the IMDB reviews. The vocabulary and corpus stayed the same. However, the embeddings for the words are totally different.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>model1 - "one" : 3.52521874e-02, 4.00067866e-03, 2.30610892e-02
</code></pre>
</div>
<div class="highlighter-rouge"><pre class="highlight"><code>model2 - "one" : -5.90761974e-02, -3.17945816e-02, 1.26407698e-01
</code></pre>
</div>
<h3 id="geometric-similarity">Geometric Similarity</h3>
<p>Fortunately, since the words still have the same relationships, they share a similar geometric structure. I applied PCA on the words “zero”, “one”, “two”, “three”, “four” from both models and visualized the relationships between the words.</p>
<center><img src="https://raw.githubusercontent.com/PragmaticLab/EmbeddingMapper/master/pca_visualization/left.png" alt="Drawing" style="width: 400px;" /><img src="https://raw.githubusercontent.com/PragmaticLab/EmbeddingMapper/master/pca_visualization/right.png" alt="Drawing" style="width: 400px;" /></center>
<p>We can see that the geometry is very similar, but not quite the same. Hint: The angle between 0 and 1 are different between the first and second model.</p>
<p>Here is the illustration that Mikolov’s PCA visualization from his <a href="http://arxiv.org/pdf/1309.4168.pdf">paper</a>, where the numbers’ embeddings came from an English model and a Spanish model.</p>
<center><img src="http://jxieeducation.com/static/img/word_translation_visualization.png" alt="Drawing" style="width: 1000px;" /></center>
<p>This geometric similarity in the PCA visualization suggests that there is a fairly decent linear mapping between models of the same word relationships.</p>
<h3 id="training-the-translation-mechanism">Training the Translation Mechanism</h3>
<p>The next step is to learn a translation matrix to connect the two models.</p>
<p>We can extract the training set by getting a small number embeddings from the source model and the target model. For my experiment, I used 2000 words. Mikolov trained on 5000 English words and their spanish counterparts via Google Translate. As a general rule, it’s better to use the highest occuring or most common vocabuary to establish the mapping, because their embeddings are the most well defined.</p>
<p>The actual training can be done easily through <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">Scikit-Learn’s Linear Regression</a>.</p>
<p>The mapping is quite simple.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>[Original embeddings] dot [translation matrix] = [New embeddings]
</code></pre>
</div>
<p>If I am translating a 2000 500xdimensional embeddings to a 200 one.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>[2000 x 500] dot [500 x 200] = [2000 x 200]
</code></pre>
</div>
<h3 id="evaluating-the-translation">Evaluating the Translation</h3>
<p>To evaluate the quality of the embeddings, we can use words that we didn’t use for training. I recommend measuring how often the map of the old embedding can be mapped to the same word.</p>
<p>For instance, if “one” is mapped to “one”, “a” and “an” in the new space, it is a hit. However, if “one” is mapped to “time”, “star” and “tree”, it is a miss.</p>
<h3 id="applications">Applications</h3>
<h4 id="the-method-can-be-applied-in-many-ways">The method can be applied in many ways:</h4>
<ul>
<li>Introducing new vocabulary to your model</li>
<li>Translating between languages (e.g. English - French, Android - iOS)</li>
<li>Increasing or decreasing the dimensionality of your embeddings</li>
</ul>
Mon, 06 Jun 2016 00:00:00 +0000
http://jxieeducation.com/2016-06-06/Translating-W2V-Embedding-From-One-Space-To-Another/
http://jxieeducation.com/2016-06-06/Translating-W2V-Embedding-From-One-Space-To-Another/word2vec,wordembeddingmlHogwild Stochastic Gradient Descent<h2 id="parallelizing-sgd">Parallelizing SGD</h2>
<p>Stochastic gradient descent differs from batch gradient descent in that the weights are updated using a single sample as opposed to the whole dataset. Because the weights are updated after every single sample is computed, the weight variable is updated very frequently.</p>
<hr />
<h3 id="how-sgd-works">How SGD works</h3>
<p>Repeat until convergence:</p>
<ol>
<li>pick a random element in the training set</li>
<li>update the weights with the example
<script type="math/tex">\theta \leftarrow (\theta - \alpha \nabla L(f(x_i), y_i))</script></li>
</ol>
<p>Because of the fast updates, SGD has the following properties:</p>
<ol>
<li>Unlike GD, each update in SGD can improve or enlarge the loss</li>
<li>SGD converges faster than GD because we update the weights much more frequently</li>
<li>SGD takes much longer to get to the most “optimal” solution, making it less likely to overfit</li>
</ol>
<p><img src="http://jxieeducation.com/static/img/sgdvsgd.png" alt="SGD vs GD" /></p>
<hr />
<h3 id="parallelizing-sgd-1">Parallelizing SGD</h3>
<p>Looking at the SGD formula, it appears to be hard to parallelize, because all the weight updates are sequential. This means that we need to finish the current small and quick update before going forward.</p>
<p>When multithreading, a single thread generally takes only a few microsecond to calculate the new weight, but may need to wait several miliseconds to obtain permission (the lock) to update the weight. It’s like in life, when it takes you a few seconds to fill out a document, but a months for the document to be processed.</p>
<p>Fortunately, scientists at the University of Wisconsin-Madison discovered that <a href="https://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf">SGD doesn’t need to be synchronous</a>.</p>
<p>Because although weight updates would inevitably overwrite each other, the absense of locks enables SGD to perform many times more updates overall. Going back to the 3 properties of SGD listed earlier, (2) and (3) allow the asynchronous SGD to yield good results in a fraction of the time.</p>
<hr />
<h3 id="experimental-results-on-the-connect-4-dataset">Experimental Results on the <a href="https://archive.ics.uci.edu/ml/datasets/Connect-4">Connect 4 Dataset</a></h3>
<p>The dataset’s features are the states of each square on the board. Each column can be either B(lank), X or O. And the target variables are (w)in, (d)raw or (l)oss. The features and target are normalized and put through a logistic regression classifier via SGD.</p>
<p>I was able to consistently achieve 90-92% accuracy with the <a href="https://github.com/jxieeducation/HogwildSGD/blob/master/connect4/sync.py">raw</a>, <a href="https://github.com/jxieeducation/HogwildSGD/blob/master/connect4/async_lock.py">lock-threaded</a> and <a href="https://github.com/jxieeducation/HogwildSGD/blob/master/connect4/async_nolock.py">nolock-threaded</a> implementations.</p>
<p>Even though the performance did not improve, the speed of the lock-threaded proved to be much faster compared to the nolock-threaded implementation.</p>
<p><img src="http://jxieeducation.com/static/img/sgd-no-lock.png" alt="SGD lock vs nolock" /></p>
<h3 id="caveat">Caveat</h3>
<p>I expected the nolock-threaded speed to plateau while the number of threads are small, because given the unsynchronous nature of weight updates, the time should not increase linearly. However, this hypothesis need to be tested on a really powerful computer before anything is conclusive.</p>
Tue, 27 Oct 2015 00:00:00 +0000
http://jxieeducation.com/2015-10-27/Hogwild-Stochastic-Gradient-Descent/
http://jxieeducation.com/2015-10-27/Hogwild-Stochastic-Gradient-Descent/SGDmlDropout Ensembling In Neural Nets<h2 id="dropout-in-neuro-nets">Dropout in Neuro Nets</h2>
<p>As the size of our neuro nets get larger, it becomes especially easy to overfit. Pretend that we have 20,000 neurons across 20 layers, with only 500 training examples, it’s very easy for the net to memorize the shape of the data and come up with some crazy, super high dimension model only to fit the data.</p>
<p>Another way to think about it is: pretend that we are studying for a big physics exam. However, the professor provided only 1 practice problem. Because we only have 1 practice problem, we spent a very long time to make sure that we would get the practice problem correct. However, we do poorly on the exam, because we tunneled on the specific problem.</p>
<hr />
<h3 id="how-does-dropout-work">How does dropout work?</h3>
<p>Normally, the input for a neuron is this:</p>
<script type="math/tex; mode=display">y=W\^T X</script>
<p>However, the dropout says that we should randomly shut off some weights, effectively turning off some neurons. The dropout equation works like this:</p>
<script type="math/tex; mode=display">y=W R\^T X</script>
<script type="math/tex; mode=display">% <![CDATA[
y =
\begin{bmatrix}w1 \\\ w2 \\\ w3 \end{bmatrix}
\begin{bmatrix}1 & 0 & 1 \end{bmatrix}
\begin{bmatrix}x1 \\\ x2 \\\ x3 \end{bmatrix} %]]></script>
<p>Here R is a combination of <code class="highlighter-rouge">0</code>s and <code class="highlighter-rouge">1</code>s. If we want 30% dropout, then there would be 30% <code class="highlighter-rouge">0</code>s and 70% <code class="highlighter-rouge">1</code>s. And 30% of the input neurons would be turned off.</p>
<hr />
<h3 id="experiment-on-the-titanic-dataset">Experiment on the Titanic dataset</h3>
<p>Here is the topology of my net.
<img src="http://jxieeducation.com/static/img/dropout-net-diagram.png" alt="dropout diagram" /></p>
<p>Then to understand dropouts, I did a grid search on the loss function for the training and test set.</p>
<p><img src="http://jxieeducation.com/static/img/dropout-grid.png" alt="dropout res" /></p>
<p>Here we see that we are overfitting like crazy when the dropout is small, because the training loss is much smaller than the test loss. However, as the dropout percentage increases, the training loss and test loss converge.</p>
<hr />
<h3 id="intuition">Intuition</h3>
<p>When the size of the neuro net is large, it’s easy for neurons to co-adapt, meaning that they work together to fit the model as well as possible. This leads to overfitting, making the net lose much of its predictive power.</p>
<p>When we applied dropout, we see that the training loss increased. This is because we are preventing the features from overfitting with crazy formulas. Instead, the neuro network is forced to learn multiple independent representations of the data. (You can’t co-adapt when you are not there randomly)</p>
<p>We can pretty much think of the dropout layer as a regularization mechanism that punishes overfitting.</p>
Wed, 21 Oct 2015 00:00:00 +0000
http://jxieeducation.com/2015-10-21/Dropout-Ensembling-In-Neural-Nets/
http://jxieeducation.com/2015-10-21/Dropout-Ensembling-In-Neural-Nets/neuronetworkoverfittingml