Sihan's Blog

Support Vector Machines (Part I): What Is a Support Vector?

Wed, 26 Mar 2025 21:29:58 -0400

Motivation

As a former TA for machine learning courses—and a learner myself—I’ve noticed that many beginners encounter support vectors as abstract definitions in lecture slides or textbooks. While technically correct, these explanations often lack a visual or intuitive component, making it difficult to see which data points actually matter in practice.

This post is my attempt to bridge that gap. We’ll revisit what support vectors are, why they matter, and—most importantly—how to recognize them visually. By the end, you should be able to look at the plot of a trained SVM and confidently identify the support vectors: the handful of data points that directly define the decision boundary.

The Role of Lagrange Multipliers

In the SVM framework, each training data point is associated with a Lagrange multiplier, denoted as $\alpha_i$. The decision boundary is computed as:

\[ \mathbf{w}^* = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \]

Only those data points for which $\alpha_i > 0$ contribute to this sum. In other words, if a data point’s corresponding multiplier is zero, it has no impact on the decision boundary. These influential points are what we call support vectors.

Support Vectors in Hard-Margin SVM

For datasets that are perfectly separable, we use a hard-margin SVM. Here, the complementary slackness condition tells us:

\[ \alpha_i(1 - y_i\mathbf{w}^\top\mathbf{x}_i) = 0 \]

This equation means that for points not on the margin (where $y_i\mathbf{w}^\top\mathbf{x}_i > 1$), the multiplier $\alpha_i$ must be 0. Hence, only those points that lie exactly on the margin (where $y_i\mathbf{w}^\top\mathbf{x}_i = 1$) have non-zero multipliers and are thus support vectors.

Try yourself!

Can you identify the support vectors in the plot below?

Hard-Margin SVM

✅ Click to reveal the answer

There are 3 support vectors in the plot above. These are the data points that lie exactly on the margin boundaries. They are the only points with non-zero Lagrange multipliers $\alpha_i > 0$ and directly influence the position of the decision boundary.

Support Vectors in Hard-Margin SVM

Code for generating the plots

Want to experiment yourself? Below is the full code used to generate the plots. Try adjusting the random_state in make_blobs to generate different datasets and see how the support vectors change!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC

# Generate a dataset
X, y = datasets.make_blobs(n_samples=20, centers=2, random_state=6, cluster_std=1.2)

# Create a soft margin SVM classifier
svm = SVC(C=1.0, kernel='linear')
svm.fit(X, y)

# Plot the data
plt.figure(figsize=(8, 6))
markers=['o', 'x']
# plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, marker='o', edgecolors='k')
for class_value, marker in zip(np.unique(y), markers):
 plt.scatter(X[y == class_value, 0], X[y == class_value, 1], s=80,
 marker=marker,
 label=f"Class {class_value}")

# Plot the decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = svm.decision_function(xy).reshape(XX.shape)

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
 linestyles=['--', '-', '--'])

num_support_vectors = svm.support_vectors_.shape[0]
print(f'The number of suppost vectors is {num_support_vectors}.')

# Highlight the support vectors
sv = svm.support_vectors_
ax.scatter(sv[:, 0], sv[:, 1], s=150, linewidth=1, facecolors='none', edgecolors='red')
plt.tick_params(axis='both', labelsize=15)

plt.xlabel('Feature 1', fontsize=15)
plt.ylabel('Feature 2', fontsize=15)
# plt.savefig('hard_margin_marked.png', dpi=300)
plt.legend(fontsize=12)
plt.show()

Support Vectors in Soft-Margin SVM

When data is not perfectly separable, SVMs use a soft-margin approach with slack variables $\xi_i$. The complementary slackness and KKT conditions become:

\[ \alpha_i(1 - y_i\mathbf{w}^\top\mathbf{x}_i - \xi_i) = 0 \\ (C - \alpha_i)\xi_i = 0 \]

We then encounter two cases:

$\alpha_i > 0$, $\xi_i = 0$:
The point lies exactly on the margin border. It is a support vector.
$\alpha_i > 0$, $\xi_i > 0$:
The point is either inside the margin or misclassified. Here, $\alpha_i = C$. These points also influence the decision boundary and are support vectors.

In contrast, points with $\alpha_i = 0$ lie far from the margin and do not affect the model.

Try yourself!

Can you identify the support vectors in the plot below?

Soft-Margin SVM

✅ Click to reveal the answer

There are 6 support vectors in the plot above. In the soft-margin setting, support vectors are the data points with non-zero Lagrange multipliers $\alpha_i > 0$. These include:

Points lying exactly on the margin boundaries
Points that are within the margin
Points that are misclassified (on the wrong side of the decision boundary)

Only these points influence the position of the decision boundary. Points farther away from the margin have $\alpha_i = 0$ and do not contribute.

Support Vectors in Soft-Margin SVM

Code for generating the plots

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC


# Adjusting the dataset to be linearly nonseparable for a soft-margin linear SVM
X, y = datasets.make_blobs(n_samples=20, centers=2, random_state=0, cluster_std=1.2)

# Create a soft margin SVM classifier with a linear kernel for the adjusted dataset
svm_linear_soft = SVC(C=1, kernel='linear') # Adjusting C for a softer margin
svm_linear_soft.fit(X, y)

# Plot the data
plt.figure(figsize=(8, 6))
markers=['o', 'x']
# plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, marker='o', edgecolors='k')
for class_value, marker in zip(np.unique(y), markers):
 plt.scatter(X[y == class_value, 0], X[y == class_value, 1], s=80,
 marker=marker,
 label=f"Class {class_value}")

# Plot the decision boundary
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# Create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = svm_linear_soft.decision_function(xy).reshape(XX.shape)

# print the number of support vectors
num_support_vectors = svm_linear_soft.support_vectors_.shape[0]
print(f'The number of support vectors is {num_support_vectors}.')

# Plot decision boundary and margins
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=.5,
 linestyles=['--', '-', '--'])


# Highlight the support vectors
sv = svm_linear_soft.support_vectors_
ax.scatter(sv[:, 0], sv[:, 1], s=150, linewidth=1, facecolors='none', edgecolors='red')


plt.tick_params(axis='both', labelsize=15)

plt.xlabel('Feature 1', fontsize=15)
plt.ylabel('Feature 2', fontsize=15)
plt.legend(fontsize=12)
# plt.savefig('soft_margin_marked.png', dpi=300)
plt.show()

Conclusion

Support vectors are not just a technical detail in SVMs—they are the essential data points that shape the decision boundary. Whether you’re working with a hard-margin or soft-margin SVM, the concept remains the same: Only the data points with non-zero Lagrange multipliers ($\alpha_i > 0$) influence the final classifier.

By now, you should be able to look at an SVM plot and confidently pick out the support vectors, understanding exactly why they matter.

Math Tricks for Machine Learning (Part I): Concentrational Inequality

Sun, 02 Mar 2025 06:52:04 -0500

Introduction

Over the past few years, I’ve passionately studied various machine learning and statistical concepts. One thing I’ve learned is that many research papers rely on clever mathematical “tricks”—techniques that are used so routinely they often go unexplained. In this series, I plan to catalog these tricks to help demystify the math behind modern ML.

In this first installment, we’ll focus on concentration inequalities, a key tool for understanding how random variables behave. Whether you’re analyzing generalization bounds or just trying to get a grip on how data “concentrates” around its mean, these inequalities provide a rigorous way to quantify uncertainty.

What Are Concentration Inequalities?

Concentration inequalities provide bounds on the probability that a random variable deviates from some central value (often its expected value). In simpler terms, they tell us how “concentrated” a random variable is around its mean.

For example, if you compute the average of a large number of independent samples, a concentration inequality can help you answer: How likely is it that the average is far from the true mean? This is crucial for ensuring that what we observe empirically (on our training set, say) is representative of the underlying data distribution.

Why They Matter in Machine Learning

In machine learning, concentration inequalities are the backbone of many generalization guarantees. They help us:

Quantify the reliability of empirical estimates: For instance, ensuring that the training error is close to the true error.
Derive performance bounds: Many algorithms’ guarantees hinge on these inequalities.
Analyze convergence: When using stochastic optimization methods, concentration inequalities can show how fast our estimates converge to their true values.

Key Examples of Concentration Inequalities

Here are some of the most common concentration inequalities that you might encounter in ML literature:

Hoeffding’s Inequality: Provides a bound for the sum of bounded independent random variables.
McDiarmid’s Inequality: Useful when the function of independent random variables does not change too much when any single variable is altered.
Chebyshev’s Inequality: Offers a more general (though often looser) bound using the variance of the random variable.
Chernoff Bounds: Provide exponentially decreasing bounds on tail distributions of sums of independent random variables.

A Closer Look: Hoeffding’s Inequality

To illustrate the concept, consider Hoeffding’s inequality. Suppose you have independent random variables $X_1, X_2, \dots, X_n$ that are bounded (say, each $X_i \in [a_i, b_i]$). Define the empirical average:

\[ \frac{1}{n} \sum_{i=1}^n X_i \]

Hoeffding’s inequality gives us a bound on how far this average can deviate from its expected value. Specifically, for any $t > 0$:

\[ \Pr\left( \left| \frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}\left[\frac{1}{n} \sum_{i=1}^n X_i\right] \right| \ge t \right) \le 2\exp\left(\frac{-2n^2 t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right) \]

In plain terms: the more samples you have, the tighter the concentration around the true mean. The probability of a large deviation shrinks exponentially fast in the number of samples $n$.

Common Variants in ML

In machine learning, our data is often assumed to be i.i.d. and bounded in $[0, 1]$. In this case, Hoeffding’s inequality simplifies to:

\[ \Pr\left( \left| \frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}[X_i] \right| \ge t \right) \le 2 \exp(-2nt^2) \]

This is commonly used when bounding the difference between empirical risk and true risk.

One-Sided Version

If you only care about the upper (or lower) tail—for example, bounding overestimation of the mean—you can drop the absolute value:

\[ \Pr\left( \frac{1}{n} \sum_{i=1}^n X_i - \mathbb{E}[X_i] \ge t \right) \le \exp(-2nt^2) \]

This is especially handy when applying a union bound across multiple events.

A Closer Look: McDiarmid’s Inequality

McDiarmid’s inequality is a powerful concentration result that applies to functions of independent random variables—especially when the function doesn’t change too much if a single variable is altered. It is sometimes referred to as the bounded difference inequality.

Setup

Let $ X_1, X_2, \dots, X_n $ be independent random variables taking values in arbitrary spaces. Suppose we have a function

\[ f : \mathcal{X}_1 \times \dots \times \mathcal{X}_n \to \mathbb{R} \]

such that changing any one coordinate $ X_i $ (while keeping the others fixed) changes the value of $ f $ by at most $ c_i $. Formally, for all $ i \in \{1, \dots, n\} $:

\[ \sup_{x_1,\dots,x_n,\,x_i'} \left| f(x_1, \dots, x_i, \dots, x_n) - f(x_1, \dots, x_i', \dots, x_n) \right| \le c_i \]

Then for any $ t > 0 $:

\[ \Pr\left( f(X_1, \dots, X_n) - \mathbb{E}[f(X_1, \dots, X_n)] \ge t \right) \le \exp\left( \frac{-2t^2}{\sum_{i=1}^n c_i^2} \right) \]

There is also a two-sided version:

\[ \Pr\left( \left| f(X_1, \dots, X_n) - \mathbb{E}[f(X_1, \dots, X_n)] \right| \ge t \right) \le 2\exp\left( \frac{-2t^2}{\sum_{i=1}^n c_i^2} \right) \]

Why It Matters in ML

McDiarmid’s inequality is especially useful in situations where we evaluate some function over a dataset, like the empirical risk, and want to show that it concentrates around its expected value.

Unlike Hoeffding’s inequality, which applies to sums of random variables, McDiarmid applies to more general functions—as long as no single variable has too much influence. This makes it highly suitable for:

Stability analysis of algorithms
Generalization bounds when empirical loss functions change only slightly with a single data point
Complex random processes, such as Rademacher complexities or covering number arguments

Example

Let’s say $ f $ is the empirical risk over a dataset of $ n $ samples:

\[ f(X_1, \dots, X_n) = \frac{1}{n} \sum_{i=1}^n \ell(X_i) \]

If the loss function $ \ell $ is bounded in $[0, 1]$, then changing one data point changes the empirical risk by at most $ \frac{1}{n} $. So $ c_i = \frac{1}{n} $, and:

\[ \sum_{i=1}^n c_i^2 = n \cdot \left(\frac{1}{n}\right)^2 = \frac{1}{n} \]

Plugging this into McDiarmid’s inequality gives:

\[ \Pr\left( f(X_1, \dots, X_n) - \mathbb{E}[f] \ge t \right) \le \exp(-2nt^2) \]

— which is exactly the same bound as Hoeffding’s inequality for i.i.d. bounded random variables, but derived in a more general framework.

Regret Analysis of FTRL and OMD Algorithms

Fri, 18 Oct 2024 22:54:24 -0400

Regret Analysis of FTRL and OMD Algorithms

Introduction

In this note, we’ll explore the regret analysis of both the Follow-The-Regularized-Leader (FTRL) algorithm and the Online Mirror Descent (OMD) algorithm. We’ll highlight their similarities and differences, and demonstrate how, under certain conditions, they are essentially equivalent. This analysis includes detailed derivations and mathematical expressions.

Follow-The-Regularized-Leader (FTRL)

Problem Setup

Consider an online convex optimization problem over $ T $ rounds. At each round $ t $:

Decision Making: The learner selects $ \mathbf{x}_t \in \mathcal{X} \subseteq \mathbb{R}^n $.
Loss Revealing: An adversary reveals a convex loss function $ f_t : \mathcal{X} \rightarrow \mathbb{R} $.
Loss Incurred: The learner incurs loss $ f_t(\mathbf{x}_t) $.

Goal: Minimize the cumulative regret:

\[ \text{Regret}_T = \sum_{t=1}^T f_t(\mathbf{x}_t) - \min_{\mathbf{x} \in \mathcal{X}} \sum_{t=1}^T f_t(\mathbf{x}). \]

FTRL Algorithm

At each round $ t $, the FTRL algorithm updates the decision by solving:

\[ \mathbf{x}_t = \arg\min_{\mathbf{x} \in \mathcal{X}} \left\{ \eta \sum_{s=1}^{t-1} f_s(\mathbf{x}) + R(\mathbf{x}) \right\}, \]

where:

$ \eta > 0 $ is the learning rate.
$ R : \mathcal{X} \rightarrow \mathbb{R} $ is a strongly convex regularization function.

Regret Analysis

Assumptions

Convexity: Each loss function $ f_t $ is convex.
Lipschitz Continuity: The subgradients are bounded: $ \| \nabla f_t(\mathbf{x}) \|_* \leq G $ for all $ \mathbf{x} \in \mathcal{X} $.
Strong Convexity: The regularizer $ R $ is $ \lambda $-strongly convex with respect to a norm $ \| \cdot \| $.

Key Steps

One-Step Regret Bound

Using the convexity of $ f_t $:
\[ f_t(\mathbf{x}_t) - f_t(\mathbf{x}^*) \leq \langle \nabla f_t(\mathbf{x}_t), \mathbf{x}_t - \mathbf{x}^* \rangle, \]
where $ \mathbf{x}^* = \arg\min_{\mathbf{x} \in \mathcal{X}} \sum_{t=1}^T f_t(\mathbf{x}) $.
Regret Decomposition

Summing over $ t $:
\[ \text{Regret}_T \leq \sum_{t=1}^T \langle \nabla f_t(\mathbf{x}_t), \mathbf{x}_t - \mathbf{x}^* \rangle. \]
Bounding the Inner Product

Using the properties of the regularizer and the FTRL updates, we can relate the sum to the Bregman divergence $ D_R $:
\[ \sum_{t=1}^T \langle \nabla f_t(\mathbf{x}_t), \mathbf{x}_t - \mathbf{x}^* \rangle \leq \frac{R(\mathbf{x}^*) - R(\mathbf{x}_1)}{\eta}. \]
Bregman Divergence Definition:
\[ D_R(\mathbf{x}, \mathbf{y}) = R(\mathbf{x}) - R(\mathbf{y}) - \langle \nabla R(\mathbf{y}), \mathbf{x} - \mathbf{y} \rangle. \]
Regret Bound

Therefore, the total regret is bounded by:
\[ \text{Regret}_T \leq \frac{R(\mathbf{x}^*) - R(\mathbf{x}_1)}{\eta}. \]
By choosing $ \eta $ appropriately (e.g., $ \eta = \sqrt{\dfrac{2 [R(\mathbf{x}^*) - R(\mathbf{x}_1)]}{G^2 T}} $), we can achieve a regret bound of:
\[ \text{Regret}_T \leq G \sqrt{2 [R(\mathbf{x}^*) - R(\mathbf{x}_1)] T}. \]

Online Mirror Descent (OMD)

Algorithm Steps

Initialization: Choose an initial point $ \mathbf{x}_1 \in \mathcal{X} $.
For each round $ t = 1, \dots, T $:

a. Compute Subgradient:
\[ \mathbf{g}_t = \nabla f_t(\mathbf{x}_t). \]
b. Dual Space Update:
\[ \mathbf{z}_{t+1} = \mathbf{z}_t - \eta \mathbf{g}_t, \]
where $ \mathbf{z}_t = \nabla \psi(\mathbf{x}_t) $.

c. Primal Space Update:
\[ \mathbf{x}_{t+1} = \nabla \psi^*(\mathbf{z}_{t+1}), \]
with $ \psi^* $ being the convex conjugate of $ \psi $.

Regret Analysis

Assumptions

Convexity: Each $ f_t $ is convex.
Lipschitz Continuity: Subgradients are bounded: $ \| \mathbf{g}_t \|_* \leq G $.
Strong Convexity: The mirror map $ \psi $ is $ \lambda $-strongly convex.

Key Steps

Regret Decomposition

The regret can be bounded by:
\[ \text{Regret}_T \leq \sum_{t=1}^T \langle \mathbf{g}_t, \mathbf{x}_t - \mathbf{x}^* \rangle. \]
Using Mirror Descent Updates

Utilizing the properties of the Bregman divergence $ D_\psi $ and the mirror descent updates:
\[ \sum_{t=1}^T \langle \mathbf{g}_t, \mathbf{x}_t - \mathbf{x}^* \rangle = \frac{1}{\eta} \left[ D_\psi(\mathbf{x}^*, \mathbf{x}_1) - D_\psi(\mathbf{x}^*, \mathbf{x}_{T+1}) + \sum_{t=1}^T D_\psi(\mathbf{x}_{t+1}, \mathbf{x}_t) \right]. \]
Bounding the Bregman Divergences

Since $ D_\psi(\mathbf{x}^*, \mathbf{x}_{T+1}) \geq 0 $ and $ D_\psi(\mathbf{x}_{t+1}, \mathbf{x}_t) \leq \dfrac{\eta^2 G^2}{2 \lambda} $, we have:
\[ \text{Regret}_T \leq \frac{D_\psi(\mathbf{x}^*, \mathbf{x}_1)}{\eta} + \frac{\eta G^2 T}{2 \lambda}. \]
Optimizing the Learning Rate

Choosing:
\[ \eta = \sqrt{\dfrac{2 \lambda D_\psi(\mathbf{x}^*, \mathbf{x}_1)}{G^2 T}}, \]
yields the regret bound:
\[ \text{Regret}_T \leq G \sqrt{\dfrac{2 D_\psi(\mathbf{x}^*, \mathbf{x}_1) T}{\lambda}}. \]

Equivalence of FTRL and OMD

Under certain conditions, FTRL and OMD are equivalent algorithms.

Conditions for Equivalence

Matching Regularizers and Mirror Maps: If the regularizer $ R $ in FTRL is identical to the mirror map $ \psi $ in OMD.
Unconstrained Domain: When the feasible set $ \mathcal{X} $ is the entire space $ \mathbb{R}^n $.

Demonstration of Equivalence

FTRL Update in Terms of Gradients

The FTRL update can be expressed as:
\[ \mathbf{x}_t = \arg\min_{\mathbf{x} \in \mathcal{X}} \left\{ \left\langle \eta \sum_{s=1}^{t-1} \mathbf{g}_s, \mathbf{x} \right\rangle + R(\mathbf{x}) \right\}. \]
Relation to Dual Variables in OMD

In OMD, the dual variable $ \mathbf{z}_t $ is:
\[ \mathbf{z}_t = \nabla \psi(\mathbf{x}_t) = \mathbf{z}_1 - \eta \sum_{s=1}^{t-1} \mathbf{g}_s. \]
Primal Update via Convex Conjugate

The FTRL update becomes:
\[ \mathbf{x}_t = \nabla R^*\left( -\eta \sum_{s=1}^{t-1} \mathbf{g}_s \right), \]
which matches the OMD update when $ R = \psi $:
\[ \mathbf{x}_t = \nabla \psi^*\left( \nabla \psi(\mathbf{x}_1) - \eta \sum_{s=1}^{t-1} \mathbf{g}_s \right). \]

Conclusion

By aligning the regularization function in FTRL with the mirror map in OMD and considering the unconstrained domain, the updates of both algorithms coincide. This demonstrates that FTRL and OMD are essentially equivalent under these conditions, offering different perspectives on the same optimization process.

About

Sun, 28 Jun 2020 08:31:57 -0500

About Me

Hi, I’m Sihan Wei — a learner who documents the path, and lights it up for others.

I write (and think) about machine learning theory, optimization, and the occasional abstract rabbit hole.

This blog is a space for slow thoughts: the kind that start with a proof, wander through patterns, and land somewhere in probability.

I mostly write in English, but every now and then you’ll find me posting something fun in Chinese — it’s my mother tongue, and sometimes it just captures the feeling better.

For academic stuff: Check out my research homepage.

Hope you enjoy hanging out here on my blog!

About this blog

I started this blog because I once didn’t understand — and now that I do, I want to help others get there faster. This is my way of passing the torch.

This blog has two main flavors:

Research notes — mostly for myself. Stuff I’m thinking about, half-finished ideas, little technical rabbit holes.
If you’re into optimization or ML theory, cool — you might find a gem (or at least a weird equation) here and there.
ML notes — for anyone trying to make sense of machine learning.
I write these when I finally understand something I’ve been stuck on — hoping it saves someone else a bit of headache.

Basically: I write to figure things out. And sometimes, I hit “publish” in case it helps someone else too.

Proofs — because structure matters.
Patterns — because abstraction connects everything.
Probabilities — because uncertainty is part of all learning, and all living.

This is the lens I bring to research — and sometimes, to writing too.

My Name

My Chinese name is 思涵, pronounced Sī Hán in Mandarin. (Hear it here via Google Translate)

It was chosen by my mom, and it means a lot to both of us.

“思” means “to think” or “thought,” and “涵” means “to forgive,” “to tolerate,” or “to be lenient.”

My mom once told me she had lived through a lot of anger and intolerance, and she hoped I’d grow into someone who thinks before speaking, and meets the world with calm and grace.

I still think about that often. And I hope to live up to the name.

Timeline (a.k.a. Building My Little Internet Corner)

2020.07.24 — Got my domain sihanwei.org on Netlify for $10.99/year (now $14.99/year thanks to inflation!)
2020.06.27 — Moved my blog from GitHub Pages to Netlify. (too lazy to build/deploy manually, to be honest)
2017–2018 — Found GitHub Pages & Hexo during senior year. Tried building an academic homepage to help with grad school apps. Eventually switched to Hugo — it’s fast, simple, and, let’s face it, pretty cool.

Thanks for stopping by!

Links

Mon, 01 Jan 0001 00:00:00 +0000

Here are some links I found really helpful or just plain cool. Hope you enjoy them too!

Search

Mon, 01 Jan 0001 00:00:00 +0000

Stay updated with the latest posts from this blog via RSS!

RSS Feed URL: https://blog.sihanwei.org/index.xml
How to subscribe:
Copy the feed URL above and paste it into your favorite RSS reader.

Sihan's Blog

Support Vector Machines (Part I): What Is a Support Vector?

Motivation

The Role of Lagrange Multipliers

Support Vectors in Hard-Margin SVM

Try yourself!

Code for generating the plots

Support Vectors in Soft-Margin SVM

Try yourself!

Code for generating the plots

Conclusion

Math Tricks for Machine Learning (Part I): Concentrational Inequality

Introduction

What Are Concentration Inequalities?

Why They Matter in Machine Learning

Key Examples of Concentration Inequalities

A Closer Look: Hoeffding’s Inequality

Common Variants in ML

One-Sided Version

A Closer Look: McDiarmid’s Inequality

Setup

Why It Matters in ML

Example

Regret Analysis of FTRL and OMD Algorithms

Regret Analysis of FTRL and OMD Algorithms

Introduction

Follow-The-Regularized-Leader (FTRL)

Problem Setup

FTRL Algorithm

Regret Analysis

Online Mirror Descent (OMD)

Algorithm Steps

Regret Analysis

Equivalence of FTRL and OMD

Conditions for Equivalence

Demonstration of Equivalence

Conclusion

About

About Me

About this blog

My Name

Timeline (a.k.a. Building My Little Internet Corner)

Archives

Links

Search

Subscribe

Subscribe to This Blog

Recommended RSS Readers