Just something small and tiny.

Held by Ryan Ruoshui Yan.

Here is the world of a nerdy guy.


Neural Network & Machine Learning Notes

Notes for CSC321

Disclaimer: The notes below are fully/partially NOT created by myself. They are from slides and/or wikipedia and/or textbook. The purpose of this post is simply to learn and review for the course. If you think something is inappropriate, please contact me at “ryan_yrs [at] hotmail [dot] com” immediately and I will remove it as soon as possible.

Design Choices Learned

  • Task
    • Regression
    • Binary Classification
    • Multipway Classification
  • Model/Architecture
    • Linear
    • Log-linear
  • Loss Function
    • Squared Error
    • 0-1 Loss
    • Cross-Entropy
    • Hinge Loss
  • Optimization Algorithm
    • Direct Solution
    • Gradient Descent
    • Perceptron

Type of Machine Learning

  • Supervised Learning
    • Task
    • Model
    • Input
    • Target
    • Label
    • Training Set
    • Test Set
    • Image Classification
    • Object Recognition
    • Localization
    • Semantic Segmentation
    • Machine Translation
    • Sequence-to-Sequence Learning
    • Speech Recognition
    • Caption Generation
  • Reinforcement Learning
    • Reward signal
    • Agent
    • Environment
    • Action
    • State
    • Policy
    • Adversary
  • Unsupervised Learning
    • Distribution Modeling
    • generate from
    • Latent Structure (?)
    • Clustering
    • Cluster

Problem Setup

  • Model (Architecture)
  • Hypothesis
  • Weight
  • Bias
  • Data Space (Input Space)
  • Weight Space
  • Loss Function: L(y,t)
  • Squared Error: L(y,t) = (y-t)^2/2
  • Residual: y-t
  • Cost Function

Direct Solution

  • Set partial derivatives to 0, and solve for params.
  • Critical Point: A point where partial derivatives are 0.


  • Rewrite linear regression model, as well as both solution methods, in terms of operations on matrices and vectors.
  • Closed-Form Solution: An exact solution which we can express with a formula.

Feature Mapping

  • For non-linear
  • Polynomial Regression


  • Generalization
  • Underfitting
  • Overfitting
  • Hyperparameter
    • Values that one cannot include in training procedure itself, but which one needs to set using some other means.
    • e.g. Degree of polynomial.
  • Training Set
  • Validation Set
  • Test Set


  • Predictin a discrete-valued target

Binary Classification

Predicting a binary-valued target.


  • Predict a binary target t \in {0,1}
  • Positive Example: t = 1
  • Negative Example: t = 0 // Not “-1”!!!

WLOG (Without Loss of Generality 不失一般性)

Linear Separation

  • If training examples can be separated by a linear decision rule, they are LINEARLY SEPARABLE.
  • Proof Idea: Cannot be both A and B.

Feasible Region

  • Region that satisfies all constraints .
  • If this region is non-empty, problem is FEASIBLE.


  • A set S is convex if any line segment connecting points in S lies entirely within S.
    • x_1, x_2 \in S ==> λx_1 + (1-λ)x_2 \in S, 0 <= λ <=1.
  • A simple inductive argument shows that for x_1, …, x_N \in S, WEIGHTED AVERAGES, or CONVEX COMBINATIONS, lie within set.
    • λ_1 x_1 + … + λ_N x_N \in S for λ_i > 0, λ_1 + … + λ_N = 1

Proof that XOR is not linearly separable (?)

  • Half-space are obviously convex.
  • Suppose there were some feasible hypothesis. If positive examples are in positive half-space, then green line segment must be as well.
  • Similarly, red line segment must line within negative half-space.
  • But intersection cannot lie in both half-spaces. Contradiction!

0-1 Loss (Classification Error Loss)

  • L_{0-1}(y,t) = (0 if y = t, 1 if y != t) = l_{y!=t}

Surrogate Loss Function

  • Replace loss function we care about with one which is easier to optimize.
  • y = w^T x + b
  • L_SE (y,t) = (y-t)^2 / 2
  • Problem: Large residual

Logistic Activation Function

  • A kind of Sigmoidal, or S-shaped function.


  • A linear model with a logistic non-linearity.

CE (Cross Entropy) Loss

  • L_CE (y,t) = -t log y - (1-t) log (1-y)
  • Loss derivative: dL_CE/dz = y-t

Hinge Loss

  • L_H (y,t) = max(0, 1 - t * y)
  • where t \in {-1, 1}

SVM (Support Vector Machine)

  • A linear model with hinge loss.
  • y = w^T * x + b
  • L_H = max(0, 1 - ty)

Multiclass Classification

  • Targets form a discrete set {1, …, K}
  • It is often more convenient to represent as indicator vectors.
  • If a model outputs a vector of class probs, we can use CE as loss function, where log is applied elementwise.

Indicator Vector (1-of-K encoding)

  • t = (0, …, 0, 1, 0, …, 0)

Weight Matrix

  • D input dimensions
  • K output dimensions
  • K * D weights, which we arrange as a WEIGHT MATRIX W.

Softmax Function 歸一化指數函數

  • A multi-variable generalization of logistic function.
  • Outputs are positive and sum to 1.
  • if one of z_k ‘s is much larger than others, softmax(z) is approxly argmax.


  • A set S is convex if line segment connecting any 2 points in S must lie within S.
  • All critical points are minima.
  • Gradient descent finds optimal solution.
  • In data space, pos & neg regions are both convex.
    • => If inputs x^(1), …, x^(N) are all in pos region, then any weighted avg must also be in pos region. Similarly fo neg region.
  • In weight space, feasible region is convex.

Weighted Average

  • A weighted average of points x^(1), …, x^(N) is a point given by linear combination:
    • x^(avg) = λ_1 * x^(1) + … + λ_N * x^(N)
    • where 0 <= λ_i <= 1 and λ_1 + … + λ_N = 1

Finite Difference

  • Check derivatives numerically by plugging in a small value of h.

Gradient Checking


  • Input Space
  • Weight Space

Input Space (Data Space)

  • Each DATA CASE corresponds to
    • a vector.
  • A CLASSIFIER corresponds to
    • a decision boundary, or
    • a hyperplane, such that
      • positive examples lie on one side, and
      • negative examples lie on other side.

Weight Space

  • Each set of classification WEIGHTS corresponds to
    • a vector.
  • Each TRAINING CASE corresponds to
    • a constraint in this space,
  • where some regions of weight space are classified
  • and some regions are classified


  • Positive Region
    • Where points are classified as positive
  • Negative Region
    • Where points are classified as negative

Binary Linear Classifier

  • Binary
    • Distinguish between 2 categories
  • Linear Classification is done using a linear function of inputs.
  • Input Dimension (Feature) D Summarize important info for classification.
  • x_j^(i)
    • jth feature for ith training example.
  • x^(i)
    • A vector formed by concatenating all features for a given training case.
  • Target
    • What we are trying to predict.
  • Class
    • 2 psb values that a binary target taken.
    • Positive: Positive examples
    • Negative: Negative examples
  • Training Set
    • A set of N pairs (x^(i), t^(i))
    • x^(i) is input
    • t^(i) is label (binary-valued target)
  • Label
    • Binary-valued target
    • Labeled examples
  • Model
    • Determines how predictions are computed from inputs.
  • Threshold
  • Function
    • w_1 * x_1 + … + w_D * x_D + b => w^T * x + b
      • w: Weight Vector
      • b: Bias (scalar-valued)
    • z = w^T * x + b
    • y = (1 if z >= r), (0 if z < r)
    • For y = 1
      • w^T * x + b - r >= 0
      • We reset b = b - r and r = 0
      • Then
      • z = w^T * x + b
      • y = (1 if z >= 0), (0 if z < 0)

Dummy Feature

  • To eliminate bias,
  • Add another input dimension x_0, which always takes value 1.
  • w_1 * x_1 + … + w_D * x_D
    • = w_0 * x_0 + w_1 * x_1 + … + w_D * x_D
    • = w_0 + w_1 * x_1 + … + w_D * x_D
  • So w_0 effectively plays role of a bias. We can then simply write
    • z = w^T * x



Perceptron 感知器

  • Positive: 1
  • Negative: -1 // Not “0”!!!
  • Classification Model
    • z = w^T * x
    • y = (1 if z >= 0), (-1 if z < 0)

Perceptron Algorithm 感知器算法

  • Examine each of training cases on at a time.
  • For each input x^(i), we compute prediction y^(i) and see if it matches target t^(i).
  • If prediction is correct, we do nothing.
  • If it is wrong, we adjust weights in a direction that makes it more correct.
  • Criterion
    • z^(i) * t^(i) > 0
  • How to update?
    • w’ = w + αx
    • w’: New weight vector
    • w: Old weight vector
    • α: 1 for pos case and -1 for neg case

Perceptron Learning Rule

  • For each training case (x^(i), t^(i)),
z^(i) <- w^T * x^(i)
if z^(i) * t^(i) <= 0
    w <- w + t^(i) * x^(i)

Feature Representation

  • z = w^T * φ(x) + b
  • where φ(x) = φ_1(x), …, φ_D(x) is a function mapping input vectors to feature vectors.

MLP (Multilayer Perceptron)

  • Output units
    • y = f(x) = φ(Wx + b)
  • A multilayer network consisting of fully connected layers.
  • No relationship with perceptrons.
  • y = f^(L)◦…◦f^(1) (x)

DAG (Directed Acyclic Graph)

Feed-Forward Neural Network

  • Layers
    • Input layer
    • First hidden layer
    • Second hidden layer
    • Output layer
  • Any boolean circuit can be translated into a feed-forward neural net.

Recurrent Neural Network


  • Fully connected layer

Activation Function

  • Linear
    • y = z
  • ReLU
    • y = max(0,z)
  • Soft ReLU
    • y = log(1+e^z)
  • Hard Threshold
    • y = (1 if z > 0) (0 if z <= 0)
  • Logistic
    • y = 1 / (1 + e^{-z})
  • Hyperbolic Tangent (tanh)
    • y = (e^z - e^{-z}) / (e^z + e^{-z})

Feature Learning

  • Each first-layer hidden unit computes σ(w^T_i x)
  • Feature: Weight vector
  • It is reshaped into an image, with
    • gray = 0
    • white = +
    • black = -

Linear Layer

  • Any seq of linear layers can be equivalently represented with a single linear layer.
  • y = W’ x
    • where W’ Δ= W^(3) W^(2) W^(1)

ReLU (Rectified Linear Unit)

  • f(t)= mex(0,t)
  • Works well if one is careful: better than others.
  • Cheap to compute
  • CON: if t is too small: No gradient at all

Universal Approximator

  • Approximate any function arbitrarily well.
  • Limit
    • May need to represent an exponentially large network
    • May overfit
    • Need a compact representation

Univariate Chain Rule

Topological Ordering

Backpropagation 反向傳播算法

  • The algorithm repeats a two phase cycle, propagation and weight update. When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is then compared to the desired output, using a loss function, and an error value is calculated for each of the neurons in the output layer. The error values are then propagated backwards, starting from the output, until each neuron has an associated error value which roughly represents its contribution to the original output.
  • 用來訓練人工神經網絡的常見方法。該方法計算對網絡中所有權重計算損失函數的梯度。這個梯度會反饋給最優化方法,用來更新權值以最小化損失函數。
  • 反向傳播要求有對每個輸入值想得到的已知輸出,來計算損失函數梯度。

Error Signal

  • ybar: Denote derivative dL/dy.
  • This emphasizes that error signals are just values our program is computing.

Full Backpropagation Algorithm

  • Forward Pass
  • Backward Pass

Gradient Descent 梯度下降法

  • Gradient descent updates are opposite gradient direction.

Level Set (Contour)

  • Sets of points on which C(θ) is const


  • Vector of partial derivatives
  • ∇ _θ C = ∂C/∂θ = (∂C/∂θ_1, ∂C/∂θ_2)
  • Points in direction of max increase
  • Orthogonal to level set.

Weight Space Symmetry

  • We can permute all hidden units in a given layer and obtain an equivalent solution.

Local Minima

Saddle Point

  • ∂C/∂θ = 0
  • If we’re exactly on the saddle point, then we’re stuck.
  • If we’re slightly to the side, then we can get unstuck.


  • A flat region.
  • e.g.
    • Saturated Unit
    • 0 - 1 loss
    • Hard threshold activations
    • Logistic Activation & Least Squares.

Saturated Unit

Dead Unit

Ravine 山溝

Batch Normalization

  • Explicitly centres each hidden activation.

Momentum 動量

  • A simple and highly effective method.
  • p <- μp - α ∂C/∂θ
  • θ <- θ + p
  • α: Learning rate
  • μ: Damping Param, slightly less than 1.
    • Cannot be 1, because then the conservation of energy implies it will never settle down.
  • If gradient is const (i.e. cost surface is a plane), the params will reach a terminal velocity of
    • (α / (1 - μ)) (∂C/∂θ)
    • If one increases μ, α should be lowered to compensate.

Second-Order Optimization

  • Explicitly uses curvature info.


  • An optimization procedure which uses just a little bit of curvature info.

Batch Training

  • Summing over ALL training examples that is required by computing gradient.
  • Impractical with large dataset.

SGD (Stochastic Gradient Descent)

  • Update params based on gradient for a single training eg.
  • Can make significant progress before it has even looked at all data.
  • An unbiased estimate of batch gradient when sampling a training eg at random.
    • SGD


  • Compute gradients on a medium-sized set of training examples.
    • Its size S is a hyperparam that needs to be set.
      • Too large: Takes more memory to store activations, and longer to compute each gradient update.
      • too smal: can’t exploit vectorization.
      • A reasonable value might be S = 100.

Learning Rate (α)

  • A hyperparam
  • Typically btw 0.001 to 0.1



Set of models averages prediction on test data.


Train networks on diff subsets of training data.

Stochastic Regularization

Regularizing a network by injecting noise into computations.


  • A stochastic regularizer which randomly deactivates a subset of units.
  • i.e. Sets their activations to 0.
  • h_i = (phi(z_i) with prob 1-rho), (0 with prob rho_
    • Where rho is a hyperparam.
  • Equivalently, h_i = m_i phi(z_i)
    • Where m_i is a Bernoulli random var, independent for each hidden unit.
  • Backprop Rule    - zbar_i = hbar_i m_i phi’(z_i)

Generative Approach

Observation Model

Represented as p(a|s), which tells us how likely sentence s is to lead to acoustic signal a.


Represented as p(s), which tells us how likely a given sentence s is.

Posterior Distribution

p(s|a) = (p(s)p(a|s)) / (sum_s’ p(s’) p(a|s’))

Language Modeling

Chain Rule of Conditional Probability

p(s) = p (w_1, …, w_T) = p(w_1) p(w_2|w_1) … p(w_T|w_1, …, w_(T-1))


Distribution over next word only depends on preceding few words.

Distributed Representation

Info about a given word is distributed thruout representation.

Negative Sampling

  • Model is given pairs of words, and needs to distinguish between:
  • Real: 2 words actually occur near each other in training corpus.
  • Fake: 2 words are sampled randomly from training corpus.

Structure Sharing


  • If image or waveform is transformed slightly, classification shouldn’t change.



  • aka Kernel

Zero Padding

  • Treat signals as infly large by treaing values as 0 everywhere else.

Convolution 捲積

  • x*w
  • (x*w)[t] = \sum_τ x[t-τ] w[τ]
  • (x*w)[s,t] = \sum_{σ,τ} x[s-σ,t-τ] w[σ,τ]
  • e.g.
    • Blur
      • 0 1 0 1 4 1 0 1 0
    • Sharpen
      • 0 -1 0 -1 8 -1 0 -1 0
    • Detect Edges
      • 0 -1 0 -1 4 -1 0 -1 0
    • Vertical        - 1 0 -1 2 0 -2 1 0 -1
  • Properties
    • Commutativity
    • Associativity
    • Linearity
    • Equivariance
  • Any function computed by a convolution layer can be computed by a fully connected layer.


  • Signal x*w is composed of multiple copies of x
  • Translated and scaled by various amounts according to entries of w.


  • Generate each of entries of x*w by flipping w, shifting it, and taking dot product with x.

Sobel Filter

Linear Rectification


Convolution Layer

  • Don’t actually compute convolutions, but a closely related option called filtering.
  • (x*w)[t] = \sum_τ x[t+τ] w[τ]

Sparse Connectivity

  • Not every input unit is connected to every output unit.

Weight Sharing

  • Network’s weights are each shared between multiple connections.

Pooling Layer

  • Summarizes feature maps of prev layer by computing a simple function over small regions of image.


Pooling Group

Stride S

Receptive Field

Sequence-to-Sequence Prediction

If inputs and outputs are both sequences.


Its predictions don’t depend on anything before context window.


Artificial Neural Network 人工神經網絡

Nearest Neighbour Classification

KNN (K-Nearest Neighbors Algorithm 最近鄰居法)

Euclidean Distance

  • “L2 Distance”
  • |x^(1) - x^(2)| = sqrt(Σ_i(x_i^(1) - x_i^(2))^2)

L-Infinity Distance

  • |x^(1) - x^(2)| = max_i |x_i^(1) - x_i^(2)|

Validation Set

  • To determine K for KNN.
  • Which is separate from both training and test set.
  • Select K for best performance on validation set.
  • But report results on test set.
  • Generally, performance on validation set will be better than on test set.


  • Large K
    • Relatively simple boundary, no small islands in data.
    • Small changes in x do not generally change label.
  • Small K
    • A complex boundary btw labels.
    • Small changes in x often change label.

MAP Estimation (Maximum A Posteriori Estimation 最大後驗概率)

  • An estimate of an unknown quantity, that equals the mode of the posterior distribution.
  • 後驗概率分佈的眾數。利用最大後驗概率估計可以獲得對實驗數據中無法直接觀察到的量的點估計。它與最大似然估計中的經典方法有密切關係,但是它使用了一個增廣的優化目標,進一步考慮了被估計量的先驗概率分佈。所以最大後驗概率估計可以看作是規則化的最大似然估計。

Bayesian Inference 貝葉斯推斷

  • A method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
  • 其通過某些觀察的值來確定某些假設的概率,或者使這些概率更接近真實值。

Decision Boundary (Decision Surface)

  • Boundary btw pos/neg regions.
  • Set where w^T * x + b = 0
  • A hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.

Sigmoid Function S函數

  • σ(t) = 1/(1+exp(-t))
  • CON
    • σ’(t) is very small for t outside of t \in [-5, 5]
    • σ(t) is always positive


  • tanh(t) = (1 - exp(-2t))/(1 + exp(-2t))
  • = 2σ(2t) - 1
  • Not always positive
    • No problem with all weights having to move in same direction
    • Advantage over sigmoid


A group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

Softmax Function 歸一化指數函數

  • 邏輯函數的一種推廣。它能將一個含任意實數的K維的向量z的“壓縮”到另一個K維實向量σ(z)中,使得每一個元素的範圍都在(0, 1)之間,並且所有元素的和為1。

Weight Initialization

  • Extremely important for Multiplayer Neural Networks.
  • Not all zeros
  • Small random numbers
  • Heuristic: Random numbers that depend on # incoming weights.
  • Can set biases to 0.

Hidden Layer Unit as Features

  • Once we train neural network, values units in hidden layer should be useful for computing output units.
  • Weights W^0 btw input layer and hidden layer are such that hidden units are useful.
  • Think of hidden units as features of data: summaries of data that are useful for computing outputs.
  • In networks with no hidden layer, we simply compute as many features as there are outputs. So features should look like inputs that we are looking for.


  • Training data contains info about regularities in mapping from input to output. But it also contains noise.
    • Target values may be unreliable.
    • There is SAMPLING ERROR. There will be accidental regularities just cuz of particular training cases that were chosen.
  • When we fit model, it cannot tell which regularities are real and which are caused by sampling error.
    • So it fits both kinds of regularity.
    • If model is very flexible it can model sampling error really well. This is a disaster.
  • How to prevent?
    • Use a model that has right capacity.
      • Enough to model true regularities.
      • Not enough to also model spurious regularities (Assuming they are weaker).
    • Fitting curves in 2D.
      • Only fit lines, not higher-degree polynomials.
      • Only fit quadratics, not higher degree polynomials.

Weight Decay

  • Limiting size of weights
  • Weight-decay involves adding an extra term to cost function tht penalizes squared weights.
  • Keeps weights small unless they have big error derivatives.
  • It prevents network from using weights that it does not need.
  • For L2 penalty, if network has 2 very similar inputs it prefers to put half weight on each rather than all weight on one.


  • Training Data
    • Learning params of model.
  • Validation Data
    • Not used of learning but is used for deciding what type of model and what amount of regularization works best.
  • Test Data
    • Used to get a final, unbiased estimate of how well network works. We expect this estimate to be worse than on validation data.


Resample (with replacement) from training set: abcde -> accdds


Fit model one at a time. Re-weight each training case by how badly it is predicted by models already fitted.


Dot product btw 2D array.

Receptive Field

Area of lower layer to which neuron is connected.

Convolution 捲積

  • Operation of computing feature layer from lower layer.
  • 通過兩個函數f和g生成第三個函數的一種數學算子,表徵函數f與經過翻轉和平移的g的乘積函數所圍成的的曲邊梯形的面積。

Sobel Operator 索貝爾算子

  • Used in image processing and computer vision, particularly within edge detection algorithms where it creates an image emphasising edges.
  • 圖像處理中的算子之一,主要用作邊緣檢測。在技術上,它是一離散性差分算子,用來運算圖像亮度函數的梯度之近似值。在圖像的任何一點使用此算子,將會產生對應的梯度矢量或是其法矢量。

Blob Detection

Aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions.

Convolutional Neural Network 捲積神經網絡

  • A type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.
  • 一種前饋神經網絡,它的人工神經元可以響應一部分覆蓋範圍內的周圍單元,對於大型圖像處理有出色表現。

Deep Dream

  • Want to exaggerate details in image that look a little bit like recognizable objects.
  • Pick a layer in ConvNet.
  • It will have some neurons that are highly activated.
  • There is a trade-off here: we cannot make ALL of them be even more highly activated simultaneously.
  • Idea: Make rich get richer. Change input x with most highly activated neurons influencing change in input most.

RNN (Recurrent Neural Network 遞歸神經網絡)

Likelihood Function L(θ)

Prob of observed data, as a function of θ.

Log-Likelihood Function l(θ) = log L(θ)

Maximum Likelihood Criterion

Choose param θ which maximizes l(θ)

Empirical Mean

MLE of mean of a normal distribution is simply mean of observed values, or empirical mean.

Prior Distribution p(θ)

Everything you believed about params before looking at data.

Likelihood p(D|θ)

Prob of observations given params.

Posterior Distribution p(θ|D)

Beliefs about params after observing data.

Posterior Predictive Distribution

Distribution over future observables given past observations.

Uninformative Prior

Data Overwhelm Prior

More data we observe, less uncertainty there is about param, and more likelihood comes to dominate.

Beta Distribution

Conjugate Prior


Weighted Average


Artificial Intelligence Notes

Notes for CSC384Disclaimer: The notes below are fully/partially NOT created by myself. They are from slides and/or wikipedia and/or textbook. The purpose of this post is simply to learn and review for the course. If you think something is inapprop...…


Prolog & SWI-Prolog Notes

Disclaimer: The notes below are fully/partially NOT created by myself. They are from slides and/or wikipedia and/or textbook. The purpose of this post is simply to learn and review for the course. If you think something is inappropriate, please co...…