computer vision

these notes are total confusion but hopefully someday i’ll come to understand them.


Lecture 2


Lecture 3 (1/13)

Introduction to the Fourier Transform But what is the Fourier Transform? A visual introduction. - YouTube signal processing - Is the convolution an invertible operation? - Mathematics Stack Exchange The Fast Fourier Transform Algorithm - YouTube

[Invertible Convolutions Emiel Hoogeboom](

Canny edge detector - Wikipedia

The Berkeley Segmentation Dataset and Benchmark - human annotated boundaries for an objective measure of consistency - the machine algorithm output is compared to the human annotated output, to measure effectiveness

next time: more advanced edge detectors

Lecture 4

second moment matrix

  • look at eigenvalues of 2nd moment matrix
    • look at 2 dir at which it changes rapidly
    • can characterize a local patch in terms of eigenvalues of M
      • in a flat region they will be small
      • in an edge, stronger variation along one direction than the other
      • at corner, both are large and of comparable magnitude
    • we have a measure of how ‘corner-like’ a local point in the image is

    • R = det(M) - alphatrace(M)^2
      • a summary function
      • at edge, R<0, corner, R>0, flat, R is small
  • analogous to canny edge detector
  • take image derivatives
  • there is a windowing function (gaussian filter)
  • non maxima suppression
  • want difference at a PATCH, not at a particular point
  • choosing the scale of the window function, we choose the scale over which we are detecting corners

invariance and covariance

  • affine intensity change
  • rotation, translation

Since the corner operator can be scaled. the choice of interest point detecting is not covariant to scaling. we are picking a scale; if we upsample or downsample the image, something that is a corner at one scale is not the same at the other -> lack of covariance -> -> run corner detector at multiple scales if there si a sclae change, consider those differences somehow

orientation normalization

if we want to design descriptors that are invariant to orientation, local rotation

  • suppose we have a descriptor of a patch w edge structure
  • capture histogram orientation intensity at this local patch
  • see what orientations are prevalent
  • index this histogram so that the descriptor is invariant to local orientation
    • append our record of where we started indexing

maximally stable extremal regions

  • characterising local interest points and making local descriptors of them
  • MSER
  • etc

local features

image representations

for an interest point detector, how to build meaningful descriptors? -> histogram things by color, texture, SIFT descriptors


  • consider whether we want joint or marginal histograms
    • marginal: lsab,texton
      • requires independent features
    • joint: ab color space; do binning over the 2d color space

lecture 5

feature descriptors

scale invarient Feature transform

taking a local image patch and creating a descriptor for it

  • SIFT vector formation
    • for an 8x8px image
      • bins orientation energy within a cell
      • one cell of the descriptor coversa 4x4 pixel array in the image
      • building some invariance to small local translations / orientations
    • gaussian smoothing function
    • threshold gradient magnitudes to avoid excessive influence of high gradients
  • interest point operator with this descriptor
    • add scale invariance
    • convolve with gaussian smoothing filter
    • convolve again and again w gsf
    • take difference btwn 2 neighboring smoothed image
    • look for extrema in this differenced space
    • to detect spatial locations where there is some unique structural change w scale
    • repeat for a smoothed and subsampled version of the image
    • to choose interest point locations, take the set of difference of gaussian responses in both x and y; look for xy scale locations that are local extrema

scale invariance - if you feed in a version of the image ; to down sample is to smooth w a gaussian and then subsample; for different versions of the image, apply process to each of them, some level of the smoothing process would approx align with the smoothing of version 2. hope is that you find corresponding locations as ‘peaks’ in the gaussian process; able to pick out same relative locations, meaningful interest points; the detection process mirrors a re-scaling process for the image

the descriptor itself is very important strategy of binning is more important than interest point detector

log polar binning - shape context descriptor want to describe edge energy; any pixelwise measure at some point in the image; the lpbinning strategy - bins are spaced equally and radially; very sensitive to fine scale structure at the actual location of the structure; robust to deformations far from pt you are describing and less for the closer ones.capturing object shape across multiple scales; make it robust to different deformations

what should you be histogramming? if u want to match higher level structure use a pairwise patch similarity measure; edges are replaced with a correlation measure btwn local patches same structur eof different textures -> factors out effect of texture within an image measure pairwise correlation btwn patches the correlation surfaces look similar

shape context histogramming strategy gives us good correspondence btwn similar points on wha we perceive as the same object

local descriptors

want something robust to a set of deformations, distinctive so that it matches only to correct corresponding discriptors in different images. sobel is a building block color, self similarity


simplest approach: use descriptors to find correspondence btwn 2 diff views in same object. could just pick th enearest neighbor -caveat, lots of photos have repeating strucures

  • consiering distance ratios
  • want to find parameters of a model that best fit the data
  • and alignment of the parameters of the transformation that best align matched points
  • must design correpondence to a parameter
    • if we have a setting for parameters for model of interest, hwo does it agree w our data (matches)
    • optimization method to recover parameters quickly

fitting and alignment

  • global optimization
    • least squares, other
  • hypothesize and test

global optimization

line fitting: given some datapoints, want to put a line thru it in a closed form, write down a cost fxn that minimizes the SSD between the location of pts predicted by the model

other ways to search for parameters

  • grid search
  • gradient descent

multi-stable perception? multiple candidate interpretations, for example, the necker cube

the hough transform

Hough transform - Wikipedia Image Transforms - Hough Transform How Hough Transform works - YouTube

take our model, create a discrete set of parameter values discretize parameter values ‘vote’ for sets of parameters create a histogram of votes for model parameters for each potential feature match, a vote is contributed to a point or multiple points in the parameter grid after voting process, find local min/max in the grid -> become the hypothesis for model parameters that we go back and check

a point in x,y dimension corresponds to a line in hough space any value of m -> a value of b descretize each m and b; accumulate how many times some point in the left appears -> some cncenetration of correct values of m and b

if by orientation, theta - use a polar representation for the parameter space bounded parameter space by rho, theta

for circles. or generally choose center reference in the shape; every point - consider a tangent direction; vote for center locations for distance to center from this tangent create any parameter space and voting scheme

cost- how many data items do we need to cast votes for? how many votes


random sample consensus handle robustness to outliers, multiple valid hypothesis we need to detect


  1. randomly sample the number of points required to fit the model
  2. solve for model parameters using samples
  3. score by the fraction of inliers within a preset threshold of the model
    1. how many of our remaining data points agree w this hypothesis?
  4. repeat 1-3 until best model is found with high confidence

if we want to find model parameters that will be consistent w inliers and not outliers, then step 1 will more likely sample inliers get models fit mostly w inliers; other inliers will be consistent; outliers eliminated

good - robust to outliers

cost - time grows with fraction of outliers and number of parameters; not good for getting multiple fits

image alignment

  • estimate some form of geometric transformation


affine transformations

  • combinations of linear transformations and translations properties
  • map lines to lines
  • parallel are not necessarily preserved
  • ratios not preserved


using least squares, given candidate feature descriptors, solve for translation

if we have outliers, wrap RANSAC around least squares solver; subsample fewer points , as many we need to estimate the transformation

how do you know how many u need?

or use hough transform; initialize a grid of parameter values, each matched pair casts a vote, find the parameters w the most votes, solve using least squares w inliers

alt: iteratively refine ur model for interest points, but no actual descriptors present

individual instance recognition david g lowe, distinctive image features from scale invariant keypoints


  • edge detection
  • interest point operator
  • feature descriptors (SIFT)
  • feature descriptor matching via approx nearest neighbors
  • hough transform

lecture 6

high dimensional nearest neighbors.

the two very common ways to approach this.

  • tree based search structure
    • an extension of bst to high dimensional space
  • hashing search structure
  • sift descriptors go into the leaf nodes of the tree
  • internal nodes are axis-aligned sub divisions
    • whatever dimension
  • the tree splits the 2d search space into 2 halves; continue splitting, choosing axis aligned x or y coordinates
  • compute the variance of the data in the current bin, along each axis aligned coordinate.
  • wiht a heuristic split decision; select axis w greatest variance; choose median as split value and split
  • repeat the variance thing on LH and RH of subtree - each split corresponds w the next level subtree
  • stopping condition: just 1 item left

aim is to find nearest neighbors

the tree might contain a collection of descriptors for objects in our database

with a new query point .. our data items fall into bins the 2d query point is at a location it might be mapped into a bin whose descriptor is along an edge in space; but there’s a closer neighbor juts over the edge of the subdivision line. ?? so i don’t want to only look at the larger dimensional subspace ??

the 128 dimensional rectangle . 1st fix - map a query pt into leaf node, check descriptors in any bordering rect subregion. In 3d, however, there could be a lot of bordering subregions. as this is upped in dimensionality, we’d have to exhaustively check a large portion of the tree, to be sure we have the exact NN

retrieving the exact nearest neighbor, with some probability of only retrieving a close neighbor. while descending the tree, keep a priority queue - rank unexplored branches of the tree by distance to the query point. there is a distance metric ( just along the coordinate axis) -> the whole subtree B is at least distance d from the query point; if there is anything in A that is smaller than d away, you can hold off checking subtree B - if not, pop off B and explore it


priority queue - D d_1; B d_0 descend to a leaf node, enqueuing unexpored branches and the most optimal distance one could take; KD tree algorithm: how it works - YouTube best bin first kd tree descriptors - Google Search

hashing search structure

locality sensitive hashing

  • instead of finding axis aligned splits in our dataset, we throw down random hyperplanes (in 128 hyperplanes, in 2d they are lines); we use these to split up our data as a hashing function.

  • for each hyperplane/line, build up a hash code that maps points into number of bits = number of hyperplanes
  • map each point to some binary code that says which side of the plane the point is on
  • for all possible combination of length 2 binary codes (for 2d), have a hash table of 4 bins - 00,11,10,11; sort points into them
  • look for a points neighbors in its bin
  • problem - what if query point is just on the other side of a split, from its actual closest nearest neighbor?
    • bins are constructed randomly. with high probability, we hope that they bin actual nearest neighbors with each other
    • to increase the chances of that - build two hash tables
    • in the second one, we use a totally different set of splitting hyper planes
    • so for two hyperplanes, we build 3 different hash tables
    • given a query point, we match it to each set of hyperplanes independently, check its neighbors in all of the tables
    • will give a close neighbor, if not true NN -> good for feature matching

k bits in the hash code, and some number N of tables another caveat - how do we choose the parameters of these hyperplanes?

  • given a collection of vectors (feature descriptors) {v_i}
  • subtracts out the mean to center it
  • normalize data so descriptors are of unit length -> map to a circle around origin (the unit hypersphere surrounding the origin)
  • now just consider split hyperplanes thru the origin (described by angle thru origin, or the normal vector)

  • descriptors themselves change due to viewpoint or lighting …
  • so for computer vision, it must be sufficient to do approx. matching
    • some might not be correct, or might be outliers
    • after indexing and lookup - there will be a stage after the matching process, like a model fitting process (hough, ransac, least robust squares)
  • how many hyper planes? - application dependent
    • for a point a

the history of recognition prior to deep learning

  • there are 10000 - 30000 visual object level categories in the dictionary
    • imagenet has 1000 categories
    • segmentation w 100 categories
  • we have 1/10th of complexity in labels for distinct objs

  • hierarchies of obj categories from the wordnet hierarchy

image parsing/semantic segmentation

recognition is all about modeling variability

  • why study classical approaches -
    • context
    • there are some modeling strategies that are relevant irrespective of what set of ml techniques u layer on top of it
    • sift is good at matching exact replicas. what if we want adaptations to semantic level of variability to understand categories? this approaches human level capabilities
  • for a known shape, estimate some parameters such as camera, pose, illumination, that allows u to model the appearance of that particular shape
  • simplified visual worlds -> modeling simple geometric shapes, considering junctions & how appearance of objs in 2d rendering of a 3d model depends on camera, pose, illumination
  • idea of using complex objects as assembly of subcomponents
    • “generalized cylinders”
  • to extend this to multiple objects, breaking them down into geometric primitives
  • part based recognition
    • forsyth - human body as general shape primitives
    • zisserman 1995 - geometric models
    • ponce 1989
  • subsequent/in parallel: people trying to integrate statistical models of appearance variability
    • of faces: eigenfaces (turk & penland 1991)
      • getting eigenvalue decomp of the space of faces
  • color histograms
    • building up feature descriptors based on color histograms


  • 1960s - early 1990s: geometric era
  • 1990s : appearance based models
  • 1990s- present: sliding window approaches
    • for every subregion of an image, ask/ build a predictor to tell if the subregion falls into a category. ask this question for this patch, treating patch as an image itself. ask for each adjacent patch until all possible subregions are checked
  • late 1990s: local features, SIFT and local feature design
    • for object instance recognition even under occlusion or geometric deformations
    • large-scale image search
      • for correlating multiple views of 3d structure.
      • building 3d model based on camera position & photographs
  • early 2000s: part based models
    • object as a set of parts, with relative locatios between parts ( a valid geometric arrangement /config of those parts - via a “spring” system; can move a bit depending on how tightly the spring couples them)
    • some probability dist of what the spatial distance is and angular relationship between parts. modeling pairwise interactions btwn parts
    • constellation models
      • fixated on deformation; focuses on a few parts
    • pictorial structure model
      • instead of modeling all pairwise combos btwn parts, we have a factorized model of how parts interact
      • fits well with articulated objects like the body (joints that connect each part)
      • primary motivation - to simplify the process of trying to figure out the optimal configuration of part geometry from some evidence of part appearance.
      • would facilitate efficient detection algos, since we look at part interactions of subsets of tree that are connected - not ALL.
  • 2000’s : bag of features models
    • loses all concept of arrangement
    • view an image as an unordered collection of SIFT or color descriptors
    • good for texture recognition,
    • build up hists over patches
  • spatial pyramid representation, that calculates bags at multiple scales; descriptors binned over subdivisions of pyramid
    • alt- in feature space of descriptors rather than image - match based on pyramids; encode descriptor content and descriptor itself hists over spatial locations

lecture 7

challenges of recognition

  • illumination, deformation, occlusion, clutter, interclass variation

    data driven approach: linear classifier

  • f(x,W) = Wx + b
  • image features: perform histogramming operations to extract some representation of image color, texture, or content
  • with multiple of these feature processing pipelines, we can make a longer feature representation vector to feed into our simple classifier
    • histograms, oriented gradients
  • How to define this classifier? what are the parameters?
      1. define a loss function that quantiifes our unhappiness with the scores across the training data
        • hinge loss / Support vector machine loss, which sums over the set of labels, looks at score assigned to the particular label vs correct label, performs max of 0 and diff of scores + 1; if score of correct class is some amount greater than the score for any other class, we don’t incur any penalty; if that’s not the case, then incur some loss penalty (linearly increasing)
      1. come up with a way of efficiently finding the parameters that minimize the loss function.

data driven approach: regularization

L(W) = data loss + regularization regularization - prevent the model from doing too well on training data; prevents overfitting

why regularize?

  • express some bias over what we expect reasonable model parameters to look like for our set of problems, so that we don’t fit noise into the data
  • affects optimization when we devise an algorithmic strategy to figure out what the optimal parameter W should be

typical regularization

  • sum of squares of all of the entries of the weight matrix, amounting to preferring a setting of weights that spreads out, making multiple features important in the classification decision rather than just 1
  • prefers simpler models
  • an unregularized version might create a fxn too complicated, which fits noise into data

the softmax classifier

  • we want to engineer our scoring function as probabilities
  • need to formulate raw real valued score vector to a prob. distribution over classes
  • compute scores and renormalize them by passing thru an exponential function, ensuring they’re nonneg, then normalizing so that probabilities sum to one (logits -> probabilities)
  • we compare the probabilities predicted by model with correct probabilities by ground truth
  • think about a distance metric that measures the diff between predicted and ground truth; using a defintiion of divergence (kuliback-leibler divergence) we can define an objective or loss function - so we minimize the cross entropy loss btwn the two prob dist functions
  • min/max possible loss L_i : min 0, max infinity

gradient descent

recall the 1d derivative of a function gradient is the vector of partial derivatives along each dimension (in multiple dimensions) taking gradient of the loss fxn wrt our parameters w: this gives us an idea of how th eloss changes as we move the entries of our weight matrix w. so, we have an optimization strat that computes grad of loss wrt our current params, then updates those params in the negative gradient direction so that we can decrease the loss.

stochastic gradient descent

  • repeating this process ^
  • with a loss fxn, compute gradient of loss fxn wrt our parameters
  • with a large dataset

issue -

solution - feature transformation; transform data first, and then classify hand designed -> training

idea of new deep learning approach - separation btwn feature extraction and classifier disappears; a pipeline ends up doing both; with parameters and a loss fxn

neural networks

linear score fucntion: f= Wx 2 layers : f=w2max(0, W1x) 3 layers: f = W3 max(0, w2max(0, W1x)) if there is no nonlinearity -> we get linear classifier again

activation functions in common use -

  • ReLu - max(0,x)
    • rectified linear unit - rectification = cuttint out output if equal to 0
  • ELU - exponentiated linear unit
  • Leaky ReLU
    • nonzero but small resposne to neg. inputs

how to compute gradients?

score function s loss function on predictions (SVM, hinge, cross entropy) regularization term

total loss: prediction loss + regularization lambda as a hyperparameter that reduces classif error ……. then compute gradient of the loss with respect to W1 and W2 (if s = W2max(0, W1x)) -> apply sgd to update weights W1 and W2 with a local optimization strategy

how to compute partials wrt w1 and w2?

good strategy : computational graphs + backpropagation example: scalars x,y,z f(x,y,z) = (x+y)z

inermediate value is q = x+y f= qz

want partial derivatives at each stage in the data flow graph so the first intermediate computation -> q = x+y partials dq/dx = dq/dy = 1 df/dq = z, df/dz = q compute df/dx, df/dy, df/dz with the chain rule

so df/dq is an upstream gradient and dq/dy is a local gradient df/dy = df/dq*dq/dy

so we can backpropogate inputs w preceding hidden states of the network eventually can compute gradient of loss with respect to all activations in the network (intermediate outputs) and all the network parameters (weights that participate in defining the function)

for any other modules or nodes in the graph, we continue the process

patterns in gradient flow

  • add gates
  • mul gates : “swap multiplier”
  • copy gate
  • gradient router

backprop: modularized implementation

modularizes stuff for u forward and backward pass in order to backpropogate, in the forward pass we want to stash values at a given node. if z = xy, output going forward is forward() also, we want the values that participated in the backward pass

so we need to store the hidden state of the network computed in the forward pass, and use those values in the backward pass

vectors and matrices?

vector wrt scalar: its partials vector wrt vector : derivative is jacobian

the upstream gradient, instead of being a scalar value, is a vector we have local gradients that are jacobian matrices backprop happens via matrix vector multiplication; upstream * local again.

the upstream gradient in a matrix partial with respect to z; for each element of z, hwo muc does it influence L? A Derivation of Backpropagation in Matrix Form – Sudeep Raja – Doctoral Student at Columbia University

lecture 8


motivation for sharing filter weights? there is an expectation that images are translation invariant What is a receptive field in a convolutional neural network? - Quora

pooling layers

helps us control spacial representation used to downsample by a factor of 2 in each spatial direction

  • treat each as an array & apply methods

max pooling

A Gentle Introduction to Pooling Layers for Convolutional Neural Networks makes pooling more discriminative; take maximal responses to filters in your network, to preserve important characteristics in ur spatial region

  • 2x2 max pooling, stride of 2 etc

after enough pooling - 1x1 spatial resolution

  • can then apply a fully connected layer like last lecture
  • gradual pooling is the most natural and efficient way tio implmeent a classification network


classic net arch - classification, some repeating blocks of conv and relu, pooling to change spatial resolution, repeat a few times, then connect fc classification layer, attach a softmax

activation fxns

sigmoid - if the nput to a neuron is always positive, output will be always pos or neg ?? tanh(x) - gradient saturation issue reLU - does not saturate, computationally efficient, converges faster than others - however, not zero-centered output (?)

in practice , use reLU and don’t use sigmoid

data preprocessing

assists in learning

  • PCA and Whitening of data
  • normalization (in linear classifiers)

weight initialization

Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming activation stats cluster to zero in later stages of network -> not much signal gets through if activations are all saturated, the network computes the same output for every input and we are in a regime where we need to move the parameters a lot


batch normalization

as we update our network parameters … we can try to dynamically enforce some property of activation statistics

insert, after every nonlinearity in a network, a renormalization layer that takes activations and dynamically maps the activation set to some desired property;

classification - What is zero mean and unit variance in terms of image data? - Cross Validated initialzie these layers to perform normalization, and then shift to a parameterized mean invariance, the initial values of these have zero mean unit variance ???

A Gentle Introduction to Batch Normalization for Deep Neural Networks dependence on batch size is reduced improves performance on conv nets

layer normalization

for n examples and d feature dimensions, normalize over d.

instance normalization

lecture 9

optimization - problems with SGD what if this parameter space has strange behavior? loss changing quickly in one dir and slow in another? high fluctuation in gradient -> jagged update what if gradient is zero? gradient descent gets stuck

adding a momentum term if you’re moving in a direction, keep moving in that dir - build up velocity as a running mean of gradients (have a velocity term that has some memory) - rho gives friction

for a current point in parameter space actual step is sum of velocity and gradient -> nesterov momentum instead of gradient at current place in parameter space, look ahead to where it would pussh u, compute gradient there and mix w velocity to get actual update direction

CS231n Convolutional Neural Networks for Visual Recognition Nesterov Accelerated Gradient and Momentum

other update method ; Adagrad

  • keep track of historical sum of squares in each dimension
  • take larger steps when parameters are …
  • smaller when there is a history of recent gradients w widely varying values
  • adjust learning rate on per element basis

Cutout regularization for CNNs - Ombeline Lagé - Medium

regularization dropout batch normalization data augmentation …

lecture 11

  • [fast r cnn]
  • [deep watershed transform for instance segmentation]
    • generate an energy surface over the domain of the image, so every pixel has a probability assoc. w the likelihood it is the object center, or distance from obj center
    • potential energy field - low when near the center, high otherwise
    • image morphology operations -> how do i partition the image based on this energy fxn?
    • flooding grows lower minima into larger regions
    • when flooding connects, that forms a partition
  • [CNNs and spectral embedding]
    • the CNN is driving a prediction about image content, which we can reassemble in the scene as objects & segmentation of the objects
    • “We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. “
    • “ Spectral embedding then resolves these predictions into a globally consistent segmentation and figure/ground organization of the scene”
    • what output are we training the nn to predict? how does it relate to reassembling the interpretation of the image?
  • [multi-person pose estimation using part affinity fields]
    • model of joint locations and connections of joint locs
    • different intermediate predictions
      • for each type of joint, predicting the probability that the joint is at a location in the scene (detecting all instances in the scene); joint localization
      • output density for elbows, knees, torsos, heads, wrists, etc
      • one set of channels that simultaneously detects relevant joints, no matter how many people are in the picture
      • rather than a loop that for each candidate, there are subprocesses that run
      • predicts how to connect up the joints to one another
        • another channel, in terms of output prediction, that gives an affinity field - left elbows and left shoulders; gives connection strengths in adjacent joints; reads off reassembly of full pose by predicted affinities
  • [neural image caption generation with visual attention]
    • gradually generates word by word description of the image; what words are activated by the process that selects for attention in subparts of an image

multigrid neural architectures

  • evolution of CNNs to something that looks like rainbow picture
    • 3x3 conv filters stack many layers; increasing feature channel count from 64->512; spatial pooling and subsampling
    • this design makes no sense
      • efficient algorithms in CS: coarse defined classification or decision making; tree based organization
      • it is fine-to-coarse - building larger context as u go
      • slow receptive field growth
        • for an activation in a neuron - receptive field (how much of input is connected by a pathway to a particular unit in the net) - first layer has 3x3 receptive field, then 5x5, then 7x7 -> constant growth of receptive fields as u add layers -> takes a long time til something relies on the input in its entirety (why is this an issue??)
      • features early in the layer affect subsequent layers.
      • if we buy into this story of gradual buildup to more abstract feature reps, there is an enforced choice that later layers contain more abstraction and spatial scale (coarseness) -> coupling of this
  • instead
    • store a pyramid of activation tensors
      • every layer in the net has an extra dimension of scale space
      • different tensors at fifferent spatial scales
      • flow of information from coarser spatial grids to finer, vice versa, everywhere in the network
      • series of layers emulates the fully connected layers
      • an info pathway that flows between spatial scales - provides shortcut pathways across the spatial dimension in the networks
      • triggering rapid receptive field growth
      • allows networks to learn tasks that standard cnns have a difficult time with
  • multigrid convolution
    • layers, instead of 1 activation tensor, have a set of them at diff spatial scales, with diff number of channels; define an operation that turns the pyramid of input into another pyramidal output; a multigrid extension of conv. built out of simple components in conv nets
      • upsampling (nearest neighbor or bilinear)
      • downsampling (max pooling)
      • pooling (communication)
    • gives us an operation where, cnn evolves representation on pyramids
    • has standard cnn embedded within it
  • key properties
    • images are multiscale - why not do something coarse defined ?
    • receptive field growth
      • diameter of the finest scale spatial grid = s
      • in O(logs) layers, i have pathways in the networks that connect every unit with any unit in the pyramid; instead of constant receptive growth - O(S) layers for standard cnn
      • how quickly u propagate info is exponentially faster
      • (why does receptive field matter?)

        The concept of receptive field is important for understanding and diagnosing how deep CNNs work. Since anywhere in an input image outside the receptive field of a unit does not affect the value of that unit, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for each single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction

      • internal attention
    • why didnt traditional cnns start coarse?
      • need to go up the scalespace and back down;;

        attentional tasks / learned attention

  • building cnns that undo spatial transformations [jaderberg]
    • mnist digits rendered in distorted forms - offset, noise - w goal of recovering undistorted version
    • had a net that did localization which output translation parameters
    • sampler, that, given transformation, undoes parameterized transf
      • if the transformer is replaced with a multigrid cnn
        • it learns it
  • building an attention map of, what portion of the input does the upper left of the output depend on?
    • as the location of the input image changes, the attentional pattern of multigrid networks change; whereas of unet is static
    • humans try to hand design modules that compute a mask over some set of spatial locs, multiply it with an activation tensor in the net, then summarize the results as a pool for later use - attention mechanism built in
    • multi grid gives an implicit mechanism
      • infers what spatial subregion of the image is important for the output
  • [dynamic routing between capsules (… hinton)]
    • hinton’s capsules
      • more complicated routing strategy from layer to layer
      • overlapped mnist digits
      • a different strategy for creating vector valued reps that are meaningful
      • each component of the activation tensor .
      • what was missing from neural nets for CV
        • vector valued reps emerging internally, where they encode things like local pose parameters
        • want to arch it w more complicated interactions w layers, so that there is a more informative decomposition
        • trying to change how layers operate internally in order to route information .. by “agreement”
      • information flow from layer to layer
        • detecting evidence for components that agree w a particular configuration of an object
      • multigrid learns this

memory arch of multigrid

- learn to store a map of the maze in the internal state of its network
- capacity for internal attention for spatial locations
- if u augment the net w memory, it can attend to its own internal memory
	- this lets u update read and write ops
	- and learn attentional strategies
	- learn about the memory network

general adversarial networks

- ebgan
- biggan - just a years difference in research
  • generative tasks, formulations, and issues
    • gans are not the only option for generative tasks
  • generation - learn to sample from the distribution represented by the learning task
  • unsupervised learning
  • conditional generation
    • conditioning output to generate from a subspace of the full space
      • “an indigo bunting facing right”
  • semantic segmentation and its inverse
    • images to semantic labeling, vice versa
    • labels to street scene. day to night. bw to color. edges to photo
      • rendering and art

designing a network for generative tasks

  • architecture
    • encoder decoder; upsampling architectures for dense prediction
      • autoencoders
    • since we dont have labelled training data, how do we design loss functions?
      • want to measure how close our output is to training; instead of writing a loss formulation in analytical form - train another net to predict whether the output of the 1st net looks realistic
    • train two networks with opposing objectives
      • generator - generates samples
      • discriminator - distinguishes between generated and real samples
        • want generator to fool discriminator. want discriminator not to be fooled

lecture 12

designing a network for generative tasks

  • take an encoder/decoder or u net arch
    • consider its latter half & smaller middle part
      • ?
    • loss function
      • nn or otherwise

how to implement loss fxn? - samples generated according to a prob dist - training data has a prob dist - want p_model to match p_data - probability models

  • train 2 neural nets
    • one is a generator
    • one is a descriminator
  • loss - conditional log likelihood for real and generated data; high score for real, low score for fake

  • nash equilibrium occurs if the generator produces samples from the same distribution as the underlying data distribution. A Gentle Introduction to Generative Adversarial Network Loss Functions

GAN Objective Functions: GANs and Their Variations - Towards Data Science KL and JS divergence probability dists

lecture ??

learning to produce informative image descriptions

learning to perform image descriptions

recurrent networks Word2vec - Wikipedia Recurrent Neural Networks - Towards Data Science visual semantic space