notes from CMSC25040
computer vision
these notes are total confusion but hopefully someday i’ll come to understand them.
textbook
http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.pdf
Lecture 2
file:///Users/april/Downloads/lec01b_color_filt.pdf
Lecture 3 (1/13)
Introduction to the Fourier Transform But what is the Fourier Transform? A visual introduction.  YouTube signal processing  Is the convolution an invertible operation?  Mathematics Stack Exchange The Fast Fourier Transform Algorithm  YouTube
[Invertible Convolutions  Emiel Hoogeboom](https://ehoogeboom.github.io/post/invertible_convs/) 
Canny edge detector  Wikipedia https://towardsdatascience.com/nonmaximumsuppressionnms93ce178e177c
The Berkeley Segmentation Dataset and Benchmark  human annotated boundaries for an objective measure of consistency  the machine algorithm output is compared to the human annotated output, to measure effectiveness
next time: more advanced edge detectors
Lecture 4
second moment matrix
 look at eigenvalues of 2nd moment matrix
 look at 2 dir at which it changes rapidly
 can characterize a local patch in terms of eigenvalues of M
 in a flat region they will be small
 in an edge, stronger variation along one direction than the other
 at corner, both are large and of comparable magnitude

we have a measure of how ‘cornerlike’ a local point in the image is
 R = det(M)  alphatrace(M)^2
 a summary function

at edge, R<0, corner, R>0, flat, R is small  have a threshold val (nonmaxima suppression)
harris corner detector
Harris Corner Detector  Wikipedia http://www.cse.psu.edu/~rtc12/CSE486/lecture06.pdf
 have a threshold val (nonmaxima suppression)
 analogous to canny edge detector
 take image derivatives
 there is a windowing function (gaussian filter)
 non maxima suppression
 want difference at a PATCH, not at a particular point
 choosing the scale of the window function, we choose the scale over which we are detecting corners
invariance and covariance
 affine intensity change
 rotation, translation
Since the corner operator can be scaled. the choice of interest point detecting is not covariant to scaling. we are picking a scale; if we upsample or downsample the image, something that is a corner at one scale is not the same at the other > lack of covariance > > run corner detector at multiple scales if there si a sclae change, consider those differences somehow
orientation normalization
if we want to design descriptors that are invariant to orientation, local rotation
 suppose we have a descriptor of a patch w edge structure
 capture histogram orientation intensity at this local patch
 see what orientations are prevalent
 index this histogram so that the descriptor is invariant to local orientation
 append our record of where we started indexing
maximally stable extremal regions
 characterising local interest points and making local descriptors of them
 MSER
 etc
local features
image representations
for an interest point detector, how to build meaningful descriptors? > histogram things by color, texture, SIFT descriptors
histograms
 consider whether we want joint or marginal histograms
 marginal: lsab,texton
 requires independent features
 joint: ab color space; do binning over the 2d color space
 marginal: lsab,texton
lecture 5
feature descriptors
scale invarient Feature transform
taking a local image patch and creating a descriptor for it
 SIFT vector formation
 for an 8x8px image
 bins orientation energy within a cell
 one cell of the descriptor coversa 4x4 pixel array in the image
 building some invariance to small local translations / orientations
 gaussian smoothing function
 threshold gradient magnitudes to avoid excessive influence of high gradients
 for an 8x8px image
 interest point operator with this descriptor
 add scale invariance
 convolve with gaussian smoothing filter
 convolve again and again w gsf
 take difference btwn 2 neighboring smoothed image
 look for extrema in this differenced space
 to detect spatial locations where there is some unique structural change w scale
 repeat for a smoothed and subsampled version of the image
 to choose interest point locations, take the set of difference of gaussian responses in both x and y; look for xy scale locations that are local extrema
scale invariance  if you feed in a version of the image ; to down sample is to smooth w a gaussian and then subsample; for different versions of the image, apply process to each of them, some level of the smoothing process would approx align with the smoothing of version 2. hope is that you find corresponding locations as ‘peaks’ in the gaussian process; able to pick out same relative locations, meaningful interest points; the detection process mirrors a rescaling process for the image
the descriptor itself is very important strategy of binning is more important than interest point detector
log polar binning  shape context descriptor want to describe edge energy; any pixelwise measure at some point in the image; the lpbinning strategy  bins are spaced equally and radially; very sensitive to fine scale structure at the actual location of the structure; robust to deformations far from pt you are describing and less for the closer ones.capturing object shape across multiple scales; make it robust to different deformations
what should you be histogramming? if u want to match higher level structure use a pairwise patch similarity measure; edges are replaced with a correlation measure btwn local patches same structur eof different textures > factors out effect of texture within an image measure pairwise correlation btwn patches the correlation surfaces look similar
shape context histogramming strategy gives us good correspondence btwn similar points on wha we perceive as the same object
local descriptors
want something robust to a set of deformations, distinctive so that it matches only to correct corresponding discriptors in different images. sobel is a building block color, self similarity
matching
simplest approach: use descriptors to find correspondence btwn 2 diff views in same object. could just pick th enearest neighbor caveat, lots of photos have repeating strucures
 consiering distance ratios
 want to find parameters of a model that best fit the data
 and alignment of the parameters of the transformation that best align matched points
 must design correpondence to a parameter
 if we have a setting for parameters for model of interest, hwo does it agree w our data (matches)
 optimization method to recover parameters quickly
fitting and alignment
 global optimization
 least squares, other
 hypothesize and test
global optimization
line fitting: given some datapoints, want to put a line thru it in a closed form, write down a cost fxn that minimizes the SSD between the location of pts predicted by the model
other ways to search for parameters
 grid search
 gradient descent
multistable perception? multiple candidate interpretations, for example, the necker cube
the hough transform
Hough transform  Wikipedia Image Transforms  Hough Transform How Hough Transform works  YouTube
take our model, create a discrete set of parameter values discretize parameter values ‘vote’ for sets of parameters create a histogram of votes for model parameters for each potential feature match, a vote is contributed to a point or multiple points in the parameter grid after voting process, find local min/max in the grid > become the hypothesis for model parameters that we go back and check
a point in x,y dimension corresponds to a line in hough space any value of m > a value of b descretize each m and b; accumulate how many times some point in the left appears > some cncenetration of correct values of m and b
if by orientation, theta  use a polar representation for the parameter space bounded parameter space by rho, theta
for circles. or generally choose center reference in the shape; every point  consider a tangent direction; vote for center locations for distance to center from this tangent create any parameter space and voting scheme
cost how many data items do we need to cast votes for? how many votes
RANSAC
random sample consensus handle robustness to outliers, multiple valid hypothesis we need to detect
algorithm
 randomly sample the number of points required to fit the model
 solve for model parameters using samples
 score by the fraction of inliers within a preset threshold of the model
 how many of our remaining data points agree w this hypothesis?
 repeat 13 until best model is found with high confidence
if we want to find model parameters that will be consistent w inliers and not outliers, then step 1 will more likely sample inliers get models fit mostly w inliers; other inliers will be consistent; outliers eliminated
good  robust to outliers
cost  time grows with fraction of outliers and number of parameters; not good for getting multiple fits
image alignment
 estimate some form of geometric transformation
what
affine transformations
 combinations of linear transformations and translations properties
 map lines to lines
 parallel are not necessarily preserved
 ratios not preserved
translation
using least squares, given candidate feature descriptors, solve for translation
if we have outliers, wrap RANSAC around least squares solver; subsample fewer points , as many we need to estimate the transformation
how do you know how many u need?
or use hough transform; initialize a grid of parameter values, each matched pair casts a vote, find the parameters w the most votes, solve using least squares w inliers
alt: iteratively refine ur model for interest points, but no actual descriptors present
individual instance recognition david g lowe, distinctive image features from scale invariant keypoints
ingredients
 edge detection
 interest point operator
 feature descriptors (SIFT)
 feature descriptor matching via approx nearest neighbors
 hough transform
lecture 6
high dimensional nearest neighbors.
the two very common ways to approach this.
 tree based search structure
 an extension of bst to high dimensional space
 hashing search structure
tree based search
 sift descriptors go into the leaf nodes of the tree
 internal nodes are axisaligned sub divisions
 whatever dimension
 the tree splits the 2d search space into 2 halves; continue splitting, choosing axis aligned x or y coordinates
 compute the variance of the data in the current bin, along each axis aligned coordinate.
 wiht a heuristic split decision; select axis w greatest variance; choose median as split value and split
 repeat the variance thing on LH and RH of subtree  each split corresponds w the next level subtree
 stopping condition: just 1 item left
aim is to find nearest neighbors
the tree might contain a collection of descriptors for objects in our database
with a new query point .. our data items fall into bins the 2d query point is at a location it might be mapped into a bin whose descriptor is along an edge in space; but there’s a closer neighbor juts over the edge of the subdivision line. ?? so i don’t want to only look at the larger dimensional subspace ??
the 128 dimensional rectangle . 1st fix  map a query pt into leaf node, check descriptors in any bordering rect subregion. In 3d, however, there could be a lot of bordering subregions. as this is upped in dimensionality, we’d have to exhaustively check a large portion of the tree, to be sure we have the exact NN
retrieving the exact nearest neighbor, with some probability of only retrieving a close neighbor. while descending the tree, keep a priority queue  rank unexplored branches of the tree by distance to the query point. there is a distance metric ( just along the coordinate axis) > the whole subtree B is at least distance d from the query point; if there is anything in A that is smaller than d away, you can hold off checking subtree B  if not, pop off B and explore it
A B CD EF
priority queue  D d_1; B d_0 descend to a leaf node, enqueuing unexpored branches and the most optimal distance one could take; KD tree algorithm: how it works  YouTube best bin first kd tree descriptors  Google Search
hashing search structure
locality sensitive hashing

instead of finding axis aligned splits in our dataset, we throw down random hyperplanes (in 128 hyperplanes, in 2d they are lines); we use these to split up our data as a hashing function.
 for each hyperplane/line, build up a hash code that maps points into number of bits = number of hyperplanes
 map each point to some binary code that says which side of the plane the point is on
 for all possible combination of length 2 binary codes (for 2d), have a hash table of 4 bins  00,11,10,11; sort points into them
 look for a points neighbors in its bin
 problem  what if query point is just on the other side of a split, from its actual closest nearest neighbor?
 bins are constructed randomly. with high probability, we hope that they bin actual nearest neighbors with each other
 to increase the chances of that  build two hash tables
 in the second one, we use a totally different set of splitting hyper planes
 so for two hyperplanes, we build 3 different hash tables
 given a query point, we match it to each set of hyperplanes independently, check its neighbors in all of the tables
 will give a close neighbor, if not true NN > good for feature matching
k bits in the hash code, and some number N of tables another caveat  how do we choose the parameters of these hyperplanes?
 given a collection of vectors (feature descriptors) {v_i}
 subtracts out the mean to center it
 normalize data so descriptors are of unit length > map to a circle around origin (the unit hypersphere surrounding the origin)

now just consider split hyperplanes thru the origin (described by angle thru origin, or the normal vector)
 descriptors themselves change due to viewpoint or lighting …
 so for computer vision, it must be sufficient to do approx. matching
 some might not be correct, or might be outliers
 after indexing and lookup  there will be a stage after the matching process, like a model fitting process (hough, ransac, least robust squares)
 how many hyper planes?  application dependent
 for a point a
the history of recognition prior to deep learning
 there are 10000  30000 visual object level categories in the dictionary
 imagenet has 1000 categories
 segmentation w 100 categories

we have 1/10th of complexity in labels for distinct objs
 hierarchies of obj categories from the wordnet hierarchy
image parsing/semantic segmentation
recognition is all about modeling variability
 why study classical approaches 
 context
 there are some modeling strategies that are relevant irrespective of what set of ml techniques u layer on top of it
 sift is good at matching exact replicas. what if we want adaptations to semantic level of variability to understand categories? this approaches human level capabilities
 for a known shape, estimate some parameters such as camera, pose, illumination, that allows u to model the appearance of that particular shape
 simplified visual worlds > modeling simple geometric shapes, considering junctions & how appearance of objs in 2d rendering of a 3d model depends on camera, pose, illumination
 idea of using complex objects as assembly of subcomponents
 “generalized cylinders”
 to extend this to multiple objects, breaking them down into geometric primitives
 if able to recognize these, each obj can be described as an assembly of them Recognitionbycomponents theory  Wikipedia
 part based recognition
 forsyth  human body as general shape primitives
 zisserman 1995  geometric models
 ponce 1989
 subsequent/in parallel: people trying to integrate statistical models of appearance variability
 of faces: eigenfaces (turk & penland 1991)
 getting eigenvalue decomp of the space of faces
 of faces: eigenfaces (turk & penland 1991)
 color histograms
 building up feature descriptors based on color histograms
timeline
 1960s  early 1990s: geometric era
 1990s : appearance based models
 1990s present: sliding window approaches
 for every subregion of an image, ask/ build a predictor to tell if the subregion falls into a category. ask this question for this patch, treating patch as an image itself. ask for each adjacent patch until all possible subregions are checked
 late 1990s: local features, SIFT and local feature design
 for object instance recognition even under occlusion or geometric deformations
 largescale image search
 for correlating multiple views of 3d structure.
 building 3d model based on camera position & photographs
 early 2000s: part based models
 object as a set of parts, with relative locatios between parts ( a valid geometric arrangement /config of those parts  via a “spring” system; can move a bit depending on how tightly the spring couples them)
 some probability dist of what the spatial distance is and angular relationship between parts. modeling pairwise interactions btwn parts
 constellation models
 fixated on deformation; focuses on a few parts
 pictorial structure model
 instead of modeling all pairwise combos btwn parts, we have a factorized model of how parts interact
 fits well with articulated objects like the body (joints that connect each part)
 primary motivation  to simplify the process of trying to figure out the optimal configuration of part geometry from some evidence of part appearance.
 would facilitate efficient detection algos, since we look at part interactions of subsets of tree that are connected  not ALL. http://cs.brown.edu/people/pfelzens/papers/lsvmpami.pdf
 2000’s : bag of features models
 loses all concept of arrangement
 view an image as an unordered collection of SIFT or color descriptors
 good for texture recognition,
 build up hists over patches
 spatial pyramid representation, that calculates bags at multiple scales; descriptors binned over subdivisions of pyramid
 alt in feature space of descriptors rather than image  match based on pyramids; encode descriptor content and descriptor itself hists over spatial locations
lecture 7
challenges of recognition
 illumination, deformation, occlusion, clutter, interclass variation
data driven approach: linear classifier
 f(x,W) = Wx + b
 image features: perform histogramming operations to extract some representation of image color, texture, or content
 with multiple of these feature processing pipelines, we can make a longer feature representation vector to feed into our simple classifier
 histograms, oriented gradients
 How to define this classifier? what are the parameters?

 define a loss function that quantiifes our unhappiness with the scores across the training data
 hinge loss / Support vector machine loss, which sums over the set of labels, looks at score assigned to the particular label vs correct label, performs max of 0 and diff of scores + 1; if score of correct class is some amount greater than the score for any other class, we don’t incur any penalty; if that’s not the case, then incur some loss penalty (linearly increasing)
 define a loss function that quantiifes our unhappiness with the scores across the training data

 come up with a way of efficiently finding the parameters that minimize the loss function.

data driven approach: regularization
L(W) = data loss + regularization regularization  prevent the model from doing too well on training data; prevents overfitting
why regularize?
 express some bias over what we expect reasonable model parameters to look like for our set of problems, so that we don’t fit noise into the data
 affects optimization when we devise an algorithmic strategy to figure out what the optimal parameter W should be
typical regularization
 sum of squares of all of the entries of the weight matrix, amounting to preferring a setting of weights that spreads out, making multiple features important in the classification decision rather than just 1
 prefers simpler models
 an unregularized version might create a fxn too complicated, which fits noise into data
the softmax classifier
 we want to engineer our scoring function as probabilities
 need to formulate raw real valued score vector to a prob. distribution over classes
 compute scores and renormalize them by passing thru an exponential function, ensuring they’re nonneg, then normalizing so that probabilities sum to one (logits > probabilities)
 we compare the probabilities predicted by model with correct probabilities by ground truth
 think about a distance metric that measures the diff between predicted and ground truth; using a defintiion of divergence (kulibackleibler divergence) we can define an objective or loss function  so we minimize the cross entropy loss btwn the two prob dist functions
 min/max possible loss L_i : min 0, max infinity
gradient descent
recall the 1d derivative of a function gradient is the vector of partial derivatives along each dimension (in multiple dimensions) taking gradient of the loss fxn wrt our parameters w: this gives us an idea of how th eloss changes as we move the entries of our weight matrix w. so, we have an optimization strat that computes grad of loss wrt our current params, then updates those params in the negative gradient direction so that we can decrease the loss.
stochastic gradient descent
 repeating this process ^
 with a loss fxn, compute gradient of loss fxn wrt our parameters
 with a large dataset
issue 
solution  feature transformation; transform data first, and then classify hand designed > training
idea of new deep learning approach  separation btwn feature extraction and classifier disappears; a pipeline ends up doing both; with parameters and a loss fxn
neural networks
linear score fucntion: f= Wx 2 layers : f=w2max(0, W1x) 3 layers: f = W3 max(0, w2max(0, W1x)) if there is no nonlinearity > we get linear classifier again
activation functions in common use 
 ReLu  max(0,x)
 rectified linear unit  rectification = cuttint out output if equal to 0
 ELU  exponentiated linear unit
 Leaky ReLU
 nonzero but small resposne to neg. inputs
how to compute gradients?
score function s loss function on predictions (SVM, hinge, cross entropy) regularization term
total loss: prediction loss + regularization lambda as a hyperparameter that reduces classif error ……. then compute gradient of the loss with respect to W1 and W2 (if s = W2max(0, W1x)) > apply sgd to update weights W1 and W2 with a local optimization strategy
how to compute partials wrt w1 and w2?
good strategy : computational graphs + backpropagation example: scalars x,y,z f(x,y,z) = (x+y)z
inermediate value is q = x+y f= qz
want partial derivatives at each stage in the data flow graph so the first intermediate computation > q = x+y partials dq/dx = dq/dy = 1 df/dq = z, df/dz = q compute df/dx, df/dy, df/dz with the chain rule
so df/dq is an upstream gradient and dq/dy is a local gradient df/dy = df/dq*dq/dy
so we can backpropogate inputs w preceding hidden states of the network eventually can compute gradient of loss with respect to all activations in the network (intermediate outputs) and all the network parameters (weights that participate in defining the function)
for any other modules or nodes in the graph, we continue the process
patterns in gradient flow
 add gates
 mul gates : “swap multiplier”
 copy gate
 gradient router
backprop: modularized implementation
modularizes stuff for u forward and backward pass in order to backpropogate, in the forward pass we want to stash values at a given node. if z = xy, output going forward is forward() also, we want the values that participated in the backward pass
so we need to store the hidden state of the network computed in the forward pass, and use those values in the backward pass
vectors and matrices?
vector wrt scalar: its partials vector wrt vector : derivative is jacobian
the upstream gradient, instead of being a scalar value, is a vector we have local gradients that are jacobian matrices backprop happens via matrix vector multiplication; upstream * local again.
the upstream gradient in a matrix partial with respect to z; for each element of z, hwo muc does it influence L? A Derivation of Backpropagation in Matrix Form – Sudeep Raja – Doctoral Student at Columbia University
lecture 8
ahhhhhhhhhhhhhhhhhh
motivation for sharing filter weights? there is an expectation that images are translation invariant What is a receptive field in a convolutional neural network?  Quora
pooling layers
helps us control spacial representation used to downsample by a factor of 2 in each spatial direction
 treat each as an array & apply methods
max pooling
A Gentle Introduction to Pooling Layers for Convolutional Neural Networks makes pooling more discriminative; take maximal responses to filters in your network, to preserve important characteristics in ur spatial region
 2x2 max pooling, stride of 2 etc
after enough pooling  1x1 spatial resolution
 can then apply a fully connected layer like last lecture
 gradual pooling is the most natural and efficient way tio implmeent a classification network
summary
classic net arch  classification, some repeating blocks of conv and relu, pooling to change spatial resolution, repeat a few times, then connect fc classification layer, attach a softmax
activation fxns
sigmoid  if the nput to a neuron is always positive, output will be always pos or neg ?? tanh(x)  gradient saturation issue reLU  does not saturate, computationally efficient, converges faster than others  however, not zerocentered output (?)
in practice , use reLU and don’t use sigmoid
data preprocessing
assists in learning
 PCA and Whitening of data
 normalization (in linear classifiers)
weight initialization
Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming activation stats cluster to zero in later stages of network > not much signal gets through if activations are all saturated, the network computes the same output for every input and we are in a regime where we need to move the parameters a lot
schemes
 Xavier initialization
 Understanding Xavier Initialization In Deep Neural Networks  PERPETUAL ENIGMA helps us figure out what our intials weights should be (filter/kernel)
 reLU
 for relu activations, adjusting by a factor of 2 yields reasonable inl here
batch normalization
as we update our network parameters … we can try to dynamically enforce some property of activation statistics
insert, after every nonlinearity in a network, a renormalization layer that takes activations and dynamically maps the activation set to some desired property;
classification  What is zero mean and unit variance in terms of image data?  Cross Validated initialzie these layers to perform normalization, and then shift to a parameterized mean invariance, the initial values of these have zero mean unit variance ???
A Gentle Introduction to Batch Normalization for Deep Neural Networks dependence on batch size is reduced improves performance on conv nets
layer normalization
for n examples and d feature dimensions, normalize over d.
instance normalization
lecture 9
optimization  problems with SGD what if this parameter space has strange behavior? loss changing quickly in one dir and slow in another? high fluctuation in gradient > jagged update what if gradient is zero? gradient descent gets stuck
adding a momentum term if you’re moving in a direction, keep moving in that dir  build up velocity as a running mean of gradients (have a velocity term that has some memory)  rho gives friction
for a current point in parameter space actual step is sum of velocity and gradient > nesterov momentum instead of gradient at current place in parameter space, look ahead to where it would pussh u, compute gradient there and mix w velocity to get actual update direction
CS231n Convolutional Neural Networks for Visual Recognition Nesterov Accelerated Gradient and Momentum
other update method ; Adagrad
 keep track of historical sum of squares in each dimension
 take larger steps when parameters are …
 smaller when there is a history of recent gradients w widely varying values
 adjust learning rate on per element basis
Cutout regularization for CNNs  Ombeline Lagé  Medium
regularization dropout batch normalization data augmentation …
lecture 11
 [fast r cnn]
 [deep watershed transform for instance segmentation]
 generate an energy surface over the domain of the image, so every pixel has a probability assoc. w the likelihood it is the object center, or distance from obj center
 potential energy field  low when near the center, high otherwise
 image morphology operations > how do i partition the image based on this energy fxn?
 flooding grows lower minima into larger regions
 when flooding connects, that forms a partition
 [CNNs and spectral embedding]
 the CNN is driving a prediction about image content, which we can reassemble in the scene as objects & segmentation of the objects
 “We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. “
 “ Spectral embedding then resolves these predictions into a globally consistent segmentation and figure/ground organization of the scene”
 what output are we training the nn to predict? how does it relate to reassembling the interpretation of the image?
 [multiperson pose estimation using part affinity fields]
 model of joint locations and connections of joint locs
 different intermediate predictions
 for each type of joint, predicting the probability that the joint is at a location in the scene (detecting all instances in the scene); joint localization
 output density for elbows, knees, torsos, heads, wrists, etc
 one set of channels that simultaneously detects relevant joints, no matter how many people are in the picture
 rather than a loop that for each candidate, there are subprocesses that run
 predicts how to connect up the joints to one another
 another channel, in terms of output prediction, that gives an affinity field  left elbows and left shoulders; gives connection strengths in adjacent joints; reads off reassembly of full pose by predicted affinities
 [neural image caption generation with visual attention]
 gradually generates word by word description of the image; what words are activated by the process that selects for attention in subparts of an image
multigrid neural architectures
 evolution of CNNs to something that looks like rainbow picture
 3x3 conv filters stack many layers; increasing feature channel count from 64>512; spatial pooling and subsampling
 this design makes no sense
 efficient algorithms in CS: coarse defined classification or decision making; tree based organization
 it is finetocoarse  building larger context as u go
 slow receptive field growth
 for an activation in a neuron  receptive field (how much of input is connected by a pathway to a particular unit in the net)  first layer has 3x3 receptive field, then 5x5, then 7x7 > constant growth of receptive fields as u add layers > takes a long time til something relies on the input in its entirety (why is this an issue??)
 features early in the layer affect subsequent layers.
 if we buy into this story of gradual buildup to more abstract feature reps, there is an enforced choice that later layers contain more abstraction and spatial scale (coarseness) > coupling of this
 instead
 store a pyramid of activation tensors
 every layer in the net has an extra dimension of scale space
 different tensors at fifferent spatial scales
 flow of information from coarser spatial grids to finer, vice versa, everywhere in the network
 series of layers emulates the fully connected layers
 an info pathway that flows between spatial scales  provides shortcut pathways across the spatial dimension in the networks
 triggering rapid receptive field growth
 allows networks to learn tasks that standard cnns have a difficult time with
 store a pyramid of activation tensors
 multigrid convolution
 layers, instead of 1 activation tensor, have a set of them at diff spatial scales, with diff number of channels; define an operation that turns the pyramid of input into another pyramidal output; a multigrid extension of conv. built out of simple components in conv nets
 upsampling (nearest neighbor or bilinear)
 downsampling (max pooling)
 pooling (communication)
 gives us an operation where, cnn evolves representation on pyramids
 has standard cnn embedded within it
 layers, instead of 1 activation tensor, have a set of them at diff spatial scales, with diff number of channels; define an operation that turns the pyramid of input into another pyramidal output; a multigrid extension of conv. built out of simple components in conv nets
 key properties
 images are multiscale  why not do something coarse defined ?
 receptive field growth
 diameter of the finest scale spatial grid = s
 in O(logs) layers, i have pathways in the networks that connect every unit with any unit in the pyramid; instead of constant receptive growth  O(S) layers for standard cnn
 how quickly u propagate info is exponentially faster
 (why does receptive field matter?)
The concept of receptive field is important for understanding and diagnosing how deep CNNs work. Since anywhere in an input image outside the receptive field of a unit does not affect the value of that unit, it is necessary to carefully control the receptive field, to ensure that it covers the entire relevant image region. In many tasks, especially dense prediction tasks like semantic image segmentation, stereo and optical flow estimation, where we make a prediction for each single pixel in the input image, it is critical for each output pixel to have a big receptive field, such that no important information is left out when making the prediction http://www.cs.toronto.edu/~wenjie/papers/nips16/top.pdf
 internal attention
 why didnt traditional cnns start coarse?
 need to go up the scalespace and back down;;
attentional tasks / learned attention
 need to go up the scalespace and back down;;
 building cnns that undo spatial transformations [jaderberg]
 mnist digits rendered in distorted forms  offset, noise  w goal of recovering undistorted version
 had a net that did localization which output translation parameters
 sampler, that, given transformation, undoes parameterized transf
 if the transformer is replaced with a multigrid cnn
 it learns it
 if the transformer is replaced with a multigrid cnn
 building an attention map of, what portion of the input does the upper left of the output depend on?
 as the location of the input image changes, the attentional pattern of multigrid networks change; whereas of unet is static
 humans try to hand design modules that compute a mask over some set of spatial locs, multiply it with an activation tensor in the net, then summarize the results as a pool for later use  attention mechanism built in
 multi grid gives an implicit mechanism
 infers what spatial subregion of the image is important for the output
 [dynamic routing between capsules (… hinton)]
 hinton’s capsules
 more complicated routing strategy from layer to layer
 overlapped mnist digits
 a different strategy for creating vector valued reps that are meaningful
 each component of the activation tensor .
 what was missing from neural nets for CV
 vector valued reps emerging internally, where they encode things like local pose parameters
 want to arch it w more complicated interactions w layers, so that there is a more informative decomposition
 trying to change how layers operate internally in order to route information .. by “agreement”
 information flow from layer to layer
 detecting evidence for components that agree w a particular configuration of an object
 multigrid learns this
 hinton’s capsules
memory arch of multigrid
 learn to store a map of the maze in the internal state of its network
 capacity for internal attention for spatial locations
 if u augment the net w memory, it can attend to its own internal memory
 this lets u update read and write ops
 and learn attentional strategies
 learn about the memory network
general adversarial networks
 ebgan
 biggan  just a years difference in research
 generative tasks, formulations, and issues
 gans are not the only option for generative tasks
 generation  learn to sample from the distribution represented by the learning task
 unsupervised learning
 conditional generation
 conditioning output to generate from a subspace of the full space
 “an indigo bunting facing right”
 conditioning output to generate from a subspace of the full space
 semantic segmentation and its inverse
 images to semantic labeling, vice versa
 labels to street scene. day to night. bw to color. edges to photo
 rendering and art
designing a network for generative tasks
 architecture
 encoder decoder; upsampling architectures for dense prediction
 autoencoders
 since we dont have labelled training data, how do we design loss functions?
 want to measure how close our output is to training; instead of writing a loss formulation in analytical form  train another net to predict whether the output of the 1st net looks realistic
 train two networks with opposing objectives
 generator  generates samples
 discriminator  distinguishes between generated and real samples
 want generator to fool discriminator. want discriminator not to be fooled
 encoder decoder; upsampling architectures for dense prediction
lecture 12
designing a network for generative tasks
 take an encoder/decoder or u net arch
 consider its latter half & smaller middle part
 ?
 loss function
 nn or otherwise
 consider its latter half & smaller middle part
how to implement loss fxn?  samples generated according to a prob dist  training data has a prob dist  want p_model to match p_data  probability models
 train 2 neural nets
 one is a generator
 one is a descriminator

loss  conditional log likelihood for real and generated data; high score for real, low score for fake
 nash equilibrium occurs if the generator produces samples from the same distribution as the underlying data distribution. A Gentle Introduction to Generative Adversarial Network Loss Functions
GAN Objective Functions: GANs and Their Variations  Towards Data Science KL and JS divergence probability dists
lecture ??
learning to produce informative image descriptions
learning to perform image descriptions
recurrent networks Word2vec  Wikipedia Recurrent Neural Networks  Towards Data Science visual semantic space https://cs.stanford.edu/people/karpathy/sfmltalk.pdf