Notes:Deep Learning

From MyInfoRepo
Jump to navigation Jump to search



Assessment

1. Presentation: during the seminar a small group of students presents a paper. You will have to present once.

2. Students will have to submit relevant questions about papers/lectures

3. Lab assignment: in a small group of students you work on a deep learning paper reproducibility project.

4. Exam about the papers and the theory.

\***Disclaimer: Assessment this year may change depending on the COVID19 virus***

Exam

The exam will be on weblab. Exam info: Closed book exam. Questions are multiple choice. The order of the questions is randomized. Every question counts equally. Wrong answers are not penalized. Select the single best fitting answer for each question. Only the answers that you save are recorded.

Practice exam: We created a test exam, please take that exam to ensure you are familiar with the setting (Go to “your submissions” to see the exams)

Form: To reduce the likelihood of fraud, individual questions are timed, and you will not be able to change the question after the time is up.

Your email: In the first question we will ask your email so we may contact you directly after the exam for an oral video session.

Stay online: Please check your email until the official exam end time.


Notes per lecture

Lecture 1: Feedforward neural nets and training them

Feed forward networks are simple connected networks that feed information from their origin nodes through layers until the end node is reached.

Feed forward network specifics

A neural network consists of initial values, weights, activation function and bias. A neural net consists of multiple layers that are connected with each other, each layer having a specific type of forwarding to another layer. The initial values often come from the input into the network. Weights are the multipliers that connect two layers. This weight should be known to arrive at a sensible answer in the final layer. By tuning weights, loss of a function can be minimized and the output of the function is as close to the actual model as possible. Between layers, typically an activation function is present. This may take many forms and they serve to normalize or bound output of the network values. One popular activation function is the Rectified Linear Unit function, ReLU for short. This is simply max{0, z}. Another is the sigmoid function, transforming input to a value between 0 to 1 (near 0 for very large negatives, near 1 for very large positives), which is a logistic function. For any node in a layer inside the network, the incoming weight * input values are summed to calculate the values for that node.

Training by loss function

To train a network to the correct parameters, a loss function is used. A popular loss function is Mean Squared Error (MSE). Of course we want to minimize the output of the loss function. By taking the derivative of the network layer function we can see how much the output will change based on an input change. Loss reduction is achieved by finding an extremum, or the point where the derivative is close to 0. This is done by moving along the function derivative based on its value. Moving in the wrong direction will cause the error to increase. Therefore, we will converge to a point where the derivative is close to 0 and the error is small.

Finding a minimum error we need to find the minimum derivative value for ALL weights, this value is found by taken all partial derivatives. This combination is the gradient of the function, with the whole process being gradient descent.

Lecture 2:

...

Lecture 3: Convolutional Neural Network (CNN)

End-to-end learning in deep learning means there is no feature extraction. Kernels are used to filter images and detect features.

We look at an example network for an image. The network will be convoluted, a non-linearity function is applied, spatial pooling is applied and the image is combined for the next layer. By using multiple features in a convolution, we can detect different features. After convolution, we can apply a non-linearity to 'remove' all negative values from the data. By applying spatial pooling, a reduction of the image size, we create a lower resolution image. This is done to enhance effectiveness of then feature detection. We can look at larger features and gain a smaller model (activation maps do not have to be stored) to train on in the backpropagation step. By stacking images on top of each other after the spatial pooling, we keep the feature information for the next layer.

Toeplitz matrix (diagonal-constant matrix) is the same as convolution. Some thing to notice about these matrices: they are sparse, local and share parameters.

CNN is by design a limited parameter version of feed forward. This means we have less parameters and less flexibility. This is preferred, since we can learn faster and with convolution we target features that are expected, meaning we can distinguish easier.

Shifting and convolving is commutative (either can be done first, not changing the result).

Stochastic Gradient Descent (SGD)

Lecture 4: ..

Lecture 5: ..

Learning rate

An appropriate learning rate value is important to have.

We use algorithms, like momentum, RMSprop and Adam to make sure the learning rate converges to the optimal value smoothly. Momentum looks at the average of the losses. Adam combines momentum and RMSprop.


Noisy series can be averaged out to allow for better learning


Lecture 6: GNU LSTM

For parsing