Microsoft deep learning interview questions
Questions
• Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them?
• Q2: What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?
• Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the performance on the test isn’t good. What can you do to reduce overfitting?
• Q4: Why should we use Batch Normalization?
• Q5: How to know whether your model is suffering from the problem of Exploding Gradients?
• Q6: Can you name and explain a few hyperparameters used for training a neural network?
• Q7: Can you explain the parameter sharing concept in deep learning?
• Q8: Describe the architecture of a typical Convolutional Neural Network (CNN).
• Q9: What is the Vanishing Gradient Problem in Artificial Neural Networks and How to fix it?
• Q10: When it comes to training an artificial neural network, what could be the reason why the loss doesn’t decrease in a few epochs?
• Q11: Why Sigmoid or Tanh is not preferred to be used as the activation function in the hidden layer of the neural network?
• Q12: Discuss in what context it is recommended to use transfer learning and when it is not.
• Q13: Discuss the vanishing gradient in RNN and How it can be solved.
Questions & Answers
Q1: What are autoencoders? Explain the different layers of autoencoders and mention three practical usages of them.
Autoencoders are one of the deep learning types used for unsupervised learning. There are key layers of autoencoders, which are the input layer, encoder, bottleneck hidden layer, decoder, and output.
The three layers of the autoencoder are:-
1. Encoder — Compresses the input data to an encoded representation which is typically much smaller than the input data.
2. Latent Space Representation/ Bottleneck/ Code — Compact summary of the input containing the most important features
3. Decoder — Decompresses the knowledge representation and reconstructs the data back from its encoded form. Then a loss function is used at the top to compare the input and output images. NOTE- It’s a requirement that the dimensionality of the input and output be the same. Everything in the middle can be played with.
Autoencoders have a wide variety of usage in the real world. The following are some of the popular ones:
1. Transformers and Big Bird (Autoencoders is one of these components in both algorithms): Text Summarizer, Text Generator
2. Image compression
3. A nonlinear version of PCA
Q2: What is an activation function and discuss the use of an activation function? Explain three different types of activation functions?
In mathematical terms, the activation function serves as a gate between the current neuron input and its output, going to the next level. Basically, it decides whether neurons should be activated or not. It is used to introduce non-linearity into a model.
Activation functions are added to introduce non-linearity to the network, it doesn’t matter how many layers or how many neurons your net has, the output will be linear combinations of the input in the absence of activation functions. In other words, activation functions are what make a linear regression model different from a neural network. We need non-linearity, to capture more complex features and model more complex variations that simple linear models can not capture.
There are a lot of activation functions:
• Sigmoid function:
f(x) = 1/(1+exp(-x))
The output value of it is between 0 and 1, we can use it for classification. It has some problems like the gradient vanishing on the extremes, also it is computationally expensive since it uses exp.
• Relu:
f(x) = max(0,x)
it returns 0 if the input is negative and the value of the input if the input is positive. It solves the problem of vanishing gradient for the positive side, however, the problem is still on the negative side. It is fast because we use a linear function in it.
• Leaky ReLU:
F(x)= ax, x<0 F(x)= x, x>=0
It solves the problem of vanishing gradient on both sides by returning a value “a” on the
negative side and it does the same thing as ReLU for the positive side.
• Softmax: it is usually used at the last layer for a classification problem because it returns a set of probabilities, where the sum of them is 1. Moreover, it is compatible with cross-entropy loss, which is usually the loss function for classification problems.
Q3: You are using a deep neural network for a prediction task. After training your model, you notice that it is strongly overfitting the training set and that the
performance on the test isn’t good. What can you do to reduce overfitting?
To reduce overfitting in a deep neural network changes can be made in three places/stages: The input data to the network, the network architecture, and the training process:
1. The input data to the network:
• Check if all the features are available and reliable
• Check if the training sample distribution is the same as the validation and test set distribution. Because if there is a difference in validation set distribution then it is hard for the model to predict as these complex patterns are unknown to the model.
• Check for train / valid data contamination (or leakage)
• The dataset size is enough, if not try data augmentation to increase the data size
• The dataset is balanced
2. Network architecture:
• Overfitting could be due to model complexity. Question each component:
• can fully connected layers be replaced with convolutional + pooling layers?
• what is the justification for the number of layers and number of neurons chosen? Given how hard it is to tune these, can a pre-trained model be used?
• Add regularization — ridge (l1), lasso (l2), elastic net (both)
• Add dropouts
• Add batch normalization
3. The training process:
• Improvements in validation losses should decide when to stop training. Use callbacks for early stopping when there are no significant changes in the validation loss and restore_best_weights.
Q4: Why should we use Batch Normalization?
Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch.
Usually, a dataset is fed into the network in the form of batches where the distribution of the data differs for every batch size. By doing this, there might be chances of vanishing gradient or exploding gradient when it tries to backpropagate. In order to combat these issues, we can use BN (with irreducible error) layer mostly on the inputs to the layer before the activation function in the previous layer and after fully connected layers.
Batch Normalisation has the following effects on the Neural Network:
1. Robust Training of the deeper layers of the network.
2. Better covariate-shift proof NN Architecture.
3. Has a slight regularisation effect.
4. Centered and Controlled values of Activation.
5. Tries to Prevent exploding/vanishing gradient.
6. Faster Training/Convergence to the minimum loss function
Batch normalization layer.
Q5: How to know whether your model is suffering from the problem of Exploding Gradients?
By taking incremental steps towards the minimal value, the gradient descent algorithm aims to minimize the error. The weights and biases in a neural network are updated using these processes. However, at times, the steps grow excessively large, resulting in increased updates to weights and bias terms to the point where the weights overflow (or become NaN, that is, Not a Number). An exploding gradient is the result of this, and it is an unstable method.
There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:
1. The model is unable to get traction on your training data (e g. poor loss).
2. The model is unstable, resulting in large changes in loss from update to update.
3. The model loss goes to NaN during training.
If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients. There are some less subtle signs that you can use to confirm that you have exploding gradients:
1. The model weights quickly become very large during training.
2. The model weights go to NaN values during training.
3. The error gradient values are consistently above 1.0 for each node and layer during training.
Q6: Can you name and explain a few hyperparameters used for training a neural network?
Answer:
Hyperparameters are any parameter in the model that affects the performance but is not learned from the data unlike parameters ( weights and biases), the only way to change it is manually by the user.
1. Number of nodes: number of inputs in each layer.
2. Batch normalization: normalization/standardization of inputs in a layer.
3. Learning rate: the rate at which weights are updated.
4. Dropout rate: percent of nodes to drop temporarily during the forward pass.
5. Kernel: matrix to perform dot product of image array with
6. Activation function: defines how the weighted sum of inputs is transformed into outputs (e.g. tanh, sigmoid, softmax, Relu, etc)
7. Number of epochs: number of passes an algorithm has to perform for training
8. Batch size: number of samples to pass through the algorithm individually. E.g. if the dataset has 1000 records and we set a batch size of 100 then the dataset will be divided into 10 batches which will be propagated to the algorithm one after another.
9. Momentum: Momentum can be seen as a learning rate adaptation technique that adds a fraction of the past update vector to the current update vector. This helps damps oscillations and speed up progress towards the minimum.
10. Optimizers: They focus on getting the learning rate right.
Adagrad optimizer: Adagrad uses a large learning rate for infrequent features and a smaller learning rate for frequent features. Other optimizers, like Adadelta, RMSProp, and Adam, make further improvements to fine-tuning the learning rate and momentum to get to the optimal weights and bias. Thus getting the learning rate right is key to well-trained models.
11. Learning Rate: Controls how much to update weights & bias (w+b) terms after training on each batch. Several helpers are used to getting the learning rate right.
Q7: Can you explain the parameter sharing concept in deep learning?
Answer:
Parameter sharing is the method of sharing weights by all neurons in a particular feature map. Therefore helps to reduce the number of parameters in the whole system, making it computationally cheap. It basically means that the same parameters will be used to represent different transformations in the system. This basically means the same matrix elements may be updated multiple times during backpropagation from varied gradients. The same set of elements will facilitate transformations at more than one layer instead of those from a single layer as conventional. This is usually done in architectures like Siamese that tend to您好！ 本帖隐藏的内容需要积分高于 188 才可浏览 您当前积分为 0。 使用VIP即刻解锁阅读权限或查看其他获取积分的方式 游客，您好！ 本帖隐藏的内容需要积分高于 188 才可浏览 您当前积分为 0。 VIP即刻解锁阅读权限 或 查看其他获取积分的方式 0; 𝐜𝐚𝐧 𝐛𝐞 𝐮𝐬𝐞𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐟𝐨𝐥𝐥𝐨𝐰𝐢𝐧𝐠 𝐜𝐚𝐬𝐞𝐬:
1. The downstream task has a very small amount of data available, then we can try using pre-trained model weights by switching the last layer with new layers which we will train.
2. In some cases, like in vision-related tasks, the initial layers have a common behavior of detecting edges, then a little more complex but still abstract features and so on which is common in all vision tasks, and hence a pre-trained model’s initial layers can be used directly. The same thing holds for Language Models too, for example, a model trained in a large Hindi corpus can be transferred and used for other Indo-Aryan Languages with low resources available.
𝐂𝐚𝐬𝐞𝐬 𝐰𝐡𝐞𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐞𝐫 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐬𝐡𝐨𝐮𝐥𝐝 𝐧𝐨𝐭 𝐛𝐞 𝐮𝐬𝐞𝐝:
1. The first and most important is the “COST”. So is it cost-effective or we can have a similar performance without using it.
2. The pre-trained model has no relation to the downstream task.
3. If the latency is a big constraint (Mostly in NLP ) then transfer learning is not the best option. However Now with the TensorFlow lite kind of platform and Model Distillation, Latency is not a problem anymore.
Q13: Discuss the vanishing gradient in RNN and How it can be solved.
Answer:
In Sequence to Sequence models such as RNNs, the input sentences might have long-term dependencies for example we might say “The boy who was wearing a red t-shirt, blue jeans, black shoes, and a white cap and who lives at … and is 10 years old …… etc, is genius” here
the verb (is) in the sentence depends on the (boy) i.e if we say (The boys, ……, are genius”. When training an RNN we do backward propagation both through layers and backward through time. Without focusing too much on mathematics, during backward propagation we tend to multiply gradients that are either > 1 or < 1, if the gradients are < 1 and we have about 100 steps backward in time then multiplying 100 numbers that are < 1 will result in a very very tiny gradient causing no change in the weights as we go backward in time (0.1 * 0.1
* 0.1 * …. a 100 times = 10^(-100)) such that in our previous example the word “is” doesn’t affect its main dependency the word “boy” during learning the meanings of the word due to the long description in between.
Models like the Gated Recurrent Units (GRUs) and the Long short-term memory (LSTMs) were proposed, the main idea of these models is to use gates to help the network determine which information to keep and which information to discard during learning. Then Transformers were proposed depending on the self-attention mechanism to catch the dependencies between words in the sequence. |