Mini-lecture on Differentiable Neural Architecture Search (DARTS)

Рет қаралды 1,919

Жыл бұрын

Mini-lecture where we discuss differentiable neural architecture search. More specifically, we explain DARTS (arxiv.org/abs/1806.09055) in quite some detail (although we did not cover everything, such as the cell-based search method, where we search for cells and stack them in pre-defined ways, rather than searching for the entire network architecture at once).
Liked the video? Share with others!
Any feedback, comments, or questions? Let me know in the comments section below!

Пікірлер: 9

@arentsteen5452 3 күн бұрын

Very interesting. Thank you!

@MonkkSoori 11 күн бұрын

Does it make sense to speak of using regularization like L1 in the DARTS equations to make sure that some of the connections to some of the considered components of the model architecture (the "architecture coefficients" you mentioned) are reduced to zero to end up with a discrete architecture instead of a soft one?

@aixplained4763 10 күн бұрын

It is definitely possible to use L1 regularization, although it should be noted that this will not immediately lead to a discrete architecture. So we will still need to retrain the model after having selected the discrete components. We can thus ask ourselves whether we gain anything from the L1 regularization if we select the top-K architectural components and have to perform retraining afterward anyway.

@Ssc2969 7 ай бұрын

Thank you so much for this lecture. Just wanted to clarify one question. So if am understanding correctly, Each operation in DARTS has 2 parameters: alpha (operation strength/ operation coefficient) and actual weights (w) of those operations (which are matrices probably for CNN). Initially all these operation weights (w) and alpha are random and fixed at the start of the process. We are NOT manually intervening on them. However the question would be, in contrast to usual CNN architectures, does DARTS assign a new parameter - alpha on top of the actual weights of these operations and are these alphas kept same at the start of the process ? The set of all alphas are vectors I understand (as per the DARTS paper) and each alpha associated to each operations are scalers. nodes are states or latent representations and edges have operation choices. Then, each intermediate nodes (j) receive mixed operations (i.e. summation of {product of [Softmax scores based of each alpha ]multiplied with the feature map obtained by putting that operation on 'x' where 'x' represents feature map from predecessor node (i)}) . The operations we are talking about are the choices that lies in the edge connecting that node (j) to its predecessor (i). Now, DARTS minimizes training loss by updating the weights of the operations. i.e. it tries to optimize the actual weights (matrices) of the operations by gradient-descent. Using those actual weights , it then tries to minimize the validation-loss too by gradient descent and in this case it optimizes each respective alphas of each associated operation choices (ie scalers which represents strength and being used in the Softmax operations .So alpha vector (set of all individual alphas) is are getting changed here, I am hoping ?). The alphas we get at the end of the process are compared and choose the one that has highest value. Thus DARTS jointly optimizes WEIGHTS (by training loss ) and alpha or operation coefficient or operation strengths (by validation loss) . Questions : 0) Is my above understanding correct ? 1) What is the relation between alpha (operation strengths) and its actual weights (weights of those operations, example weight of a conv filter which is a matrix and that is getting optimized during training process) 2) how are alphas initialized at the beginning ? Are these scalers randomly assigned or are all same at the beginning ? We are not manually assigning weights to these operations right ? 3) Also the authors assumes that each cell that 2 input nodes and one output. While output nodes can be one, in my implementation I used more than 2 input nodes or 1 input node even. As per equation (1) of DARTS paper which just sums the operations from each predecessor nodes, I don't think this losses generalizability. Please let me know your thoughts. I would be really very grateful to you if you can kindly clarify this question. Lectures on NAS are very rare so there are little chances to ask questions like this. Thanks a lot for uploading your videos. It helps!

@aixplained4763 7 ай бұрын

Great questions! 0) Yes, very good! 1) The weights - by large - determine the behavior of the neural network. They compute the next hidden state (embedding) based on the received input. Assume we go from the input X to the first hidden layer, and that we have 3 operations to choose from. Each of these 3 operations computes a certain hidden embedding. The alphas are used to mix/weigh these embeddings to compute a single hidden state for the next layer. The alphas are basically the output of the softmax applied to the operation strengths of these 3 operations. In the lecture, the operation strengths are unfortunately called w; but these are not equal to the weights/matrices of the neural network! 2) The alphas are usually initialized so that the operations are mixed uniformly. A default value that is commonly used is 0 for all operation strengths, so that when you apply a softmax over this to compute the alphas, you get a uniform distribution. For example, if we have 4 operations, and initialize each operation strength with 0, the alphas (obtained by doing a softmax over the operation strengths) would all be 0.25. 3) I agree with you. You could use any number of inputs per node and the DARTS algorithm can be perfectly applied. I hope this helps! Best of luck with your application :)

@aixplained4763 6 ай бұрын

Happy that it was helpful! Unfortunately, I do not have LinkedIn, otherwise I would have loved to connect. I can give you my email though, if you want :)

@Ssc2969 6 ай бұрын

Thank you very much Professor. That sounds great! I can drop you an email then. Thanks a lot.@@aixplained4763