Parameter Prediction & Training Without SGD with Prof. Graham Taylor

Feb 14, 2022

Share this post

Previously on Private AI’s Speaker Series, our CEO Patricia Thaine sat down with data privacy law expert Carol Piovesan to talk about the legal ramifications ML teams should be aware of and what most people misunderstand about data governance. This past week on Private AI's ML webinar, we sat down with Professor Graham Taylor to discuss parameter prediction and training without Stochastic Gradient Descent (SGD).

Professor Graham Taylor is a Canada Research Chair and Professor of Engineering at the University of Guelph. He co-directs the University of Guelph Centre for Advancing Responsible and Ethical AI and is also the Interim Research Director of the Vector Institute for AI. Graham co-organized the annual CIFAR Deep Learning Summer School, trained 70+ students and researchers on AI-related projects, was named as one of 18 inaugural CIFAR Azrieli Global Scholars in 2016, and honoured as one of Canada's Top 40 under 40 in 2018.

In this episode, Private AI does a deep dive into Prof. Graham’s lab research, his inspiration behind training without SGD, the benefits of foregoing SGD, potential experiments, model accuracy improvements, and advice for aspiring graduate students.

PAI: Tell us about some of the most exciting research to come out of your lab.

G: A few themes that are really popular right now in my lab are:

Generative models, that is, models that “create” rather than predict;
How can learning aid in solving combinatorial problems; and
Architectures that can move beyond vector representations to process sets, sequences, and graphs.

This research challenges the long held assumptions that gradient-based optimizers are required to train deep neural networks. We show that a “meta-model”, a type of graph neural network, can take as input a computational graph that describes a unique network architecture it has never seen before and output a “good” set of parameters.

Astonishingly, the meta-model can predict parameters for almost any neural network in just one forward pass, achieving ~60% accuracy on the popular CIFAR-10 dataset without any training. Moreover, while the meta-model was training, it did not observe any network close to the ResNet-50 whose ~25 M parameters it predicted.

PAI: What inspired the idea behind your lab’s work on training without SGD?

G: The idea was originally proposed by Boris Knyazev, a PhD student in my group who had started an internship at Meta AI Research, then Facebook, in early 2020, with my long-time collaborators Adriana Romero and Mihal Drozdzal. Boris was really fixated on the fact that when we optimize the parameters for a new architecture, typical optimizers disregard past experience gained by optimizing different nets.

Basically every time you randomly initialize a net and run SGD, it’s tabula rasa. He was motivated to find a way to share experience across training sessions so that a practitioner didn’t always need to start from scratch. So out of many weeks of discussion and refinement among the four of us, the idea to use a “meta model” based on something called a Graph Hypernetwork was born.

PAI: Can you tell us more about the benefits of foregoing SGD?

G: The computational requirements of training large-scale architectures is widely known to be one of the downsides of deep learning, for two reasons: 1) in the sense of energy usage and emissions, and; 2) in the sense that lesser resourced organizations, like universities, startups, and certain governments are often unable to carry out large-scale experiments because they lack the necessary hardware.

Like our 2020 project with Adriana,Mihal, and PhD student Terrance DeVries to reduce the computational requirements of GANs, parameter prediction democratizes DL by making the technology accessible to smaller players in the field. And even for well-resourced organizations, it’s an effective initializer for certain tasks. And there’s a lot you can do with the embeddings of network architectures. You can predict all kinds of things: their predictive accuracy on clean data, predictive accuracy on noisy data, inference speed, and SGD convergence speed were the ones we demonstrated.

Photo source via Twitter

PAI: So you’re using a graph neural network to predict the parameters for new models, with the nodes of the graph being various layers types (attention, convolution, weight normalization, etc.) and the features associated with these nodes being the hidden states of the network.

Can you tell us about how interconnected these nodes are, what else was tested, and what future experiments might test?

G: Yes, the input to the Graph Hypernetwork (or GHN) that predicts parameters is a computational graph that describes the architecture whose parameters you want to predict. The nodes of this computational graph represent operations such as convolutions, fully-connected layers, summations, and edges represent connectivity. At the input layer, the node attributes are “one-hot” vectors representing the type of operation.

But as in standard graph neural net fashion, after several rounds or “layers” of message passing, the node features represent local neighbourhood features. We use the node features at the final message passing layer to condition a decoder that predicts the parameters associated with each node. To handle different parameter dimensions per operation type, we reshape and slice the output according to the shape of parameters in each node.

The interconnectivity will depend on the specific architecture that’s being represented. In most standard architectures, the computational graph is not extremely dense. However, one type of out-of-distribution family of networks has extremely dense connections.

This notion of training on a set of standard architectures, which we call the “in-distribution” architectures, and testing on different sets of “out-of-distribution” architectures is an important challenge pursued in this work. We designed and released a dataset of 1M architectures in the form I described earlier, which we use for training. But the goal is not to predict parameters for these architectures. The goal is to predict parameters for unseen architectures that may come from a practitioner or even a procedure such as neural architecture search. That’s why the paper is titled Parameter Prediction for Unseen Deep Architectures.

PAI: Tell us about how the accuracy compares to SGD. What have you observed as the number of epochs required when training with SGD to reach the same levels of accuracy as a GNN-based initialization.

G: The analogy I like to give is if a network at initialization is a newborn baby, and a network trained by SGD is an adult, the network you get by parameter prediction is like a toddler. However, it’s a toddler that has skipped a few years of learning. You basically went from baby to toddler for free. This research is in its early days, so don’t expect an adult (i.e. an extremely performant network) yet.

To give some concrete results, the GHN can predict all 24 million parameters of a ResNet-50, achieving a 60% accuracy on CIFAR-10. And it never saw ResNet-50 before, that’s not in the DeepNets-1M training set. On ImageNet, the top-5 accuracy of some of our networks approaches 50%.

Something else we showed was that GHNs are effective in transfer learning, particularly in the low-data regime. In one experiment, we use a GHN trained on ImageNet to transfer to CIFAR-10 using only 100 labeled examples per class or 1,000 examples total. Here, we’re about 10 percentage points better vs. Kaiming He initialization, and about the same as pre-training on ImageNet for 2,500 steps. But that amount of ImageNet pre-training takes about 1500 GPUs, and a forward pass of GHN is a fraction of a second. We also perform a second transfer learning experiment on the Penn-Fudan object detection dataset. Again, our GHN’s performance is similar to about 1,000 steps of pre-training with SGD.

Now, the bad news: although we find significant gains using our method for initialization on two low-data tasks, we didn’t find the initialization beneficial in the case of more data. So if we compare fine-tuning on top of Kaiming He’s initialization vs. fine-tuning on top of GHN initialization in the full-data CIFAR-10 setting, GHN is considerably worse.

Parameter prediction without training and SGD with Prof. Graham Taylor

PAI: Are you researching how to improve initial accuracy of the models whose weights are predicted by a GNN?

G: Perhaps surprisingly, we’re not moving in this direction right away, though I think improvements here are possible. We actually think there’s some interesting uses for the “toddler” style networks.

PAI: Is there any other work your lab is doing that you’d like to tell us about?

G: Further on the subject of generative models and graph neural networks, we have been focusing on how to properly evaluate the output of generative models. These “creative” systems are tough to evaluate quantitatively because there’s no single correct answer as there often is in predictive systems. A couple of years ago, my former PhD student Terrance DeVries studied this topic in the context of conditional GANs for images.

Conditioning adds another layer of complexity because you’re giving the generator additional information on “what” you want to generate. So the output needs to be 1) high quality and 2) diverse, meaning you don’t always generate the same thing. Those are the two things people usually care about in image generation. But it also needs to be 3) consistent with what you’re conditioning on. If I ask for a cat, don’t generate me a very nice set of diverse dogs.

More recently, another student named Rylee Thompson has been systematically evaluating metrics for generative models that output graphs. Boris is also part of this project. The literature for graph generative model evaluations is much less mature than that of image generation. We have a paper appearing at the upcoming ICLR 2022.

PAI: If someone would like to apply to be your graduate student, what would you like them to know?

G: I think that applicants really need to balance the line between demonstrating that they are a truly exceptional candidate and demonstrating integrity. Let me deconstruct that.

AI/ML is a very popular area of study and Vector is a world-class (I mean top-5 or top-10 in the world institution). Now I am going to brag a bit: my lab in Guelph is a top training institution for AI/ML talent. I have spent a lot of time with my research manager on creating resources and workflows for graduate student development.

Our grads have gone on to careers in the majority of top tech companies in Canada and the US (Meta, NVIDIA, Layer 6). They have gone on to pursue further studies at U of T and Stanford. I’ve advised students who have gone on to form Tendermint and Clarifai. So we are getting a lot of interest, particularly from abroad. I’ve sat in Next AI interviews where Ajay Agrawal has straight up asked candidates “so what makes you outstanding”? I am not as direct as that but I expect to see that come out on an application.

So you need to convince the evaluator of your application that you stand out among all the other candidates who are passionate about AI/ML. But you need to do it in a way that’s sincere. Sometimes candidates don’t distinguish between conference papers and workshop papers, for example they write a bibliography entry such that it appears they published a paper at CVPR main conference but it’s at one of the associated workshops. Don’t do that.

I think it’s amazing that someone’s publishing at topical workshops during undergrad or a Master’s. Be sincere and don’t hide the details in a bibliographic entry so that it seems more prestigious. Another example: I was interviewing an intern candidate recently who had listed “completed” and “in progress” courses on their CV. It was very clear: in-progress courses were written as IPR. But I noticed there were more than 10 courses listed as IPR and many of them were graduate courses. This was an undergraduate.

So I asked them: how could you be taking so many courses right now, can’t you only take 5-6 courses at once? They eventually said that some of those courses were “planned” and not “in-progress”. Again, that’s an example of not demonstrating integrity. And in academics, integrity is so important. How can I trust someone to always do the right thing in science when I can’t trust what’s in their CV?