Parameter Prediction & Training Without SGD with Prof. Graham Taylor

Feb 14, 2022
Share this post
Sharing to FacebookSharing to LinkedInSharing to XSharing to Email

Previously on Private AI’s Speaker Series, our CEO Patricia Thaine sat down with data privacy law expert Carol Piovesan to talk about the legal ramifications ML teams should be aware of and what most people misunderstand about data governance. This past week on Private AI's ML webinar, we sat down with Professor Graham Taylor to discuss parameter prediction and training without Stochastic Gradient Descent (SGD).

Professor Graham Taylor is a Canada Research Chair and Professor of Engineering at the University of Guelph. He co-directs the University of Guelph Centre for Advancing Responsible and Ethical AI and is also the Interim Research Director of the Vector Institute for AI. Graham co-organized the annual CIFAR Deep Learning Summer School, trained 70+ students and researchers on AI-related projects, was named as one of 18 inaugural CIFAR Azrieli Global Scholars in 2016, and honoured as one of Canada's Top 40 under 40 in 2018.

In this episode, Private AI does a deep dive into Prof. Graham’s lab research, his inspiration behind training without SGD, the benefits of foregoing SGD, potential experiments, model accuracy improvements, and advice for aspiring graduate students. 

PAI: Tell us about some of the most exciting research to come out of your lab.

G: A few themes that are really popular right now in my lab are:

  • Generative models, that is, models that “create” rather than predict;
  • How can learning aid in solving combinatorial problems; and
  • Architectures that can move beyond vector representations to process sets, sequences, and graphs.

This research challenges the long held assumptions that gradient-based optimizers are required to train deep neural networks. We show that a “meta-model”, a type of graph neural network, can take as input a computational graph that describes a unique network architecture it has never seen before and output a “good” set of parameters. 

Astonishingly, the meta-model can predict parameters for almost any neural network in just one forward pass, achieving ~60% accuracy on the popular CIFAR-10 dataset without any training. Moreover, while the meta-model was training, it did not observe any network close to the ResNet-50 whose ~25 M parameters it predicted.

PAI: What inspired the idea behind your lab’s work on training without SGD?

G: The idea was originally proposed by Boris Knyazev, a PhD student in my group who had started an internship at Meta AI Research, then Facebook, in early 2020, with my long-time collaborators Adriana Romero and Mihal Drozdzal. Boris was really fixated on the fact that when we optimize the parameters for a new architecture, typical optimizers disregard past experience gained by optimizing different nets. 

Basically every time you randomly initialize a net and run SGD, it’s tabula rasa. He was motivated to find a way to share experience across training sessions so that a practitioner didn’t always need to start from scratch. So out of many weeks of discussion and refinement among the four of us, the idea to use a “meta model” based on something called a Graph Hypernetwork was born.

PAI: Can you tell us more about the benefits of foregoing SGD? 

G: The computational requirements of training large-scale architectures is widely known to be one of the downsides of deep learning, for two reasons: 1) in the sense of energy usage and emissions, and; 2) in the sense that lesser resourced organizations, like universities, startups, and certain governments are often unable to carry out large-scale experiments because they lack the necessary hardware. 

Like our 2020 project with Adriana,Mihal, and PhD student Terrance DeVries to reduce the computational requirements of GANs, parameter prediction democratizes DL by making the technology accessible to smaller players in the field. And even for well-resourced organizations, it’s an effective initializer for certain tasks. And there’s a lot you can do with the embeddings of network architectures. You can predict all kinds of things: their predictive accuracy on clean data, predictive accuracy on noisy data, inference speed, and SGD convergence speed were the ones we demonstrated.

Photo source via Twitter

PAI: So you’re using a graph neural network to predict the parameters for new models, with the nodes of the graph being various layers types (attention, convolution, weight normalization, etc.) and the features associated with these nodes being the hidden states of the network.

Can you tell us about how interconnected these nodes are, what else was tested, and what future experiments might test?

G: Yes, the input to the Graph Hypernetwork (or GHN) that predicts parameters is a computational graph that describes the architecture whose parameters you want to predict. The nodes of this computational graph represent operations such as convolutions, fully-connected layers, summations, and edges represent connectivity. At the input layer, the node attributes are “one-hot” vectors representing the type of operation. 

But as in standard graph neural net fashion, after several rounds or “layers” of message passing, the node features represent local neighbourhood features. We use the node features at the final message passing layer to condition a decoder that predicts the parameters associated with each node. To handle different parameter dimensions per operation type, we reshape and slice the output according to the shape of parameters in each node.

The interconnectivity will depend on the specific architecture that’s being represented. In most standard architectures, the computational graph is not extremely dense. However, one type of out-of-distribution family of networks has extremely dense connections. 

This notion of training on a set of standard architectures, which we call the “in-distribution” architectures, and testing on different sets of “out-of-distribution” architectures is an important challenge pursued in this work. We designed and released a dataset of 1M architectures in the form I described earlier, which we use for training. But the goal is not to predict parameters for these architectures. The goal is to predict parameters for unseen architectures that may come from a practitioner or even a procedure such as neural architecture search. That’s why the paper is titled Parameter Prediction for Unseen Deep Architectures.

PAI: Tell us about how the accuracy compares to SGD. What have you observed as the number of epochs required when training with SGD to reach the same levels of accuracy as a GNN-based initialization.

G: The analogy I like to give is if a network at initialization is a newborn baby, and a network trained by SGD is an adult, the network you get by parameter prediction is like a toddler. However, it’s a toddler that has skipped a few years of learning. You basically went from baby to toddler for free. This research is in its early days, so don’t expect an adult (i.e. an extremely performant network) yet.

To give some concrete results, the GHN can predict all 24 million parameters of a ResNet-50, achieving a 60% accuracy on CIFAR-10. And it never saw ResNet-50 before, that’s not in the DeepNets-1M training set. On ImageNet, the top-5 accuracy of some of our networks approaches 50%.

Something else we showed was that GHNs are effective in transfer learning, particularly in the low-data regime. In one experiment, we use a GHN trained on ImageNet to transfer to CIFAR-10 using only 100 labeled examples per class or 1,000 examples total. Here, we’re about 10 percentage points better vs. Kaiming He initialization, and about the same as pre-training on ImageNet for 2,500 steps. But that amount of ImageNet pre-training takes about 1500 GPUs, and a forward pass of GHN is a fraction of a second. We also perform a second transfer learning experiment on the Penn-Fudan object detection dataset. Again, our GHN’s performance is similar to about 1,000 steps of pre-training with SGD.

Now, the bad news: although we find significant gains using our method for initialization on two low-data tasks, we didn’t find the initialization beneficial in the case of more data. So if we compare fine-tuning on top of Kaiming He’s initialization vs. fine-tuning on top of GHN initialization in the full-data CIFAR-10 setting, GHN is considerably worse.

Parameter prediction without training and SGD with Prof. Graham Taylor

PAI: Are you researching how to improve initial accuracy of the models whose weights are predicted by a GNN?

G: Perhaps surprisingly, we’re not moving in this direction right away, though I think improvements here are possible. We actually think there’s some interesting uses for the “toddler” style networks.

PAI: Is there any other work your lab is doing that you’d like to tell us about?

G: Further on the subject of generative models and graph neural networks, we have been focusing on how to properly evaluate the output of generative models. These “creative” systems are tough to evaluate quantitatively because there’s no single correct answer as there often is in predictive systems. A couple of years ago, my former PhD student Terrance DeVries studied this topic in the context of conditional GANs for images. 

Conditioning adds another layer of complexity because you’re giving the generator additional information on “what” you want to generate. So the output needs to be 1) high quality and 2) diverse, meaning you don’t always generate the same thing. Those are the two things people usually care about in image generation. But it also needs to be 3) consistent with what you’re conditioning on. If I ask for a cat, don’t generate me a very nice set of diverse dogs.

More recently, another student named Rylee Thompson has been systematically evaluating metrics for generative models that output graphs. Boris is also part of this project. The literature for graph generative model evaluations is much less mature than that of image generation. We have a paper appearing at the upcoming ICLR 2022.

PAI: If someone would like to apply to be your graduate student, what would you like them to know?

G: I think that applicants really need to balance the line between demonstrating that they are a truly exceptional candidate and demonstrating integrity. Let me deconstruct that.

AI/ML is a very popular area of study and Vector is a world-class (I mean top-5 or top-10 in the world institution). Now I am going to brag a bit: my lab in Guelph is a top training institution for AI/ML talent. I have spent a lot of time with my research manager on creating resources and workflows for graduate student development. 

Our grads have gone on to careers in the majority of top tech companies in Canada and the US (Meta, NVIDIA, Layer 6). They have gone on to pursue further studies at U of T and Stanford. I’ve advised students who have gone on to form Tendermint and Clarifai. So we are getting a lot of interest, particularly from abroad. I’ve sat in Next AI interviews where Ajay Agrawal has straight up asked candidates “so what makes you outstanding”? I am not as direct as that but I expect to see that come out on an application.

So you need to convince the evaluator of your application that you stand out among all the other candidates who are passionate about AI/ML. But you need to do it in a way that’s sincere. Sometimes candidates don’t distinguish between conference papers and workshop papers, for example they write a bibliography entry such that it appears they published a paper at CVPR main conference but it’s at one of the associated workshops. Don’t do that. 

I think it’s amazing that someone’s publishing at topical workshops during undergrad or a Master’s. Be sincere and don’t hide the details in a bibliographic entry so that it seems more prestigious. Another example: I was interviewing an intern candidate recently who had listed “completed” and “in progress” courses on their CV. It was very clear: in-progress courses were written as IPR. But I noticed there were more than 10 courses listed as IPR and many of them were graduate courses. This was an undergraduate. 

So I asked them: how could you be taking so many courses right now, can’t you only take 5-6 courses at once? They eventually said that some of those courses were “planned” and not “in-progress”. Again, that’s an example of not demonstrating integrity. And in academics, integrity is so important. How can I trust someone to always do the right thing in science when I can’t trust what’s in their CV?

Watch the full session:

https://youtu.be/T4zpTc0DYfI

Click to view Graham’s Github or website for more information on his lab.

Data Left Behind: AI Scribes’ Promises in Healthcare

Data Left Behind: Healthcare’s Untapped Goldmine

The Future of Health Data: How New Tech is Changing the Game

Why is linguistics essential when dealing with healthcare data?

Why Health Data Strategies Fail Before They Start

Private AI to Redefine Enterprise Data Privacy and Compliance with NVIDIA

EDPB’s Pseudonymization Guideline and the Challenge of Unstructured Data

HHS’ proposed HIPAA Amendment to Strengthen Cybersecurity in Healthcare and how Private AI can Support Compliance

Japan's Health Data Anonymization Act: Enabling Large-Scale Health Research

What the International AI Safety Report 2025 has to say about Privacy Risks from General Purpose AI

Private AI 4.0: Your Data’s Potential, Protected and Unlocked

How Private AI Facilitates GDPR Compliance for AI Models: Insights from the EDPB's Latest Opinion

Navigating the New Frontier of Data Privacy: Protecting Confidential Company Information in the Age of AI

Belgium’s Data Protection Authority on the Interplay of the EU AI Act and the GDPR

Enhancing Compliance with US Privacy Regulations for the Insurance Industry Using Private AI

Navigating Compliance with Quebec’s Act Respecting Health and Social Services Information Through Private AI’s De-identification Technology

Unlocking New Levels of Accuracy in Privacy-Preserving AI with Co-Reference Resolution

Strengthened Data Protection Enforcement on the Horizon in Japan

How Private AI Can Help to Comply with Thailand's PDPA

How Private AI Can Help Financial Institutions Comply with OSFI Guidelines

The American Privacy Rights Act – The Next Generation of Privacy Laws

How Private AI Can Help with Compliance under China’s Personal Information Protection Law (PIPL)

PII Redaction for Reviews Data: Ensuring Privacy Compliance when Using Review APIs

Independent Review Certifies Private AI’s PII Identification Model as Secure and Reliable

To Use or Not to Use AI: A Delicate Balance Between Productivity and Privacy

To Use or Not to Use AI: A Delicate Balance Between Productivity and Privacy

News from NIST: Dioptra, AI Risk Management Framework (AI RMF) Generative AI Profile, and How PII Identification and Redaction can Support Suggested Best Practices

Handling Personal Information by Financial Institutions in Japan – The Strict Requirements of the FSA Guidelines

日本における金融機関の個人情報の取り扱い - 金融庁ガイドラインの要件

Leveraging Private AI to Meet the EDPB’s AI Audit Checklist for GDPR-Compliant AI Systems

Who is Responsible for Protecting PII?

How Private AI can help the Public Sector to Comply with the Strengthening Cyber Security and Building Trust in the Public Sector Act, 2024

A Comparison of the Approaches to Generative AI in Japan and China

Updated OECD AI Principles to keep up with novel and increased risks from general purpose and generative AI

Is Consent Required for Processing Personal Data via LLMs?

The evolving landscape of data privacy legislation in healthcare in Germany

The CIO’s and CISO’s Guide for Proactive Reporting and DLP with Private AI and Elastic

The Evolving Landscape of Health Data Protection Laws in the United States

Comparing Privacy and Safety Concerns Around Llama 2, GPT4, and Gemini

How to Safely Redact PII from Segment Events using Destination Insert Functions and Private AI API

WHO’s AI Ethics and Governance Guidance for Large Multi-Modal Models operating in the Health Sector – Data Protection Considerations

How to Protect Confidential Corporate Information in the ChatGPT Era

Unlocking the Power of Retrieval Augmented Generation with Added Privacy: A Comprehensive Guide

Leveraging ChatGPT and other AI Tools for Legal Services

Leveraging ChatGPT and other AI tools for HR

Leveraging ChatGPT in the Banking Industry

Law 25 and Data Transfers Outside of Quebec

The Colorado and Connecticut Data Privacy Acts

Unlocking Compliance with the Japanese Data Privacy Act (APPI) using Private AI

Tokenization and Its Benefits for Data Protection

Private AI Launches Cloud API to Streamline Data Privacy

Processing of Special Categories of Data in Germany

End-to-end Privacy Management

Privacy Breach Reporting Requirements under Law25

Migrating Your Privacy Workflows from Amazon Comprehend to Private AI

A Comparison of the Approaches to Generative AI in the US and EU

Benefits of AI in Healthcare and Data Sources (Part 1)

Privacy Attacks against Data and AI Models (Part 3)

Risks of Noncompliance and Challenges around Privacy-Preserving Techniques (Part 2)

Enhancing Data Lake Security: A Guide to PII Scanning in S3 buckets

The Costs of a Data Breach in the Healthcare Sector and its Privacy Compliance Implications

Navigating GDPR Compliance in the Life Cycle of LLM-Based Solutions

What’s New in Version 3.8

How to Protect Your Business from Data Leaks: Lessons from Toyota and the Department of Home Affairs

New York's Acceptable Use of AI Policy: A Focus on Privacy Obligations

Safeguarding Personal Data in Sentiment Analysis: A Guide to PII Anonymization

Changes to South Korea’s Personal Information Protection Act to Take Effect on March 15, 2024

Australia’s Plan to Regulate High-Risk AI

How Private AI can help comply with the EU AI Act

Comment la Loi 25 Impacte l'Utilisation de ChatGPT et de l'IA en Général

Endgültiger Entwurf des Gesetzes über Künstliche Intelligenz – Datenschutzpflichten der KI-Modelle mit Allgemeinem Verwendungszweck

How Law25 Impacts the Use of ChatGPT and AI in General

Is Salesforce Law25 Compliant?

Creating De-Identified Embeddings

Exciting Updates in 3.7

EU AI Act Final Draft – Obligations of General-Purpose AI Systems relating to Data Privacy

FTC Privacy Enforcement Actions Against AI Companies

The CCPA, CPRA, and California's Evolving Data Protection Landscape

HIPAA Compliance – Expert Determination Aided by Private AI

Private AI Software As a Service Agreement

EU's Review of Canada's Data Protection Adequacy: Implications for Ongoing Privacy Reform

Acceptable Use Policy

ISO/IEC 42001: A New Standard for Ethical and Responsible AI Management

Reviewing OpenAI's 31st Jan 2024 Privacy and Business Terms Updates

Comparing OpenAI vs. Azure OpenAI Services

Quebec’s Draft Regulation Respecting the Anonymization of Personal Information

Version 3.6 Release: Enhanced Streaming, Auto Model Selection, and More in Our Data Privacy Platform

Brazil's LGPD: Anonymization, Pseudonymization, and Access Requests

LGPD do Brasil: Anonimização, Pseudonimização e Solicitações de Acesso à Informação

Canada’s Principles for Responsible, Trustworthy and Privacy-Protective Generative AI Technologies and How to Comply Using Private AI

Private AI Named One of The Most Innovative RegTech Companies by RegTech100

Data Integrity, Data Security, and the New NIST Cybersecurity Framework

Safeguarding Privacy with Commercial LLMs

Cybersecurity in the Public Sector: Protecting Vital Services

Privacy Impact Assessment (PIA) Requirements under Law25

Elevate Your Experience with Version 3.5

Fine-Tuning LLMs with a Focus on Privacy

GDPR in Germany: Challenges of German Data Privacy (Part 2)

Comply with US Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence using Private AI

How to Comply with EU AI Act using PrivateGPT