ML Model Evaluations & Multimodal Learning

Share This Post

In the previous episode of Private AI’s ML Speaker Series, Patricia Thaine (CEO of Private AI) sat down with Dr. Aida Nematzadeh (Staff Research Scientist at DeepMind) to discuss machine learning models and multimodal learning. 

Before joining DeepMind, Dr. Nematzadeh was a postdoctoral researcher at UC Berkeley advised by Tom Griffiths and affiliated with the Computational Cognitive Science Lab and BAIR. Aida received a PhD and an MSc in Computer Science from the University of Toronto, where she was advised by Suzanne Stevenson and Afsaneh Fazly, and was a member of the Computational Linguistics group. Broadly, her research interests are in the intersection of computational linguistics, cognitive science, and machine learning. Aida’s recent work has focused on multimodal learning and evaluation and analysis of neural representations. During her PhD. She studied how children learn, represent, and search for semantic information through computational modeling.

If you missed this last discussion on ML model evaluation and multimodal learning, scroll down to find a recap of Patricia and Aida’s chat or watch the full session below.

Watch the full session:

PAI: First of all, in the paper that you co-authored titled “Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization”, you and your co-authors found that vision and language models trained for visual question answering don’t generalize well.

Can you expand on what a better approach would be than testing these models on data that comes from the same distribution as the training data?

Aida:  Sure. One thing that is worth pointing out is that when we were looking at this family of models, the idea was to look at models that are pre-trained on tasks that are not visual question answering. We wanted to see whether this family of pre-trained models that typically perform very well in a range of tasks, do well in out-of-distribution settings. So before fine tuning these models for visual question answering, they are trained using different losses. This includes one that tries to match image and text modality. 

Then, going back to your question, there are a few things that I think we can do better. One is the idea that we basically implemented in this paper where we are doing what we call a transfer setting, or like an out of distribution transition test. We fine tune models on one particular VQA weekly benchmark, but we test the model on a test split of other VQA benchmarks. Visual question answering is particularly nice because we have a range of different benchmarks with different properties. So, we can study how fine-tuning on one benchmark changes the performance across a range of other benchmarks. 

Another interesting way to evaluate pretrained models is zero-shot settings. If you can set them in a zero-shot way, that’s like a direct test of pretrained representations. We are not really adding any other compound from the fine tuning experiment. Other than these two settings, I think being able to have a bit more control on the type of things that we are interested in testing is usually nice. When we test for visual question answering, or when we test for commonsense, we get these benchmarks that are collected and we just use them. So researchers collect benchmarks, they have a goal and they think that the benchmarks will test for certain properties. When we take a benchmark that is designed just for certain properties, it’s possible that the model does well in the benchmark without really having the capacity or learning a specific thing that we are interested to test for because the models can pick up on spurious correlations. I think addressing this with probing data sets or like datasets where we have a bit more control are a nice way to evaluate models for specific properties. 

PAI: So you did touch on this a bit. Is there anything else that you see on how we should test these models instead?

Aida: Another reason I like to think about evaluation is being very clear about the purpose of the evaluation. So again, sometimes we are interested in visual question answering because we think visual question answering is a good proxy for an application. If you have a good visual question answering system, visually impaired users or blind users can benefit from the system. But then the question that we typically don’t necessarily answer in our work in research papers is whether this benchmark that we designed or collected data to help see better performance in a real word application, actually correlates with a real world application. If we improve on this specific benchmark, take our system, deploy it in the real world, then this system will do very well for the real world application that we were interested in. I think this is a really interesting question to think about and it’s not really easy to test because most of the time researchers collect benchmarks and it’s hard to actually test systems in a real world setting. 

I’ve been thinking about this a bit more in the last few years with the visual question answering setting. One particular data set that I found very interesting is a data set called viswis. It’s a benchmark that is collected from visually impaired users. For the setting in this work, the authors have a set up where visually impaired users can post a question and an image that they take. Then there’s a crowdsourcing system so other people can answer this question given the image. If you look at the properties of this dataset and compare the other visual question answering datasets, it’s very different. Sometimes the images are not pointing to the right object or it’s a blurry image because of the way the users are able to take pictures. I think here we can basically think about what is the next thing that we can do. For example, maybe the big challenge here is actually giving enough good feedback given the image and the questions so we can get a better image. There’s also an interesting privacy concern related to your work given this dataset because a lot of times the visually impaired are interested in answers for questions that can contain sensitive information. For example, maybe they want to know what their credit card number is and they’re going to post an image of a credit card or they want somebody to read a document that has some sort of information. Working with the state is particularly interesting for those reasons too. 

PAI: Do you find, given these different testing environments that you propose, that some models generalize better than others? 

Aida: In the visual question answering work that we did, we were comparing discriminative versus generative models. In previous work that shows that generalized models generalize better to out of distribution settings because they don’t overfit as much as a discriminative model. The discriminative models tend to overfit a bit more. And that was basically the hypothesis we wanted to test in this particular setup. In our experiments, we interestingly see that generative models do better. They’re more robust in out of distribution generalization in most conditions. So for a particular family of models that we looked at, the results were strong. For another model it was less strong, but we still see some patterns. I think maybe one thing to actually study a bit more in the future is this difference and maybe focusing on the generative models would be a better way to move forward.

PAI: In one of your papers, you discuss commonsense knowledge. First, I’d like to get to how you would define commonsense knowledge. 

Aida: Yeah, that’s a good question. It’s a hard one too. I feel like we spend a lot of time thinking about this. My very simple way of defining commonsense knowledge is that it is a knowledge shared by a typically large group of people. It’s often not mentioned, so it’s often left unsaid. It’s not explicitly said, even in daily communication or even like in our text corpora. It’s typically about everyday situations, like things that people kind of know about everyday situations. An interesting part of it for me is also that it’s typically probabilistic. So we might have some knowledge, but there’s usually probably the distribution of what we believe. Let’s say if I ask you how you boil water, the most common way is basically you pour water in a kettle, you put the kettle on your stove – but this is true if you’re in a kitchen. If you’re in a lab environment, you’re going to boil water in a different setting. So there’s some probability on this knowledge and it’s different from facts. Most of it is basically true or false, it’s probabilistic. 

PAI: Do you find that a model’s ability to generalize is tied to a model’s ability to learn what we think of as commonsense knowledge? 

Aida: Yeah, I’m not sure. I’m actually even not sure how to test for this. Like if you just think about commonsense reasoning or understanding, we can test for transfer in that domain. My intuition is that it’s going to be a harder transfer task because the way the models can use commonsense is either by having commonsense knowledge base and basically retrieving from that or trying to extract common knowledge or learn commonsense knowledge from the existing corpora that we have. If you think about this, basically what you’re asking in terms of transition to me is better models are able to learn to synthesize commonsense knowledge, which can be a challenging task and I don’t actually know if it’s directly tied to generation. It’s hard to test for it in a way. You want your model to do well in a generalization setting, is the test organization the ability to synthesize new knowledge? How would you test for it? You kind of need to have manual annotations of what the knowledge is produced and go from there. It’s an interesting question. 

PAI: I guess one way to start might be to compare which models do better in commonsense knowledge tasks and which models generalize well and see whether they are similar. 

Aida: I guess major transition is typically per task, right? So when you say multi-generalized well, the question is for what tasks we’re going to test for in the current most recent line of work. The base models are typically a language model and we pick a language model that generalizes well, in a sense, that it’s doing well on the validation or test that sometimes if we have that actually with the larger language models, this is also really hard to measure. So it’s not necessarily an auto distribution transition, it’s just a better model. And given the current comments and benchmarks, I think we will probably see a correlation between performance and language modeling and performance and commonsense benchmarks for most of the existing benchmarks. But I don’t think that is necessarily generalization in the domain of commonsense. Does that make sense? 

PAI: That makes sense, absolutely. You co-authored a paper called “Do Language Models Learn Commonsense Knowledge?” I was curious about what some best practices are to test for commonsense knowledge. 

Aida: Yeah, I think when we started working on that project, the dominant approach was basically taking some existing models that are either pre-trained for language modeling, like Encoder style or autoregressive language modeling, and then fine tuned for commonsense as a benchmark. And then when we started looking at this line of work, I think the interesting question for us was if we take a very good language model like this larger family of models, are they actually able to learn what we consider commonsense given our existing benchmarks by just the language modeling objective? For that question, I think the zero short evaluation was an interesting one because we’re not really going to teach the model about commonsense at all. We’re just going to see whether the model is able to learn. 

Having said that, if the goal is to come up with a system that is really good at commonsense reasoning for a specific application, fine tuning definitely makes sense. I think I would answer this question by adding maybe a phrase to it like: what do you want to do with this system that we’re evaluating and identifying that will help us think about what are the best practices for evaluation for a test? If the idea is to test a model for their ability to learn commonsense knowledge in a semi-supervised and supervised way, then I would test it without fine tuning. If we were interested in having systems that work very well in a specific benchmark, fine tuning is fine. 

Something that is also interesting is that some of these commonsense benchmarks are created from a knowledge base that are based on a knowledge base. So it’s possible that you actually use that knowledge base as a signal during training. Thinking about how general these knowledge bases are and how well the models use each type of knowledge work or basically perform for different commonsense measurements, is another interesting question. 

PAI: Moving to this other paper that you co-authored called “Probing Image-Language Transformers for Verb Understanding” you mention that certain categories of verbs are particularly challenging for image language transformers to learn.

Can you tell us about these categories and about what kind of datasets would help with training models to better deal with these verbs? 

Aida: Yeah, so in that paper we couldn’t really identify a particular category based on the properties that we looked at. So one intuition that we have is that some of the verbs are known to be more visual, meaning that it’s easier to see them in an image. We had a dataset from previous work called “Image Situations” where the authors of the paper have basically asked annotators to identify whether a verb is visual or not by basically checking whether they can find images for that. That was one specific thing or hypothesis that we could form a test for. Interestingly, we didn’t see a strong correlation between performance and the verbs across this family of models and whether the verbs were visual or not. We also tried to look for a few other properties of the verbs and none of the results were released. I would say there was no strong correlation. 

One thing that we didn’t look into in the paper and I think might be important since it’s something that I thought afterwards and I haven’t actually tested this empirically, is that the difficulty of the verbs might come from the types of images they are carrying and how visually different these images are. So if you think about a very simple verb of eat, it can happen in very diverse environments like animals eat in an environment that’s very different from how people typically eat, which is like in a kitchen or in a restaurant or industry. That might be actually one thing to check for and I think it would be interesting to test for that. As for the second part of your question, what kind of datasets would help with training models to better deal with these verbs – that’s an interesting question. I think there’s a question whether we need different datasets or maybe we actually need different ways of modeling for verbs. I would say we probably just need better models and maybe we can learn from existing datasets. 

Typically when we consider learning in the grounded setting with images and language, there’s a lot of work that focuses on objects, right? Objects are interesting, but they’re very easy because we can model them as a classification task so we can predict the object alone. Verbs are a bit more interesting because they are both relations. If you want to model verbs, we need to actually learn to predict a relation. So it’s a structured prediction task or predicting the relation and not the existing model like the recent family models that we were testing. These larger pretrained models are not really specifically trying to learn relations. I think that’s basically an interesting future of work that’s looking at what might help these models do better for verbs. 

PAI: To what extent do you think that verse comprehension can be a test of commonsense knowledge?

Aida: One thing to consider is actually whether verb understanding is commonsense. If you had asked me what type of commonsense knowledge is there, I wouldn’t call linguistic knowledge as commonsense. I don’t know if you have a very good reason for that, but there are different types of knowledge that we have. I think knowledgeable language is something that we learn, but I wouldn’t consider it as commonsense. Although it has a lot of properties that I probably listed when I define commonsense knowledge. Maybe, yeah, I don’t know why. Partly it can be probabilistic. It’s something that we use every day. It is shared by a large group of people, but it’s also something that we kind of like to learn specifically and then we take it as knowledge that we have. I don’t think I have a good answer for this, to be honest. I have to think about more. 

PAI: I can see why that would be a difficult thing to debate. 

Aida: Yeah, I think basically there’s this question like what is knowledge and what is commonsense knowledge and knowledge about different fields that we learn, like chemistry, language, and even physics. There’s like a lot of commonsense datasets across that is about physical or temporal knowledge. But maybe the way that we can think about it is that when it’s commonsense knowledge, it’s more about combining the information with our everyday situations as opposed to learning specific sets of facts or even knowledgeable specific domains. 

PAI: Maybe, temporarily speaking, one piece of commonsense would be if you are boiling water, you probably put the water in the kettle before you boil the water and that a verb would be able to tell you when that water was put in the kettle. 

Aida: A verb would be able to tell us when you’re boiling, you know that the water is already in the kettle. Yeah, that’s actually a good example. So it’s a commonsense knowledge that we know by learning about the verb meaning, but maybe it’s like something that you actually synthesize from that verb meaning and you use it. 

PAI: How well do you think test for commonsense knowledge in purely language transformers transfer over to image-language transformers?

When you say the test for commonsense knowledge, are you asking whether the benchmarks that are language only are good benchmarks? I think that’s a very interesting direction for future work too. We did some experiments where the idea was that we have all these benchmarks that are language only in a way that people have been testing them on language only corpora. If you use images, if we ground them in a different modality, can we do better? Assuming that some of the things that are not set in the text might be visual in the image. 

Given the current benchmarks, I think maybe for some of them there is some information that we can find from images. But it’s really hard because a lot of these datasets are curated based on some specific knowledge bases that are text only. Finding images that depict information for those are not necessarily easy, particularly some of the benchmarks that we looked at. For example, tests for social commonsense or even the ones that test for temporal commonsense are not really easy. It’s hard to find useful image datasets or basically useful images for them. My intuition was that for some of them, maybe they wanted to test for physical commonsense, we can find some useful images. 

PAI: Moving towards what you’re currently doing at DeepMind, is there any research either from yourself or somebody else’s that you’re most excited about at the moment? 

Aida: Recently I’ve been reading a lot of papers and modeling structure. In particular, I’ve been reading a lot of papers from Yoon Kim, so I’m excited about that line of work. I’m thinking about how we can model structure in the multimodal setting. Other than that, this learner that I already mentioned, I’m also very interested in captioning for blind users. I think it introduces a lot of interesting problems that I haven’t been thinking about when considering other benchmarks. One interesting, for example, challenge is that they actually need to do pragmatic reasoning. A lot of times the language is a pragmatic use. And I think that’s a very interesting problem to study. 

PAI: What is it like working at DeepMind? 

Aida: Yeah, that’s a great question. Working at DeepMind is fun. I guess if I want to just give a very short answer to this question. To me, it’s really similar to working in an academic setting, like as an academic researcher, maybe with a different distribution of tasks. So we probably have very similar tasks, but depending on where in academia you are, you might spend different amounts of time on the tasks. 

Compared to me at DeepMind, we have a lot of focus on research. For example, I still find opportunities to teach. I enjoy teaching, but it’s not mandatory or like a big component of our job. We are lucky that we also don’t have to apply for grants to do our own research, which is very helpful. So we can basically just focus a bit more on the research part. 

PAI: Is there anything you’d recommend to researchers looking at applying to DeepMind? 

Aida: One question that I get a lot when I talk to people is: what sort of research is going on at DeepMind? Now that we are a bigger company, we try to have a lot of public facing information for people who are interested. I feel like people are curious about what the opportunities are and what type of research they can do at DeepMind. My recommendation is that if you’re at a conference and you see a DeepMinder, or there’s a recruiting team at the conference, talk to people. I think that’s the best way to kind of get a sense of the different types of research that people do and the different groups at DeepMind and figure out if there is a match for your interest at the company.

PAI: What are the exciting use cases for structured data modeling? For example, do you apply them to biology and chemistry problems? 

Aida: For me, I’m mostly excited about the image language benchmark that I’ve been looking at. If you think about something like visual question answering, a lot of times their structure and the language is important for the model to answer the question correctly. 

Missed this webinar? Sign-up for Private AI’s newsletter to receive updates on upcoming events.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore


Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.