Researchers at MIT and the MIT-IBM Watson AI Lab have developed a technique to evaluate the reliability of foundation models before applying them to a specific task. They achieve this by analyzing a set of foundation models that slightly differ from each other. The algorithm assesses the consistency of the representations each model learns about the same test data. If the representations are consistent, the model is considered reliable.
Comparing their technique with state-of-the-art methods, the researchers found that their method is better at capturing the reliability of foundation models across various classification tasks.
This technique allows users to decide whether to apply the model in a specific environment without the need for testing on real data. This is especially useful in situations where data may not be available due to privacy issues, such as health data. Additionally, the technique can rank models according to their reliability results, allowing users to choose the best model for their task.
„All models can make mistakes, but models that know when they are wrong are more useful. The problem of quantifying uncertainty or reliability is more challenging for these foundation models because their abstract representations are difficult to compare. Our method allows quantifying how reliable a model's representation is for any input data,” says lead author Navid Azizan, professor at MIT and a member of the Laboratory for Information and Decision Systems (LIDS).
Alongside him on the work were co-lead author Young-Jin Park, a PhD student at LIDS; Hao Wang, a research scientist at the MIT-IBM Watson AI Lab; and Shervin Ardeshir, a senior research scientist at Netflix. The work will be presented at the Conference on Uncertainty in Artificial Intelligence.
Measuring Consensus
Traditional machine learning models are trained to perform a specific task. These models typically give a concrete prediction based on the input. For example, a model might say whether a particular image contains a cat or a dog. In this case, reliability assessment can be as simple as checking the final prediction.
But foundation models are different. The model is pre-trained using general data, in an environment where its creators do not know all the tasks it will be applied to. Users adapt it to their specific tasks after it has already been trained.
To evaluate the reliability of foundation models, the researchers used an ensemble approach by training several models that share many characteristics but differ slightly.
„Our idea is like measuring consensus. If all these foundation models give consistent representations for any data in our dataset, then we can say that the model is reliable,” says Park.
But they faced a problem: how to compare abstract representations?
„These models only give a vector, composed of some numbers, so we can't easily compare them,” he adds.
They solved the problem using an idea called neighborhood consistency.
For their approach, the researchers prepare a set of reliable reference points for testing on the ensemble of models. Then, for each model, they investigate the reference points that are close to the model's representation for the test point.
By looking at the consistency of neighboring points, they can assess the model's reliability.
Aligning Representations
Foundation models map data points into what is known as a representation space. One way to think about this space is as a sphere. Each model maps similar data points to the same place in its sphere, so images of cats go to one place, and images of dogs to another.
But each model would map animals differently in its sphere, so while cats might be grouped near the South Pole of one sphere, another model might map cats somewhere in the Northern Hemisphere.
Researchers use neighboring points as anchors to align these spheres so they can compare representations. If the neighbors of a data point are consistent across multiple representations, then we can be confident in the model's reliability for that point.
When they tested this approach on a wide range of classification tasks, they found it was much more consistent than baseline methods. Additionally, it was not confused by challenging test points that other methods were baffled by.
Moreover, their approach can be used to assess reliability for any input data, so it can evaluate how well the model works for a specific type of individual, such as a patient with certain characteristics.
„Even if all models have average performance, from an individual perspective, you will prefer the one that works best for that individual,” says Wang.
One limitation comes from the need to train an ensemble of foundation models, which is computationally expensive. In the future, they plan to find more efficient ways to build multiple models, possibly using small perturbations of a single model.
„With the current trend of using foundation models for their representations to support various tasks — from fine-tuning to generation with retrieval-augmented approaches — the topic of quantifying uncertainty at the representation level is becoming increasingly important but challenging, as the representations themselves lack grounding. Instead, it's about how the representations of different inputs are related to each other, an idea that this work neatly encapsulates through the proposed neighborhood consistency score,” says Marco Pavone, associate professor in the Department of Aeronautics and Astronautics at Stanford University, who was not involved in this work. „This is a promising step towards high-quality uncertainty quantification for representation models, and I am excited to see future extensions that can function without the need for model ensembling to truly enable this approach in foundation-sized models.”
This work was partially funded by the MIT-IBM Watson AI Lab, MathWorks, and Amazon.
Creation time: 17 July, 2024
Note for our readers:
The Karlobag.eu portal provides information on daily events and topics important to our community. We emphasize that we are not experts in scientific or medical fields. All published information is for informational purposes only.
Please do not consider the information on our portal to be completely accurate and always consult your own doctor or professional before making decisions based on this information.
Our team strives to provide you with up-to-date and relevant information, and we publish all content with great dedication.
We invite you to share your stories from Karlobag with us!
Your experience and stories about this beautiful place are precious and we would like to hear them.
Feel free to send them to us at karlobag@ karlobag.eu.
Your stories will contribute to the rich cultural heritage of our Karlobag.
Thank you for sharing your memories with us!