Can we know where, when and why our AI Earth models fail?
Foundation models are becoming astonishingly capable. They summarise papers, write code, classify imagery, and help us make sense of a world that is changing faster than ever. But they also have a familiar flaw: they can sound confident when they’re completely wrong.
In disaster response, climate monitoring, and environmental science, that kind of misplaced certainty isn’t just inconvenient. It’s dangerous.
This summer at the Earth Systems Lab, researchers explored a simple but powerful idea: what if foundation models could acknowledge uncertainty instead of bluffing through it?
Full blog below:
Make Foundation Models say IDK!
We’ve all stood in pure awe of what AI can do today: watching it find the perfect tone for that important email, break down a complex matter so that it finally makes sense to us or generate images and videos of perfect worlds that are sometimes almost impossible to distinguish from reality.
And what is the reality? Wildfires, floods, droughts and storms are reshaping landscapes worldwide. With our planet changing faster than ever, what if we could use this technology to tackle the consequences of climate change?
This technology is called Foundation Models (or FMs): large artificial neural network models trained on vast amounts of data, with the promise of solving many tasks. Every day, satellites beam down petabytes of geospatial data, a living record of Earth in motion and with that, many research labs are already working on building these Foundation Models of the Earth. So we are done here, right? Well…not quite.
Who has not stood in shocking disbelief when the same technology that wrote a brilliant essay suddenly can’t count how many r’s are in strawberry, insists that 17 is larger than 42 or confidently provides directions to a café that closed in 2018? Moments that make us shake our heads and think: What on Earth?
We all know someone who speaks with too much confidence. Now imagine that person running a disaster-response system, insisting that the red roof you are seeing there is actually a raging fire and that the mountain’s shadow hanging over a village is actually a landslide burying its inhabitants.
So instead of waiting for models to become perfect (which they most likely never will), we decided to focus on something far more practical: making foundation models say, "IDK!"
At the FDL Earth Systems Lab this summer, our team built SHRUG-FM, a framework that lets foundation models pause and say, “Hmm, I don’t know, maybe you want to double-check that.”
To make this possible, we developed three simple “red-flag” signals any model can raise when it’s entering risky territory:
The Input: “I’ve never seen anything like this before!”
If the satellite image comes from a region, event or condition the model has never encountered, SHRUG-FM catches that unfamiliarity.The Embedding: “I don’t really understand what I’m looking at…”
Even if the pixels look normal to us, the model’s internal representation might be confused. SHRUG-FM checks whether the model’s “mental map” is drifting into the weird unknown.The Task: “I can’t answer this with the information I have.”
Sometimes the model understands the image but still can’t perform the downstream task (like detecting fire damage). SHRUG-FM spots when its confidence collapses.
Together, these signals give responders and scientists something AI rarely offers: humble honesty.
Now that we have these signals, it’s time to put them to work. And this is not limited to a person manually examining the prediction.
If the input is unfamiliar, developers can expand the dataset and ensure that they include data from the unknown region or condition. If the internal representation is confused, the model or training procedure needs improvement. And if the task prediction breaks down, we may need better sensors, additional imagery or new annotations.
This is not a trivial task and SHRUG-FM is just one step on a long journey. But we believe that, together, we can build foundation models that can be deployed in the real world. Preliminary results already show the effectiveness of the SHRUG-FM framework.
We want to trust our models when lives are at stake, so let’s all work together to make foundation models say, "I don’t know!"
#MakeFMsayIDK
- Kai-Hendrik Cohrs, FDL Earth Systems Lab researcher
Over the summer, the team have:
Developed SHRUG-FM, a novel framework that enables geospatial foundation models to quantify and flag their own uncertainty.
Data-centric methods successfully correlated out-of-distribution signals in training data and embeddings with poor model performance in specific environmental conditions, effectively identifying gaps and biases.
By using ensemble-based predictive variance, the framework effectively identifies and filters out unreliable predictions, demonstrably improving the overall trustworthiness of the foundation model's outputs.
Integrated SHRUG-FM system provides a reliability-aware prediction mechanism that can confidently predict, raise a warning, or "shrug" (abstain) when uncertainty is high.
These signals aren’t just helpful for users, they guide developers toward improving datasets, training pipelines, and sensor design. As we confront climate-driven disasters and complex planetary systems, trust will matter as much as capability. SHRUG-FM is a step toward the kind of AI the world can rely on.
More honest predictions. Safer deployment. Better decisions when lives and ecosystems are at stake.