Progress in AI relies on researchers’ ability to compare their models’ performance through open, shared benchmarks. Last year, Facebook AI built and released Dynabench, a first-of-its-kind platform that radically rethinks benchmarking in AI, starting with natural language processing (NLP) models. Rather than using static tests, as has been standard in the field, Dynabench uses both humans and NLP models together “in the loop” to create steadily evolving benchmarks that won’t saturate, may be less prone to bias and artifacts, and allow AI researchers to measure performance in ways that are closer to real-world applications.
They’re now announcing a major new capability called Dynaboard, an evaluation-as-a-service platform for conducting comprehensive, standardized evaluations of NLP models. Dynaboard makes it possible for the first time to perform apples-to-apples comparisons dynamically without common issues from bugs in evaluation code, inconsistencies in filtering test data, backwards compatibility, accessibility, and many other reproducibility issues that plague the AI field today.
To push the industry further toward more rigorous, real-world evaluation of NLP model quality, Dynaboard enables AI researchers to customize a new “Dynascore” metric based on multiple axes of evaluation, including accuracy, compute, memory, robustness, and fairness.
Importantly, there is no single correct way to rank models in AI research. Dynascore addresses this challenge by allowing AI researchers to dynamically adjust the default score by placing more or less weight on particular metrics to evaluate performance in a nuanced, comprehensive way. This capability is a key component of Dynascore, because every person who uses leaderboards has a different set of preferences and goals. AI researchers need to be able to make informed decisions about the tradeoffs of using a particular model. Even a 10x more accurate NLP model may be useless to an embedded systems engineer if it’s untenably large and slow, for example. Likewise, a very fast, accurate model shouldn’t be considered high-performing if it doesn’t work well for everyone.
Their journey is just getting started. Since launching Dynabench, they’ve collected over 400,000 examples, and they’ve released two new, challenging data sets. Now, they have adversarial benchmarks for all four of their initial official tasks within Dynabench, which initially focus on language understanding. They believe that, as the AI community continues to build on their open platform, the field will iteratively and rigorously improve the way researchers evaluate models, create data sets, and evolve toward better benchmarks that, ultimately, narrow the gap between AI models used in research versus practice.
Spurring progress with evaluation-as-a-service
Dynaboard requires minimal overhead for model creators who want to submit their NLP model for evaluation, while offering maximum flexibility for users who want to make fine-grained comparisons between models. Although other platforms have addressed subsets of current issues, like reproducibility, accessibility, and compatibility (forward and backward), Dynaboard addresses all of these issues in one single end-to-end solution.
Dynaboard: Under the hood
While they might add metrics in the future, the following set of NLP model evaluation metrics are currently supported in the overall “Dynascore” ranking function.
- Accuracy. The standard AI evaluation metric is some form of accuracy, e.g. how many examples did the model get right as a percentage. On Dynaboard, the exact accuracy metric is task-dependent. Tasks, which are owned by AI community members, can have multiple accuracy metrics, but only one metric (decided by the task owners) is used as a part of the ranking function.
- Compute. Measuring the computational efficiency of a model is important for several reasons. First, a highly accurate model that takes hours to label a single example is completely useless in any real world scenario. Second, it’s important that the AI field focus on “Green AI” to minimize the negative impact on the environment. By evaluating compute resources, we can account for a model’s energy and carbon footprint in our model ranking. To account for computation, we measure the number of examples that a model can process per second on its instance in our evaluation cloud.
- Memory. Understanding the memory constraints helps AI researchers identify that even a highly accurate model that’s untenably large may not be useful in practice and can be hard to reproduce. They measure the amount of memory that a model requires in gigabytes of memory usage. We average the memory usage over the duration that the model is running, with measurements taken each N seconds.
- Robustness. Initially, they focus mostly on typographical errors and local paraphrases. An NLP model should be able to capture that a “baaaad restuarant” is not a good restaurant, for instance, to be considered flexibility under challenging situations. They evaluate robustness of a model’s prediction by measuring changes after adding such perturbations to the examples. We fully expect this metric to evolve over time as well, and to give a better sense of the many different aspects involved in measuring model robustness.
- Fairness. The AI community is in the early days of understanding the challenges of fairness and potential algorithmic bias. There is no single, widely agreed definition of fairness — let alone measure of model fairness. At the launch of Dynaboard, they’re starting off with an initial metric relevant to NLP tasks that they hope serves as a starting point for collaboration with the broader AI community. Similar to the measurement of robustness, as a first version, they perform perturbations of original datasets by changing, for instance, noun phrase gender (e.g., replacing “sister” with “brother”, or “he” with “they”) and by substituting names with others that are statistically predicative of another race or ethnicity. For the purposes of Dynaboard scoring, a model is considered more “fair” if its predictions don’t change after such a perturbation. Although approaches like theirs that replace words have become a common method in NLP for measuring fairness, this metric is far from perfect. For example, heuristically replacing “his” either with “hers” or “her” makes sense given English grammar but sometimes results in mistakes (i.e., if we always replace “his” with “her”, then with the sentence “this cat is his”, we end up with “this cat is her,” which doesn’t maintain the meaning).
Because the initial metric leaves room for improvement, we hope that the AI community will build on Dynaboard’s highly accessible, reproducible platform and make progress on devising better metrics for specific contexts for evaluating relevant dimensions of fairness in the future.
Calculating the Dynascore
Of course, it’s likely that many of these properties result in trade-offs: For instance, accuracy is often anticorrelated with speed: we need to expend a certain amount of compute to achieve a given amount of accuracy. It’s important to combine multiple metrics into a single score to rank models rather than rely on traditional disparate, static metrics because a static leaderboard’s ranking cannot approximate researchers’ preferences, which often require weighing costs, such as compute and memory – and such metrics are rarely even reported.
So, how do they combine disparate metrics into a single score that can be used to rank NLP models? More importantly, how can they allow Dynaboard users to customize the scoring function to better approximate their own utility function?
Their approach is to borrow from microeconomics theory to find the “exchange rate” between metrics that can be used to standardize units across metrics, after which a weighted average is taken to calculate the “Dynascore.” As the user adjusts the weights to better approximate their utility function, the models will be dynamically re-ranked in real time. To compute the rate at which the adjustments or trade-offs are made, they use the formula called the marginal rate of substitution (MRS), which in economics is the amount of good that a consumer is willing to give up for another good, while getting the same utility. To calculate the default Dynascore, which are specifiable by task owners, they estimate the average rate at which users are willing to trade-off each metric for a one-point gain in performance (i.e., MRS with respect to performance) and use that to convert all metrics into units of performance.
They choose this approach for two reasons. For one, it’s a principled approach to user personalization that builds on the utility-based critique of static leaderboards. Second, typical normalization methods used in ML (e.g., z-scores) do not work well in practice when a metric’s values are highly skewed, as they often are. For more details on the default Dynascore, see page 7 of the paper here.
State of the art AI on Dynaboard
As their first experiment, they use Dynaboard to rank current state of the art NLP models — such as BERT, RoBERTa, ALBERT, T5, and DeBERTa — on the four core Dynabench tasks. This set of models roughly encompasses the top 5 models on both GLUE and SuperGLUE, the popular leaderboards in NLP. And they compare them against clearly defined baselines. When metrics are aggregated to rank models, by default, the canonical performance metric is given half the weight and the other half is split among the other metrics. As default for all tasks, Dynascore weights all scoring datasets equally. And after computing the Dynascore for each model using the default weights, they find that the SuperGLUE ranking is roughly preserved. Even when they factor in the additional axes of evaluations, DeBERTa, the currently highest ranked open source model, still performs best.
This is a hopeful sign, because their recent more accurate NLP models also appear to perform well on their rudimentary fairness and robustness metrics. At the same time, models have been getting more compute-intensive and require more memory, pointing to possibly low-hanging-fruit in the “Green AI” space, something which wasn’t measured by previous leaderboards. As models keep improving in terms of accuracy, and as they keep collecting harder and harder dynamic adversarial datasets, they expect the other axes of evaluation to become more and more important.
Try it yourself. You can interact directly with state of the art models to see where they fall across their four core tasks: