Speaker: Dr. Shuran Zheng, Tsinghua University
Time: 14:00 p.m., March 5, 2024, GMT+8
Venue: Room 204, Courtyard No.5, Jingyuan
Abstract:
In the age of artificial intelligence (AI), data serves as the lifeblood that fuels innovation and development. A common way to evaluate a dataset in ML involves training a model on this dataset and assessing the model's performance on a test set. However, this approach has two issues: (1) it may incentivize undesirable data manipulation in data marketplaces, as the self-interested data providers seek to modify the dataset to maximize their evaluation scores; (2) it may select datasets that overfit to potentially small test sets. We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Any manipulation of the data, including but not limited to data duplication, adding random data, data removal, or re-weighting data from different groups, cannot increase their expected score. Our valuation score measures the \emph{pointwise mutual information} of the test dataset and the evaluated dataset. We show that this score has useful information theoretic properties and show how to efficiently estimate it for certain Bayesian settings. Finally, we test by simulations the effectiveness of our data valuation method in identifying the top datasets among multiple data providers. Our method consistently outperforms the standard approach of selecting datasets based on trained model's test performance, suggesting that our evaluation score, in addition to disincentivizing data manipulation, is also more robust to overfitting.
Source: Center on Frontiers of Computing Studies, PKU