Enhancing Metadata Curation in Social Science Data Management with Generative AI

University of Michigan - Social Science Data Repository

Problem

 Efficiently curating metadata with controlled terminology is a critical yet time-consuming task in social science data management. Data depositors often provide insufficient metadata, necessitating extensive enhancement by data repository staff. This process traditionally involves navigating a wide array of controlled terms, requiring substantial time, expertise, and sometimes the creation of new terms.

Audience

The AI tool was targeted at data repository staff involved in curating social science metadata. These staff are responsible for enhancing metadata provided by data depositors, ensuring accurate and comprehensive descriptions of data sets using controlled terminology.

Outcome/Impact

Caden Picard, a Graduate Assistant at the University of Michigan, introduced an innovative model employing the U-M GPT Toolkit technology to address the challenges of metadata curation. This tool significantly reduced the time required for metadata curation while enhancing the accuracy of term matching. By rapidly analyzing text and extracting pertinent keywords from established thesauri, including the ICPSR Subject Thesaurus, the European Language Social Science Thesaurus (ELSST), and Library of Congress Subject Headings (LC SH), along with intelligent recommendations from U-M GPT, the tool ensured precise and comprehensive metadata curation.

The implementation of this AI-driven approach expedited the curation process, allowing data repository staff to focus on more complex tasks and improving overall efficiency. The tool's ability to provide accurate and relevant term matches enhanced the quality of metadata, ensuring higher precision and recall in the results. This innovation demonstrated the potential of Generative AI to streamline metadata curation processes, ultimately benefiting the management and accessibility of social science data.