Scientific literature is undeniably an important source of scientific data for research but the review and curation of data from literature is both tedious and time consuming. Investigators must sort through many articles to review and extract relevant information. For many areas of research, seminal articles may be read and re-read by numerous investigators, each extracting similar data from the articles. One solution is community-focused curation of scientific data to lessen the burden on investigators by providing reference databases containing scientific data extracted from scientific journal articles. Though this can lessen the burden on investigators, these efforts still require significant manual effort and run the risk of not being able to match the rate at which new publications appear. We are investigating Natural Language Processing and Machine Learning Techniques to accelerate the curation of Materials-related data which appear in scientific literature.
Our prior work on semi-automated methods for Materials Data Curation has benefited from collaboration with the NIST Thermodynamics Research Center (TRC) in Boulder and has resulted machine learning techniques that will speed the selection of relevant journal articles as part of the TRC’s curation process. Working with the TRC, we developed scientific journal article classifiers for the NIST TRC data curation process. Based on our results, we concluded that further improvements could only result from the inclusion of domain information into the classification process. We’ve worked with our TRC colleagues to identify the most important domain information that will lead to improved classification. This has led to an article submitted to Integrating Materials and Manufacturing Innovation (IMMI).
We are currently working to accelerate the metadata curation process for the NIST Materials Resource Registry (NMMR) by investigating methods to extract data from scientific literature and the Web.
Peskin, Adele P. ; Dima, Alden A. A.; Classification of Journal Articles In a Search For New Experimental Thermophysical Property Data: A Case Study, IMMI, submitted.