The Configurable Data Curation System developed at NIST and also known as the CDCS or Curator, provides a means for capturing, sharing, and transforming unstructured data into a structured format based on the Extensible Markup Language (XML). The CDCS can be viewed as a "loading dock" for scientific data. It serves as means to enable the collection and dissemination of structured scientific data. It can be applied to any area and is agnostic to the type of data. “Curated” data is amenable to transformation to other formats such as those used by existing computational tools. The data are organized using user-selected community-developed templates encoded in XML Schema used to create data documents that are saved in a non-relational (NoSQL) document database.
The CDCS is currently in use by the Materials Genome Initiative (MGI). In the MGI, there may be collections of incompatible data often represented in diverse formats. This is a challenge to the distributed research goal envisaged by the MGI. The Materials Data Curation System (MDCS) allows for the curation of materials data into a repository using predefined templates. The ability of the MDCS’ underlying XML format to be transformed into virtually any other format using standard tools, gives the MDCS the ability to serve as a data source for a wide variety of existing materials informatics efforts that can span across projects, groups, and organizations. Each project, group, or organization can run as many MDCS instances as needed. Individual MDCS repositories can be interconnected for federated searches and data sharing.
The CDCS is implemented in Python, the Django framework and MongoDB. It uses XML because it is a robust, proven standard written as plain text. It can also be shared and converted into other formats easily. The CDCS provides a Representational State Transfer (REST) API that allows other software to directly interact with it over a network. CDCS functions are available via the API, allowing for full automation.