Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Cloud of Reproducible Records

Summary

An important part of the NIST (National Institute of Science and Technology) MGI (Materials Genome Initiative) effort is to investigate infrastructure and tools to enable reproducible research in computational and experimental materials science. Corroborating scientific results is a corner stone of the scientific method. In the current state of the three forms of science and engineering only the theoretical field can be guaranteed to be corroborative because of its mathematical foundations sufficient for proof. The two others forms of science and engineering (experimental and computational) require some equipment to reproduce the results. As such the reliability of manual steps and the uncertainty introduced by the use of electro-mechanical components is at the heart of what makes it difficult corroborate the results. This stresses the need for automated tools when it comes to documenting scientific research. A good practice in computational research is to use a dedicated computation management tool (CMT) throughout the development process rather than creating an ad-hoc computation management scheme. On the other end experimental research requires more automation, inter-operability between research devices and control interfaces to give more leverage to experiment management tools (EMT). Listed below are a number of requirements for an effective SMT and EMT. Capturing research results effectively is not the end goal. Disseminating these records and providing collaborative schemes is as much important. We introduce CoRR (Cloud of Reproducible Records) to serve this very critical purpose.

Description

CoRR Logo

Cloud of Reproducible Records (CoRR)

CoRR is a Cloud Infrastructure for storing, disseminating, federating and collaborating on Reproducible Record atoms. It is composed of two Python Flask REST web services (Cloud and API), a cloud storage space, a MongoDB database and a Material Design AJAX web frontend.

CoRR offers two ways of access to the records. A web access throught which scientists will be able to manage their projects, records and interect with others. A common API access is also provided for Simulation Managment Tools, Experiment Managment Tools and also for developers of third party applications (Evaluation, Benchmarking, Reproducibility accessment, Machine learning frameworks, ...).

The CoRR infrastructure can be fully deployed on cloud service providers like Amazon Web Services, Google Cloud Computing or Heroku. The CoRR source code will be soon moved to the NIST Github account(link is external).

More documentation can be found: http://corr.readthedocs.io(link is external)

Simulation Management Tools

SMT can be organized into three groups:

  1. Workflow based: The simulation is treated as a  black box linked to other simulations through inputs or outputs. Thus a simulation input can be another one output and its output possibly another's input. There are currently implementations of this group: Taverna, Galaxy, VisTrails, ...
  2. Event based control: The simulation interactions with the Operating System (library loads, input files, outputs files, netwrok communication, compilers, ...) are all accessible to the management tool to record. The current most used implementations are: Sumatra, ReproZip, ...
  3. Library based: The simulation implements custom items from the library instead of using items from standard libraries. These custom items give a much granular access to the interaction with the Operating System compared to the event based one. Here we can cite Sumatra, Dexy, ...

One particular SMT that is currently being supported is Sumatra. It is a is a lightweight system for recording the history and provenance data for numerical simulations. It works particularly well for scientists that are in the intermediate stage between developing a code base and using that code base for active research. This is a common scenario and often results in a mode of development that mixes branching for both code development and production simulations. Using Sumatra avoids this unintended use of the versioning system by providing a lightweight design for recording the provenance data independently from the versioning system used for the code development. The lightweight design of Sumatra fits well with existing ad-hoc patterns of simulation management contrasting with more pervasive workflow tools, which can require a wholesale alteration of work patterns. Sumatra uses a straightforward Django-based data model enabling persistent data storage independently from the Sumatra installation.

CoRR will progressively support most of the SMT implementation listed and look for more.

Experiment Management Tools

In the experimental world depending on the device used by the research, various level of automation and interoperability can be found:

  1. Open API capable device: The scientist can interact with the machine to control it from a software on a computer or any API compatible device. Almost everything can be virtually captured. It is a rare configuration but more and more common with open hardware and open source based devices. Developer can implement various forms of automation on top of the API.
  2. Not protecting data dumping device: Machines in this group do not have an Open API yet are hooked to a computer through an appropriate media (Ethernet, Serial DB9, USB, ...). They basically dump non encrypted or proprietary data from the experiment into the computer file system. In some cases they have a proprietary software to communicate with the machine in a not open way. Even though the communication is proprietary their is still room for possible automation and inter-operability.
  3. Encrypted or Protected data dumping device: In machine like those, everything is encrypted or proprietary format. One can just know that a data has been dumped. The access to this data is through another or the same software used to control the machine. In these cases there almost no room for automation and inter-operability without a tremendous implication from the device manufacturer.

Currently CoRR R&D focuses in special use cases to provide solutions and examples to scientists in Material Science, BioScience Engineering, etc... We are also investigating the proposal of a Minimal Machine Control Interface (MMCI) for future scientific device mannufacturing in order to grant an ease of automation and inter-operability.

SMT and EMT Requirements

Automation

Ideally, the logging and recording process is entirely automated with the only researcher contribution being a small "commit messages" that logs the researcher's thoughts, reasons and outcomes for running the simulation.

Integrated with Version Control

The SMT and EMT should be entirely integrated and aware of the common distributed version control (DVC) tools such as Git, Bazaar, Mercurial or other ways of versionning research investigations (simulations and experiments). The provenance data and investigation data should not be recorded by the version control system, only the SMT project data should be held in version control.

REST API

The SMT and EMT clients should communicate using a REST API completely independent of any backend databases.

Simple Local Records Storage

The SMT and EMT should have a simple local store (dump to JSON) for when the API is unavailable.

Data

Output data files should be hashed to enable effective replication and future regression and calibration testing with a continuous integration like tool.

Integrate low level tests

Low overhead for integration of low level regression or calibration tests with each provenance record.

Record Dependencies, Requirements and Specifications

All dependencies should be automatically recorded as well as uninstalled development repositories that the investigation depends on. This is hard to achieve across multiple language barriers, but one of the most important requirements.

Live Inspection

The SMT and EMT should be aware of the status of live jobs and experiments and send updates via the API.

Parallel

The SMT and EMT (To be defined in R&D later) needs to be aware of provenance data associated with parallel jobs or intermediate experiments (such as which nodes or devices are being used) as well as awareness of various queuing systems or experimental design and control system.

Other Provenance Data

Every record (simulation) should have a unique ID and an associated time stamp.

 

Major Accomplishments

Other Media

 

 

 

Created October 24, 2017, Updated March 25, 2019