In today’s cloud computing world, a major effort is made to efficiently bring algorithms to the data. While this allows for a simplified ingest of data using specialized algorithms, it also presents unique challenges to prevent the exfiltration (or data leakage) of the information contained in the data, which may be troublesome in the cases where the data contains Personal Identifying Information (PII) or other sensitive information. In many scenarios, such as medical modeling or computer vision technologies, this algorithm-to-data paradigm is avoided altogether due to the potential sensitivity of the data.
Many applications require a model to simplify the deployment of specialized software that allows access to the network and data models within a custom research cloud. As algorithms are computed on those complex clouds, their resource usage can be aggregated in order to allow for comparisons of similar approaches. By performing this protected data benchmark computation, a deeper understanding of the actual performance measurement of each approach can be estimated. This may simplify the understanding of the software’s performance and allow for the estimation and relevance in scaling.
To address this concern, staff from the Multimodal Information Group of ITL’s Information Access Division released on NIST’s Github two novel research software packages (Conducere and DMoni) which aim to: facilitate computational cluster configuration and deployment; benchmark distributed applications; and monitor system performance.
Conducere [https://github.com/usnistgov/Conducere] is research software that defines an automated workflow for configuring large, ready-to-use OpenStack [https://www.openstack.org/] clusters using open-source tools and predefined templates. It has been developed by the Group’s Evaluation Management System team based on its human language technology evaluation experience and data science projects.
DMoni [https://github.com/usnistgov/DMoni] is research software that benchmarks distributed applications and monitors their performance. In a cluster, a distributed application runs on many nodes or Virtual Machines. DMoni measures the execution time of launched distributed applications, and monitors the resource usages (CPU, memory, disk IO and network IO) of all associated processes. DMoni’s collected performance data can be used for more complete performance analysis.