To improve approaches for analyzing very large quantities of data, computer scientists at the National Institute of Standards and Technology (NIST) have released broad specifications for how to build more widely useful technical tools for the job.
Following a multiyear effort, the agency has published the final version of the NIST Big Data Interoperability Framework, a collaboration between NIST and more than 800 experts from industry, academia and government. Filling nine volumes, the framework is intended to guide developers on how to deploy software tools that can analyze data using any type of computing platform, be it a single laptop or the most powerful cloud-based environment. Just as important, it can allow analysts to move their work from one platform to another and substitute a more advanced algorithm without retooling the computing environment.
“We want to enable data scientists to do effective work using whatever platform they choose or have available, and however their operation grows or changes,” said Wo Chang, a NIST computer scientist and convener of one of the collaboration’s working groups. “This framework is a reference for how to create an ‘agnostic’ environment for tool creation. If software vendors use the framework’s guidelines when developing analytical tools, then analysts’ results can flow uninterruptedly, even as their goals change and technology advances.”
The framework fills a long-standing need among data scientists, who are asked to extract meaning from ever-larger and more varied datasets while navigating a shifting technology ecosystem. Interoperability is increasingly important as these huge amounts of data pour in from a growing number of platforms, ranging from telescopes and physics experiments to the countless tiny sensors and devices we have linked into the internet of things. While several years ago the world was generating 2.5 exabytes (billion billion bytes) of data each day, that number is predicted to reach 463 exabytes daily by 2025. (This is more than would fit on 212 million DVDs.)
Computer specialists use the term “big data analytics” to refer to the systematic approaches that draw insights from these ultra-large datasets. With the rapid growth of tool availability, data scientists now have the option of scaling up their work from a single, small desktop computing setup to a large, distributed cloud-based environment with many processor nodes. But often, this shift places enormous demands on the analyst. For example, tools may have to be rebuilt from scratch using a different computer language or algorithm, costing staff time and potentially time-critical insights.
The framework is an effort to help address these problems. As with the draft versions of the framework NIST has released previously, the final includes consensus definitions and taxonomies to help ensure developers are on the same page when they discuss plans for new tools. It also includes key requirements for data security and privacy protections that these tools should have. What is new in the final version is a reference architecture interface specification that will guide these tools’ actual deployment.
“The reference architecture interface specification will enable vendors to build flexible environments that any tool can operate in,” Chang said. “Before, there was no specification on how to create interoperable solutions. Now they will know how.”
This interoperability could help analysts better address a number of data-intensive contemporary problems, such as weather forecasting. Meteorologists section the atmosphere into small blocks and apply analytics models to each block, using big data techniques to keep track of changes that hint at the future. As these blocks get smaller and our ability to analyze finer details grows, forecasts can improve — if our computational components can be swapped for more advanced tools.
“You model these cubes with multiple equations whose variables move in parallel,” Chang said. “It’s hard to keep track of them all. The agnostic environment of the framework means a meteorologist can swap in improvements to an existing model. It will give forecasters a lot of flexibility.”
Another potential application is drug discovery, where scientists must explore the behavior of multiple candidate drug proteins in one round of tests and then feed the results back into the next round. Unlike weather forecasting, where an analytical tool must keep track of multiple variables that change simultaneously, the drug development process generates long strings of data where the changes come in sequence. While this problem demands a different big data approach, it would still benefit from the ability to make changes easily, as drug development is already a time-consuming and expensive process.
Whether applied to one of these or other big-data-related problems — from spotting health-care fraud to identifying animals from a DNA sample — the value of the framework will be in helping analysts speak to one another and more easily apply all the data tools they need to achieve their goals.
“Performing analytics with the newest machine learning and AI techniques while still employing older statistical methods will all be possible,” Chang said. “Any of these approaches will work. The reference architecture will let you choose.”