Shaping the Future of the NIST Mass Spectral Library

When one thinks of mass spectral libraries — well, let’s be honest here, the very idea of mass spectral libraries will never cross most people’s minds, even most scientists’. But that’s true for most of the seemingly esoteric things that many of us at the National Institute of Standards and Technology (NIST) have devoted our careers to. And though many of us have often struggled to explain the nature of our work to friends and loved ones (and more often than not failed to elicit excitement), the importance of NIST’s work cannot be overstated. This is very much true for what we call the NIST Standard Reference Database 1A, more commonly known as the NIST Mass Spectral Library.

This year marks the eighth release of the library, which includes over 2 million mass spectra measured for over 350,000 chemical compounds. A mass spectrum is essentially a fingerprint generated by analyzing a chemical with an instrument called a mass spectrometer. When trying to identify an unknown substance, acquiring and comparing its mass spectrum to reference spectra in a library is one of the quickest and most accurate ways to do it. This sort of analysis is routinely performed in thousands of labs across the world in a wide range of different industries, many of which rely on the NIST Mass Spectral Library.

The library has been continuously updated for over 40 years, with three-year release cycles. A common question that’s asked whenever we release a new version of the library is “What’s new?” It is indeed a perfectly reasonable question, but a vexingly difficult one to answer considering we add spectra from upward of 7,000 new chemical compounds each year. The kind of answer that people expect also varies depending on what field they’re in, and what their interests are.

However, an even more common, and typically unasked, question in the back of people’s minds is “Why?” It’s an almost existential question, perhaps even mildly off-putting one for those of us who have helped build the library over the years, decades, and — for some of us — our entire scientific careers. However, it’s certainly a question that merits contemplation, especially for those of us who are so close to the library’s development.

Updating the database

Having joined NIST in 2014, I am a relative newcomer to the Mass Spectrometry Data Center (MSDC), the group that is responsible for maintaining, curating and expanding the NIST Mass Spectral Library. Despite my relative inexperience, I have nonetheless been given a significant role in the library’s expansion, which is to select the aforementioned 7,000 or so chemical compounds each year that will be acquired and analyzed, and whose spectra will eventually be added to the library. With tens of millions of chemical compounds available for purchase, how does one go about selecting a mere fraction of a fraction of these for analysis? Why even bother trying to “select” compounds anyway; why can’t we just go down the list by alphabetical order and buy as many as we can handle?

The simple answer is that most of the compounds for sale are simply irrelevant because they are produced for large-scale drug discovery studies by biopharmaceutical companies. Most of these compounds would not be found in nature and are simply of no interest to anyone else, at least unless the chemical ends up being a useful drug, so there’s little reason to add them to our library. Thus, the task at hand is to select compounds that people do care about.

This is an exceptionally wide umbrella, covering fields such as forensics, the flavor and fragrance industry, wastewater treatment, food science, biomedical research and agriculture, to name a few. Being at the helm of this selection process has been a huge learning experience for me, not only in terms of understanding the intricacies of our library building process, but also the broad impact the NIST Mass Spectral Library has had in such a wide range of critical fields and will continue to have in the rapidly emerging fields like big data-driven biosciences that focus on the analysis of biological molecules like DNA and proteins.

It is the latter that I am particularly interested in, in no small part because I (and other new members of the MSDC) feel a responsibility to uphold the legacy of the library by keeping it relevant in our rapidly changing world. It is my peculiar educational background and research interests that I rely on to guide me in this process.

Learning the trade

Being that I’m in a group that is so steeped in the world of mass spectrometry (it’s in our name after all!), many are surprised to learn that I am not a trained chemist, much less a trained mass spectrometrist. My official job title at NIST is “statistician,” and though that’s not quite an accurate representation of my skill set, it is nonetheless the closest thing you can get to my area of expertise in the list of federal job titles. My preferred job title would be bioinformatician, someone who designs computational tools and algorithms to analyze biological data.

I got my start in the field when I was just 16 years old. Almost by pure coincidence I happened to land a high-school internship at the Center for Advanced Research in Biotechnology (CARB), which is now known as the Institute for Bioscience and Biotechnology Research (IBBR), a research institute in Rockville, Maryland, that many NISTers are quite familiar with, as it is a joint partnership between NIST and the University of Maryland. I certainly didn’t have an inkling that I’d end up working at NIST decades down the road, and in the division that is most closely involved with the IBBR, too!

The lab that I picked to spend the summer before my senior year at was (unbeknownst to me at the time) quite renowned in the field of computational biology, headed by Professor John Moult, and I whimsically chose to work in it because I had never fathomed putting computers and biology together before. Fortunately, it was the right place at the right time for me because the year was 2003, and the Human Genome Project had just wrapped up, revealing over 22,300 protein-coding genes littered across a staggering 3.3 billion base pairs, the molecules that, if DNA were a spiral staircase, make up the individual steps. Processing this overwhelming amount of information, otherwise known as genomics, necessitated a whole new class of scientist, one who was knowledgeable in both molecular biology as well as computer science.

As an intern, my primary task was constructing diagrams of the evolutionary relationships among organisms (phylogenetic trees), and while I could only grasp the very basics of what I was doing, I knew I was hooked and sought to make a career out of it. I earned my bachelor’s degree in electrical engineering and entered graduate school to pursue a Ph.D. in bioinformatics. However, things had changed by 2009 when I was just starting my first lab rotation. There was a new “-omics” field that was emerging called metabolomics, which promised to be as transformative as genomics was, and I just so happened to be in a lab that specialized in it.

Adding metabolomic ‘genes’

While genomics focuses on analyzing the totality of information at the genetic level of an organism, metabolomics focuses on analyzing the totality of information at the metabolic level. Thus, genomics and metabolomics operate on opposite ends of Francis Crick’s central dogma of molecular biology, wherein DNA is transcribed into RNA, which is translated into proteins, many of which catalyze chemical reactions that sustain life, otherwise known as metabolism. My thesis focused on the development of new computational methods for analyzing metabolomics data, which happened to pique the interest of Stephen E. Stein, mass spectrometry guru, NIST Fellow and father of the NIST Mass Spectral Library.

At NIST I continue to pursue my interests in metabolomics, not just in the development of new algorithms, but in enhancing the mass spectral library to make it an invaluable resource for the field. By prioritizing the addition of compounds from databases including the Human Metabolome Database and Chemical Entities of Biological Interest, we are adding critical puzzle pieces analogous to genes in the human metabolome to the library every year. In doing so, I believe the library is akin to the Human Genome Project in its foundational importance to the field of metabolomics.

Bioscience, Bioinformatics, Chemistry, Molecular characterization, Health, Information technology, Data and informatics, Visualization research, Mathematics and statistics, Physics and Spectroscopy

About the author

Tytus Mak

Tytus Mak is a statistician in the Mass Spectrometry Data Center at the National Institute of Standards and Technology (NIST). His research focuses on developing machine learning approaches for analyzing big datasets generated by high-throughput biomolecular analysis platforms including metabolomics, glycomics and proteomics. He received his B.S. in electrical and computer engineering at Cornell University and his Ph.D. in bioinformatics and tumor biology at Georgetown University. He is also an avid martial artist and spends his free time practicing karate and kenjutsu.

Comments

valuable information

Reply

I would love to work for NIST my name is Benjamin Garza and my email is ben1979.bg@gmail.com. Please contact me if thier is an internship or job opening!

Reply

Hi Ben,

If you'd like to learn more about employment or internship opportunities at NIST, please see https://www.nist.gov/careers. Thanks!

Reply

How good is the equipment/software at determining when it's a new chemical or existing? What about when there are two or more chemicals...can they be easily differentiated or does it look like a new chemical? Is there software that can look at the different signatures and know or suggest that it's a combo of A + B or C+E+G?

Reply

Excellent questions! Though it's not uncommon to directly inject an unknown sample into your mass spectrometer (MS), the two most popular MS platforms utilize gas (GC-MS) and liquid (LC-MS) chromatography at the front end. In doing so, complex samples such as urine or blood which contain thousands of compounds can be analyzed in such a way that multiple chemicals are not simultaneously introduced into the instrument, enabling relatively "pure" spectra to be acquired that can generally be assumed to originate from a single compound. Thousands of spectra can be acquired from analyzing a single sample, and thus many if not all of the chemical constituents of a given sample can theoretically be identified. However, the fraction of the compounds that can be identified is dependent on the size of the spectral library. Matching algorithms are used to compare experimentally acquired spectra to the reference spectra in the NIST Mass Spectral Library, and an unknown compound will have a very low score.

Reply

Very cool, thanks for the thoughtful reply. I've had some (limited) experience using a mass spectrometer, that only looked for handful of specific compounds in a chemical samples of formaldehyde/urea formaldehyde concentrate. That work preceded the whole CSI effect and forensics from TV where they made it so oversimplified blaise. The labwork for that job was pretty fun, albeit repetitive.

Relative to your article, I think a worthy followup would be to answer the "Why?" that is asked of the work. I think the number of uses for mass spec are underestimated and if people knew about it/it's broad spectrum of potential uses (QA/QC/forensics/environmental/'watchdog'/etc.), as well as the reduced cost of entry from back in the day, would benefit everyone. Knowing that there is a substantial reference library available to check against is icing on the cake. Knowing that samples are available are important for a backyard scientist perhaps looking to see if Big Pharma company X is releasing harmful chemicals to the environment. This is a great and very relevant resource.

Reply

Add new comment

Your name

CAPTCHA

What code is in the image? *

Enter the characters shown in the image.

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Please be respectful when posting comments. We will post all comments without editing as long as they are appropriate for a public, family friendly website, are on topic and do not contain profanity, personal attacks, misleading or false information/accusations or promote specific commercial products, services or organizations. Comments that violate our comment policy or include links to non-government organizations/web pages will not be posted.

Taking Measure

Shaping the Future of the NIST Mass Spectral Library

Share

Updating the database

Learning the trade

Adding metabolomic ‘genes’

About the author

Tytus Mak

Related posts

7 Tips to Keep Your Smart Home Safer and More Private, From a NIST Cybersecurity Researcher

Mathematical Models Can Help Us Understand — and Possibly Treat — Complex Diseases

Shedding Light: How NIST’s Light Database Helps Bring You Computer Chips, Welded Steel and More

Comments

Add new comment

Plain text