Just a Standard Blog
Industry and businesses have long been known to have their own specialized “languages” — words and phrases that mostly only make sense to someone in that business. This technical jargon, slang or industry lingo has largely developed as a shorthand method to convey complex or very specific ideas and directives using a minimal amount of effort.
“Peter, please get me that TSP printout for my retirement ASAP.”
“Don’t overdo it with the salt, the tsp should be enough.”
“I need to finish my white paper for the ivory tower by COB.”
“Engine one is due for a lube inspection and a re-winding. Let’s push it until next week’s line wide PM.”
Sentences like these may mean a very specific thing to you, or may mean nothing at all. Maybe you think you understand parts of it, but those same parts may mean something else to another person. Even if the letters and words are familiar to you, their context and their meaning can be lost without the specific insight into where they came from. Sometimes that context can be found in the sentence itself; other times it is more elusive.
Consider the term “TSP.” Any average English speaker could recognize it as an abbreviation for something, but depending on who is reading it, where, and when, the answer to what it means could be very different. Perhaps it stands for “teaspoon,” or “Thrift Savings Plan,” or “trisodium phosphate,” or any number of other possibilities. It is the context around it that must be interpreted to understand its intent.
People generally are very good at learning and translating context and intent with comparatively little additional information. Computers, however, are not. In the example above, words like “salt,” “retirement” or “chemical” could be added to quickly allow a computer to figure out the context. But even then, there might be confusion depending on whether the word is used in a technical setting versus a casual one. Trisodium phosphate is chemically a salt, leading to correct but confusing phrases like “ONE TSP: TSP.”
I lead a group at NIST that is very interested in these kinds of highly contextual coded languages. After reading “ONE TSP: TSP,” we want a computer to be able to translate that phrase to another user as “Add one teaspoon of trisodium phosphate to the mix.” My colleagues and I study and work in the area of technical language processing (TLP) — the act of using computers for capturing, understanding and translating jargon for other users. These can be direct actions like controlling a robot, but often more importantly, we want computers to be able to communicate the ideas they capture back to another person.
For our purposes, technical languages can be anything written or spoken in an industrial or scientific setting, where context is especially important. In many cases, this includes words or phrases that might not even appear outside of a very small group. But clearly not all language is technical, so let’s briefly talk about the wider-known counterpart to TLP.
Natural language processing (NLP) is a formal area of study that takes communications by humans and transforms that information into something more suitable for computer use and analysis. In broad terms, this is performed by restructuring the communication into a form that allows it to be compared to “concepts” or ideas that the computer has previously learned. But where NLP focuses on the most common uses for words, TLP focuses on the less common uses, or meanings that can change based on context. For example, “running” and “jogging” are similar concepts, but may or may not work interchangeably depending on the context. An NLP tool might recognize both as means of locomotion, but a TLP tool could also know that jogging a memory has little relation to running a store and that neither are means of locomotion. There are of course ways to get NLP to recognize these differences, but this type of problem is where TLP lives.
Some of the most common applications of NLP that you encounter in your daily life are translation tools. This can be language translations, such as English to Spanish, but it can also be voice-to-text translation. Interactive chatbots and some search engines use forms of NLP.
While machines have started to provide real societal benefit from NLP, TLP has yet to really show its full potential and remains a much harder task. Industry leaders have begun to recognize the need to both process high volumes of text, and translate information between individuals in areas where NLP struggles to perform, so they are starting to lean more and more toward TLP to help them.
One reason is that specialized industry lingo and technical jargon are significantly different from the way people normally communicate. NLP tools trained for “normal” speech just don’t work in technical settings. NLP defaults to the most common way to use a word, which often is incorrect. Also, for most factories and businesses, the numbers of examples needed to teach a computer technical communications just don’t exist. Most NLP tools need numbers of examples in the hundreds of thousands to millions to teach them.
TLP is targeted at solving these types of problems. Part of my job is to help people teach computers contextually specialized language with the fewest possible examples. Often, the only successful way to do that is with direct human oversight and input, so teaching people is also a very real part of TLP.
Some areas, such as the medical field, have a head start on TLP because of a yearslong effort to create rigorous consistency in how terms are used, but other fields are just now realizing its potential. Misspellings, inconsistent shorthand, formatting differences and slang are all common occurrences in industrial documents. My goal is to help people teach artificial intelligence that when someone inputs “Fixxed leek,” “Leak Repaired” or “John applied sealant to drip site,” they all mean the same thing despite having zero words in common. Many cases like this exist where something obvious to a human is nearly impossible for a computer to learn on its own.
Another one of the goals and challenges of TLP is to help researchers and workers in vastly different fields collaborate and search through one another’s work, despite having very different ways to talk about things. A common practice to one person might be the innovative solution needed by another, but the difference in how they speak about things keeps them separate. A sound editor in Hollywood might have found the solution to a gene sequencing problem, but would never know because she calls her method “dynamic time warping” instead of a “Levenshtein distance measure.” In another case, John may be looking to find a way to replant forests quickly after a wildfire. To ensure maximum coverage and sprouting, he needs to project a high volume of nutrient-encased seed pods a long distance without rupturing them on launch. Jim is a master paintball player and is widely known to have the longest-shooting guns with the biggest bullets. Jim may be able to help solve John’s problem … but John’s not interested in paintball and Jim couldn’t care less about ecology. So despite them both having very detailed webpages about their respective work, they never find each other. TLP could help connect them.
Beyond our own research, NIST is linking academic and industrial communities to help advance the development and use of TLP technologies. We helped found and continue to support an active TLP Community of Interest where everyone from researchers to users to even just the curious can come and actively participate in research and conversations on the subject. We have projects evaluating how operators assess and communicate problems with equipment, a project developing methods for analyzing technical documents, one to make diagnostic models from manuals, and more. TLP is largely about helping workers do what they already do, but making it easier, more productive and hopefully a little less tedious.
We hope the tools of TLP will soon be able to:
Whether with computers or humans, language and communication operate on ideas, intent and context. Every day that I work on TLP with my colleagues, we are pushing the boundaries of how people and computers interact through language. What could be more exciting than that?
As we like to say, Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Great editorial! NLP for technical data/language is a massive challenge. Please continue bringing the TLP (+ adjacent) community together to identify the low-hanging fruit. We are very much interested in participating!! Standing by...
I enjoyed reading about TLP. We are trying to customize a general LLM for the accounting industry through prompting, embedding, and fine-tuning. The natural language output should be natural, intuitive, and intelligent and the jargon should be industry specific to the accounting industry. I imagine that there will be chatbots launched in every industry vertical tuned using TLP for that vertical. Do you have any tips or tricks on the most efficient way to train AI models for TLP other than by brute force of human editing?
Thank you for your comment, JJ!
Glad you enjoyed the article.
In the time since I wrote this, the world has become ablaze with generative text and LLMs.
TLP specializes small corpus and hyper specialized languages that generally have some root that can relate to more generalized LLM.
In the foreseeable future, I do not think we will be fully able to remove the human element when it comes to fine tuning and adapting these models towards more specialized domains - such as accounting. But that's not to say everything needs to be brute force tagging either. There are currently softwares out there [Link NESTOR] (https://www.nist.gov/services-resources/software/nestor) that help with human in the loop merging and masking of terms, concepts, and tags that can then be used to accelerate the adaptation of the more generalized LLMs, or to develop new models from the ground up.
Although I am no expert in the accounting industry, I would assume there are copious amounts of technical documents and sources of technical text that could be ingested in these models to give an industry focused language model. The nature and structure of that model would depend on your end goals, but generically I would believe the starting place would be the same.
Build a corpus of documents relevant to your goals.
Pass through standard NLP modeling processes.
Identify terms and concepts not well recognized by the model (this can be the tricky part in many cases)
Use a human-in-the-loop system to re-tag or "re-mask" under identified words and ideas
Retrain with the new target labels.
Repeat as needed.
This is of course oversimplified, but about as much as I can generically fit in this comment style response.
Feel free to reach out to me or other NIST researchers if you have more questions or just want to chat in general on the topic.
~MES
Yes, the Forest Products Industry has its own TLP. People outside the industry try to write about it and its history and fail miserably. Within the industry, corporations pay outside vendors to try to create software to simplify or enhance all of its work on the raw supply end. "Fixing" the TLP to meet the goal is constant. It is expensive and time consuming!