Abstract
This report describes the adaptation, composition and use of natural language processing, machine learning and other computational tools to help make implicit informational structures in very large technical corpora explicit. The tools applied to the corpora automatically build normalized multi-word term structures, which in turn are used to build taxonomies, semantic schema, topic models and knowledge graphs. In our cybersecurity use case, we apply these tools to help us understand the threat landscape as exhibited in the Common Vulnerabilities and Exposures (CVE -
https://cve.mitre.org/ and
https://nvd.nist.gov/) corpus with the aim of proactively anticipating threats. The use case provides the context for development, use and evaluation of the automated tools and processes based on them. The latter are incrementally and iteratively evaluated and improved as they are developed and used. Local evaluation and improvement come before global evaluation. We believe that performing global evaluation of text processing methodologies and processes currently exemplified by the Text REtrieval Conference (TREC), Document Understanding Conference (DUC) and Text Analysis Conference (TAC) are worth pursuing, and we have done so using TREC. However, we will mostly be focusing on local development, evaluation and improvement in this report. We will articulate various aspects of these approaches by describing and showing 1) our multi-word term based process for topic modeling that can be supported by semantic schema, taxonomic structures and knowledge graphs built out of the same multi-word terms; 2) the heuristic methods used to evaluate the performance of the multi-word term-based topic modeling that includes suggestions for measuring how well a topic is represented in the documents that are indexed to it; 3) how these local heuristic methods might be transformed into a new full-blown rigorous evaluation standard like TREC, DUC and TAC, but with emphasis on their contribution to interpretation and understanding of very large corpora. With respect to 3), we will also briefly explore what parts of the heuristic process would need to be automated and an indication of the algorithms needed to do so.