Cybersecurity automated information extraction techniques: Drawbacks of current methods, and enhanced extractors

Robert A. Bridges, Kelly M.T. Huffer, Corinne L. Jones, Michael D. Iannacone, John R. Goodall

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Scopus citations

Abstract

We address a crucial element of applied information extraction- A ccurate identification of basic security entities in text--by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017
EditorsXuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages437-442
Number of pages6
ISBN (Electronic)9781538614174
DOIs
StatePublished - 2017
Event16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 - Cancun, Mexico
Duration: Dec 18 2017Dec 21 2017

Publication series

NameProceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017
Volume2017-December

Conference

Conference16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017
Country/TerritoryMexico
CityCancun
Period12/18/1712/21/17

Funding

ACKNOWLEDGMENTS The authors wish to thank Jason Laska for helpful discussions. This material is based on research sponsored by the Department of Homeland Security Science and Technology Directorate, Cyber Security Division via BAA 11-02; the Department of National Defence of Canada, Defence Research and Development Canada; the Dutch Ministry of Security and Justice; and the Department of Energy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of: the Department of Homeland Security; the Department of Energy; the U.S. Government; the Department of National Defence of Canada, Defence Research and Development Canada; or the Dutch Ministry of Security and Justice. This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • cybersecurity
  • entity extraction
  • information extraction
  • machine learning

Fingerprint

Dive into the research topics of 'Cybersecurity automated information extraction techniques: Drawbacks of current methods, and enhanced extractors'. Together they form a unique fingerprint.

Cite this