Abstract
We address a crucial element of applied information extraction- A ccurate identification of basic security entities in text--by evaluating previous methods and presenting new labelers. Our survey reveals that the previous efforts have not been tested on documents similar to the targeted sources (news articles, blogs, tweets, etc.) and that no sufficiently large publicly available annotated corpus of these documents exists. By assembling a representative test corpus, we perform a quantitative evaluation of previous methods in a realistic setting, revealing an overall lack of recall, and giving insight to the models' beneficial and inhibiting elements. In particular, our results show that many previous efforts overfit to the non-representative test corpora in this domain. Informed by this evaluation, we present three novel cyber entity extractors, which seek to leverage the available labeled data but remain worthwhile on the more diverse documents encountered in the wild. Each new model increases the state of the art in recall, with maximal or near maximal F1 score. Our results establish that the state of the art in cyber entity tagging is characterized by F1 = 0.61.
Original language | English |
---|---|
Title of host publication | Proceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 |
Editors | Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 437-442 |
Number of pages | 6 |
ISBN (Electronic) | 9781538614174 |
DOIs | |
State | Published - 2017 |
Event | 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 - Cancun, Mexico Duration: Dec 18 2017 → Dec 21 2017 |
Publication series
Name | Proceedings - 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 |
---|---|
Volume | 2017-December |
Conference
Conference | 16th IEEE International Conference on Machine Learning and Applications, ICMLA 2017 |
---|---|
Country/Territory | Mexico |
City | Cancun |
Period | 12/18/17 → 12/21/17 |
Funding
ACKNOWLEDGMENTS The authors wish to thank Jason Laska for helpful discussions. This material is based on research sponsored by the Department of Homeland Security Science and Technology Directorate, Cyber Security Division via BAA 11-02; the Department of National Defence of Canada, Defence Research and Development Canada; the Dutch Ministry of Security and Justice; and the Department of Energy. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of: the Department of Homeland Security; the Department of Energy; the U.S. Government; the Department of National Defence of Canada, Defence Research and Development Canada; or the Dutch Ministry of Security and Justice. This manuscript has been authored by UT-Battelle,LLC under Contract No. DE-AC05-00OR22725with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Keywords
- cybersecurity
- entity extraction
- information extraction
- machine learning