Abstract
Road safety remains a critical issue as traffic accidents continue to rise. Analyzing crash reports is vital for understanding accident causation and implementing preventative measures. In this research, we focused on developing an information extraction system utilizing natural language processing (NLP) to enhance the interpretation of traffic crash reports. While standardized forms offer basic information, the unique details and contexts of each crash require more advanced techniques for comprehensive analysis. We employed a rule-based approach to extract information on unstructured natural language, emphasizing syntactic and light semantic feature recognition in traffic crash narratives. The rule-based approach focused on extracting subjects, actions, and objects of events in crash reports. We prepared a data set of 80 crash reports for training and 20 for testing from Michigan Office of Highway Safety Planning. A new ruleset was developed from the training data, incorporating part-of-speech (POS) tagging and sentence structure patterns matching to extract target information. The extraction process employed the General Architecture for Text Engineering, utilizing its essential NLP resources to find matches of POS tagging features and sentence structures effectively. Experiments on the testing data demonstrated 95.4% precision and 86.9% recall without typos/grammar correction for the data, with improvement to 96.7% precision and 90.16% recall with typos/grammar correction. These results outperformed the state of the art including ChatGPT-4o, highlighting the potential of rule-based NLP techniques by mainly using POS tagging in extracting key information from crash report narratives. This research offers a robust tool for improving road safety analysis.
| Original language | English |
|---|---|
| Article number | 04025105 |
| Journal | Journal of Computing in Civil Engineering |
| Volume | 40 |
| Issue number | 1 |
| DOIs | |
| State | Published - Jan 1 2026 |
| Externally published | Yes |
Funding
We would like to thank the National Science Foundation (NSF). This material is based on work supported by the NSF under Grant No. 2121967. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.