Abstract
Web spam is a serious threat for both end-users and search engines (w.r.t., query cost). Webgraphs can be exploited in detecting spam. In the past, several graph mining techniques were applied to measure metrics for pages and hyperlinks. In this paper, we justify the importance of webgraph to distinguish spam websites from non-spam ones based on several graph metrics computed for a labelled dataset (WEBSPAM-UK2007) and justify our model by testing on uk-2014 dataset, the most recently available dataset on the same (uk) domain. WEBSPAM-UK2007 dataset includes 0.1 million different hosts and four kinds of feature sets: Obvious, Link, Transformed Link and Content. We use five prominent machine learning (ML) techniques (i.e., Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression, Naïve Bayes and Random Forest) to build a ML-based classifier. To evaluate the performance of our classifier, we compute accuracy and F-1 score and perform 10-fold cross validation. We also compare graph based features with content based textual features and find that graph properties are similar or better than text properties. We achieve above 99% training accuracy for most of our machine learning models. We test our model with uk-2014 dataset with 4.7 million hosts for the graph-based feature sets and achieve accuracy in between 90-94% for most of the models. To the best of our knowledge, prior works on web spam detection with WEBSPAM-UK2007 dataset did not use different test dataset for their models. Our model classifier is capable of detecting web spam for any input webgraph based on its graph metrics features.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019 |
| Editors | Chaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Tony Hu, Ronay Ak, Yuanyuan Tian, Roger Barga, Carlo Zaniolo, Kisung Lee, Yanfang Fanny Ye |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 4299-4308 |
| Number of pages | 10 |
| ISBN (Electronic) | 9781728108582 |
| DOIs | |
| State | Published - Dec 2019 |
| Externally published | Yes |
| Event | 2019 IEEE International Conference on Big Data, Big Data 2019 - Los Angeles, United States Duration: Dec 9 2019 → Dec 12 2019 |
Publication series
| Name | Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019 |
|---|
Conference
| Conference | 2019 IEEE International Conference on Big Data, Big Data 2019 |
|---|---|
| Country/Territory | United States |
| City | Los Angeles |
| Period | 12/9/19 → 12/12/19 |
Funding
ACKNOWLEDGMENT This work has been partially supported by Louisiana Board of Regents RCS Grant LEQSF(2017-20)-RDA-25 and University of New Orleans ORSP SCORE award 2019. We also thank the anonymous reviewers for the helpful comments and suggestions to improve this paper.
Keywords
- graph mining
- machine learning
- security
- web spam
- webgraphs