Detecting Web Spam in Webgraphs with Predictive Model Analysis

Naw Safrin Sattar, Shaikh Arifuzzaman, Minhaz F. Zibran, Md Mohiuddin Sakib

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

Web spam is a serious threat for both end-users and search engines (w.r.t., query cost). Webgraphs can be exploited in detecting spam. In the past, several graph mining techniques were applied to measure metrics for pages and hyperlinks. In this paper, we justify the importance of webgraph to distinguish spam websites from non-spam ones based on several graph metrics computed for a labelled dataset (WEBSPAM-UK2007) and justify our model by testing on uk-2014 dataset, the most recently available dataset on the same (uk) domain. WEBSPAM-UK2007 dataset includes 0.1 million different hosts and four kinds of feature sets: Obvious, Link, Transformed Link and Content. We use five prominent machine learning (ML) techniques (i.e., Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression, Naïve Bayes and Random Forest) to build a ML-based classifier. To evaluate the performance of our classifier, we compute accuracy and F-1 score and perform 10-fold cross validation. We also compare graph based features with content based textual features and find that graph properties are similar or better than text properties. We achieve above 99% training accuracy for most of our machine learning models. We test our model with uk-2014 dataset with 4.7 million hosts for the graph-based feature sets and achieve accuracy in between 90-94% for most of the models. To the best of our knowledge, prior works on web spam detection with WEBSPAM-UK2007 dataset did not use different test dataset for their models. Our model classifier is capable of detecting web spam for any input webgraph based on its graph metrics features.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
EditorsChaitanya Baru, Jun Huan, Latifur Khan, Xiaohua Tony Hu, Ronay Ak, Yuanyuan Tian, Roger Barga, Carlo Zaniolo, Kisung Lee, Yanfang Fanny Ye
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4299-4308
Number of pages10
ISBN (Electronic)9781728108582
DOIs
StatePublished - Dec 2019
Externally publishedYes
Event2019 IEEE International Conference on Big Data, Big Data 2019 - Los Angeles, United States
Duration: Dec 9 2019Dec 12 2019

Publication series

NameProceedings - 2019 IEEE International Conference on Big Data, Big Data 2019

Conference

Conference2019 IEEE International Conference on Big Data, Big Data 2019
Country/TerritoryUnited States
CityLos Angeles
Period12/9/1912/12/19

Funding

ACKNOWLEDGMENT This work has been partially supported by Louisiana Board of Regents RCS Grant LEQSF(2017-20)-RDA-25 and University of New Orleans ORSP SCORE award 2019. We also thank the anonymous reviewers for the helpful comments and suggestions to improve this paper.

FundersFunder number
Louisiana Board of RegentsLEQSF(2017-20)-RDA-25

    Keywords

    • graph mining
    • machine learning
    • security
    • web spam
    • webgraphs

    Fingerprint

    Dive into the research topics of 'Detecting Web Spam in Webgraphs with Predictive Model Analysis'. Together they form a unique fingerprint.

    Cite this