The Role of Data Filtering in Open Source Software Ranking and Selection

Addi Malviya-Thakur, Audris Mockus

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Faced with more than 100M open source projects, a more manageable small subset is needed for most empirical investigations. More than half of the research papers in leading venues investigated filtering projects by some measure of popularity with explicit or implicit arguments that unpopular projects are not of interest, may not even represent "real"software projects, or that less popular projects are not worthy of study. However, such filtering may have enormous effects on the results of the studies if and precisely because the sought-out response or prediction is in any way related to the filtering criteria.This paper exemplifies the impact of this common practice on research outcomes, specifically how filtering of software projects on GitHub based on inherent characteristics affects the assessment of their popularity. Using a dataset of over 100,000 repositories, we used multiple regression to model the number of stars -a commonly used proxy for popularity- based on factors such as the number of commits, the duration of the project, the number of authors and the number of core developers. Our control model included the entire dataset, while a second filtered model considered only projects with ten or more authors. The results indicated that while certain characteristics of the repository consistently predict popularity, the filtering process significantly alters the relationships between these characteristics and the response. We found that the number of commits exhibited a positive correlation with popularity in the control sample but showed a negative correlation in the filtered sample. These findings highlight the potential biases introduced by data filtering and emphasize the need for careful sample selection in empirical research of mining software repositories. We recommend that empirical work should either analyze complete datasets such as World of Code, or employ stratified random sampling from a complete dataset to ensure that filtering is not biasing the results.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering, WSESE 2024
PublisherAssociation for Computing Machinery, Inc
Pages7-12
Number of pages6
ISBN (Electronic)9798400705670
DOIs
StatePublished - Apr 16 2024
Event1st International Workshop on Methodological Issues with Empirical Studies in Software Engineering, WSESE 2024 - Lisbon, Portugal
Duration: Apr 16 2024 → …

Publication series

NameProceedings - 2024 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering, WSESE 2024

Conference

Conference1st International Workshop on Methodological Issues with Empirical Studies in Software Engineering, WSESE 2024
Country/TerritoryPortugal
CityLisbon
Period04/16/24 → …

Funding

The work was partially supported by National Science Foundation awards 1633437, 1901102, 1925615, and 22120429. This manuscript has been authored by UT-Battelle, LLC, USA under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid up, irrevocable, worldwide license to publish or reproduce the published form of the manuscript, or allow others to do so, for U.S. Government purposes. The DOE will provide public access to these results in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Keywords

  • empirical software engineering
  • filtering
  • mining software repositories
  • missing data problem
  • sampling

Fingerprint

Dive into the research topics of 'The Role of Data Filtering in Open Source Software Ranking and Selection'. Together they form a unique fingerprint.

Cite this