Awesome Empirical Software Engineering 
A curated repository of data sets and tools that can be used for conducting evidence-based, data-driven research on software systems. This research approach is often termed experimental, or empirical software engineering. Many of the data sets can also be useful in research using search-based software engineering methods. The repository is named after the Mining Software Repositories (MSR) conference series. For examples of such work see the MSR conference's Hall of Fame.
- This list requires your input for its continuous improvement. Read the contribution guide for instructions on how you can contribute. Alternatively, you can send me an email if you find the process too cumbersome or confusing.
- For more awesome lists, see awesome.
Contents
Repositories
- ESEUR All data used in the openly available book Evidence-based Software Engineering
- Directory of MSR Datasets
- FLOSSmole - Collaborative collection and analysis of free/libre/open source project data.
- PROMISE - About 20 datasets related to software engineering research.
- SIR - Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data.
- Zenodo - Software data collections in CERN's open-access repository.
- Software Engineering Artifacts Can Really Assist Future Tasks
- Empirical Software Engineering
- Mining Software Repositories
Data Sets
- AndroidTimeMachine - Graph-based dataset of commit history of 8,431 real-world Android apps.
- AndroZoo - Collection of Android Applications.
- Bug Prediction Dataset - Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.
- Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
- CoREBench - Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.
- Cryptocurrency GitHub Activity and Market Cap Dataset - Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.
- Defects4J - Collection of 395 reproducible bugs collected with the goal of advancing software testing research.
- Eclipse AERI stacktraces - Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.
- Enron Spreadsheets and Emails - All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'.
- Findbugs-maven - Set of FindBugs reports for the Java projects of the Maven repository.
- GHTorrent - Scalable, queriable, offline mirror of data offered through the GitHub REST API.
- GitHub Bug Dataset - Bug Dataset of 15 Java open-source projects characterized by static source code metrics.
- GitHub on Google BigQuery - GitHub data accessible through Google's BigQuery platform.
- Grammar Zoo - Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.
- KaVE - Developer tool interaction data.
- Linux Kernel 4.21 Call Graphs - The Linux Kernel 4.21 Call Graphs produced using CScout.