Awesome Web Archiving

Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ Web crawlers for automated capture due to the massive scale of the Web. Ever-evolving Web standards require continuous evolution of archiving tools to keep up with the changes in Web technologies to ensure reliable and meaningful capture and replay of archived web pages.Contents

Resources for Web Publishers

Tools & Software

Acquisition

Blogs and Scholarship

Mailing Lists

Slack

Twitter

Web Archiving Service Providers

Self-hostable, Open Source

Hosted, Closed Source

Training/Documentation

Introductions to web archiving concepts:

What is a web archive? - A video from the UK Web Archive YouTube Channel

Wikipedia's List of Web Archiving Initiatives

Glossary of Archive-It and Web Archiving Terms

The Web Archiving Lifecycle Model - The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.

Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray

Training materials:

IIPC and DPC Training materials: module for beginners (8 sessions)

UNT Web Archiving Course

Continuing Education to Advance Web Archiving (CEDWARC)

A Whirlwind Tour of Common Crawl's Datasets using Python

The WARC Standard:

The warc-specifications community HTML version of the official specification and hub for new proposals.

The offical ISO 28500 WARC specification homepage.

For researchers using web archives:

GLAM Workbench: Web Archives - See also this related blog post on 'Asking questions with web archives'.

Archives Unleashed Toolkit documentation

Tutorial for Humanities researchers about how to explore Arquivo.pt

Resources for Web PublishersThese resources can help when working with individuals or organisations who publish on the web, and who want to make sure their site can be archived.

Definition of Web Archivability - This describes the ease with which web content can be preserved. (Archived version from the Stanford Libraries)

The Archive Ready tool, for estimating how likely a web page will be archived successfully.

Tools & SoftwareThis list of tools and software is intended to briefly describe some of the most important and widely-used tools related to web archiving. For more details, we recommend you refer to (and contribute to!) these excellent resources from other groups:

Comparison of web archiving software

Awesome Website Change Monitoring

Acquisition

ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)

Awesome Web Archiving

Training/Documentation
Resources for Web Publishers
Tools & Software
Acquisition
Replay
Search & Discovery
Utilities
WARC I/O Libraries
Analysis
Quality Assurance
Curation
Community Resources
Other Awesome Lists
Blogs and Scholarship
Mailing Lists
Slack
Twitter
Web Archiving Service Providers
Self-hostable, Open Source
Hosted, Closed Source

Training/Documentation

Introductions to web archiving concepts:
What is a web archive? - A video from the UK Web Archive YouTube Channel
Wikipedia's List of Web Archiving Initiatives
Glossary of Archive-It and Web Archiving Terms
The Web Archiving Lifecycle Model - The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.
Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray
Training materials:
IIPC and DPC Training materials: module for beginners (8 sessions)
UNT Web Archiving Course
Continuing Education to Advance Web Archiving (CEDWARC)
A Whirlwind Tour of Common Crawl's Datasets using Python
The WARC Standard:
The warc-specifications community HTML version of the official specification and hub for new proposals.
The offical ISO 28500 WARC specification homepage.
For researchers using web archives:
GLAM Workbench: Web Archives - See also this related blog post on 'Asking questions with web archives'.
Archives Unleashed Toolkit documentation
Tutorial for Humanities researchers about how to explore Arquivo.pt

Resources for Web Publishers

These resources can help when working with individuals or organisations who publish on the web, and who want to make sure their site can be archived.

Definition of Web Archivability - This describes the ease with which web content can be preserved. (Archived version from the Stanford Libraries)
The Archive Ready tool, for estimating how likely a web page will be archived successfully.

Tools & Software

This list of tools and software is intended to briefly describe some of the most important and widely-used tools related to web archiving. For more details, we recommend you refer to (and contribute to!) these excellent resources from other groups:

Acquisition

ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)