Awesome Web Archiving 
Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ Web crawlers for automated capture due to the massive scale of the Web. Ever-evolving Web standards require continuous evolution of archiving tools to keep up with the changes in Web technologies to ensure reliable and meaningful capture and replay of archived web pages.
Contents
- Training/Documentation
- Resources for Web Publishers
- Tools & Software
- Acquisition
- Replay
- Search & Discovery
- Utilities
- WARC I/O Libraries
- Analysis
- Quality Assurance
- Curation
- Community Resources
- Other Awesome Lists
- Blogs and Scholarship
- Mailing Lists
- Slack
- Web Archiving Service Providers
- Self-hostable, Open Source
- Hosted, Closed Source
Training/Documentation
- Introductions to web archiving concepts:
- What is a web archive? - A video from the UK Web Archive YouTube Channel
- Wikipedia's List of Web Archiving Initiatives
- Glossary of Archive-It and Web Archiving Terms
- The Web Archiving Lifecycle Model - The Web Archiving Lifecycle Model is an attempt to incorporate the technological and programmatic arms of the web archiving into a framework that will be relevant to any organization seeking to archive content from the web. Archive-It, the web archiving service from the Internet Archive, developed the model based on its work with memory institutions around the world.
- Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray
- Training materials:
- IIPC and DPC Training materials: module for beginners (8 sessions)
- UNT Web Archiving Course
- Continuing Education to Advance Web Archiving (CEDWARC)
- A Whirlwind Tour of Common Crawl's Datasets using Python
- The WARC Standard:
- The warc-specifications community HTML version of the official specification and hub for new proposals.
- The offical ISO 28500 WARC specification homepage.
- For researchers using web archives:
- GLAM Workbench: Web Archives - See also this related blog post on 'Asking questions with web archives'.
- Archives Unleashed Toolkit documentation
- Tutorial for Humanities researchers about how to explore Arquivo.pt
Resources for Web Publishers
These resources can help when working with individuals or organisations who publish on the web, and who want to make sure their site can be archived.
- Definition of Web Archivability - This describes the ease with which web content can be preserved. (Archived version from the Stanford Libraries)
- The Archive Ready tool, for estimating how likely a web page will be archived successfully.
Tools & Software
This list of tools and software is intended to briefly describe some of the most important and widely-used tools related to web archiving. For more details, we recommend you refer to (and contribute to!) these excellent resources from other groups:
Acquisition
- ArchiveBox - A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly
Bookmark Archiver). (In Development)