2026-02-12: How to archive web pages in bulk using the Internet Archive Google Sheets service
| Results of archiving web pages using the Wayback Machine Google Sheets service |
One of the greatest services the Internet Archive (IA) offers is the Save Page Now (SPN) service. It is easy to use and it was completely overhauled in 2019 with added features. However, the UI is limited to archiving a single URL at a time. To overcome this limitation, the IA launched the Google Sheets service, which allows you to submit a Google Sheet with the first column full of URLs (up to 5,000 URLs), and it will archive them all in the Wayback Machine.
I needed to archive 1.5 million URLs that I collected from four Arabic and English news websites published between 1999 and 2022. This is obviously not doable one URL at a time, but the IA's Google Sheets service makes it possible. because I can archive 5,000 URLs at a time and up to 30.000 URLs (six sheets) per day.
The first step is to create a Google sheet and paste the URLs you need to archive in the first column (one URL in each row). I wrote a python script to create and populate multiple Google Sheets with URLs saved in a text file to save myself from having to create tens of Google Sheets and copy-paste 5,000 URLs in each on of them. The script takes the input file, text file that has URLs (one in each row), and creates sheets with the URLs (5000 URLs per sheet).
| The input Google sheet with URLs |
To begin using Save Page Now for Google Sheets log into your IA account or create an account (free) if you don't have one (you can use your Google account to sign in instead).
Next, go to the Wayback-GSheets service page and log in with your Google account that you used to create your spreadsheet and grant it the necessary permissions.
| Wayback-GSheets service page and Google account login |
If the sheet is ‘view only’ for your account, request edit-access to your account or make a copy of it. This will allow IA to log import-errors to the sheet via your Google account and populate the columns to your Google sheet.
Once you’ve authenticated, you’ll see three large green buttons: Save Page(s) Now, Include Links to Wayback Machine archives, and Check if URLs are available in the Live Web. Click on the first button from the top (Save Pages Now).
| Save Pages Now in the Wayback Machine GSheets service |
- Paste the URL of your spreadsheet with links in the Google Spreadsheet URL box.
- As far as options, check and uncheck the checkbox next to each option based on your needs.
- “Capture outlinks” box tells the service to archive the outlinks in each URL in the sheet.
- "Capture screen shot" tells the service to capture a screen shot of each page in the sheet.
- "Save results in a new sheet" tells the service to keep your Google sheet unchanged and create a new sheet and save the results in the new sheet.
- You can keep the “Capture only if not archived within 6 hours” option enabled or change it as needed.
- You can keep the “Delay the availability of new captures for ~10 hours” option enabled.
Finally, click the green “Archive” button.
| Options page in the Wayback Machine GSheets service |
You will receive an email that the IA is processing your sheet, and a link to watch the progress (it is the same page you landed on when you hit the "Archive" button minus the "Abort" button).
| Progress page in the Wayback Machine GSheets service |
You will also get another email when the job is done. The spreadsheet will be updated with info about the status of all URLs in the sheet.
The Internet Archive Google Sheets Service will populate the columns in your sheet with Wayback Machine capture information next to each URL. The columns that get added are:
B: Yes/No (indicates whether or not the URL has been archived by the IA in the past)
C: The URL of the last archived copy if column B has Yes (empty if No)
D: The URL of the archived copy if IA was able to archive it (error message if not)
E: The number of outlinks captured in the web page
F: The archive the copy is archived for the first time (the same information in column B)
| Results of archiving web pages using the Wayback Machine Google Sheets service |
Notes:
It is important to understand that it can take multiple days/weeks for new captures to show up in the Wayback Machine, so don’t worry if URLs you’ve captured aren’t available in the Wayback Machine after processing your sheet is finished. The delay in showing up in the Wayback Machine after the processing is finished depends on the current load on it.
If some URLs were not successfully archived (produced errors), copy these URLs to a new sheet, and submit the new sheet with just the URLs that weren’t captured. Do not resubmit the old sheet with all the URLs after it’s done. You can save them to a new tab instead, but you have to make that new tab the first tab in the sheet. This is because if you are using tabs in your Google sheets, the Wayback-GSheets service will only process the first tab so make sure to organize tabs accordingly.
Again, the service only allows 5,000 URLs per sheet, and each account is only allowed 30,000 captures per day.
Comments
Post a Comment