How to Archive a Website with Scrnify (JavaScript)

4/3/2025

Hey there! Laura and Heidi here from SCRNIFY! 🇦🇹

Have you ever needed to preserve a website exactly as it appears at a specific moment in time? Maybe you're tracking changes to a competitor's site, creating documentation, or building a historical archive of important web content. Whatever your reason, website archiving is a crucial capability for many developers and organizations.

We've been there ourselves! While building SCRNIFY, we needed to capture and preserve websites for testing, debugging, and demonstration purposes. The challenge? Most DIY solutions are complex to set up, resource-intensive to maintain, and often produce inconsistent results. 😅

In this tutorial, we'll show you how to build a complete website archiving solution using JavaScript, Node.js, and the SCRNIFY API. You'll learn to:

Set up a Node.js project for website archiving
Crawl websites to discover pages
Capture high-quality screenshots using SCRNIFY
Save and organize your archives with proper timestamps
Automate the entire process

By the end, you'll have a powerful, reliable tool for archiving any website with minimal effort. Let's dive in! ☕

Get free access to the SCRNIFY API during our open beta!

1. Understanding Website Archiving

Before we start coding, let's clarify what we mean by "website archiving." Unlike full-content archiving services like the Internet Archive's Wayback Machine, which store HTML, CSS, and other assets, our approach focuses on visual archiving through screenshots. This method has several advantages:

Visual accuracy: Captures exactly what users see, including dynamic content and complex layouts
Simplicity: No need to handle complex asset dependencies or JavaScript execution
Consistency: Provides uniform results across different websites
Accessibility: Screenshots are universally viewable without special software
Reduced storage: Often requires less space than full content archives

This approach is particularly valuable for tracking visual changes, creating documentation, or maintaining compliance records.

2. Prerequisites

Before we begin, make sure you have:

Node.js installed (v16 or higher recommended)
A SCRNIFY API key (sign up here to get one for free during our beta)
Basic knowledge of JavaScript and async/await
A code editor (VS Code, Sublime Text, etc.)

3. Setting Up Your Project

Let's start by creating a new Node.js project and installing the necessary dependencies.

Create a Project Directory

mkdir website-archiver
cd website-archiver
npm init -y

Install Dependencies

We'll need several packages for our project:

npm install axios dotenv crawlee date-fns fs-extra

Here's what each package does:

axios: For making HTTP requests to the SCRNIFY API
dotenv: For managing environment variables (like our API key)
crawlee: A powerful web crawling library that works with various engines
date-fns: For easy date formatting and manipulation
fs-extra: Enhanced file system methods

Configure Environment Variables

Create a .env file in your project root to securely store your SCRNIFY API key:

SCRNIFY_API_KEY=your_api_key_here

Make sure to add .env to your .gitignore file if you're using version control:

echo ".env" >> .gitignore

4. Crawling a Website

The first step in our archiving process is to crawl the target website to discover all the pages we want to archive. We'll use Crawlee, a powerful library that simplifies web crawling.

Create a file named crawler.js:

// crawler.js
const { CheerioCrawler, RequestQueue } = require('crawlee');

/**
 * Crawls a website and returns all discovered URLs
 * @param {string} startUrl - The URL to start crawling from
 * @param {number} maxUrls - Maximum number of URLs to crawl (default: 100)
 * @param {string} domain - Optional domain restriction
 * @returns {Promise<string[]>} - Array of discovered URLs
 */
async function crawlWebsite(startUrl, maxUrls = 100, domain = null) {
    console.log(`Starting to crawl: ${startUrl}`);

    // Extract domain from startUrl if not provided
    if (!domain) {
        const urlObj = new URL(startUrl);
        domain = urlObj.hostname;
    }

    const requestQueue = await RequestQueue.open();
    await requestQueue.addRequest({ url: startUrl });

    const discoveredUrls = new Set();

    // Create a crawler
    const crawler = new CheerioCrawler({
        requestQueue,
        maxRequestsPerCrawl: maxUrls,
        // Only process requests from the same domain
        requestHandler: async ({ request, $ }) => {
            const currentUrl = request.url;
            discoveredUrls.add(currentUrl);

            // Find all links on the page
            const links = $('a[href]')
                .map((_, el) => $(el).attr('href'))
                .get();

            // Process each link
            for (const link of links) {
                try {
                    // Handle relative URLs
                    let absoluteUrl;
                    try {
                        absoluteUrl = new URL(link, currentUrl).href;
                    } catch {
                        continue; // Skip invalid URLs
                    }

                    const linkDomain = new URL(absoluteUrl).hostname;

                    // Only add URLs from the same domain and that we haven't seen yet
                    if (linkDomain === domain && !discoveredUrls.has(absoluteUrl)) {
                        await requestQueue.addRequest({ url: absoluteUrl });
                    }
                } catch (error) {
                    // Skip problematic URLs
                    console.warn(`Error processing link ${link}: ${error.message}`);
                }
            }
        },
    });

    // Start the crawler
    await crawler.run();

    console.log(`Crawling complete. Discovered ${discoveredUrls.size} URLs.`);
    return Array.from(discoveredUrls);
}

module.exports = { crawlWebsite };

This crawler uses Cheerio, a fast and lightweight implementation of jQuery for the server. It's perfect for basic crawling needs and doesn't require a full browser instance.

5. Taking Screenshots with SCRNIFY

Now that we can discover URLs, let's create a module to capture screenshots using SCRNIFY's API. Create a file named screenshot.js:

// screenshot.js
require('dotenv').config();
const axios = require('axios');
const fs = require('fs-extra');
const path = require('path');
const { format } = require('date-fns');

// SCRNIFY API base URL
const SCRNIFY_API_URL = 'https://api.scrnify.com/capture';
const API_KEY = process.env.SCRNIFY_API_KEY;

if (!API_KEY) {
    throw new Error('SCRNIFY_API_KEY is not defined in .env file');
}

/**
 * Captures a screenshot of a URL using SCRNIFY API
 * @param {string} url - The URL to capture
 * @param {Object} options - Screenshot options
 * @returns {Promise<Buffer>} - Screenshot data as Buffer
 */
async function captureScreenshot(url, options = {}) {
    // Default options
    const defaultOptions = {
        type: 'image',
        format: 'png',
        width: 1920,
        fullPage: true,
        waitUntil: 'networkIdle', // Wait until network is idle for best results
    };

    const params = {
        key: API_KEY,
        url: encodeURIComponent(url),
        ...defaultOptions,
        ...options,
    };

    // Build query string
    const queryString = Object.entries(params)
        .map(([key, value]) => `${key}=${value}`)
        .join('&');

    try {
        console.log(`Capturing screenshot of: ${url}`);
        const response = await axios({
            method: 'get',
            url: `${SCRNIFY_API_URL}?${queryString}`,
            responseType: 'arraybuffer',
        });

        return Buffer.from(response.data, 'binary');
    } catch (error) {
        if (error.response) {
            const errorData = Buffer.from(error.response.data).toString();
            console.error(`Error capturing screenshot: ${errorData}`);
        } else {
            console.error(`Error capturing screenshot: ${error.message}`);
        }
        throw error;
    }
}

/**
 * Saves a screenshot to disk with timestamp
 * @param {Buffer} screenshotData - The screenshot data
 * @param {string} url - The URL that was captured
 * @param {string} outputDir - Directory to save screenshots
 * @returns {Promise<string>} - Path to saved file
 */
async function saveScreenshot(screenshotData, url, outputDir = 'archives') {
    // Create output directory if it doesn't exist
    await fs.ensureDir(outputDir);

    // Generate a filename based on the URL and current date
    const urlObj = new URL(url);
    const hostname = urlObj.hostname;
    const pathname = urlObj.pathname.replace(/\//g, '_');
    const timestamp = format(new Date(), 'yyyy-MM-dd_HH-mm-ss');

    // Create a sanitized filename
    let filename = `${hostname}${pathname}`;
    if (filename.length > 200) {
        // Truncate if too long
        filename = filename.substring(0, 200);
    }
    filename = `${filename}_${timestamp}.png`;

    // Create a path for the file
    const filePath = path.join(outputDir, filename);

    // Save the screenshot
    await fs.writeFile(filePath, screenshotData);
    console.log(`Screenshot saved to: ${filePath}`);

    return filePath;
}

module.exports = { captureScreenshot, saveScreenshot };

This module handles both capturing screenshots via the SCRNIFY API and saving them to disk with appropriate filenames that include timestamps.

6. Putting It All Together

Now, let's create our main application file that combines the crawler and screenshot functionality. Create a file named archiver.js:

// archiver.js
require('dotenv').config();
const path = require('path');
const fs = require('fs-extra');
const { format } = require('date-fns');
const { crawlWebsite } = require('./crawler');
const { captureScreenshot, saveScreenshot } = require('./screenshot');

/**
 * Archives a website by crawling it and taking screenshots
 * @param {string} url - The URL to archive
 * @param {Object} options - Archiving options
 */
async function archiveWebsite(url, options = {}) {
    const {
        maxUrls = 50,
        outputDir = 'archives',
        screenshotOptions = {},
        delay = 1000, // Delay between screenshots in ms
    } = options;

    try {
        // Create a timestamped directory for this archive session
        const timestamp = format(new Date(), 'yyyy-MM-dd_HH-mm-ss');
        const sessionDir = path.join(outputDir, `archive_${timestamp}`);
        await fs.ensureDir(sessionDir);

        // Create a log file
        const logPath = path.join(sessionDir, 'archive_log.txt');
        const logStream = fs.createWriteStream(logPath, { flags: 'a' });
        const log = (message) => {
            const timestampedMessage = `[${format(new Date(), 'yyyy-MM-dd HH:mm:ss')}] ${message}`;
            console.log(timestampedMessage);
            logStream.write(timestampedMessage + '\n');
        };

        log(`Starting archive of: ${url}`);
        log(`Max URLs to crawl: ${maxUrls}`);

        // Step 1: Crawl the website to discover URLs
        log('Starting website crawl...');
        const urls = await crawlWebsite(url, maxUrls);
        log(`Crawl complete. Discovered ${urls.length} URLs.`);

        // Save the list of URLs
        const urlListPath = path.join(sessionDir, 'urls.txt');
        await fs.writeFile(urlListPath, urls.join('\n'));
        log(`URL list saved to: ${urlListPath}`);

        // Step 2: Capture screenshots for each URL
        log('Starting screenshot capture...');

        let successCount = 0;
        let failureCount = 0;

        for (let i = 0; i < urls.length; i++) {
            const currentUrl = urls[i];
            try {
                log(`Capturing screenshot ${i + 1}/${urls.length}: ${currentUrl}`);

                // Capture the screenshot
                const screenshotData = await captureScreenshot(currentUrl, screenshotOptions);

                // Save the screenshot
                const filePath = await saveScreenshot(screenshotData, currentUrl, sessionDir);
                log(`Screenshot saved: ${filePath}`);

                successCount++;

                // Add a delay to avoid overwhelming the API
                if (i < urls.length - 1) {
                    await new Promise(resolve => setTimeout(resolve, delay));
                }
            } catch (error) {
                log(`Error capturing ${currentUrl}: ${error.message}`);
                failureCount++;
            }
        }

        // Log completion statistics
        log(`Archive complete!`);
        log(`Total URLs: ${urls.length}`);
        log(`Successful screenshots: ${successCount}`);
        log(`Failed screenshots: ${failureCount}`);

        logStream.end();
        return {
            totalUrls: urls.length,
            successCount,
            failureCount,
            archiveDir: sessionDir,
        };
    } catch (error) {
        console.error(`Archive failed: ${error.message}`);
        throw error;
    }
}

module.exports = { archiveWebsite };

// If this file is run directly (not imported)
if (require.main === module) {
    // Get URL from command line arguments
    const url = process.argv[2];

    if (!url) {
        console.error('Please provide a URL to archive');
        console.error('Usage: node archiver.js https://example.com');
        process.exit(1);
    }

    // Run the archiver
    archiveWebsite(url)
        .then(result => {
            console.log('Archive completed successfully!');
            console.log(`Results saved in: ${result.archiveDir}`);
        })
        .catch(error => {
            console.error('Archive failed:', error.message);
            process.exit(1);
        });
}

This file brings everything together and provides a complete archiving solution. It:

Crawls the website to discover URLs
Creates a timestamped directory for the archive session
Captures screenshots of each discovered URL
Saves detailed logs of the process
Handles errors gracefully

7. Running the Archiver

Now that we have all the pieces in place, let's create a simple command-line interface to run our archiver. Create a file named index.js:

// index.js
const { archiveWebsite } = require('./archiver');

// Parse command line arguments
const args = process.argv.slice(2);
const url = args[0];
const maxUrls = parseInt(args[1]) || 50;

if (!url) {
    console.error('Please provide a URL to archive');
    console.error('Usage: node index.js https://example.com [maxUrls]');
    process.exit(1);
}

console.log(`🔨 Website Archiver using SCRNIFY 📸`);
console.log(`URL: ${url}`);
console.log(`Max URLs: ${maxUrls}`);
console.log('-----------------------------------');

// Run the archiver
archiveWebsite(url, { maxUrls })
    .then(result => {
        console.log('\n✅ Archive completed successfully!');
        console.log(`📁 Results saved in: ${result.archiveDir}`);
        console.log(`📊 Stats:`);
        console.log(`   - Total URLs: ${result.totalUrls}`);
        console.log(`   - Successful screenshots: ${result.successCount}`);
        console.log(`   - Failed screenshots: ${result.failureCount}`);
    })
    .catch(error => {
        console.error('\n❌ Archive failed:', error.message);
        process.exit(1);
    });

Now you can run the archiver from the command line:

node index.js https://example.com 20

This will archive up to 20 pages from example.com.

8. Advanced Customization

Our basic archiver is already powerful, but there are several ways you can enhance it for your specific needs:

Custom Screenshot Options

SCRNIFY's API offers many options for customizing your screenshots. You can pass these options when calling archiveWebsite:

archiveWebsite('https://example.com', {
    maxUrls: 20,
    screenshotOptions: {
        width: 1280,
        height: 800,
        format: 'jpeg',
        quality: 90,
        fullPage: true,
        waitUntil: 'networkIdle',
    }
});

Scheduled Archiving

For regular archiving, you can set up a cron job or use a scheduling library like node-cron:

// Install with: npm install node-cron
const cron = require('node-cron');
const { archiveWebsite } = require('./archiver');

// Run every day at midnight
cron.schedule('0 0 * * *', async () => {
    try {
        console.log('Running scheduled archive...');
        await archiveWebsite('https://example.com', { maxUrls: 50 });
        console.log('Scheduled archive completed successfully!');
    } catch (error) {
        console.error('Scheduled archive failed:', error.message);
    }
});

Filtering URLs

You might want to exclude certain URLs or only include specific patterns:

// Add to crawler.js
function shouldCrawlUrl(url) {
    // Skip login pages, admin areas, etc.
    if (url.includes('/login') || url.includes('/admin')) {
        return false;
    }

    // Only include blog posts
    if (!url.includes('/blog/')) {
        return false;
    }

    return true;
}

Then update the crawler's requestHandler to use this function.

Parallel Processing

For faster archiving of large sites, you can process URLs in parallel:

// Add to archiver.js
const pLimit = require('p-limit'); // npm install p-limit

// In the archiveWebsite function:
const limit = pLimit(5); // Process 5 URLs concurrently
const promises = urls.map(url => limit(() => processUrl(url)));
const results = await Promise.all(promises);

9. Conclusion

Congratulations! 🎉 You've built a powerful website archiving system using JavaScript and SCRNIFY's screenshot API. This solution allows you to:

Automatically discover and crawl website pages
Capture high-quality screenshots of each page
Save archives with proper timestamps and organization
Customize the archiving process to your specific needs

Website archiving has never been easier! With SCRNIFY handling the complex task of rendering and capturing web pages, you can focus on building features that matter to your specific use case.

The approach we've outlined is perfect for:

Tracking changes to websites over time
Creating visual documentation
Compliance and record-keeping
Preserving important web content
Building a historical archive

Get free access to the SCRNIFY API during our open beta and start generating screenshots today! Sign up here

Have you built something cool with this tutorial? We'd love to see it! Share your projects with us on Twitter @scrnify.

Cheers, Laura & Heidi 🇦🇹

P.S. Check out our introductory post about SCRNIFY to learn more about our journey and the problems we're solving!