Creating Custom Scraping Scripts with Ruby and Nokogiri

published on 28 February 2025

Want to extract data from websites without relying on third-party tools? This guide will show you how to create custom web scraping scripts using Ruby and Nokogiri. Here's what you'll learn:

  • Why Ruby and Nokogiri?: Ruby's simple syntax and Nokogiri's powerful HTML/XML parsing make them ideal for scraping tasks.
  • Setup Made Easy: Step-by-step instructions for installing Ruby, Nokogiri, and essential dependencies on Windows, macOS, and Linux.
  • Scraping Basics: Learn how to fetch web pages, parse HTML, and extract data using CSS selectors or XPath.
  • Advanced Techniques: Handle JavaScript-heavy sites, manage errors, and optimize performance for large-scale scraping.
  • Legal and Ethical Practices: Stay compliant with laws and avoid getting blocked by websites.

Quick Comparison: Ruby/Nokogiri vs. InstantAPI.ai

Ruby

Feature Ruby & Nokogiri InstantAPI.ai
JavaScript Handling Requires extra setup Automatic
Proxy Management Manual configuration Built-in
CAPTCHA Solving Additional libraries Fully automated
Ease of Use Full control, more effort Quick and beginner-friendly
Cost Free (self-managed) Pay-per-use plans

Whether you're building a small scraper for personal projects or need a scalable solution for enterprise tasks, this guide has you covered.

Scraping the Web with Ruby - Part 1

Setup Requirements

Before starting with web scraping, you need to set up your development environment correctly. Here's a step-by-step guide for different operating systems.

Ruby and Nokogiri Installation

The installation process depends on your operating system. Follow the instructions below:

Operating System Ruby Installation Nokogiri Installation Notes
Windows Use RubyInstaller with DevKit gem install nokogiri (or gem install nokogiri --platform=ruby -- --use-system-libraries) Ensure Ruby is added to your system PATH.
macOS brew install ruby gem install nokogiri (or gem install nokogiri --platform=ruby -- --use-system-libraries) You may need to install libxml2 and libxslt using Homebrew: brew install libxml2 libxslt.
Ubuntu/Debian sudo apt-get install ruby-full Install dependencies: sudo apt-get install pkg-config libxml2-dev libxslt-dev, then run gem install nokogiri --platform=ruby -- --use-system-libraries.
Fedora/Red Hat/CentOS sudo yum install ruby Install dependencies: dnf install -y zlib-devel xz patch, then run gem install nokogiri --platform=ruby.

For macOS, if you encounter issues, install the required libraries with Homebrew and retry:

brew install libxml2 libxslt
gem install nokogiri --platform=ruby -- --use-system-libraries

Setting Up Your Project Structure

After configuring your environment, organize your project directory like this:

scraper/
├── bin/
│   └── scrape        # Command-line executable
├── lib/
│   ├── scraper.rb    # Main library file
│   └── scraper/      # Core classes
├── Gemfile           # Dependencies
├── README.md         # Project overview
└── test/             # Test files

To create a new project, use Bundler:

bundle gem scraper

Add the required gems to your Gemfile:

source 'https://rubygems.org'

gem 'nokogiri'
gem 'httparty'

Then, install the dependencies and update your tools:

bundle install
gem update --system
gem install bundler

For Docker users working with Alpine-based images, include the following in your Dockerfile:

FROM ruby:3.0-alpine
RUN apk add --no-cache build-base libxml2-dev libxslt-dev
RUN gem install nokogiri --platform=ruby -- --use-system-libraries

Verifying Your Installation

To confirm everything is set up correctly, run this simple script:

require 'nokogiri'
puts Nokogiri::VERSION

Once you see the Nokogiri version printed, you're ready to start building your web scraping scripts!

Writing Basic Scraping Scripts

Website Selection Guidelines

Before scraping a website, it's important to evaluate it thoroughly. Start by checking the robots.txt file, which outlines the site's rules for automated access. For example, you can find Amazon's policies at amazon.com/robots.txt.

Here are some key factors to consider when choosing a website:

  • Legal compliance: Always review the terms of service for any restrictions on scraping.
  • Technical feasibility: Check if the page structure is stable and whether it relies heavily on JavaScript.
  • Data accessibility: Ensure the data you need is publicly available.
  • Rate limiting: Look for restrictions on request frequency and set proper delays to avoid being blocked.

Whenever possible, opt for websites that provide official APIs. For example, Reddit offers an API that is far more reliable and efficient than scraping their HTML content.

HTML Data Collection

Once you've selected a website, you can start retrieving its content. A combination of HTTParty for HTTP requests and Nokogiri for parsing HTML works well. Here's a simple script to get started:

require 'httparty'
require 'nokogiri'

def fetch_page(url)
  headers = {
    'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept' => 'text/html,application/xhtml+xml',
    'Referer' => 'https://www.google.com'
  }

  response = HTTParty.get(url, headers: headers)
  Nokogiri::HTML(response.body)
end

# Example usage
doc = fetch_page('https://example.com')

Once the page is loaded, you can use tailored selectors to extract the data you need.

Data Selection Methods

Nokogiri provides two main ways to select elements: CSS selectors and XPath. For most tasks, CSS selectors are easier to read and maintain.

Here’s how you can extract specific elements using CSS selectors:

# Extract titles using CSS selectors
titles = doc.css('.title').map { |t| t.text.strip }

# Handle missing data gracefully
description = doc.css('.description').first&.text&.strip || 'No description available'

For more complex structures, you can create dedicated methods for better organization:

class WebScraper
  def initialize(url)
    @doc = fetch_page(url)
  end

  def extract_product_details
    {
      name: extract_name,
      price: extract_price,
      availability: check_availability
    }
  end

  private

  def extract_name
    @doc.css('h1.product-title').text.strip
  end

  def extract_price
    price_element = @doc.css('.price').first
    price_element ? price_element.text.gsub(/[^\d.]/, '') : nil
  end

  def check_availability
    @doc.css('.stock-status').text.downcase.include?('in stock')
  end
end

"Using Nokogiri to web scrape data from a website is an incredibly powerful and useful tool to have in your programmer's toolbox. Learning how to web scrape can allow you to automate data collection that would be tedious and time consuming if you were to do it manually!" - Brian Cheung, Software Engineer

sbb-itb-f2fbbd7

Advanced Scraping Methods

Scraping dynamic content demands advanced techniques to ensure data is collected efficiently and accurately. Below are strategies tailored for Ruby and Nokogiri.

Handling JavaScript-Driven Content

Since Nokogiri can't process JavaScript, you'll need alternative methods. Here are two practical approaches:

Method 1: Using the Kimurai Framework
Kimurai integrates Nokogiri's parsing tools with Capybara for handling JavaScript-heavy content. Here's an example:

require 'kimurai'

class ProductScraper < Kimurai::Base
  @engine = :selenium_chrome
  @config = {
    user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    before_request: { delay: 2 }
  }

  def parse(response)
    browser.find_css('.product-title').each do |element|
      item = {
        name: element.text.strip,
        price: element.find_css('.price').text
      }
      save_to "products.json", item
    end
  end
end

Method 2: Extracting Data from JavaScript Variables
Some websites store data directly in JavaScript variables. You can parse this data as follows:

def extract_js_data(doc)
  script_content = doc.css('script').find { |script| script.text.include?('productData') }
  JSON.parse(script_content.text.match(/productData = (.*?);/)[1])
end

Managing Errors

Effective error management is key to reliable scraping. Here's an example of implementing retries and error logging:

class ScraperWithRetry
  MAX_RETRIES = 3
  BACKOFF_FACTOR = 2

  def fetch_with_retry(url)
    retries = 0
    begin
      response = HTTParty.get(url)
      raise "Rate limit reached" if response.code == 429
      response
    rescue => e
      retries += 1
      if retries <= MAX_RETRIES
        sleep(BACKOFF_FACTOR ** retries)
        retry
      else
        log_error(url, e)
        raise
      end
    end
  end
end

Key strategies for error handling include:

  • Using exponential backoff for retries
  • Monitoring response times and error rates
  • Rotating IP addresses with proxies
  • Validating the format of extracted data
  • Keeping detailed error logs for debugging

Once your error management is solid, ensure your scraping practices comply with legal and ethical standards.

Requirement Implementation
Rate Limiting Add 2–5 second delays between requests
robots.txt Use the robotstxt gem to verify rules
Data Usage Ensure adherence to terms of service
Personal Data Obtain explicit consent
Server Load Use throttling to reduce server strain

"Web scraping is like treasure hunting - except sometimes the map leads you to a big, fat error message instead of gold." - Ize Majebi, Python developer and data enthusiast

Remember, non-compliance with regulations like GDPR can result in severe penalties, including fines up to €20 million or 4% of global revenue. The HiQ Labs v. LinkedIn case clarified that scraping public data is generally allowed, but actions like using fake accounts could lead to legal trouble.

Best practices for advanced scraping:

  • Regularly monitor performance metrics
  • Rotate user agents to avoid detection
  • Use headless browsers for dynamic content
  • Maintain comprehensive error logs
  • Update your scripts to adapt to website changes

Speed and Scale Improvements

Improve performance and efficiently manage data collection at scale.

Script Performance Tips

Managing memory effectively is key for fast data scraping. Here's an example of using SAX parsing for better memory efficiency:

# Example of memory-efficient parsing using SAX
class MyDocHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    return unless name == 'product'
    # Process each product without loading the entire document
    process_product(attrs)
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MyDocHandler.new)
parser.parse(File.open("large_file.xml"))

Some proven optimization strategies include:

  • Using SAX parsing for large XML files instead of loading the entire file into memory (DOM parsing).
  • Employing batch processing with targeted XPath queries to reduce overhead.
  • Caching nodes temporarily and freeing up memory as soon as they're processed.
  • Regularly profiling your code with tools like Ruby-prof to identify bottlenecks.

After addressing script performance, focus on storage solutions to handle growing data efficiently.

Data Storage Options

Choosing the right storage method can significantly impact scalability and speed. Here's a quick comparison of popular options:

Storage Type Best For Performance Impact Scaling Capability
PostgreSQL Structured data High with indexing Vertical scaling
MongoDB Unstructured data Fast writes Horizontal scaling
Amazon S3 Large files 99.99% availability Unlimited scaling
Local CSV/JSON Small datasets Quick access Limited by disk space

For databases like PostgreSQL, using batch inserts can improve write speeds. Here's an example:

# Batch insert example with PostgreSQL
def batch_save(records, batch_size = 1000)
  records.each_slice(batch_size) do |batch|
    Product.insert_all(batch)
    GC.start # Trigger garbage collection after each batch
  end
end

Efficient storage and processing ensure your system can handle large-scale tasks without breaking a sweat.

Task Scheduling

Automating tasks is essential to keep data collection running smoothly. Schedule your jobs effectively with tools like whenever:

# config/schedule.rb
every 1.day, at: '2:30 am' do
  runner "ScraperJob.perform_now"
end

For concurrent requests, Typhoeus with Hydra is a great tool. It allows you to process multiple requests simultaneously:

hydra = Typhoeus::Hydra.new(max_concurrency: 10)

urls.each do |url|
  request = Typhoeus::Request.new(url, 
    followlocation: true,
    headers: { 'User-Agent' => 'Custom Ruby Scraper 1.0' }
  )

  request.on_complete do |response|
    process_response(response)
  end

  hydra.queue(request)
end

hydra.run # Execute all requests in parallel

These techniques help maintain consistent performance and scalability, even as your data collection grows.

InstantAPI.ai Overview

InstantAPI.ai steps in as a streamlined solution to the challenges of custom Ruby-based scraping. This AI-driven platform simplifies tasks that typically require intricate scripting. Below, we’ll dive into its features, compare it with custom scripts, and break down its pricing options.

InstantAPI.ai Main Features

InstantAPI.ai tackles the technical hurdles of web scraping, offering automation for JavaScript rendering, proxy rotation, and CAPTCHA solving. It provides both a Chrome extension and an API, giving users flexible tools for their scraping needs. Here's how it stacks up against custom Ruby scripts:

Feature Custom Ruby Scripts InstantAPI.ai
JavaScript Handling Requires manual setup Automatic rendering
Proxy Management Needs custom configuration Built-in premium proxies
CAPTCHA Solving Relies on additional libraries Fully automated
Data Structure Requires manual parsing (e.g., Nokogiri) AI-generated JSON schema
Maintenance Frequent updates required Self-maintaining system

When to Choose Each Approach

Different scenarios call for different tools. Here’s a quick comparison to help you decide:

Use Case Best Option Why
Small, static websites Ruby Scripts Simple and cost-effective
Dynamic websites with heavy JavaScript InstantAPI.ai Handles JS rendering automatically
Large-scale enterprise scraping InstantAPI.ai Scales easily with premium proxies
Highly customized data processing Ruby Scripts Offers full control over logic and parsing

"After trying other options, we were won over by the simplicity of InstantAPI.ai's AI Web Scraper. It's fast, easy, and allows us to focus on what matters most - our core features." - Juan, Scalista GmbH

Cost and Plan Details

Pricing often plays a key role in choosing a scraping solution. InstantAPI.ai offers flexible plans based on usage:

Plan Cost Features
Free Access $0 500 scrapes/month, unlimited concurrency
Full Access $10/1,000 scrapes Unlimited pages, pay-per-use model
Enterprise Custom (starting at $5,000) Direct API access, dedicated account manager, SLA

For those who prefer a no-code option, the Chrome extension is available at $15 for 30 days or $120 annually. It provides unlimited scrapes in a user-friendly interface.

For teams looking to move away from Ruby scripts, the Full Access plan is a practical starting point. It offers scalability and reduces the time spent on development compared to manual Nokogiri-based solutions.

Summary and Resources

Key Takeaways

Ruby paired with Nokogiri provides a strong foundation for web scraping. Ruby's easy-to-read syntax, combined with Nokogiri's ability to parse HTML, makes it great for extracting data using CSS selectors or XPath. Here are some essential tools in the Ruby ecosystem to enhance your scraping projects:

Tool Primary Use Best For
Nokogiri HTML/XML parsing Extracting static content
Typhoeus Advanced HTTP client Handling parallel connections, HTTP2
Kimurai Framework Scraping JavaScript-heavy sites

Choosing Your Approach

Your choice between custom Ruby scripts and automated solutions depends on the project's specific needs. Here's a quick guide:

Requirement Recommended Approach Notes
Complex data processing Custom Ruby scripts Offers more control for intricate tasks
Rapid deployment InstantAPI.ai Ideal for quick project launches
Large-scale scraping InstantAPI.ai + API Efficient for handling big data volumes
Limited budget Ruby scripts Cost-effective but requires more effort

This breakdown helps you decide which tools and methods best suit your scraping goals.

Learning Resources

Level up your Ruby web scraping skills with these essential resources:

  • Official Documentation
    • Nokogiri Documentation: Detailed guides and API references.
    • Ruby-Doc.org: Standard library documentation for tools like net/http and open-uri.
  • Community Resources
    • Ruby Weekly Newsletter: Stay updated on the latest Ruby tools and practices.
    • RubyFlow: A community-driven hub for Ruby news and tutorials.
  • Resources for Advanced Projects
    • Kimurai Framework Documentation: Perfect for tackling JavaScript-heavy sites.
    • Ruby Toolbox: A curated list of Ruby gems tailored for web scraping.

These resources will help you sharpen your skills and handle more complex scraping tasks effectively.

Related Blog Posts

Read more