Creating Custom Scraping Scripts with Ruby and Nokogiri

Want to extract data from websites without relying on third-party tools? This guide will show you how to create custom web scraping scripts using Ruby and Nokogiri. Here's what you'll learn:

Why Ruby and Nokogiri?: Ruby's simple syntax and Nokogiri's powerful HTML/XML parsing make them ideal for scraping tasks.
Setup Made Easy: Step-by-step instructions for installing Ruby, Nokogiri, and essential dependencies on Windows, macOS, and Linux.
Scraping Basics: Learn how to fetch web pages, parse HTML, and extract data using CSS selectors or XPath.
Advanced Techniques: Handle JavaScript-heavy sites, manage errors, and optimize performance for large-scale scraping.
Legal and Ethical Practices: Stay compliant with laws and avoid getting blocked by websites.

Quick Comparison: Ruby/Nokogiri vs. InstantAPI.ai

Feature	Ruby & Nokogiri	InstantAPI.ai
JavaScript Handling	Requires extra setup	Automatic
Proxy Management	Manual configuration	Built-in
CAPTCHA Solving	Additional libraries	Fully automated
Ease of Use	Full control, more effort	Quick and beginner-friendly
Cost	Free (self-managed)	Pay-per-use plans

Whether you're building a small scraper for personal projects or need a scalable solution for enterprise tasks, this guide has you covered.

Scraping the Web with Ruby - Part 1

Setup Requirements

Before starting with web scraping, you need to set up your development environment correctly. Here's a step-by-step guide for different operating systems.

Ruby and Nokogiri Installation

The installation process depends on your operating system. Follow the instructions below:

Operating System	Ruby Installation	Nokogiri Installation	Notes
Windows	Use RubyInstaller with DevKit	`gem install nokogiri` (or `gem install nokogiri --platform=ruby -- --use-system-libraries`)	Ensure Ruby is added to your system PATH.
macOS	`brew install ruby`	`gem install nokogiri` (or `gem install nokogiri --platform=ruby -- --use-system-libraries`)	You may need to install `libxml2` and `libxslt` using Homebrew: `brew install libxml2 libxslt`.
Ubuntu/Debian	`sudo apt-get install ruby-full`	Install dependencies: `sudo apt-get install pkg-config libxml2-dev libxslt-dev`, then run `gem install nokogiri --platform=ruby -- --use-system-libraries`.
Fedora/Red Hat/CentOS	`sudo yum install ruby`	Install dependencies: `dnf install -y zlib-devel xz patch`, then run `gem install nokogiri --platform=ruby`.

For macOS, if you encounter issues, install the required libraries with Homebrew and retry:

brew install libxml2 libxslt
gem install nokogiri --platform=ruby -- --use-system-libraries

Setting Up Your Project Structure

After configuring your environment, organize your project directory like this:

scraper/
├── bin/
│   └── scrape        # Command-line executable
├── lib/
│   ├── scraper.rb    # Main library file
│   └── scraper/      # Core classes
├── Gemfile           # Dependencies
├── README.md         # Project overview
└── test/             # Test files

To create a new project, use Bundler:

bundle gem scraper

Add the required gems to your Gemfile:

source 'https://rubygems.org'

gem 'nokogiri'
gem 'httparty'

Then, install the dependencies and update your tools:

bundle install
gem update --system
gem install bundler

For Docker users working with Alpine-based images, include the following in your Dockerfile:

FROM ruby:3.0-alpine
RUN apk add --no-cache build-base libxml2-dev libxslt-dev
RUN gem install nokogiri --platform=ruby -- --use-system-libraries

Verifying Your Installation

To confirm everything is set up correctly, run this simple script:

require 'nokogiri'
puts Nokogiri::VERSION

Once you see the Nokogiri version printed, you're ready to start building your web scraping scripts!

Writing Basic Scraping Scripts

Website Selection Guidelines

Before scraping a website, it's important to evaluate it thoroughly. Start by checking the robots.txt file, which outlines the site's rules for automated access. For example, you can find Amazon's policies at amazon.com/robots.txt.

Here are some key factors to consider when choosing a website:

Legal compliance: Always review the terms of service for any restrictions on scraping.
Technical feasibility: Check if the page structure is stable and whether it relies heavily on JavaScript.
Data accessibility: Ensure the data you need is publicly available.
Rate limiting: Look for restrictions on request frequency and set proper delays to avoid being blocked.

Whenever possible, opt for websites that provide official APIs. For example, Reddit offers an API that is far more reliable and efficient than scraping their HTML content.

HTML Data Collection

Once you've selected a website, you can start retrieving its content. A combination of HTTParty for HTTP requests and Nokogiri for parsing HTML works well. Here's a simple script to get started:

require 'httparty'
require 'nokogiri'

def fetch_page(url)
  headers = {
    'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept' => 'text/html,application/xhtml+xml',
    'Referer' => 'https://www.google.com'
  }

  response = HTTParty.get(url, headers: headers)
  Nokogiri::HTML(response.body)
end

# Example usage
doc = fetch_page('https://example.com')

Once the page is loaded, you can use tailored selectors to extract the data you need.

Data Selection Methods

Nokogiri provides two main ways to select elements: CSS selectors and XPath. For most tasks, CSS selectors are easier to read and maintain.

Here’s how you can extract specific elements using CSS selectors:

# Extract titles using CSS selectors
titles = doc.css('.title').map { |t| t.text.strip }

# Handle missing data gracefully
description = doc.css('.description').first&.text&.strip || 'No description available'

For more complex structures, you can create dedicated methods for better organization:

class WebScraper
  def initialize(url)
    @doc = fetch_page(url)
  end

  def extract_product_details
    {
      name: extract_name,
      price: extract_price,
      availability: check_availability
    }
  end

  private

  def extract_name
    @doc.css('h1.product-title').text.strip
  end

  def extract_price
    price_element = @doc.css('.price').first
    price_element ? price_element.text.gsub(/[^\d.]/, '') : nil
  end

  def check_availability
    @doc.css('.stock-status').text.downcase.include?('in stock')
  end
end

"Using Nokogiri to web scrape data from a website is an incredibly powerful and useful tool to have in your programmer's toolbox. Learning how to web scrape can allow you to automate data collection that would be tedious and time consuming if you were to do it manually!" - Brian Cheung, Software Engineer

sbb-itb-f2fbbd7

Advanced Scraping Methods

Scraping dynamic content demands advanced techniques to ensure data is collected efficiently and accurately. Below are strategies tailored for Ruby and Nokogiri.

Handling JavaScript-Driven Content

Since Nokogiri can't process JavaScript, you'll need alternative methods. Here are two practical approaches:

Method 1: Using the Kimurai Framework
Kimurai integrates Nokogiri's parsing tools with Capybara for handling JavaScript-heavy content. Here's an example:

require 'kimurai'

class ProductScraper < Kimurai::Base
  @engine = :selenium_chrome
  @config = {
    user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    before_request: { delay: 2 }
  }

  def parse(response)
    browser.find_css('.product-title').each do |element|
      item = {
        name: element.text.strip,
        price: element.find_css('.price').text
      }
      save_to "products.json", item
    end
  end
end

Method 2: Extracting Data from JavaScript Variables
Some websites store data directly in JavaScript variables. You can parse this data as follows:

def extract_js_data(doc)
  script_content = doc.css('script').find { |script| script.text.include?('productData') }
  JSON.parse(script_content.text.match(/productData = (.*?);/)[1])
end

Managing Errors

Effective error management is key to reliable scraping. Here's an example of implementing retries and error logging:

class ScraperWithRetry
  MAX_RETRIES = 3
  BACKOFF_FACTOR = 2

  def fetch_with_retry(url)
    retries = 0
    begin
      response = HTTParty.get(url)
      raise "Rate limit reached" if response.code == 429
      response
    rescue => e
      retries += 1
      if retries <= MAX_RETRIES
        sleep(BACKOFF_FACTOR ** retries)
        retry
      else
        log_error(url, e)
        raise
      end
    end
  end
end

Key strategies for error handling include:

Using exponential backoff for retries
Monitoring response times and error rates
Rotating IP addresses with proxies
Validating the format of extracted data
Keeping detailed error logs for debugging

Once your error management is solid, ensure your scraping practices comply with legal and ethical standards.

Legal and Ethical Guidelines

Requirement	Implementation
Rate Limiting	Add 2–5 second delays between requests
robots.txt	Use the `robotstxt` gem to verify rules
Data Usage	Ensure adherence to terms of service
Personal Data	Obtain explicit consent
Server Load	Use throttling to reduce server strain

"Web scraping is like treasure hunting - except sometimes the map leads you to a big, fat error message instead of gold." - Ize Majebi, Python developer and data enthusiast

Remember, non-compliance with regulations like GDPR can result in severe penalties, including fines up to €20 million or 4% of global revenue. The HiQ Labs v. LinkedIn case clarified that scraping public data is generally allowed, but actions like using fake accounts could lead to legal trouble.

Best practices for advanced scraping:

Regularly monitor performance metrics
Rotate user agents to avoid detection
Use headless browsers for dynamic content
Maintain comprehensive error logs
Update your scripts to adapt to website changes

Speed and Scale Improvements

Improve performance and efficiently manage data collection at scale.

Script Performance Tips

Managing memory effectively is key for fast data scraping. Here's an example of using SAX parsing for better memory efficiency:

# Example of memory-efficient parsing using SAX
class MyDocHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attrs = [])
    return unless name == 'product'
    # Process each product without loading the entire document
    process_product(attrs)
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MyDocHandler.new)
parser.parse(File.open("large_file.xml"))

Some proven optimization strategies include:

Using SAX parsing for large XML files instead of loading the entire file into memory (DOM parsing).
Employing batch processing with targeted XPath queries to reduce overhead.
Caching nodes temporarily and freeing up memory as soon as they're processed.
Regularly profiling your code with tools like Ruby-prof to identify bottlenecks.

After addressing script performance, focus on storage solutions to handle growing data efficiently.

Data Storage Options

Choosing the right storage method can significantly impact scalability and speed. Here's a quick comparison of popular options:

Storage Type	Best For	Performance Impact	Scaling Capability
PostgreSQL	Structured data	High with indexing	Vertical scaling
MongoDB	Unstructured data	Fast writes	Horizontal scaling
Amazon S3	Large files	99.99% availability	Unlimited scaling
Local CSV/JSON	Small datasets	Quick access	Limited by disk space

For databases like PostgreSQL, using batch inserts can improve write speeds. Here's an example:

# Batch insert example with PostgreSQL
def batch_save(records, batch_size = 1000)
  records.each_slice(batch_size) do |batch|
    Product.insert_all(batch)
    GC.start # Trigger garbage collection after each batch
  end
end

Efficient storage and processing ensure your system can handle large-scale tasks without breaking a sweat.

Task Scheduling

Automating tasks is essential to keep data collection running smoothly. Schedule your jobs effectively with tools like whenever:

# config/schedule.rb
every 1.day, at: '2:30 am' do
  runner "ScraperJob.perform_now"
end

For concurrent requests, Typhoeus with Hydra is a great tool. It allows you to process multiple requests simultaneously:

hydra = Typhoeus::Hydra.new(max_concurrency: 10)

urls.each do |url|
  request = Typhoeus::Request.new(url, 
    followlocation: true,
    headers: { 'User-Agent' => 'Custom Ruby Scraper 1.0' }
  )

  request.on_complete do |response|
    process_response(response)
  end

  hydra.queue(request)
end

hydra.run # Execute all requests in parallel

These techniques help maintain consistent performance and scalability, even as your data collection grows.

InstantAPI.ai Overview

InstantAPI.ai steps in as a streamlined solution to the challenges of custom Ruby-based scraping. This AI-driven platform simplifies tasks that typically require intricate scripting. Below, we’ll dive into its features, compare it with custom scripts, and break down its pricing options.

InstantAPI.ai Main Features

InstantAPI.ai tackles the technical hurdles of web scraping, offering automation for JavaScript rendering, proxy rotation, and CAPTCHA solving. It provides both a Chrome extension and an API, giving users flexible tools for their scraping needs. Here's how it stacks up against custom Ruby scripts:

Feature	Custom Ruby Scripts	InstantAPI.ai
JavaScript Handling	Requires manual setup	Automatic rendering
Proxy Management	Needs custom configuration	Built-in premium proxies
CAPTCHA Solving	Relies on additional libraries	Fully automated
Data Structure	Requires manual parsing (e.g., Nokogiri)	AI-generated JSON schema
Maintenance	Frequent updates required	Self-maintaining system

When to Choose Each Approach

Different scenarios call for different tools. Here’s a quick comparison to help you decide:

Use Case	Best Option	Why
Small, static websites	Ruby Scripts	Simple and cost-effective
Dynamic websites with heavy JavaScript	InstantAPI.ai	Handles JS rendering automatically
Large-scale enterprise scraping	InstantAPI.ai	Scales easily with premium proxies
Highly customized data processing	Ruby Scripts	Offers full control over logic and parsing

"After trying other options, we were won over by the simplicity of InstantAPI.ai's AI Web Scraper. It's fast, easy, and allows us to focus on what matters most - our core features." - Juan, Scalista GmbH

Cost and Plan Details

Pricing often plays a key role in choosing a scraping solution. InstantAPI.ai offers flexible plans based on usage:

Plan	Cost	Features
Free Access	$0	500 scrapes/month, unlimited concurrency
Full Access	$10/1,000 scrapes	Unlimited pages, pay-per-use model
Enterprise	Custom (starting at $5,000)	Direct API access, dedicated account manager, SLA

For those who prefer a no-code option, the Chrome extension is available at $15 for 30 days or $120 annually. It provides unlimited scrapes in a user-friendly interface.

For teams looking to move away from Ruby scripts, the Full Access plan is a practical starting point. It offers scalability and reduces the time spent on development compared to manual Nokogiri-based solutions.

Summary and Resources

Key Takeaways

Ruby paired with Nokogiri provides a strong foundation for web scraping. Ruby's easy-to-read syntax, combined with Nokogiri's ability to parse HTML, makes it great for extracting data using CSS selectors or XPath. Here are some essential tools in the Ruby ecosystem to enhance your scraping projects:

Tool	Primary Use	Best For
Nokogiri	HTML/XML parsing	Extracting static content
Typhoeus	Advanced HTTP client	Handling parallel connections, HTTP2
Kimurai	Framework	Scraping JavaScript-heavy sites

Choosing Your Approach

Your choice between custom Ruby scripts and automated solutions depends on the project's specific needs. Here's a quick guide:

Requirement	Recommended Approach	Notes
Complex data processing	Custom Ruby scripts	Offers more control for intricate tasks
Rapid deployment	InstantAPI.ai	Ideal for quick project launches
Large-scale scraping	InstantAPI.ai + API	Efficient for handling big data volumes
Limited budget	Ruby scripts	Cost-effective but requires more effort

This breakdown helps you decide which tools and methods best suit your scraping goals.

Learning Resources

Level up your Ruby web scraping skills with these essential resources:

Official Documentation
- Nokogiri Documentation: Detailed guides and API references.
- Ruby-Doc.org: Standard library documentation for tools like net/http and open-uri.
Community Resources
- Ruby Weekly Newsletter: Stay updated on the latest Ruby tools and practices.
- RubyFlow: A community-driven hub for Ruby news and tutorials.
Resources for Advanced Projects
- Kimurai Framework Documentation: Perfect for tackling JavaScript-heavy sites.
- Ruby Toolbox: A curated list of Ruby gems tailored for web scraping.

These resources will help you sharpen your skills and handle more complex scraping tasks effectively.

Creating Custom Scraping Scripts with Ruby and Nokogiri

Quick Comparison: Ruby/Nokogiri vs. InstantAPI.ai

Scraping the Web with Ruby - Part 1

Setup Requirements

Ruby and Nokogiri Installation

Setting Up Your Project Structure

Verifying Your Installation

Writing Basic Scraping Scripts

Website Selection Guidelines

HTML Data Collection

Data Selection Methods

sbb-itb-f2fbbd7

Advanced Scraping Methods

Handling JavaScript-Driven Content

Managing Errors

Legal and Ethical Guidelines

Speed and Scale Improvements

Script Performance Tips

Data Storage Options

Task Scheduling

InstantAPI.ai Overview

InstantAPI.ai Main Features

When to Choose Each Approach

Cost and Plan Details

Summary and Resources

Key Takeaways

Choosing Your Approach

Learning Resources

Related Blog Posts

Read more

Web Scraping in the Pharmaceuticals Sector: Data Collection Strategies

Handling CAPTCHA and Anti-Bot Measures with AI

An Overview of Popular Web Scraping Frameworks

Creating Custom Scraping Scripts with Ruby and Nokogiri

Quick Comparison: Ruby/Nokogiri vs. InstantAPI.ai

Scraping the Web with Ruby - Part 1

Setup Requirements

Ruby and Nokogiri Installation

Setting Up Your Project Structure

Verifying Your Installation

Writing Basic Scraping Scripts

Website Selection Guidelines

HTML Data Collection

Data Selection Methods

sbb-itb-f2fbbd7

Advanced Scraping Methods

Handling JavaScript-Driven Content

Managing Errors

Legal and Ethical Guidelines

Speed and Scale Improvements

Script Performance Tips

Data Storage Options

Task Scheduling

InstantAPI.ai Overview

InstantAPI.ai Main Features

When to Choose Each Approach

Cost and Plan Details

Summary and Resources

Key Takeaways

Choosing Your Approach

Learning Resources

Related Blog Posts

Read more

Web Scraping in the Pharmaceuticals Sector: Data Collection Strategies

Handling CAPTCHA and Anti-Bot Measures with AI

An Overview of Popular Web Scraping Frameworks

No spam.One-time email.

No spam.
One-time email.