Django

From Django to Spider: Implementing Scrapy in Your Web Application

Running a Scrapy Spider from Django Command-Line: A Step-by-Step Guide

Hazem Abbas

Sep 18, 2024 — 5 min read

Table of Content

Integrating Scrapy with Django creates a powerful combination for web scraping within your Django projects. This tutorial demonstrates how to run a Scrapy spider from the Django command line, enabling you to manage web scraping tasks from your existing Django application.

Whether you're gathering data for analytics, monitoring competitors, or enriching your database with external content, this method streamlines the process.

Prerequisites:

Django installed in your project (Install using pip install django)
Scrapy installed (Install using pip install scrapy)

Step 1: Create a Django Project

If you haven't set up your Django project yet, begin by creating a new one:

django-admin startproject scrapy_django_project
cd scrapy_django_project

Next, create a new app within your Django project where your Scrapy spider will be integrated:

python manage.py startapp webscraper

Step 2: Create a Scrapy Project Inside Your Django App

Navigate to the Django app folder (webscraper) and create a Scrapy project inside the app:

cd webscraper
scrapy startproject scraper

This will create a Scrapy project structure inside the webscraper app.

Step 3: Set Up the Scrapy Spider

Inside the scraper directory, navigate to scraper/spiders/ and create a new spider file for your scraping logic. For example, let’s create a spider that scrapes example.com.

Create example_spider.py:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ['<http://example.com/>']

    def parse(self, response):
        page_title = response.xpath('//title/text()').get()
        yield {
            'title': page_title
        }

Step 4: Create a Custom Django Command to Run the Scrapy Spider

To run the Scrapy spider from Django's command line, we need to create a custom Django management command.

Inside your Django app (webscraper), create a directory called management/commands/:

mkdir -p webscraper/management/commands

Inside this folder, create a file named scrape_data.py. This file will contain the logic to run the Scrapy spider.

touch webscraper/management/commands/scrape_data.py

Now, add the following code to scrape_data.py to run the spider using Django’s command-line interface:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from django.core.management.base import BaseCommand
from scraper.spiders.example_spider import ExampleSpider

class Command(BaseCommand):
    help = 'Run the Scrapy spider to scrape data'

    def handle(self, *args, **options):
        process = CrawlerProcess(get_project_settings())
        process.crawl(ExampleSpider)
        process.start()

Step 5: Configure Scrapy Settings for Django

In the scraper directory, locate the settings.py file for Scrapy. Adjust it to integrate smoothly with your Django app. You may need to modify logging and database settings based on your project's needs.

For now, keep your Scrapy settings minimal and ensure they don't conflict with Django:

# scraper/settings.py

BOT_NAME = 'scraper'

SPIDER_MODULES = ['scraper.spiders']
NEWSPIDER_MODULE = 'scraper.spiders'

ROBOTSTXT_OBEY = True

Step 6: Run the Scrapy Spider via Django Command-Line

With everything set up, you can now run your Scrapy spider directly from Django's management command interface. Open your terminal, navigate to the Django project root, and execute this command:

python manage.py scrape_data

This will run your Scrapy spider (ExampleSpider) and scrape the data from the target website (example.com in this case).

Step 7: Processing the Scraped Data (Optional)

You can extend your setup to save the scraped data directly to Django models or a database. For instance, to store the page title scraped by the spider, you can modify your spider to interact with Django's ORM (Object-Relational Mapper).

Here’s a quick example of how you could extend the ExampleSpider to store scraped data in a model.

Define a Django model to store the scraped data in webscraper/models.py:

from django.db import models

class ScrapedData(models.Model):
    title = models.CharField(max_length=255)
    created_at = models.DateTimeField(auto_now_add=True)

Modify the spider’s parse method to save the data:

from webscraper.models import ScrapedData

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ['<http://example.com/>']

    def parse(self, response):
        page_title = response.xpath('//title/text()').get()
        # Save to Django model
        ScrapedData.objects.create(title=page_title)
        yield {
            'title': page_title
        }

Now, when you execute python manage.py scrape_data, the spider will automatically save the scraped data to your database!

To sum up:

By integrating Scrapy with Django, you can automate web scraping tasks directly from the Django command line. This powerful combination opens doors to numerous possibilities, including scheduled scraping, database enrichment with external data, and more.

Whether you're scraping product information, monitoring competitors, or gathering analytics, this Django-Scrapy integration streamlines your workflow. It's a game-changer for managing web scraping tasks within your Django projects.

Key Benefits:

Seamless integration with Django's ecosystem
Ability to run and monitor Scrapy spiders from Django's command line
Easy storage and management of scraped data using Django models
Flexibility to scale and adapt as your scraping needs evolve

With this setup, you're ready to harness the power of web scraping directly within your Django projects!

Django Tutorials Open-source Scrapping web spider How to? Python Frameworks Library

From Django to Spider: Implementing Scrapy in Your Web Application

Hazem Abbas

Table of Content

Prerequisites:

Step 1: Create a Django Project

Step 2: Create a Scrapy Project Inside Your Django App

Step 3: Set Up the Scrapy Spider

Step 4: Create a Custom Django Command to Run the Scrapy Spider

Step 5: Configure Scrapy Settings for Django

Step 6: Run the Scrapy Spider via Django Command-Line

Step 7: Processing the Scraped Data (Optional)

To sum up:

Are You Truly Ready to Put Your Mobile or Web App to the Test?

Articles

Systems

Development

Apps

Science - Healthcare

Open-source Apps

Medical Apps

Lists

Dev. Resources

Read more

Doctor's Guide to GenAI: Which Tools to Use and How to Use Them Wisely!

AI Isn’t Ready to Fire Your Developers (Yet); Lessons from a Friend’s Mistake

Top 14 Open-source MTA (Message/ Mail Transfer Agent) for Enterprise and Agencies

Why A-Frame is the Best Web Framework for Building 3D/AR/VR Experiences, 10+ Reasons

Table of Content

Prerequisites:

Step 1: Create a Django Project

Step 2: Create a Scrapy Project Inside Your Django App

Step 3: Set Up the Scrapy Spider

Step 4: Create a Custom Django Command to Run the Scrapy Spider

Step 5: Configure Scrapy Settings for Django

Step 6: Run the Scrapy Spider via Django Command-Line

Step 7: Processing the Scraped Data (Optional)

To sum up:

Read More Articles in Django

19 Years of Django!. Is it Still Valid in 2024?

How to Host Multiple Sites on the Same Server Using Nginx?

Deploying a Django Application with MySQL Using Docker Compose

How to Deploy Flask and Django Apps on Vercel?

Mrdoc is a Libre Self-hosted Document Management System

Shifter: The Ultimate Free Self-hosted File-Sharing Solution for Agencies and Freelancers

Articles

Systems

Development

Apps

Science - Healthcare

Open-source Apps

Medical Apps

Lists

Dev. Resources

Read more

Doctor's Guide to GenAI: Which Tools to Use and How to Use Them Wisely!

AI Isn’t Ready to Fire Your Developers (Yet); Lessons from a Friend’s Mistake

Top 14 Open-source MTA (Message/ Mail Transfer Agent) for Enterprise and Agencies

Why A-Frame is the Best Web Framework for Building 3D/AR/VR Experiences, 10+ Reasons