# Scrapy

Configuring a proxy in the Scrapy crawling framework can effectively improve the stability of data collection. By using a custom downloader middleware, you can avoid manually adding parameters to each crawler request. Please follow the steps below.

{% hint style="success" icon="lightbulb-exclamation-on" %}
Before configuring the proxy, you need to obtain proxies credentials. [Click here](/proxy/rotating-residential-proxies/quick-start.md) to learn how to obtain proxies credentials.
{% endhint %}

***

{% stepper %}
{% step %}

### **Install Scrapy**

1. Visit the [Scrapy official website](https://www.scrapy.org/) and follow the instructions in the official documentation to complete the installation
2. After installation, run `scrapy version` in the terminal. If the version number is displayed normally (e.g., Scrapy 2.14.2), the installation is successful.

{% endstep %}

{% step %}

### **Custom Proxy Middleware**

{% hint style="success" icon="lightbulb-exclamation-on" %}
If you do not have a Scrapy project yet, please run `scrapy startproject myproject` to create a new project first, then enter the project directory.
{% endhint %}

1. Locate and open the `middlewares.py` file in your Scrapy project directory.

<figure><img src="/files/DJEu5n2SCPq0gY1R1uZ8" alt="" width="154"><figcaption></figcaption></figure>

2. Add the following code to the file and save it:

```python
# middlewares.py
from scrapy import signals

class AutoProxyMiddleware:
    def __init__(self, proxy):
        self.proxy = proxy

    @classmethod
    def from_crawler(cls, crawler):
        # Read proxy configuration from settings.py
        proxy = crawler.settings.get('HTTP_PROXY')
        mw = cls(proxy)
        crawler.signals.connect(mw.spider_opened, signals.spider_opened)
        return mw

    def process_request(self, request, spider):
        # Set proxy for each request if it hasn't been set already
        if self.proxy and 'proxy' not in request.meta:
            request.meta['proxy'] = self.proxy
            # No need to manually add 'Proxy-Authorization' header.
            # Scrapy's built-in HttpProxyMiddleware will handle authentication
            # automatically because the credentials are in the proxy URL.

    def spider_opened(self, spider):
        spider.logger.info(f'AutoProxyMiddleware enabled, proxy: {self.proxy}')
```

{% endstep %}

{% step %}

### **Configure the Proxy**

1. Open the `settings.py` file in your project.

<figure><img src="/files/q9RvAibmKZXWbTzwYZ5H" alt="" width="148"><figcaption></figcaption></figure>

2. Add the following configuration and save it:

```python
# Set your proxy (replace the proxy information with your actual credentials)
HTTP_PROXY = 'http://your_username:your_password@your_proxy_host:your_port'

# Configure downloader middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.AutoProxyMiddleware': 749,      # Custom middleware, priority higher than the built-in HttpProxyMiddleware. Replace 'myproject' with your Scrapy project name.
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
}
```

{% hint style="success" icon="lightbulb-exclamation-on" %}
The priority of `AutoProxyMiddleware` is set to `749` (less than 750), so it will execute before the built-in `HttpProxyMiddleware`, setting `meta['proxy']` for each request in advance, thereby overriding system environment variables or global proxy settings.
{% endhint %}

{% endstep %}

{% step %}

### Verify the Proxy

1. Create a test file and fill in the following content

```python
# test_proxy.py
import scrapy
import json

class TestProxySpider(scrapy.Spider):
    name = 'test_proxy'

    def start_requests(self):
        yield scrapy.Request(
            url='https://ipinfo.io/json',
            callback=self.parse
        ) # No need to manually add 'meta' for the proxy; the middleware handles it.

    def parse(self, response):
        data = json.loads(response.text)
        print(json.dumps(data, indent=2, ensure_ascii=False))
```

2. Run the test file and check whether the output matches the proxy you have set

```python
# Example
{
   "ip": "67.72.110.148",
  "city": "Tampa",
  "region": "Florida",
  "country": "US",
  "loc": "27.9475,-82.4584",
  "org": "AS23089 Hotwire Communications",
  "postal": "33606",
  "timezone": "America/New_York",
  "readme": "https://ipinfo.io/missingauth"
}

```

{% endstep %}
{% endstepper %}

***


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.talordata.com/proxy/integrations/scrapy.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
