Python爬虫基础入门教程：从零开始抓取网页数据

路人甲乙丙丁
编程教程
1天前
13热度
0评论

1. 前言：什么是爬虫？为什么要学习爬虫？

网络爬虫（Web Crawler）是一种自动获取网页内容的程序。在数据驱动的时代，爬虫技术已经成为数据分析师、开发者、研究人员的必备技能。无论是收集训练数据、监控竞品价格、还是做市场调研，爬虫都能帮你高效地从海量网页中提取有价值的信息。

本教程能带给你什么？
✅ 理解爬虫的工作原理和基本流程
✅ 掌握Requests库发送HTTP请求
✅ 学习BeautifulSoup解析HTML数据
✅ 处理反爬机制（请求头、延时、代理）
✅ 实战：爬取电影排行榜并保存到文件

本教程假设你有Python基础（变量、函数、循环），零爬虫经验也可上手。

2. 爬虫工作原理

2.1 爬虫基本流程

发送请求 → 获取响应 → 解析数据 → 存储数据
    ↓           ↓           ↓           ↓
Requests   Response    BeautifulSoup   CSV/JSON

发送请求：模拟浏览器向服务器请求网页
获取响应：服务器返回HTML、JSON等数据
解析数据：从响应中提取需要的信息
存储数据：保存到文件或数据库

2.2 爬虫的合法性

查看网站的robots.txt文件（如 https://example.com/robots.txt）
尊重网站版权，不要用于商业用途
控制请求频率，避免对服务器造成压力
遵守相关法律法规

3. 环境搭建

3.1 安装Python

# 检查Python是否安装
python --version

# 如未安装，访问 https://python.org 下载安装

3.2 安装爬虫必备库

# 一次性安装所有依赖
pip install requests beautifulsoup4 lxml

# 国内用户可使用清华镜像加速
pip install requests beautifulsoup4 lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

# 各库作用：
# requests：发送HTTP请求
# beautifulsoup4：解析HTML/XML
# lxml：BeautifulSoup的解析引擎（比默认的快）

3.3 创建项目文件夹

mkdir my-spider
cd my-spider

4. Requests库：发送HTTP请求

4.1 发送GET请求

import requests

# 发送GET请求
url = 'https://httpbin.org/get'
response = requests.get(url)

# 查看响应状态码（200表示成功）
print(f'状态码：{response.status_code}')

# 查看响应内容（文本格式）
print(f'文本内容：{response.text[:200]}')

# 查看响应内容（JSON格式）
print(f'JSON内容：{response.json()}')

4.2 添加请求头（模拟浏览器）

import requests

url = 'https://httpbin.org/headers'

# 设置请求头，模拟真实浏览器
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}

response = requests.get(url, headers=headers)
print(response.text)

4.3 发送POST请求

import requests

url = 'https://httpbin.org/post'

# POST提交的数据
data = {
    'username': 'python_spider',
    'password': '123456'
}

response = requests.post(url, data=data)
print(response.json())

4.4 处理请求参数

import requests

# 方式一：直接拼接URL
url = 'https://httpbin.org/get?page=1&size=10'

# 方式二：使用params参数（推荐）
params = {
    'page': 1,
    'size': 10,
    'keyword': 'python'
}
response = requests.get('https://httpbin.org/get', params=params)
print(response.url)  # 查看实际请求的URL

5. BeautifulSoup：解析HTML数据

5.1 基本用法

from bs4 import BeautifulSoup

html = '''
<html>
    <head><title>我的网页</title></head>
    <body>
        <div class="content">
            <h1>文章标题</h1>
            <p class="desc">这是描述文字</p>
            <a href="https://example.com">点击链接</a>
            <ul>
                <li>列表项1</li>
                <li>列表项2</li>
            </ul>
        </div>
    </body>
</html>
'''

# 创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')

# 获取标题
title = soup.title.string
print(f'标题：{title}')

# 获取h1标签文本
h1_text = soup.h1.string
print(f'h1：{h1_text}')

# 获取所有li标签
items = soup.find_all('li')
for item in items:
    print(f'列表项：{item.string}')

5.2 常用查找方法

from bs4 import BeautifulSoup

html = '''
<div class="container">
    <div class="item" id="first">第一个项目</div>
    <div class="item" id="second">第二个项目</div>
    <div class="item special" id="third">第三个项目</div>
    <a href="/page1">页面1</a>
    <a href="/page2">页面2</a>
    <img src="image.jpg" alt="图片">
</div>
'''

soup = BeautifulSoup(html, 'lxml')

# 1. find() - 查找第一个匹配的元素
first_item = soup.find('div', class_='item')
print(f'第一个item：{first_item.string}')

# 2. find_all() - 查找所有匹配的元素
all_items = soup.find_all('div', class_='item')
print(f'共找到{len(all_items)}个item')

# 3. 按id查找
second = soup.find(id='second')
print(f'id为second：{second.string}')

# 4. 按属性查找
special = soup.find(class_='special')
print(f'class包含special：{special.string}')

# 5. 查找所有a标签
links = soup.find_all('a')
for link in links:
    print(f'链接：{link.get("href")}，文字：{link.string}')

# 6. 使用CSS选择器
items_by_css = soup.select('.item')  # 所有class=item的元素
first_by_css = soup.select_one('#first')  # id=first的元素

5.3 提取属性和文本

from bs4 import BeautifulSoup

html = '<a href="https://example.com" class="link">点击这里</a>'
soup = BeautifulSoup(html, 'lxml')
a_tag = soup.find('a')

# 获取文本内容
text = a_tag.get_text()
print(f'文本：{text}')

# 获取属性值
href = a_tag.get('href')
class_name = a_tag.get('class')
print(f'href：{href}')
print(f'class：{class_name}')

# 获取所有属性
attrs = a_tag.attrs
print(f'所有属性：{attrs}')

6. 实战项目：爬取豆瓣电影TOP250

6.1 分析目标网站

目标：https://movie.douban.com/top250
需要提取：电影名称、评分、评价人数、一句话简介

6.2 完整爬虫代码

import requests
from bs4 import BeautifulSoup
import time
import csv

def get_movies(url):
    """爬取一页的电影数据"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    }
    
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        print(f'请求失败，状态码：{response.status_code}')
        return []
    
    soup = BeautifulSoup(response.text, 'lxml')
    movie_list = []
    
    # 找到所有电影条目
    items = soup.find_all('div', class_='item')
    
    for item in items:
        # 电影名称（中文）
        title_element = item.find('span', class_='title')
        if title_element:
            title = title_element.string
        else:
            continue
        
        # 评分
        rating_element = item.find('span', class_='rating_num')
        rating = rating_element.string if rating_element else 'N/A'
        
        # 评价人数
        people_element = item.find('div', class_='star').find_all('span')[-1]
        people = people_element.string.replace('人评价', '') if people_element else '0'
        
        # 一句话简介
        quote_element = item.find('span', class_='inq')
        quote = quote_element.string if quote_element else '无简介'
        
        movie_list.append({
            '标题': title,
            '评分': rating,
            '评价人数': people,
            '简介': quote
        })
    
    return movie_list

def save_to_csv(movies, filename='douban_top250.csv'):
    """保存数据到CSV文件"""
    if not movies:
        print('没有数据可保存')
        return
    
    with open(filename, 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, fieldnames=['标题', '评分', '评价人数', '简介'])
        writer.writeheader()
        writer.writerows(movies)
    
    print(f'已保存{len(movies)}条数据到{filename}')

def main():
    """主函数：爬取所有页面"""
    all_movies = []
    base_url = 'https://movie.douban.com/top250'
    
    print('开始爬取豆瓣电影TOP250...')
    
    for page in range(10):  # 共10页，每页25条
        start = page * 25
        url = f'{base_url}?start={start}'
        print(f'正在爬取第{page+1}页...')
        
        movies = get_movies(url)
        all_movies.extend(movies)
        
        # 礼貌延时，避免请求过快
        time.sleep(2)
    
    print(f'爬取完成，共获取{len(all_movies)}部电影')
    
    # 保存结果
    save_to_csv(all_movies)
    
    # 打印前5条数据预览
    print('\n数据预览：')
    for i, movie in enumerate(all_movies[:5], 1):
        print(f'{i}. {movie["标题"]} - 评分：{movie["评分"]} - {movie["简介"]}')

if __name__ == '__main__':
    main()

7. 应对反爬机制

7.1 设置User-Agent轮换

import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0 Safari/537.36',
]

headers = {
    'User-Agent': random.choice(user_agents)
}

7.2 添加请求延时

import time
import random

# 随机延时1-3秒
time.sleep(random.uniform(1, 3))

7.3 使用代理IP

proxies = {
    'http': 'http://127.0.0.1:8888',
    'https': 'https://127.0.0.1:8888'
}

response = requests.get(url, headers=headers, proxies=proxies)

7.4 处理Cookies

# 方式一：在请求头中添加
headers = {
    'Cookie': 'sessionid=xxx; userid=123'
}

# 方式二：使用session对象保持会话
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})

# 登录后保持登录状态
response1 = session.post(login_url, data=login_data)
response2 = session.get(target_url)  # 自动携带cookies

8. 异常处理与日志记录

8.1 完整的异常处理

import requests
from requests.exceptions import RequestException, Timeout, ConnectionError

def safe_request(url, max_retries=3):
    """带重试机制的请求函数"""
    for i in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # 非200状态码抛出异常
            return response
        except Timeout:
            print(f'请求超时，第{i+1}次重试...')
        except ConnectionError:
            print(f'连接错误，第{i+1}次重试...')
        except RequestException as e:
            print(f'请求失败：{e}')
            return None
    
    print('重试次数已用完，请求失败')
    return None

9. 进阶学习方向

动态网页爬取：使用Selenium模拟浏览器
异步爬虫：使用aiohttp提升并发效率
爬虫框架：Scrapy企业级爬虫框架
数据存储：MySQL、MongoDB数据库
验证码识别：OCR技术、打码平台

10. 总结与资源推荐

恭喜你完成了Python爬虫入门教程！你已经掌握了：
✅ 爬虫工作原理和基本流程
✅ Requests库发送HTTP请求
✅ BeautifulSoup解析HTML
✅ 应对反爬的基本策略
✅ 完整的豆瓣电影爬虫实战

推荐学习资源：
🔹 Requests官方文档：https://requests.readthedocs.io
🔹 BeautifulSoup文档：https://beautiful-soup-4.readthedocs.io
🔹 Scrapy框架：https://scrapy.org

最后的建议：爬虫的学习重在实践。从简单的静态网站开始，逐步挑战需要登录、有反爬的网站。同时要养成遵守robots.txt、控制请求频率的好习惯，做一个负责任的爬虫开发者。