JsonItemExporter 和 JsonLinesItemExporter 保存数据的异同

Week.D.Awn 8/18/2022

# 前言

在 scrapy 爬虫框架的 pipeline 管道中进行持久化数据，一般会用到 ItemExporter 的 JsonItemExporter 和 JsonLinesItemExporter 这两种方式。这两种方式的用法的异同如下：

# JsonItemExporter 用法

JsonItemExporter：一次性写入大量数据，占用内存

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonItemExporter

class QsbkPipeline(object):

    def __init__(self):
        # 注意：以二进制的方式打开写入，不需要指定编码格式；以字符串的形式打开写入，就需要指定编码格式
        self.fp = open('test.json', 'wb')

        self.exporter = JsonItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self, spider):
        print('start...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('end...')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

# JsonLinesItemExporter 用法

JsonLinesItemExporter：一个字典一行，不满足 json 格式的；数据都直接存到磁盘文件中，内存占用少.

# -*- coding: utf-8 -*-

from scrapy.exporters import JsonLinesItemExporter

class QsbkPipeline(object):

    def __init__(self):
        # JsonLinesItemExporter 必须要以二进制的方式打开
        # 注意：以二进制的方式打开写入，不需要指定编码格式；以字符串的形式打开写入，就需要指定编码格式
        self.fp = open('test.json', 'wb')

        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self, spider):
        print('start...')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()
        print('end...')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# 差异

JsonItemExporter：每次把数据添加到内存中，最后统一写入到磁盘文件中。好处是，存储的是一个满足 json 规则的数据。坏处是如果数据量比较大，那么比较耗内存。
JsonLinesItemExporter：每次调用 export_item 的时候就把这个 item 存储到磁盘中。坏处是一个字典一行,整个文件不是一个满足 json 格式的文件。好处是每次数据都直接存到磁盘文件中,不会耗内存,数据相对安全。

← 后端 python 爬虫利器之 PyQuery 的用法 →