Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
时间:2021-12-03 09:27:22|栏目:Python代码|点击: 次
创建项目
scrapy startproject zhaoping
创建爬虫
cd zhaoping scrapy genspider hr zhaopingwang.com
目录结构

items.py
title = scrapy.Field() position = scrapy.Field() publish_date = scrapy.Field()
pipelines.py
from pymongo import MongoClient
mongoclient = MongoClient(host='192.168.226.150',port=27017)
collection = mongoclient['zhaoping']['hr']
class TencentPipeline(object):
def process_item(self, item, spider):
print(item)
# 需要转换为 dict
collection.insert(dict(item))
return item
spiders/hr.py
def parse(self, response):
# 不要第一个 和最后一个
tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
for tr in tr_list:
item = TencentItem()
# xpath 从1 开始数起
item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
item["position"] = tr.xpath("./td[2]/text()").extract_first()
item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
yield item
next_url = response.xpath("//a[@id='next']/@href").extract_first()
# 构造url
if next_url != "javascript:;":
print(next_url)
next_url = "https://hr.tencent.com/" + next_url
yield scrapy.Request(url=next_url,callback=self.parse,)
就是这么简单,就获取到数据

上一篇:Python性能分析工具pyinstrument提高代码效率
栏 目:Python代码
下一篇:Python数据结构之栈、队列及二叉树定义与用法浅析
本文标题:Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
本文地址:http://www.codeinn.net/misctech/185665.html






