前言

找到了一个表情包网站,没有任何反爬机制,网站大概有这个十八万个表情包吧,写了个普通爬虫太慢了,全完事估计需要好几个小时,于是上了协程,全部爬完40分钟,全程cpu占用100%,再重写改造一下一下就类似一个小框架了,以后有啥直接套用就好了,嘿嘿。

注:在异步程序中要注意最好全部阻塞操作都是异步,否则虽然不会报错,但会拖慢程序运行。

代码

网络IO,读写IO全部异步实现。

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Archerx
# @time: 2019/1/19 下午 03:56

import asyncio
import aiohttp
import uuid
import logging
import aiofiles


class AsyncSpider(object):
    def __init__(self,urls):
        self.URL = 'http://image.bee-ji.com/'
        self.HEADERS = {
            'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
        }
        self.SEMAPHORENUM = 200
        self.result = {}
        self.urls = urls
        self.THREADS = 200  #并发协程数量
        self.log_level = logging.DEBUG

    async def OtherFunc(self):
        pass

    async def GetUrl(self,session,url):
        pass

    async def SaveImage(self,content,ename,):
        # with open('F:\pycharm\JsSpider\images\\'+str(uuid.uuid1())[1:6]+'.'+ename.strip().split('/')[1],'wb') as f:
        #     f.write(content)
        async with aiofiles.open('F:\images\\'+str(uuid.uuid1())[1:6]+'.'+ename.strip().split('/')[1],'wb') as f:
            await f.write(content)

    async def GetImage(self,url):
        async with aiohttp.ClientSession() as session:
            async with session.get(url=url,headers=self.HEADERS) as response:
                assert response.status == 200
                content = await response.read()
                return content,response.headers

    async def HandleTask(self,queque):
        while not queque.empty():
            url = await queque.get()
            try:
                print('start url: '+url)
                content,headers = await self.GetImage(url=url)
                await self.SaveImage(content,headers.get('Content-Type'))
                print('save successfully')
            except Exception:
                logging.error('HandleTask error',exec_info = True)


    def EventLoop(self):
        queque = asyncio.Queue()
        [queque.put_nowait(url) for url in self.urls]
        loop = asyncio.get_event_loop()
        tasks = [asyncio.ensure_future( self.HandleTask(queque=queque) )for _ in range(self.THREADS)]
        # for task in tasks:                      #可以添加回调
        #     task.add_done_callback(callback)
        loop.run_until_complete(asyncio.wait(tasks))
        loop.close()

def GenerateUrl():
    url_list = []
    for i in range(1,2000):
        url_list.append('http://image.bee-ji.com/'+str(i))
    return url_list

def callback(future):
    print(future.result())

if __name__ == '__main__':
    spider = AsyncSpider(GenerateUrl())
    logging.basicConfig(level = spider.log_level,format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    spider.EventLoop()

大概能达到这个效果吧:

要跟我斗图吗?我可是有十八万表情包的银。

参考

preView