Table of Contents
Use requests to get GET results, and BeautifulSoup to parse it (html). Or urllib to do the basic parse, and lxml is a great tool to parse xml.
1 Basic
1.1 packages
urllib
is python official module and not as efficient as 3rd-party downloaderrequests
.- We may need cookie, proxy, https, Redirect handlers to download the target webpage.
beautifulsoup
is the ideal parser for html and lxml.
1 | import requests |
1.2 requests
regular request
1 | res = urllib.request.urlopen(url) |
request with data and header
1 | req = urllib.request.Request(url) |
request with cookiejar
1 | cj = http.cookiejar.CookieJar() |
post
1 | from urllib import parse |
1.3 random user agents
1 | User_Agents =[ |
2 beautifulsoup1
2.1 make soup
1 | html = urlopen(url).read().decode('utf-8') |
2.2 re
1 | import re |
2.3 soup find
1 | wiki = soup.find(mu=re.compile("baike.baidu.com")) |
1 | awiki = soup.find('div', class_='result c-container ') |
2.4 soup tag
Tag are list for storage.
1 | print(wiki.attrs) |
3 Crawler
Now we are able to write a full function crawler, just with these fundamental tools. But code as sophisticated programmer, we must consider error capture.
3.1 urlopen error
1 | try: |
3.2 soup check
1 | try: |
3.3 crawler framework
- crawler_main.py
- url_manager.py
- html_downloader.py
- html_parser.py
- meta_storage.py
3.4 url_manager
1 | class UrlManager(object): |
1 | import urllib.request |
3.5 html_parser
1 | from bs4 import BeautifulSoup |
3.6 crawler_main
1 | # coding:utf-8 |
3.7 meta_storage
1 | class HtmlOutputer(object): |
4 PhantomJS2
Webpage may apply code confusion with js code, while soup can only parse html page. The data is in window._DATA_, use CasperJS or PhantomJS headless browser to parse it.
If webpage needs login to GET, then use headless browser like PhantomJS or chrome headless. Or we write code to decode js code, or write a light browser.
1 | brew tap homebrew/cask |
5 selenium
Selenium is operate browser to decode js, which will be a little slow. Here is the demo.3
1 | pip3 install selenium |
6 scrapy
7 query movie info from dianying.fm
1 | import requests |
8 query movie info from movie.douban.com
1 | from selenium import webdriver |
9 query movie info from baidu baike
9.1 baike lemmaWgt
1 | import requests |
9.2 dict 2 csv/txt4
1 | import requests |