Crawling Data From html Based Pages

Word count: 606Reading time: 3 min

 2020/01/17   Share

This practicing project is going to crawl published infectious disease data from official website of health department of Shanghai (上海市卫生与计划生育委员会).

This project is finished in 2016. I don’t know whether their website has been updated or not. The name of the department has already changed from “卫计委” to “卫健委”.

However, the code should be valuable.

Basic Processes

On the official sites, there is a special page where saved all history reports.

This screen shot was taken in 2016.

From this picture, we can see there are for pages of the url. So the following processes are clear:

Save the html file of the four pages.
Extract urls from the four pages by finding the special pattern of saving the urls.

Finally, all urls are recongized and downloaded:

The next step is download all html pages of the urls, where data is saving.

This is a snapshot of one of the reporting pages.

These are the html codes of one reporting page.

The last step is recongnizing patterns of the data and save them.

Maybe due to the changes of the staff, there are three patterns of the data. (I found this after trying for a huge times (^o^;).)

But, finally, I made it.

Core Codes

to get access to the server and require html codes, we should firstly let our program pretent to be a browser. And thus we need to have a “head claim”:

for i in range(0,2):
    url = 'http://www.hnwst.gov.cn/cms/showsubpage.jsp?ocid=363&ncid=363&pno='+str(i)+'.html'

    headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                "Accept-Encoding":"gzip, deflate, sdch",
                "Accept-Language":"zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4",
                "Referer":'http://www.wsjsw.gov.cn/wsj/n429/n426/index.html'
                "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\
                 (KHTML, like Gecko) Chrome/51,.0.2704.106 Safari/537.36"
                }
    tempp=i+1
    r = requests.get(url,headers=headers)
    f=open ("D:\\pypj\\wjwhn\\page"+str(tempp)+".txt",'wb')
    f.write(r.text)
    f.close()

After we get the html file, we need to find the pattern and copy the “matched strings”

Plase notice that this method is called ‘Regular Expression’, ‘regexp’, or ‘正则表达式’. There is a more advanced plug-in in python called ‘Beautiful Soup’.

To understand what’s the meaning of ‘\\d’ or ‘\\w’, please find the tutorials of regexp by yourself.

f = open("D://pypj//wjw//page1.txt")
line1 = f.readlines()
patternu = re.compile(r'/wsj/n\d\d\d/n\d\d\d/u1ai\d+\.html')
result1u = re.findall(patternu,str(line1))

To see full codes of this program, please forward to my github reporsitory.

Please use the getpip.py to install pip

Since I am clearing up my code 4 years ago. The code for the 3rd pattern was lost. But the luckly I have enough codes to do the crawling processes.

Author：Ruoyan Han

Link of This Article：https://ry2an.github.io/2020/01/17/Crawling-Data-From-html-Based-Pages/

Publish Date：January 17th 2020, 3:36:40 pm

Update Date：August 1st 2020, 8:07:54 pm

Next Post

Simulating SIR Model
Previous Post

Vaccine Making Selling Delivering and Using System

CATALOG

1. Basic Processes
2. Core Codes



缺失模块。
1、请确保node版本大于6.2
2、在博客根目录（注意不是archer根目录）执行以下命令：
npm i hexo-generator-json-content --save
3、在根目录_config.yml里添加配置：

jsonContent:
  meta: false
  pages: false
  posts:
    title: true
    date: true
    path: true
    text: false
    raw: false
    content: false
    slug: false
    updated: false
    comments: false
    link: false
    permalink: false
    excerpt: false
    categories: true
    tags: true