web_crawl()

Read(203) Label: extract, web page,

Description:

Extract data from web pages.

Syntax:

web_crawl(jsonStr)

Note:

The external library function extracts data from web pages.

Parameters:

jsonStr

The string for defining rules for traversing URLs, downloading pages, extracting and saving desired data;

Details worthy of note that prone to parsing errors: The brackets [] under the node repsented by braces {} supply a list, and the braces {} under a brace-represented node represent the structure of mapping keys;

Explanation of the rule string:

web_info: Information of the to-be-downloaded web, including domain name, local storage location, user agent information and user-defined applications;

init_url: Specify the initial URL that is the web portal for URL traversal;

help_url: Specify the web page rule to collect URLs in the content of the website without extracting the data;

target_url: Specify rule of to-be-downloaded page for both collecting URLs and extracting data from the content of a web page;

page_url: Specify the web data extraction rule according to which target_url downloads data.

Return value:

A Boolean value

Example:

 

A

 

1

[{web_info:{save_path:'d:/tmp/data',   save_post:'false'}},{init_url:['http://www.aigaogao.com/tools/history.html?s=600000']},{page_url:{extractby:   "//div[@id='ctl16_contentdiv']/",class:'default'}}]

The JSON string for defining a rule of data extraction

2

=web_crawl(A1)

Extract data from web pages

3

=file("D:/tmp/data/600000.txt").import@cqt()

Save the extracted web data in a local file