web_crawl()

Read(60) Label: extract, web data,

Description:

Extract data from websites.

Syntax:

web_crawl(jsonstr)

Parameters:

jsonstr

Strings defining rules for extracting specific data from websites.

Note:

The function visits URLs iteratively to download, extract and save specific data.

Parameter jsonstr defines rules for data extraction from five aspects: websites information, initial URL, eligible URLs, target web pages and data extraction. Below explains their functions and uses:

 

1. Websites information: The web_info property sets domain name, local storage, user agent, user-defined application and other information for extracting data from a certain website.

The property covers the following parameters:

domain

Domain name.

save_path

The path where the extracted data will be stored.

user_agent

User agent information. It helps server to identify OS type and version, CPU type, browser type and version, browser rendering engine, browser language, browser add-ins, and so on.

sleep_time

Time intervals between data extraction actions.

cycle_retry_times

Retry times.

charset

Character set.

use_gzip

Whether to use Gzip or not.

time_out

Data extraction timeout.

cookie_name

Cookie information that store key values.

thread_size

Number of threads for performing data extraction.

save_post

Whether or not to encode the file name to prevent the existing namesake file is overwritten when saving the extracted data. Default is true. For instance, books/a.html and  music/a.html are pages to be downloaded, if the parameter value is true when data extracted from them is saved, names of saved files are a_xxxcgk.txt and a_xabcdw.txt and existing namesake files won’t be overwritten; but if the value is false and name of the saved file is a.txt, existing namesake file will be overwritten.

class_name

A user-defined storage class.

class_argv

A string parameter passed to class_name class.

 

2. Initial URL: init_url is the entrance URL for websites traversal. It contains one or more URLs in List structure.

 

3. Eligible URLs: help_url defines what is an eligible URL by filtering away certain URLs and adding eligible ones to the download list. But it won’t extract data. The filtering supports using a regular expression. For example:
  gupiao/list_(sh|sz|cyb)\.html matches a URL containing string gupiao/list_sh.html, gupiao/list_sz.html or gupiao/list_cyb.html.

The property’s content is a list and you can define multiple rules through it.

 

4. Target web page: target_url defines websites from which data will be extracted. If a URL matches the rule defined in help_url, then we’ll collect desired URLs on that page.

The syntax is as follows:
 
{target_url:{filter: pageUrl, reg_url:urlRegex, new_url:newUrl}}
It finds on websites whose URLs match pageUrl condition the href links matching urlRegex condition. If newUrl is present, it will combine with the filtering result of urlRegex to form a new URL.
For example, we find on a web page that link a_100.html matches filter regular expression reg_url=a_(\d+)\.html, then newUrl=b_%s.php and the result of filtering a_100.html using urlRegex is 100. So it combines with newUrl to form a new downloading page b_100.php.


filter defines the rule for filtering URLs; if filter is absent, all target_urls will match the rule. reg_url defines the rule for collecting URLs and must not be omitted. If it is omitted the target_url search will make no sense.

new_url defines a new web page that is formed by combing with the filtering result using reg_url.

 

Below are several cases:
1) Rule definition: {target_url:{filter:'gupiao/list_(sh|sz|cyb)\.html', reg_url:'gupiao/([sz|sh]?6000\d{2})/',new_url:'http://www.raqsft.com/history.html?s=%s'}}
The content of download page gupiao/list_sh.html is as follows:
<li><a href="/gupiao/600010/">Baotou Steel(600010)</a></li>

<li><a href="/gupiao/600039/">SRBG(600039)</a></li>

<li><a href="/gupiao/600048/">Poly Developments and Holdings(600048)</a></li>

A. gupiao/list_sh.html matches filter condition;
B. String href matches reg_url condition and generates [600010, 600039, 600048];
C. The filtering result combines with newUrl to generate a new URL:
http://www.raqsft.com/history.html?s=600010
http://www.raqsft.com/history.html?s=600039
http://www.raqsft.com/history.html?s=600048

Symbol %s in new_url is the placeholder in the combined string.

2) Rule definition: {target_url:{reg_url:'/ gupiao/60001\d'}},

The content of download page gupiao/list.html is as follows:
<li><a href="/gupiao/600010/">
Baotou Steel (600010)</a></li>

<li><a href="/gupiao/600039/"> SRBG (600039)</a></li>

<li><a href="/gupiao/600048/"> Poly Developments and Holdings (600048)</a></li>

The href matching reg_url condition:

http://www.xxx.com/gupiao/600010/
The other two href are not eligible.
By setting filter condition, we filter away the unwanted URLs using help_url and collect desired ones on right webpages in a more efficient way.

target_url can define multiple rules to satisfy different requirements.

 

5. Data extraction: The page_url property defines data extraction rules, extract data from downlading page targe_url and save the result as files. The definition is explained in xpath directions. The property extracts summary only and detail data is handled by className.
  Rule definition syntax is as follows:
 
{page_url:{filter: pageUrl, extractby: contentReg, class: className }}
  filter defines the rule for filtering URLs; if filter is absent, all target_urls will match the rule.
  extractby defines rules for extracting content from a web page. class means that the extraction is performed by className class; className=”default” means data extraction using current default type, that is, extracting data from table named table. To meet specific requirements, you can define user-defined classes to do the extraction. Relevant details are explained later.
For example, extractby :"//div[@class=news-content]/text()" extracts data under the specified node from the web page.

page_url can define specific rules for different pages and extract data from web pages after URL filtering. This can reduce the number of URLs to be handled and increase speed, especially when there are a large number of URLs.

If extracby is absent, then extract all content from target_url.
If multiple page_urls are defined, the content matching the first rule will be extracted.
For example, if the content of web page A matches rules R1, R2 and R3, rule 1, instead of R2 and R3 wil be the base of data extraction.
If target_url is absent but there is a proper page_url for the current web page, the content of this page will also be extracted.

Note: You need to pay special attention to certain syntax details in these JSON strings. The brackets [] in node {} represents list table; the braces {} in node {} represents map style key value structure. Wrong syntax may lead to wrong parsing result.

Return value:

Boolean value

Example:

 

 

A

 

1

[{web_info:{domain:'www.banban.cn', save_path:'D:/tmp/data/webmagic', thread_size:2,

user_agent:'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0'}},

{init_url:['https://www.banban.cn/gupiao/list_cybs.html', 'https://www.banban.cn/gupiao/list_sh.html']},

{help_url:['gupiao/list_(sh|sz|cyb)\.html', '/shujv/zhangting/', '/agu/$']},

{target_url:{reg_url:'/agu/365\d'}},

{target_url:{filter:'gupiao/list_(sh|sz|cyb)\.html', reg_url:'gupiao/[sz|sh]?(60000\d)/',new_url:'http://www.aigaogao.com/tools/history.html?s=%s'}},

{page_url:{filter:'history.html\?s=\d{6}', extractby:"//div[@id='ctl16_contentdiv']/"}},

{page_url:{filter:'history.html\?s=[sz|sh]?\d{6}', extractby:"//div[@id='contentdiv']/"}},

{page_url:{extractby:"//div[@id='content_all']/"}},

{page_url:{filter:"/agu/365\d", extractby:"//div[@id='content']/"}}]

Extract historical stock data. The downloading URL is http://www.banban.cn. On web page https://www.banban.cn/gupiao/list_xxx.html help_url extracts stock codes registered in SSE, SZSE and second board and combine the codes with http://www.aigaogao.com/tools/history.html?s=%s to generate the to-be-download targe_url and extract content from it.

2

=web_crawl(A1)

The extraction result is stored as txt file and saved in D:\tmp\data\webmagic\www.banban.cn:

 

 

A

 

1

[{web_info: {domain:'www.weather.com.cn/',save_path:'D:/tmp/data/weather',thread_size:2,user_agent:'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0'}},{init_url:['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/hn.shtml','http://www.weather.com.cn/textFC/db.shtml']},{help_url:['textFC/(hb|hn)\.shtml']},{target_url:{filter:'textFC/(hb|hn)\.shtml',reg_url:'101090\d{3}',new_url:'http://www.weather.com.cn/weather/%s.shtml'}},{page_url:{filter:'/weather/1010902\d{2}', extractby:"//div[@class='curve_livezs']//*"}}]

Extract weather forecast data

2

=web_crawl(A1)

The extraction result is stored as a txt file and saved in D:\tmp\data\weather\www.weather.com.cn:

 

User-defined application interface:

 esProc only supports extracting data from tables inside an HTML page. But in real-world buisinesses there are variouls types of web pages. So users need to write their own program to do the extraction job based on specific web pages. For that we provide certain interfaces:

 

1. Data extraction application interface

The content of a downloading page can be in various layout. Users can define their own data extraction program:
Interface implementation program:

import us.codecraft.webmagic.Page;

 

public interface StandPageItem {

// data extraction handling

  void parse(Page p);

}

Need to implement parse(Page p) under com.web.StandPageItem interface to do the extraction.

 

2. Data store application interface

 

There are various method of storing extracted data. Users can write their own storage program:
Interface implementation program:
package com.web;

 

import us.codecraft.webmagic.ResultItems;

import us.codecraft.webmagic.Task;

import us.codecraft.webmagic.pipeline.Pipeline;

 

public interface StandPipeline extends Pipeline {

  public void setArgv(String argv);

  public void process(ResultItems paramResultItems, Task paramTask);

}
Need to implement setArgv() and process() in com.web.StandPipeline class.

setArgv() inputs parameter interface; process() handles data storage interface.

 

3. Data extraction program example

 

To implement parse(Page p) under com.web.StandPage interface, we can use the following program:

package com.web;

import java.util.List;

import us.codecraft.webmagic.Page;

import us.codecraft.webmagic.selector.Selectable;

 

public class StockHistoryData implements StandPageItem{

  @Override

  public void parse(Page page) {

  StringBuilder buf = new StringBuilder();

   List<Selectable> nodes = page.getHtml().xpath("table/tbody/").nodes();

   for(Selectable node:nodes){

   String day = node.xpath("//a/text()").get();

   List<String> title = node.xpath("//a/text() | tr/td/text() ").all();

   if (title.size()<5) continue;

   String line = title.toString().replaceFirst(", , ", ", ");

   buf.append(line+"\n");

   }

   page.putField("content", buf.toString());

  }

}

Store the extracted data in content field udner page and get it and save it later.

 

4. Data storage program example

 

To implement setArgv() and process() in com.web.StandPageline class, we can use the following program:
package com.web;

import java.io.File;

import java.io.FileWriter;

import java.io.IOException;

import java.io.PrintWriter;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import us.codecraft.webmagic.ResultItems;

import us.codecraft.webmagic.Task;

import org.apache.commons.codec.digest.DigestUtils;

import us.codecraft.webmagic.utils.FilePersistentBase;

public class StockPipeline extends FilePersistentBase implements StandPipeline {

  private Logger logger = LoggerFactory.getLogger(getClass());

  private String m_argv;

  private String m_path;

  public static String PATH_SEPERATOR = "/";

 

  static {

    String property = System.getProperties().getProperty("file.separator");

    if (property != null) {

    PATH_SEPERATOR = property;

    }

  }

 

  public StockPipeline() {

  m_path = "/data/webcrawl";

  }

  // Get storage path and prefix of the saved file name

  public void setArgv(String argv) {

  m_argv = argv;

  if (m_argv.indexOf("save_path=")>=0){

  String[] ss = m_argv.split(", ");

  m_path = ss[0].replace("save_path=", "");

  m_argv = ss[1];

  }

  }

 

  public void process(ResultItems resultItems, Task task) {

  String saveFile = null;

  Object o = null;

  String path = this.m_path + PATH_SEPERATOR + task.getUUID() + PATH_SEPERATOR;

  try {

  do{

  String url = resultItems.getRequest().getUrl();

  o = resultItems.get("content");

  if (o == null){

  break;

  }

 

  int start = url.lastIndexOf("/");

     int end = url.lastIndexOf("?");

  if (end<0){

  end=url.length();

  } 

 

    String link = url.substring(start+1, end);

    if (m_argv!=null && !m_argv.isEmpty()){

    link = m_argv+"_"+link;

    }

    if (link.indexOf(".")>=0){

    link = link.replace(".", "_");

    }

  /Add md5Hex to avoid overwriting namesake file

    String hex = DigestUtils.md5Hex(resultItems.getRequest().getUrl());

  saveFile = path + link+"_"+ hex +".json";

  }while(false);

  if (saveFile!=null){

  PrintWriter printWriter = new PrintWriter(new FileWriter(getFile(saveFile )));

  printWriter.write(o.toString());

  printWriter.close();

  }

  } catch (IOException e) {

  logger.warn("write file error", e);

  }

  }

}

 

How to use the user-defined application

 

Compile the above interface files and Java files, package them into webStock.jar, put the jar in esProc\extlib\webcrawlCli and restart esProc. Configure the data storage application in web_info and the data extraction application in page_url. Below is an example dfx file for loading programs of implementing the two user-defined class:

 

 

A

 

1

[{web_info:{domain:"www.banban.cn", save_path:"d:/tmp/data/webmagic", thread_size:2,

user_agent:"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0",

class_name:'com.web.StockPipeline',class_argv:'stock'}},

{init_url:["https://www.banban.cn/gupiao/list_cybs.html", "https://www.banban.cn/gupiao/list_sh.html"]},

{help_url:["gupiao/list_(sh|sz|cyb)\.html", "/shujv/zhangting/"]},

{target_url:{filter:"gupiao/list_(sh|sz|cyb)\.html", reg_url:'gupiao/[sz|sh]?( \d{6})/',new_url:"http://www.aigaogao.com/tools/history.html?s=%s"}},

{page_url:{filter:"history.html\?s=\d{6}", extractby:"//div[@id='ctl16_contentdiv']/",
class:'com.web.StockHistoryData'}}]

 

2

=web_crawl(A1)

Extracted data in storage: