Accessing Webcrawl

Read(118) Label: webcrawlcli,

Users can access the WebcrawlCli external library through the interface embedded in esProc designer to extract specific data from certain websites. To deploy the external library, you:

 

1. Find the following jars on the web and download them into the esProc’s external library directory. The path is: installation directory\ esProc\extlib\WebcrawlCli. The Raqsoft core jar for this external library is WebcrawlCli.jar.

accessors-smart-1.2.jar

asm-5.0.4.jar

assertj-core-1.5.0.jar

commons-codec-1.9.jar

commons-collections-3.2.2.jar

commons-io-1.3.2.jar

commons-lang3-3.1.jar

commons-logging-1.2.jar

commons-pool2-2.4.2.jar

fastjson-1.2.28.jar

filename.bat

hamcrest-core-1.3.jar

httpclient-4.5.2.jar

httpcore-4.4.4.jar

jedis-2.9.0.jar

json-path-2.4.0.jar

json-smart-2.3.jar

jsoup-1.10.3.jar

junit-4.11.jar

log4j-1.2.17.jar

slf4j-api-1.7.6.jar

slf4j-log4j12-1.7.6.jar

webmagic-core-0.7.3.jar

webmagic-extension-0.7.3.jar

webStock.jar

xsoup-0.3.1.jar

Note: The third-party jars are provided within the package and users can choose appropriate ones for specific scenarios.

 

2. A JRE version 1.7 or above is required. The embedded JRE version in esProc is JRE1.6. Users need to install a higher version and configure java_home in the config.txt under installation directory\esProc\bin\config.txt. If a JDK version 1.7 or above when installing esProc has been chosen, just ignore this step.

 

3. esProc provides library function web_crawl() to extract data from websites. Look it up inHelp-Function referenceto find the uses.