Documentation

  1. Overview
  2. Framework Description
    1. Data Model
    2. easIE's Architecture
  3. Configuration File Description
    1. Configuration File's elements
    2. Examples
    3. Configuration File Generator
1. Overview

easIE is an easy-to-use open-source data extraction framework, for extracting data from external resources related to companies. The framework supports easy generation of Web wrappers to perform Information Extraction (IE) from numerous Web sources. As a result, users with little programming skills can contribute to the process of data gathering more actively by extracting data from both static and dynamic HTML pages simply by defining a configuration file. Using this framework, we have extracted financial and CSR data from a number of external sources about a number of companies and integrated this within the WikiRate database.

2. Framework Description

easIE is an intelligent content collection framework that was developed to populate WikiRate platform with content from selected Web sources. The goal was to extract data with minimal human assistance and supervision by only defining a configuration file. Here, we describe the data model that was used by easIE and it's architecture.

2.1 Data Model

CSR data model used by easIE

CSR data in WikiRate is organized around three main entities:
  1. companies: represent corporations and are associated with a name, a set of aliases, country and website.
  2. metrics: are pieces of information (quantitative or qualitative) related to companies; Metrics are primarily defined by their name and value, the year to which they refer, as well as information related to the source which was used to derive them;
  3. articles: refer to online resources (posts) that refer to CSR activities and issues in relation to a company or a set of companies.
2.2 easIE's Architecture

To tackle the diversity of Web sources where CSR information resides, easIE employs a semi-automatic WDE approach that generates a custom wrapper for a particular Web source based on a set of handcrafted data extraction rules. Users only need to define these rules and store them in a configuration file in order for easIE to generate and execute a custom wrapper to collect the target data from the selected Web source. Defining these rules is a simple process and makes possible the collection of CSR data from both static and dynamic HTML pages.

The picture below provides an overview of easIE. A configuration file is provided as input, analyzed and validated by the Configuration Parser. At the second step, the Web Wrapper Generator creates a custom wrapper for the target Web source, which is then applied on the page by the Wrapper Executor. As a final step, the Data Handler performs data integration by matching records referring to the same company and updating the CSR Database.

Overview of easIE

In particular, two types of wrapper are generated depending on whether the HTML content is static or dynamic. Each of them executes the following processing flow: a) page fetching, b) data extraction, c) data post-processing, d) start-over, i.e. identifying the next page to fetch, crawling within the Web page. In case of static pages, the executor applies the wrapper on a sequence of Web pages that is defined either through a pagination operation (Pagination Iterator) or through a group of URLs (Bunch URL Iterator) . In case of dynamic pages, the executor launches a browser emulator and a executes a set of user events (as specified in the configuration file), namely click and scroll events, in order to fetch and extract the target data.

3. Configuration File Description

The configuration file is defined in JSON format and consists of a set of elements as listed in Table 1. To define the configuration file, it is essential admin users to be familiar with the syntax of CSS selectors and CSS queries. More specifically, extraction rules need to be formulated in CSS format in order to define the elements of an HTML document, they want to extract. Modern browsers come with built-in tools (DOM Inspectors) that make easy to extract CSS or XPath Selector that addresses a specific HTML element.

3.1 Configuration File's elements

To extract and collect data from a Web source, a separate configuration file needs to be prepared. This is defined in JSON format and consists of a set of elements (listed in the Table below) that specify a) the type of data that is to be extracted, and b) the details of the steps that comprise the wrapper processing workflow (fetching, data extraction, post-processing, start-over).

Table 1: Main elements of easIE configuration file

The entry point for a wrapper is the url field that determines the seed Web page that the wrapper will first fetch in order to start the data collection process for the source under source_name (that is a label referring to the particular source). Then, the configuration file should specify the a) attributes that can be extracted for each company (e.g. link, country, sector etc.) through the company_info elements, and b) the metrics that will be extracted and associated with each company (these will be further described below).

Another essential configuration element is a boolean flag (dynamic_page) indicating whether the target page should be handled by the wrapper as static or dynamic. In case of static pages, the target data are available upon page load; instead, for dynamic pages the configuration file should also specify the set of user events (click or scroll) that are required in order to load the necessary HTML snippets in the browser emulator.

Company_info and metrics are arrays of JSON elements as described in Table 2. There are three fields that need to be defined: 1) label, 2) value, 3) citeyear. All three fields can have custom values defined by users or can be extracted from the Web Page. In order to extract values for label, value and citeyear fields users need to define the fields as described in Table 4.

Table 2: Specification of metrics and company info. For company info, users need to define at least one field with label equal to Company Name to specify to which company the extracted metric(s) refer.

Moreover, the data can be formatted in a table or a list then the actual data extraction process is initiated by pointing the wrapper to the DOM subtree specified by the list_selector field. This is a CSS selector that points the wrapper to the set of HTML elements that contain the target information about company attributes and metrics. Then, the wrapper iteratively visits all the elements of the selected subtree and applies the data extraction rules specified by metrics and company_info that map parts of these elements to the respective structured fields. For instance, to select each row from a table with id list-table-body we should declare "list_selector":"#list-table-body > tr" and in order to map the third column to a metric with the name "Newsweek Global Rank 2014", the respective CSS selector would be: {"label":"Newsweek Global Rank 2014", "value":{"selector":"td:nth-child(3)", "type":"text"},citeyear:2014}. In addition to text, the type field can also take one of the following values:

  1. link: corresponding to the href attribute of the selected element;
  2. image: corresponding to the src attribute of the selected element;
  3. integer: the extracted value is going to be an integer;
  4. list: referring to a list of text elements;
  5. other: A string specifying the name of the attribute of interest (e.g., "type": "src").

In several cases, it is necessary to extract part of the selected text. To address this requirement, easIE supports processing the selected pieces of text by a simple declarative specification (replace) comprising a set of regex patterns accompanied by a set of string values (with) to replace the detected patterns. For instance, the specification "replace":{{"regex":[":.*"],"with":[""]}} removes all the part of a string after and including the colon.

Table 4: Elements need to be defined in order to extract desired data from the Web Page

In numerous cases, a source of interest contains multiple pages with data. In such cases, instead of defining different configuration files and only changing the url field, the user needs to define an array of URLs (group_of_urls) to be collected by the same wrapper. Navigation within the page is enabled if crawl field is defined. The user needs to declare a nested configuration file that takes as input the url or urls that obtained by the metric named crawl_to. Another multi-page data collection scenario pertains to paginated data (i.e. entries spanning multiple pages with paging controls available to browse from one to the next). To enable paginated access to such data, one needs to specify the next_page_selector field by a CSS selector pointing to the element where the "next" button is located in the page.

A final option of complex page processing concerns the fetching and processing of HTML elements from dynamic Web pages. In such cases, the fetching and processing workflow is specified by the events field. There are two types of event that can be executed by the browser emulator:

  1. CLICK: in that case users have to also define the selector of the element to which the event will apply, as well as the number of times that the event will be executed (times_to_repeat).
  2. SCROLL_DOWN: in this case, the wrapper will execute the event until no more data can be further loaded from the page. Users may also specify the times to repeat the scroll down event, since there are pages that are based on an "infinite scroll" design.

If the times_to_repeat field is specified, then one also needs to specify the extraction_type field declaring whether data extraction should be performed after the execution of each event (AFTER_EACH_EVENT) or after the execution of all events (AFTER_ALL_EVENTS).

Table 3: Specification of events for dynamic pages.

3.2 Examples
3.3 Configuration File Generator

We have developed an online tool that facilitates users to generate their configuration files by only answering to a number of questions. There is no need for users to be familiar with JSON but they need to be familiarized with the basic concepts of easIE. An id is provided for each configuration file and the generated files are stored in our servers for future use. The tool can be found here.