ReviewEssays.com - Term Papers, Book Reports, Research Papers and College Essays
Search

Wrapper Generation for Unstructured Data

Essay by   •  February 21, 2011  •  Research Paper  •  2,784 Words (12 Pages)  •  1,058 Views

Essay Preview: Wrapper Generation for Unstructured Data

Report this essay
Page 1 of 12

пЂ

Wrapper Generation for Unstructured data

AbstractвЂ" The data on the web is highly unstructured and some times it is present without any HTML tags, so it becomes difficult to query those web-sites and extract data from them. It is also difficult to merge data after colleting from various websites as it is in different formats and data types. The machine can’t understand unstructured data by its own and more-over machine needs both structure and content so as to extract data from web. We need some algorithm that can generate structure form this unstructured data automatically without any manual intervention.

I. INTRODUCTION

The data on web is high unstructured and it is not possible with our traditional query engines to query it in an efficient and accurate manner. We need some kind of autonomous engine in which we can submit a query and it gives us result hiding all complexities inside it, like a user is not concerned to which web-sites it goes, how it get results , maps them and give us a combined and summarized result. In such a system no manual intervention is needed and wrappers or intermediate tool, all are generated at run time. Such a dynamic system is not affected by the changes on web-pages as all the things are done at run time. The idea that I am presenting in this paper is concerned with wrapper generation, as one of the crucial problems of information extraction from internet is to generate wrappers which are information extracting pattern or rules for a webpage. We describe how to generate wrapper for a web-page that has no HTML tags or few of them. The main focus is to get structure for this data that is presented without tags as to query in an efficient and accurate manner machine needs both the structure and content of the web page.

II. OVERVIEW

The data on website is highly unstructured without HTML tags. It is difficult for machine to get structure from this data but humans can see there are always visual structures present. In many web-pages the data is present without HTML tags. This data though of very importance but can’t be queried or it is not possible to extract only specific information from it. For the machine to extract data and to query it should know the schema of that unstructured data.

The web pages on these types of data intensive web sites have similar schema and are automatically generated form the back end databases and are presented in an HTML manner. This means that here is a program generating this so there must me some schema common to those similar pages. So to generate the semantics of such the web pages we need to analyze and compare the group of those web pages. The basic idea is that they are generated by a program so there must be some structure and also we have the basic assumption that there are some visual structures present.

III. RELATE WORK

A large quantity of work has been done in this field already. Many systems are available now days that will generate wrapper for web-pages but most of them are not appropriate as they have lots of limitations or they have some lousy assumptions or they need some manual intervention. Some of them also need large number of training examples so that they can work well in a specific domain. Wrapper is a specified procedure that is designed for extracting specific data form interesting web sites. The result of wrapper should be a formal structure for processing.[4]

We will now discuss the various techniques for wrapper generation and will also compare it with our approach. Autowrapper[5] extracts table from the web-pages using smith-waterman algorithm but it fails when there are no HTML tags, nested tables or single row tables. Our approach works well for data on web that is without HTML tags and any table structure. PickUp[3] is able to extract complex table structures but it fails when there is both data without HTML tags and repeated patterns. It focuses on a single page, to overcome this limitation I analyze many pages before generating wrapper for the unstructured data without HTML tags. RoadRunner[1] required HTML tags and two training documents. It tries to map one document structure with other, handling some irregularities but it fails without HTML tags. Our approach does not need any set of training documents and also it works for data without HTML tags on the web.

IV. ROADRUNNER

In this paper we are extending the work done for RoadRunner that generates wrapper automatically for HTML web pages. We will giving a brief overview of RoadRunner. It basically does the HTML tags match. It needs 2 pages to parse and generate a generalized wrapper. The first page is the reference page. While comparing tags a mismatch can occur and this is removed by generalizing the wrapper. These mismatches are handles in a very appropriate way.

There can be two types of mismatches. The first is string mismatch, which is due to the different values from the database. This is solved by putting the value #PCDATA in the wrapper, means the value is extracted from the database.

The second one is tag mismatch. This occurs when there is HTML tags mismatch in the sequence of matching tags. These mismatches are overcome by finding if it is optional or iterators. It firstly tries to find the repeated pattern and if it fails than it tries to find optional pattern. In repeated pattern it tries to find if a group of HTML tags are repeated by matching the group which they call square in a bottom-up approach i.e. they match from the bottom of square to the top. If the match occurs than that group of HTML tags can occur any number of times. So it is generalized in the wrapper by putting + sign, indicating it can occur any number of times. If this fails than it means that the field is optional and they it generalize in wrapper by putting <I> and </I> tags in the wrapper.

V. BACKGROUND DESCRIPTION

We all know that data on the web is unstructured and mostly it is without HTML tags. In some case there are situation where there is lots data between single HTML tags. The data in those cases is extracted from a database and is presented in HTML format within a single tag to the user, but the data mostly have relational structure that can be seen visually and easily recognized by humans. This data is very important in some cases and need to be queried. Getting schema automatically from it is very difficult for a machine and further without knowing the schema it is not possible for machine to query that web page and extract the information.

The approach we are taking for wrapper

...

...

Download as:   txt (15.7 Kb)   pdf (173 Kb)   docx (14.7 Kb)  
Continue for 11 more pages »
Only available on ReviewEssays.com