Saturday, May 29, 2010

Public Table Extraction Dataset

I am posting a copy of the table extraction dataset I created for my thesis here.

The dataset has three parts:
  • PublicTableExtractionDataset, a SQL database to keep track of the html pages and tables and which contains the manual labels of 'data table' or 'layout table'
  • JavaCrawlerTestDump, a folder containing all the crawled html pages
  • TableDump, a folder containing all the extracted tables from each crawled html page
Practical Information
Schema: PublicTableExtractionDataset consists of two tables, HTMLPages and Table_Contents. HTMLPages contains information on where html pages are located and how to identify them, while Table_Contents contains information on each table extracted from each HTMLPage, as well as the type of table it is (a value of '1' indicates a layout table, while a value of '2' indicates a data table).

The schema for the two tables is as follows:

HTMLPages:
  • File_ID (int, not null)
  • File_Name (varchar(200), not null)
  • Page_Domain (varchar(200), not null)
  • URL (varchar(1000), not null)
  • Page_Type(int, not null)
Table_Contents:
  • File_ID (int, not null)
  • Table_ID (int, not null)
  • Table_File_Location (varchar(200), not null)
  • Table_Type (int, null)
Format: This database is a backup of the original SQL database I used. You will need to import it to a new database using the 'import database' wizard provided with SQL Server. I have tested this with the express and full versions of SQL Server 2000 and 2008, so please let me know if you have any questions.

Accessing html pages and tables: I have removed the folder locations from the database, but you can easily add your own. For example, to update the HTMLPages SQL table to add the locations, you could use the following query:

update HTMLPages
set File_Name = 'new location' + File_Name
from HTMLPages

The same query could be used to update the Table_Contents table, just remember to change HTMLPages to Table_Contents.

Dataset Statistics
I collected 9,365 HTML pages which contain the <table> tag from 512 random domains. These pages contain a minimum of 1 and a maximum of 1,539 table pages. 6,620 table pages consist only of non-data tables, while 2,745 pages consist of at least one data table.

The total number of tables collected was 78,438, with 74,202 (94.6%) of these being non-data tables, and 4,236 (5.4%) being data tables.

More Details
You can read more about this data set and the experiments I used it for in my thesis.

No comments: