Documentation: PDF Table Extractor

PDF Table Extractor tool is responsible for extracting tables from pdf files into a csv file. The tables are extracted based on two different algorithms and we suggest extracting tables with:

  • Tables with rulings extraction algorithm: if the tables in the pdf file have ruling lines separating each cell and column.
  • Tables without rulings extraction algorithm: if the tables in the pdf file do not have ruling lines separating each cell and column.
Users can apply specific settings to achieve better results. More specifically:
  • In Tables With Rulings Extraction Algorithm, users can define the density of the columns and rows to be extracted. The default settings extract only columns and rows with density >= 50%. In detail, the thresholds help to eliminate rows/columns (and not content from cells) that are not containing a lot of information. The thresholds are used because the algorithm extracts in some cases empty rows and columns and it helps to eliminate such cases. If you set rows/columns threshold = 100 then the tool will extract only the rows/columns in the table that have data in all the corresponding columns/rows. For instance, if we have the following table:

  • 1 2 3 4 5 6 7
    1 COLOP CLOTHING AUSTRIA 198 73% 27%
    2 ADURY APPARELS LTD CLOTHING BANGLADESH 3,320 30% 70%
    3 AJAX SWEATER LTD CLOTHING BANGLADESH 758 35% 65%
    4 AKH ECO APPARELS LTD CLOTHING BANGLADESH 3,050 60% 40%
    5 AKH FASHIONS LTD CLOTHING BANGLADESH 1,103 30% 70%
    6 AMAN GRAPHICS & DESIGNS LTD CLOTHING BANGLADESH 2,307 40% 60%
    7 AMAN KNITTINGS LTD CLOTHING BANGLADESH 2,317 60% 40%
    8 ANANTA DENIM TECH. LTD CLOTHING BANGLADESH 3,388 40% 60%
    9 ANANTA HUAXIANG LTD 1,787 40%

    The density of column 1, 4 and 7 is 100% while the density of 2,3,6 is 89% and of column 5 is 0%. The density of rows 1,2,3,4,5,6,7 and 8 is 86% while the density of row 9 is 43%. If the thresholds have been set as 50% for columns and rows then the final extracted table will look like that:
    COLOP CLOTHING AUSTRIA 198 73% 27%
    ADURY APPARELS LTD CLOTHING BANGLADESH 3,320 30% 70%
    AJAX SWEATER LTD CLOTHING BANGLADESH 758 35% 65%
    AKH ECO APPARELS LTD CLOTHING BANGLADESH 3,050 60% 40%
    AKH FASHIONS LTD CLOTHING BANGLADESH 1,103 30% 70%
    AMAN GRAPHICS & DESIGNS LTD CLOTHING BANGLADESH 2,307 40% 60%
    AMAN KNITTINGS LTD CLOTHING BANGLADESH 2,317 60% 40%
    ANANTA DENIM TECH. LTD CLOTHING BANGLADESH 3,388 40% 60%

  • In Tables Without Rulings Extraction Algorithm, users can apply merging of lines into the same row. In particular, the algorithm extracts each line of the table as a different row but sometimes a cell contain the information in multiple lines and in order to achieve better results users can define merging sparse lines with the upper line (into the same row) or lower merging for merging sparse lines with the next-lower line (into the same row).

We strongly, suggest experimenting with the settings of the algorithms to achieve better results and understand how exactly they work.