PDF Parser
Given a table, a list of rows which are lists of strings, returns a new table which is a list of rows which are
dictionaries mapping the header values to the table values.
Parameters
• table – The table (a list of lists of strings).
• header (list, optional) – The header to use. If not provided, the first row of the
table will be used instead. Your header must be the same width as your table, and cannot
contain the same entry multiple times.
Raises InvalidTableHeaderError – If the width of the header does not match the width of
the table, or if the header contains duplicate entries.
Returns A list of dictionaries, where each entry in the list is a row in the table, and a row in the table
is represented as a dictionary mapping the header to the values.
Return type list[dict]
py_pdf_parser.tables.extract_simple_table(elements: ElementList, as_text: bool =
False, strip_text: bool = True, allow_gaps:
bool = False, reference_element: Op-
tional[PDFElement] = None, tolerance:
float = 0.0, remove_duplicate_header_rows:
bool = False) → List[List[T]]
Returns elements structured as a table.
Given an ElementList, tries to extract a structured table by examining which elements are aligned.
To use this function, there must be at least one full row and one full column (which we call the reference row
and column), i.e. the reference row must have an element in every column, and the reference column must have
an element in every row. The reference row and column can be specified by passing the single element in both
the reference row and the reference column. By default, this is the top left element, which means we use the
first row and column as the references. Note if you need to change the reference_element, that means you have
gaps in your table, and as such you will need to pass allow_gaps=True.
Important: This function uses the elements in the reference row and column to scan horizontally and vertically
to find the rest of the table. If there are gaps in your reference row and column, this could result in rows and
columns being missed by this function.
There must be a clear gap between each row and between each column which contains no elements, and a single
cell cannot contain multiple elements.
If there are no valid reference rows or columns, try extract_table() instead. If you have elements spanning
multiple rows or columns, it may be possible to fix this by using extract_table(). If you fail to satisfy any of the
other conditions listed above, that case is not yet supported.
Parameters
• elements (ElementList) – A list of elements to extract into a table.
• as_text (bool, optional) – Whether to extract the text from each element instead
of the PDFElement itself. Default: False.
• strip_text (bool, optional) – Whether to strip the text for each element of the
table (Only relevant if as_text is True). Default: True.
• allow_gaps (bool, optional) – Whether to allow empty spaces in the table.
• reference_element (PDFElement, optional) – An element in a full row and a
full column. Will be used to specify the reference row and column. If None, the top left
element will be used, meaning the top row and left column will be used. If there are gaps in
these, you should specify a different reference. Default: None.
42 Chapter 3. Reference