Document elements and metadata
unstructured
simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks.
What that means is no matter where your data is and no matter what format that data is in, Unstructured’s toolkit will
transform and preprocess that data into an easily digestible and usable format that is uniform across data formats.
When you partition a document with Unstructured, the result is a list of document Element
objects.
These element objects represent different components of the source document.
Element example
Here’s an example of what an element might look like:
As you can see, every element will have a type
, an element_id
, the extracted text
, and some metadata
which may
vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking.
Let’s explore some of these document element components in more detail.
Element type
Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents. This gives you more control and flexibility over how you further use the processed documents and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all documents the same in your downstream processing, regardless of source format. For example, if you plan to summarize a document, you may only be interested in the narrative of the document, and not care about footers and headers. You can easily filter out the elements you don’t need using their type.
Here are some examples of the element types your document may contain:
Element type | Description |
---|---|
Formula | An element containing formulas in a document. |
FigureCaption | An element for capturing text associated with figure captions. |
NarrativeText | NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
ListItem | ListItem is a NarrativeText element that is part of a list. |
Title | A text element for capturing titles. |
Address | A text element for capturing physical addresses. |
EmailAddress | A text element for capturing email addresses. |
Image | A text element for capturing image metadata. |
PageBreak | An element for capturing page breaks. |
Table | An element for capturing tables. |
Header | An element for capturing document headers. |
Footer | An element for capturing document footers. |
CodeSnippet | An element for capturing code snippets. |
PageNumber | An element for capturing page numbers. |
UncategorizedText | Base element for capturing free text from within document. |
If you apply chunking during partitioning of a document or later, you will also see the CompositeElement
type.
CompositeElement
is a chunk formed from text (non-Table) elements. It is only produced by chunking.
A composite element may be formed by combining one or more sequential elements produced by partitioning. For example,
several individual list items may be combined into a single chunk.
Element ID
By default, the element ID is a SHA-256 hash of the element’s text, its position on the page, page number it’s on,
and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level.
To obtain globally unique IDs in the output (UUIDs), you can pass unique_element_ids=True
into any of the partition
functions. This can be helpful if you’d like to use the IDs as a primary key in a database, for example.
Metadata
Unstructured tracks a variety of metadata about the elements extracted from documents. Here are a couple of examples of what element metadata enables you to do:
- filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata.
- map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria.
Metadata is tracked at the element level. You can access the metadata for a given document element with
element.metadata
. For a dictionary representation, use element.metadata.to_dict()
.
Common metadata fields
All document types return the following metadata fields when the information is available from the source file:
Metadata field name | Description |
---|---|
filename | Filename |
file_directory | File directory |
last_modified | Last modified Date |
filetype | File type |
coordinates | XY Bounding Box Coordinates. See notes below for further details about the bounding box. |
parent_id | Element Hierarchy. parent_id may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a “title”). |
category_depth | Element depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting <H1>, <H2>, or <H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document. |
text_as_html | HTML representation of extracted tables. Only applicable to table elements. |
languages | Document Languages. At document level or element level. List is ordered by probability of being the primary language of the text. |
emphasized_text_contents | Emphasized text (bold or italic) in the original document. |
emphasized_text_tags | Tags on text that is emphasized in the original document. |
is_continuation | True if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max_characters. |
detection_class_prob | Detection model class probabilities. From unstructured-inference, hi-res strategy. |
Notes on common metadata fields:
Metadata for document hierarchy
parent_id
and category_depth
enhance hierarchy detection to identify the document
structure in various file formats by measuring relative depth of an element within its category. This is especially
useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.
Element’s coordinates
Some document types support location data for the elements, usually in the form of bounding boxes.
If it exists, an element’s location data is available with element.metadata.coordinates
. The coordinates
property of an ElementMetadata
stores:
points
: These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and they
coordinate increases in the downward direction.system
: The points have an associated coordinate system. A typical example of a coordinate system isPixelSpace
, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.
The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by
using the Element.convert_coordinates_to_new_system
method. If the in_place
flag is True
, the coordinate system
and points of the element are updated in place and the new coordinates are returned. If the in_place
flag is False
,
only the altered coordinates are returned.
Additional metadata fields by document type
Field Name | Applicable Doc Types | Description |
---|---|---|
page_number | DOCX, PDF, PPT, XLSX | Page number |
page_name | XLSX | Sheet name in an Excel document |
sent_from | EML | Email sender |
sent_to | EML | Email recipient |
subject | EML | Email subject |
attached_to_filename | MSG | filename that attachment file is attached to |
header_footer_type | Word Doc | Pages a header or footer applies to: “primary”, “even_only”, and “first_page” |
link_urls | HTML | The url associated with a link in a document. |
link_texts | HTML | The text associated with a link in a document. |
section | EPUB | Book section title corresponding to table of contents |
Notes on additional metadata by document type:
Emails will include sent_from
, sent_to
, and subject
metadata. sent_from
is a list of strings because
the RFC 822 spec for emails allows for multiple sent from email addresses.
Microsoft Excel documents
For Excel documents, ElementMetadata
will contain a page_name
element, which corresponds to the sheet name in the Excel
document.
Microsoft Word documents
Headers and footers in Word documents include a header_footer_type
indicating which page a header or footer applies to.
Valid values are "primary"
, "even_only"
, and "first_page"
.
Table-specific metadata
For Table
elements, the raw text of the table will be stored in the text
attribute for the Element, and HTML representation
of the table will be available in the element metadata under element.metadata.text_as_html
. By default,
Unstructured will automatically extract all tables for all doc types unless you set skip_infer_table_types
parameter.
Here’s an example of a table element. The text
of the element will look like this:
And the text_as_html
metadata for the same element will look like this:
Data connector metadata fields
Documents processed through source connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector.
Common data connector metadata fields
- Data Source metadata (on json output):
- url
- version
- date created
- date modified
- date processed
- record locator
- Record locator is specific to each connector
Additional metadata fields by connector type (via record locator)
Source connector | Additional metadata |
---|---|
airtable | base id, table id, view id |
azure (from fsspec) | protocol, remote file path |
box (from fsspec) | protocol, remote file path |
confluence | url, page id |
discord | channel |
dropbox (from fsspec) | protocol, remote file path |
elasticsearch | url, index name, document id |
fsspec | protocol, remote file path |
google drive | drive id, file id |
gcs (from fsspec) | protocol, remote file path |
jira | base url, issue key |
onedrive | user pname, server relative path |
outlook | message id, user email |
s3 (from fsspec) | protocol, remote file path |
sharepoint | server path, site url |
wikipedia | page title, age url |