Orbit Extract

Orbit Extract is a system that extracts both objective and subjective information from a range of documents.

Objective information has clearly defined attributes, for example: identifiers, company names, invoice amounts, that can always be tracked back to the exact location of the file in question.

Subjective information is derived and summarised by algorithms, for example: sentiment, summarisation, Q&A. In place of absolute accuracy, the quality of the subjective extraction is measured against manual annotations, accuracy/recall, and other chosen statistics.

Get in touch

Components

The key components and architecture of the system:
Extract architecture
The system is designed to process all unstructured data types, for example: PDFs, web pages, word documents and scanned files.

Off-the-shelf datasets

In many cases, the analysis is about processing publicly available information. Orbit provides a wide range of off-the-shelf global datasets which are updated daily by the Orbit Data team. (Please refer to Orbit Data)

Content entitlement

For Orbit provided content, the data will be run through further processing stages with entitlement to manage data coverage, frequency, and latency.

Bespoke collection

Orbit offers a data collection service for bespoke requirements.
For the requirement to process internal documents with the related internal metadata to manage process, Orbit provides a number of connectors to enable the automatic uploading of the data into its secured cloud environment.
Dependent on which use case, a number of different pre-processing and functional algorithms can be performed. Orbit hosts and provides a range of generic processing capabilities via API, as well as training tailored project-specific models for clients.
Orbit provides a cloud environment from data storage to calculation and as required, can also provide on-premises or hybrid deployment solutions.

There’s a number of available algorithms for both objective and subjective extractions and depending upon the task at hand, one or more of the below will be utilised on client projects.

Pre-processing

Algorithms to extract text from documents, split into sentences, etc.

Rule-based extraction

For simple objective extractions, e.g. attributes from consistent formats, rules are the most efficient approach. Rules can be explicitly defined and users can be trained to configure them, ranging from keyword matching and regular expressions to linguistic pattern-based matching.

Template based extraction

For lengthy documents, e.g. prospectus and annual reports, where extractions rely on document structure, a template is the best method to handle it.

OCR extraction

For more straight forward documents, e.g. invoices and short contracts that have huge volume and inconsistent formats, it can be easier to convert documents to images and run Optical Character Recognition to extract.

Model based extraction

For tasks where it’s harder to define patterns or rules, the only feasible approach is to develop bespoke models.

ESG classification model

Orbit provides a sentence-level ESG topic classification model off-the-shelf to help locate relevant content more efficiently.

Q&A model

Orbit has developed a bespoke framework that takes a natural language question and looks for candidate answers against pre-defined content to locate relevant information.

Sentence level event detection

A deep learning model that runs on public documents, for example news, to detect entity and pre-defined events at a sentence level. A typical use case is to scan news for ESG controversies with a high confidence level.

Summarisation

Generate summary from news, research reports, and more.

Sentiment

A multi-lingual sentiment model on article and sentence levels.
Machine extraction can never be 100% accurate. To address this Orbit provides user interfaces and workflows for users to address them manually and also provides managed services for this.

Task assignment

Allow team collaboration

QA

Quality assurance checks

Reporting

Monitoring progress
For users to generate training data via manual labelling to train their own models
Contact us

How it works

Requirements for a clean extraction as a dataset

We decide together the:
  • Data source
  • Extraction logic
  • Extraction frequency
  • Exception handling logic (if required)
Clients will receive the clean dataset with the capability to check the original documents.

To use as an efficiency improvement tool

We configure the solution for your specific use case.
If you need us to prepare raw data we can set this up for you, otherwise you can upload your own in-house or sourced data to the system directly.
You can then operate the system as required.

Orbit Extract for…

ESG

Document parsing

News monitoring

Quantitative analysis

Fund rebates

Research Evaluation

Extraction use cases are broad, let’s discuss solutions to your challenges