20 November 2020

AWS-Textract

  • Textract is a document analysis service that detects and extracts text, structured data and tables from images and scans of documents.
  • It's ML models have been trained on millions of documents so that virtually any document type users upload is automatically recognized & processed for text extraction.
  • It can detect Latin-script characters from the standard English alphabet and ASCII symbols.
  • It currently supports PNG, JPEG, and PDF formats.
  • It supports logging of the following actions as CloudTrail events - DetectDocumentText, AnalyzeDocument, StartDocumentTextDetection, StartDocumentAnalysis, GetDocumentTextDetection & GetDocumentAnalysis.
  • It charges users based on the number of pages and images processed.
  • Data from Textract is encrypted and stored at rest in the AWS region where users are using Textract.
  • It is compliant with SOC-1, SOC-2, SOC-3, ISO 9001, ISO 27001, ISO 27017 and ISO 27018.
  • It uses Optical Character Recognition (OCR) technology to automatically detect printed text and numbers in a scan or rendering of a document, such as a legal document or a scan of a book.
  • It enables users to detect key-value pairs in document images automatically so that they can retain the inherent context of the document without any manual intervention.
  • It preserves the composition of data stored in tables during extraction.
  • It is directly integrated with Amazon A2I so users can easily implement human review of text extracted from documents.
  • Users can easily process millions of documents using Textract's text extraction APIs.
  • With synchronous processing, Textract can analyze single-page documents for applications where latency is critical.
  • It provides asynchronous operations to extend support to multipage documents.
  • With AWS Batch, Textract is able to process multiple document images in a single operation.
  • To detect text asynchronously, use StartDocumentTextDetection to start processing an input document file.
  • To detect text synchronously, use the DetectDocumentText API operation and pass a document file as input.
  • It analyzes documents and forms for relationships between detected text.
  • It analysis operations return three categories of text extraction: text, forms and tables
  • For Textract synchronous operations, users can use input documents that are stored in S3 bucket, or they can pass base64-encoded image bytes.
  • It can detect selection elements such as option buttons and check boxes on a document page.
  • It conforms to the AWS shared responsibility model, which includes regulations and guidelines for data protection.
  • It communicates exclusively via HTTPS endpoints, which are supported in all Regions supported by It.
  • It is protected by the AWS global network security procedures.

No comments:

Post a Comment

Most views on this month