Search text from PDF files stored in an S3 bucket

·

3 min read

Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?

As a developer, you have 3 options:

  1. Search by Filename: Lookup by key/value like filename [Native]
  2. Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
  3. Full-Text-Search: Extract the contents into a search engine [OCR, Database, Search add-on]

Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.

data diagram

In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.

Bonus: At the end will be a Github repository so you can import the code directly into your application.

Store the file

First we need a function to download the file locally in order to run our OCR extraction logic:

import boto3s3_client = boto3.client(  
    's3',  
    aws_access_key_id='aws_access_key_id',  
    aws_secret_access_key='aws_secret_access_key',  
    region_name='region_name'  
)

with open(s3_file_name, 'wb') as file:  
        s3_client.download_fileobj(  
            bucket_name,  
            s3_file_name,  
            file  
        )

Extract the contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parser
parsed_pdf_content = parser.from_file(s3_file_name)['content']

Insert contents into a search engine

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you don’t have OpenSearch locally you must install it first, then run it:

brew update  
brew install opensearch  
opensearch

OpenSearch will now be accessible here: http://localhost:9200. Let’s build the index and insert the file contents:

from opensearchpy import OpenSearch
os = OpenSearch("http://localhost:9200/")  

index_name="pdf-search"doc = {  
    "filename": s3_file_name,  
    "parsed_pdf_content": parsed_pdf_content  
}

response = os.index(  
    index=index_name,  
    body=doc,  
    id=1,  
    refresh=True  
)

Creating a PDF search API

We’ll use Flask to create a microservice that searches terms:

from flask import Flask, jsonify, request  
from opensearchpy import OpenSearch  
from config import *

app = Flask(__name__)  
    os = OpenSearch("http://localhost:9200/")
    @app.route('/search', methods=['GET'])  
    def search_file():  
        query = request.args.get('q', default = None, type = str)# query payload to ES  
        payload = {  
            'query': {  
                'match': {  
                    'parsed_pdf_content': query  
                }  
            }  
        }  

    response = os.search(  
        body=payload,  
        index=index_name  
    )

return jsonify(response)if __name__ == '__main__':  
    app.run(host="localhost", port=5011, debug=True)

Now we can call the API via:

GET: http://localhost:5011/search?q=SEARCH_TERM
{  
      "_shards": {  
        "failed": 0,   
        "skipped": 0,   
        "successful": 1,   
        "total": 1  
      },   
      "hits": {  
        "hits": [  
          {  
            "_id": "1",   
            "_index": "pdf-search",   
            "_score": 0.29289162,   
            "_source": {  
              "filename": "prescription.pdf",   
              "parsed_pdf_content": "http://localhost:5011/search?q=SEARCH_TERM"  
            }  
          }  
        ],   
        "max_score": 0.29289162,   
        "total": {  
          "relation": "eq",   
          "value": 1  
        }  
      },   
      "timed_out": false,   
      "took": 40  
    }

Whoo we did it! We’ve successfully created an API that offers full text PDF search.

congrats

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

So what’s next?

  • Queuing: Ensuring concurrent file uploads are not dropped
  • Security: Adding end to end encryption to the data pipeline
  • Enhancements: Including more features like fuzzy, highlighting and autocomplete
  • Rate Limiting: Building thresholds so users don’t abuse the system

Everything collapsed into just 2 API calls

If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.

Upload

import requests

url = "https://api.mixpeek.com/upload"  
    files=[  
      ('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))  
    ]  
    response = requests.request("POST", url, files=files)
import requests
url = "https://api.mixpeek.com/search?q=SEARCH_QUERY"
response = requests.request("GET", url)
print(response.text)

Corresponding Postman Collection for your convenience.

Request an API key for free, and review the docs to get started.