Skip to content

Latest commit

 

History

History
 
 

README.md

Semantic Wikipedia Search with Transformers and DistilBERT

Input 1 text file with 1 sentence per line
Output
top_k
number of sentences that match input query
Jina version 0.9.22

This is an example of using Jina's neural search framework to search through a selection of individual Wikipedia sentences downloaded from Kaggle. It's based on code generated by jina hub new --type app. It uses the distilbert-based-uncased language model from Transformers.

Run in Docker

To test this example you can run a Docker image with 30,000 pre-indexed sentences:

docker run -p 45678:45678 jinahub/app.example.wikipedia-sentences-30k:0.2.8-0.9.23

You can then query by running:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'

Setup

pip install -r requirements.txt

Index

We'll start off by indexing a small dataset of 50 sentences (data/toy-input.txt) to make sure everything is working:

python app.py -t index

To index the full dataset (almost 900 MB):

  1. Set up Kaggle
  2. Run the script: sh ./get_data.sh
  3. Set the input file: export JINA_DATA_FILE='data/input.txt'
  4. Set the number of docs to index export JINA_MAX_DOCS=500 (or whatever number you prefer. The default is 50)
  5. Delete the old index: rm -rf workspace
  6. Index your new dataset: python app.py -t index

Search

With REST API

python app.py -t query_restful

Then:

curl --request POST -d '{"top_k": 10, "mode": "search",  "data": ["text:hello world"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'

Or use Jinabox with endpoint http://127.0.0.1:45678/api/search

From the Terminal

python app.py -t query

Build a Docker Image

This will create a Docker image with pre-indexed data and an open port for REST queries.

  1. Run all the steps in setup and index first. Don't run anything in the search step!
  2. If you want to push to Jina Hub be sure to edit the LABELs in Dockerfile to avoid clashing with other images
  3. Run docker build -t <your_image_name> . in the root directory of this repo
  4. Run it with docker run -p 45678:45678 <your_image_name>
  5. Search using instructions from Search above

Image name format

Please use the following name format for your Docker image, otherwise it will be rejected if you want to push it to Jina Hub.

jinahub/type.kind.image-name:image-version-jina_version

For example:

jinahub/app.example.wikipedia-sentences-30k:0.2.8-0.9.23

Push to Jina Hub

  1. Ensure hub is installed with pip install jina[hub]
  2. Run jina hub login and paste the code into your browser to authenticate
  3. Run jina hub push <your_image_name>