2022-02-16: Pyserini: an Information Retrieval Framework
Pyserini is an information retrieval toolkit initially released in 2019. People can input a query and it will return a list of ranked documents relevant to this query. Pyserini supports sparse retrieval, dense retrieval (involves deep learning), and hybrid retrieval that integrates both approaches. Among those functions, sparse retrieval (BM25 scoring using bag-of-words representations) can serve basic information retrieval purposes. I'd like to introduce sparse retrieval in Pyserini and talk about the installation and use of it in detail.
Pyserini depends on Anserini, which is an information retrieval toolkit built on Lucene. Anserini is implemented in JAVA and Pyserini is the Python wrapper of it. Both of them should be built on JVM and PyJNIus is used to interact with the JVM.
Installation of Pyserini (Sparse Retrieval Mode)
Pyserini can be installed in an Anaconda virtual environment. I did this in a Windows OS, but you can also do this in Linux. The installation of Pyserini for sparse retrieval mode is as follows:
Create new environment:
$ conda create -n pyserini python=3.6
$ conda activate pyserini
Install JDK 11 via conda:
$ conda install -c conda-forge openjdk=11
Install Pyserini and other necessary packages:
$ pip install pyserini
$ pip install onnxruntime
$ conda install -c conda-forge pyjnius
Lucene Index
Pyserini saves all the documents in the format of an inverted index. To be specific, all the terms in the collection of documents are saved in a dictionary. For each term in the dictionary, it records which document it occurs in and the specific location. In Pyserini, inverted index is in the Lucene format. In order to use Pyserini to do information retrieval, we need to generate a Lucene index for our data. Data should be pre-processed and presented in the following JSON format:
{
"id": "doc1",
"contents": "this is the contents."
}
There are 3 ways to organize the JSON files:
Folder with each JSON in its own file
Folder with files, each of which contains an array of JSON documents
Folder with files, each of which contains a JSON on an individual line
The keys of this dictionary should be exactly ‘id’ and ‘contents’, otherwise Pyserini will not recognize it.
The following command is used to generate the Lucene index:
$ python -m pyserini.index \
--input integrations/resources/sample_collection_jsonl \
--collection JsonCollection \
--generator DefaultLuceneDocumentGenerator \
--index indexes/sample_collection_jsonl \
--threads 1 \
--storePositions --storeDocvectors --storeRaw
For example:
$ python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator -threads 1 -input D:/Kaggle/CORD19/CORD19/document_parses/sample_collection_jsonl3 -index indexes/sample_collection_jsonl3 -storePositions -storeDocvectors -storeRaw
The ‘-input’ part gives the directory where the json file for data is stored. The ‘-indexes’ part gives the directory for the Lucene index. The default directory for it is in ‘C:/Users/yourname’ so the whole directory is ‘C:/Users/yourname/indexes/sample_collection_jsonl3’.
The last three options “-storePositions -storeDocvectors -storeRaw” indicated this is a standard positional index. If any of them are missing, Pyserini builds an index that only stores term frequencies. Below is how a
Retrieval
Pyserini retrieves relevant documents based on BM25 scores. The following command is for batch retrieval:
$ python -m pyserini.search \
--topics integrations/resources/sample_queries.tsv \
--index indexes/sample_collection_jsonl \
--output run.sample.txt \
--bm25
‘--topics’ specifies the query file and the extension must be .tsv.
‘--index’ specifies the directory for the Lucene index.
‘--output’ specifies the ranking file name.
For example:
$ python -m pyserini.search --topics D:/Kaggle/sample_query.tsv --index indexes/sample_collection_jsonl3 --output D:/Kaggle/run.sample.txt --bm25
$ cat run.sample.txt | head -6
1 Q0 doc2 1 0.256200 Anserini
1 Q0 doc3 2 0.231400 Anserini
2 Q0 doc1 1 0.534600 Anserini
3 Q0 doc1 1 0.256200 Anserini
3 Q0 doc2 2 0.256199 Anserini
4 Q0 doc3 1 0.483000 Anserini
By default, Pyserini will retrieve 1000 documents for each query and save them in the query file run.sample.txt.
A more customized way for retrieval is to run a Python code as show below:
from pyserini.search import SimpleSearcher
import json
searcher = SimpleSearcher(r'C:\Users\name\indexes\sample_collection_jsonl2')
hits = searcher.search('Query content')
for i in range(len(hits)):
print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')
Put the code above in a Python script and run it with the commands:
$ cd path_of_python_script
$ python3 name_of_the_script.py
1 rd0g5j55 1.68690
2 dr2c1be5 1.67700
3 kfrp45sy 1.65200
4 5ur9du2p 1.62170
5 efrv5nvf 1.58910
6 glexsajf 1.57890
7 kfs0jl3w 1.57690
8 tv7gb7zx 1.57690
9 xbrkytbs 1.57690
10 b97yqstj 1.57290
The default length of hits is 10. We can modify it to any number:
hits = searcher.search('Query content', 100)
And it would output 100 documents.
Fetch the Contents
After query results are obtained, you may want to look at the contents of the top documents. We are able to fetch the contents of documents in two ways. The first one is to use hits[i].raw. For example:
from pyserini.search import SimpleSearcher
import json
searcher = SimpleSearcher(r'C:\Users\weixi\indexes\sample_collection_jsonl2')
hits = searcher.search('Query content')
for i in range(0, 1):
print(hits[i].raw)
Put the code above in a Python script and run it with the commands:
$ cd path_of_python_script
$ python3 name_of_the_script.py
{
"id" : "xbrkytbs",
"contents" : "Construction and Immunogenicity of Novel Chimeric Virus-Like Particles Bearing Antigens of Infectious Bronchitis Virus and Newcastle Disease Virus"
}
The second method for fetching document contents is to use doc.raw():
from pyserini.search import SimpleSearcher
import json
searcher = SimpleSearcher(r'C:\Users\weixi\indexes\sample_collection_jsonl2')
hits = searcher.search('Query content', 1)
doc = searcher.doc('xbrkytbs')
print(doc.raw())
json_doc = json.loads(doc.raw())
a=json_doc['contents']
print(a)
Put the code above in a Python script and run it with the commands:
$ cd path_of_python_script
$ python3 name_of_the_script.py
{
"id" : "xbrkytbs",
"contents" : "Construction and Immunogenicity of Novel Chimeric Virus-Like Particles Bearing Antigens of Infectious Bronchitis Virus and Newcastle Disease Virus"
}
Construction and Immunogenicity of Novel Chimeric Virus-Like Particles Bearing Antigens of Infectious Bronchitis Virus and Newcastle Disease Virus
With the above methods, you can easily obtain the contents of retrieval results in different ways and then they can be used for subsequent analysis.
--Xin Wei
Comments
Post a Comment