Posts

Showing posts from February, 2022

2022-02-25: Evaluating MAN, the Tool that Utilizes Google Translate to Normalize Arabic Names' Transliterations in Cross-Language Information Retrieval

Image
  Introduction: The increased use of Natural Language Processing (NLP) techniques is fueled by the need to process massive amounts of data, the demand for clever chat bots, and other human-computer interaction tasks. Named Entity Recognition (NER) is one of the most important techniques in NLP . The extracted named entities offer computers a way to classify documents, perform semantic analysis on textual information, etc.  In other words, NLP allows machines to understand human language(s). Speaking of languages, Cross-Language Information Retrieval (CLIR) gained traction in the past two decades or so due to the unprecedented rise in globalization, transnational companies, international news outlets, social media, and internet use. CLIR requires a translation service since CLIR deals with retrieving information written in languages different from the language of the user's query. In August 2020, I proposed an approach for  extracting named entities from Arabic text using a combi

2022-02-23: One in Five arXiv Articles Reference GitHub

Image
Starting in Fall 2021, I've had the opportunity to work on the  CoSAI Project  under the guidance of  Dr. Martin Klein ,  Dr. Michael Nelson , and  Dr. Michele Weigle . The CoSAI Project is working to preserve web-based scholarship including source code. The goal of the project is to make the archival process more accessible to institutions by creating a curation workflow to facilitate the process. As part of the project, we wanted to find a set of code repository URIs that were referenced in scholarly publications. To do this, we decided to extract URIs from PDFs in the  arXiv  corpus which now includes more than 2 million papers . We focused on a corpus of 1.56 million PDFs from April 2007 to November 2021. During an internship at LANL in Summer 2021 , Yasith Jayawardana created code that Robustifies URIs found in PDFs. Part of the code extracts URIs found in PDFs using the PyPDFium2 and PyPDF2 to extract annotated URIs and URIs in the text, respectively. I leveraged this part

2022-02-16: Tarannum Zaki (Computer Science PhD Student)

Image
Hello everyone! My name is Tarannum Zaki, and I am an international student from Bangladesh. I have started my PhD program in the Department of Computer Science at Old Dominion University in Spring 2022. Presently, I am a member of the Web Science and Digital Libraries (WS-DL) research group. My PhD advisor is Dr. Michele C. Weigle . I am currently working under the supervision of Dr. Michele C. Weigle and Dr. Michael L. Nelson on disinformation misattribution on different social media platforms and web archiving.   I have completed both my Bachelor’s and Master’s degree in Computer Science and Engineering from Military Institute of Science and Technology , Dhaka, Bangladesh in December 2016 and February 2021 respectively. Before coming here, I was appointed to the faculty position of a university lecturer for more than four years where I had been involved in conducting computer science a

2022-02-16: About the Use of Amazon Rekognition and the Installation of Associated AWS CLI

Image
  Amazon Rekognition is a cloud service for extracting text in images launched by Amazon. It can find the text in an image and recognize it, as well as output other necessary information provided in this image, such as the location of both the image and  the  text. I'd like to share my hands-on experience in installing and working with it in this blog post. It can be used on both Windows and Linux OS. Part 1: Prerequisites Step 1: Sign up to AWS Follow the instructions to sign up for an AWS account .  Step 2: Create an IAM user account Sign in to the  IAM console and set up user and permissions. You can follow these instructions in part ' Step 2: Create an IAM user account '.  The IAM console is shown in the image below. You can add users, create groups, and set up access in this console. Step 3: Create an access key ID and secret access key The access key ID and secret access key are needed for the AWS  CLI (Command Line Interface) access. In the  IAM console, choose &quo

2022-02-16: Pyserini: an Information Retrieval Framework

Image
  Pyserini is an information retrieval too lkit  initially released in 2019 .  People can input a query and it will return a list of ranked documents relevant to this query. Pyserini supports sparse retrieval, dense retrieval (involves deep learning), and hybrid retrieval that integrates both approaches. Among those functions, sparse retrieval (BM25 scoring using bag-of-words representations) can serve basic information retrieval purposes. I'd like to introduce sparse retrieval in Pyserini and talk about the installation and use of it in detail.  Pyserini depends on Anserini , which is an information retrieval toolkit built on Lucene. Anserini is implemented in JAVA and Pyserini is t he  Python wrapper  of it . Both of them should be built on JVM and PyJNIus is used to interact with the JVM. Installation of Pyse rini (S parse Retrieval Mode) Pyserini can be installed in an Anaconda virtual environment. I did this in a Windows OS, but you can also do this in Linux. The installation