Projects
Open-Source Contributions
BenSim
- A Python package for measuring the semantic similarity among sentences in the Bengali language.
- Users will provide a reference sentence and a list of target sentences as input; and optionally the similarity assessment approach (Default: Cosine Similarity) and the maximum sequence length (Default: 512). The length will be calculated in terms of the number of tokens using the WordPiece tokenizer. Currently, the maximum sequence length limit is 512.
- BenSim will perform normalization on the input texts and extract the contextual embeddings of the reference sentence and target sentences through a pre-trained BERT model. The similarities will be measured between each of the sentence pairs by applying either Euclidean distance or Cosine similarity (based on the input parameter).
- Finally, this will return a list of similarity scores between the reference sentence and the target sentences. If the assessment method is
cosine
, the higher scores will denote higher similarity, and the opposite will be foreuclidean
. - Detail usage can be found here.
mSentsTokenizer
- A Python package for tokenizing multilingual documents at the sentence level.
- Users will provide a document to be segmented and the language of the document as input. Currently, there are supports for 41 languages ranging from low to high resource, which belong to 10 language families ( i.e. Afro-Asiatic, Indo-European, Sino-Tibetan, Austronesian, Japanese, Altaic, Dravidian, Tai-Kadai, Austro-Asiatic, and Niger-Congo).
- mSentsTokenizer matches the input language with the available supports of language from the pre-existing packages and simply invokes the corresponding package to tokenize the input document. Finally, this will return a list of sentences as the output.
- Detail usage can be found here.
Experimental Projects
Question Similarity Assessment using Transfer Learning: The basic workflow of transfer learning from a general pre-trained language model (BERT) to a specific task of similarity assessment was practiced. For this task, the Quora dataset was utilized, which is composed of question pairs for the task to determine if the qustions are paraphrases of each other (have the same meaning). For avoiding the model level complexity and to minimize the expensive GPU cost, a basic transformer model (bert-base-uncased) was used from HuggingFace library. The codebase (PyTorch) can be found here.
News Classification using Vanilla Transformer: The main goal was to explore the transformer from “Attention is All You Need” from scratch and implement it for text classification. For this task, the AG news classification dataset was utilized, which is constructed by choosing 4 largest classes from the original corpus. The transformer network implementation was based on the annotated version from Harvard NLP, but some modifications were made to work for the text classification problem. Only the encoder part was used, residual connection and layer normalization were skipped, and the masking was not considered either. For extracting the features, multihead attention and positionwise feedforward network were incorporated, and finally, a linear layer was added to get the logits. The codebase (PyTorch) can be found here.
B2E Neural Machine Translation with Seq2Seq Model using Attention: In this experiment, machine translation was performed using a deep learning based approach and attention mechanism. Specifically, a sequence to sequence model was trained for Bengali to English translation. For this task, a bilingual corpus was utilized, which is formed with tab-delimited Bengali-English sentence pairs. GRU was incorporated in both of the encoder and decoder with attention layer. After learning the patterns in input language (i.e. Bengali), the encoder output and hidden state were passed to the decoder along with the start token for generating the output sequences (i.e. English). For deciding the next input to the decoder, teacher forcing was performed. The codebase (PyTorch) can be found here.
Character-Level Name Generation using LSTM: The focus of this experiment was to train a language model to generate text character-by-character. For this task, an English name corpus was chosen from Data World which contains a large number of baby names. Considering the longer inputs, LSTM model was incorporated to generate the names. In this case, each of the input and output tokens is a character. Moreover, the language model outputs a conditional probability distribution over the character set. According to this generated probability distribution, the next token was picked using Top-K sampler. The codebase (PyTorch) can be found here.
Undergraduate Course Projects
Bengali Handwritten Digit Recognition using Deep Neural Network: Implemented a deep neural network model for Bengali handwritten digit classification in the Soft Computing Lab. Experimented with different hyper-parameter settings. (DNN, PyTorch) [Fall 2020]
Flight Fare Prediction using Machine Learning: A project for predicting flight fares of Indian Airlines using classical Machine Learning algorithms in the Pattern Recognition Lab. The performance was compared between different regression algorithms i.e. Linear, Lasso, Ridge, Elastic Net, Decision Tree, Random Forest, and Gradient Boosting. In addition to this, the dataset was analyzed by Bar plot, Count plot, Box plot, and Scatter plot. (Matplotlib, Seaborn, Scikit-Learn, XGBoost, Python) [Fall 2020]
House Price Prediction using PL/SQL: This project was done in the Distributed Database Systems Lab where Linear Regression was performed in the context of PL/SQL. Training data were stored in the distributed database. Two virtual PCs were created to make a distributed system. One user could perform CRUD operations on the database containing training data; on the other hand, another user could get a prediction on unseen data through the command line interface by applying Linear Regression rule on the database. (PL/SQL, Logistic Regression, Virtual Box) [Spring 2020]
Library Management System: A simple web application for a digital library management system, was developed in the Software Development Lab. Users were able to search for books based on author, genre, etc. and borrow books to read them online. (C#, MS .Net, MySQL) [Fall 2019]
Medicine Sheba: A cross-platform application for medicine sales & delivery systems was built in the Information System Design and Software Engineering Lab. There was a dual-user mode. The general users could search for necessary medicines from a central database containing information from all the registered pharmacies’ individual databases; users could place an order from any specific pharmacy and also were able to track their order. On the other hand, the Pharmacies could register themselves, maintain their own database, and sell their products. (JavaScript, NoSQL, React Native, MongoDB, Express, Node.js) [Fall 2019]
Criminal Management System (CMS): A desktop application for criminal data management was built in the Database Lab. Law enforcement agencies could manage a relational database of criminals and store all sorts of information including custody info, general diary info, criminal images, and so on. The focus was to perform advanced operations on tables to maintain the dynamic of the CRUD operations. (RDBMS, Java, MySQL, MS SQL Server) [Spring 2019]
Talks of Code: A website was built with PHP in the Software Development Lab. The focus was to make a tutorial site for learning programming languages through reading documents. (PHP, HTML, CSS) [Spring 2019]
Police Assistant: An android mobile application was built in the Software Development Lab. There was a dual-user mode: Admin, General. Law enforcement agencies could act as admins and post information about suspects or wanted persons including their pictures. Regular citizens, on the other hand, could act as General users, where they could respond to any post by in-app call or by leaving a private comment. There was a feature of rewarding the informant upon justification. (Java, Android Studio, MySQL) [Fall 2018]
Check My Trip: A desktop application, was developed in the Software Development Lab. Users were able to find nearby visiting places, mosques, hospitals, etc. by locating their current latitude-longitude. Additionally, there were features like finding the distance between two places, setting a location as an attractive spot by the local people which might not be shown on Google Map as a visiting place, and getting the list of transportation costs from the current location to the destination (especially for a long drive). All the features were implemented by leveraging Google Map API. (Java, JavaFX, Google Map API) [Spring 2018]
The Driver: This was a racing game, built in the Software Development Lab. Here you can play as a cop & as a racer. The cop has a special bullet by which it can destroy the racing car. In the racer mode, there is an extra NOS facility for the racer. There is gameplay available on YouTube. (C++, iGraphics, Visual Studio) [Fall 2017]