Shalin Shah Home Page

Shalin Shah

Founding Scientist at Anvai AI

ORCID | Git Repositories | LinkedIn | Medium

Patents:

Link prediction using hebbian graph embeddings (Issued, 2022)
A patent that uses the hebbian graph embeddings algorithm for link prediction. Hebbian graph embeddings learns the embeddings of the nodes in a graph and uses these embeddings for link prediction and recommendations.

Similarity learning-based device attribution (Issued, 2021)
Using the probabilities outputted from a Gaussian Mixture Model as features, this patented algorithm uses a random forest classifier for cross device cookie matching.

Multi-task learning of query intent and named entities (Issued, 2024)
A patent that learns query intent and named entities using a multi-task deep neural network.

Multi-task triplet loss for Named Entity Recognition using Supplementary Text (Pending, 2021)
A patent that uses the triplet loss along with multi-task learning for better NER.

Probabilistic matching of account information in management and determination of user identity using identity graphs (Pending, 2024)
Probabilistic edge weights in an identity graph to infer connections and communities.

Publications:

Multi-Document Financial Question Answering using LLMs
This work explores financial question answering using knowledge graphs, semantic tagging and semantic RAG. The results show that financial question answering can be improved using semantic tagging and knowledge graphs. The financial questions used for evaluation are esoteric and difficult to answer, and the answers are not completely obvious.

Topic Driven Text Summarization with Defragmentation using LLMs
The paper describes a prompt based method to do text summarization by first planning to get a set of topics and then generating a final summary from the topic summaries. The work can also be used to do conditional summarization based on a given set of topics.

JCOL: A Java package for solving the graph coloring problem (Journal of Open Source Software, 2019)
JCOL is a Java graph coloring package that does a cascade of three randomized graph coloring algorithms including DSATUR, randomized Iterated Greedy and Min-Conflicts local search. Results show that it is able to outperform ColPack on some DIMACS instances.

GKNAP: A Java and C++ package for solving the multidimensional knapsack problem (Journal of Open Source Software, 2019)
GKNAP is a package to solve the multiconstraint (multidimensional) knapsack problem. It is implemented in both Java and C++. The algorithm uses Lagrangian multipliers as constraint weights in a greedy algorithm. GKNAP is able to solve almost all benchmarks to optimality.

Model Adaptation via Model Interpolation and Boosting for Web Search Ranking (EMNLP, 2009)
By interpolating between several models (from different domains, but the same target), the algorithm creates a weighted ensemble of ranking algorithms which outperform each individual algorithm for search engine ranking.

Three Proofs that the Square Root of 2 Is Irrational
This short article surveys three proofs that the square root of 2 is irrational. The first proof is a simple proof by contradiction and the other two proofs use field theory from abstract alggebra.

Genetic Algorithm for the 0/1 Multidimensional Knapsack Problem
This algorithm is a genetic algorithm to solve the multidimensional knapsack problem. A different version of the algorithm was published on the Journal of Open Source Software.

Multi-Task Triplet Loss for Named Entity Recognition using Supplementary Text
This work uses multi-task learning for named entity recognition, through a triplet loss that uses supplementary text.

Comparison of Stochastic Forecasting Models
This paper explores five stochastic forecasting methods like stochastic differential equations, arima, kalman filter, and the bayesian filter. It tries to forecast stock prices of Microsoft, Tesla and Target, 54 days into the future, and finds interesting results.

Multi-Task Learning of Query Intent and Named Entities using Transfer Learning
This paper explores multi-task learning of named entity recognition and query intent using BERT transfer learning. We find interesting results in which the NER loss and the intent loss both affect each other in interesting ways.

A Survey of Latent Factor Models for Recommender Systems and Personalization
This paper surveys four latent factor models for recommender systems. The methods include hebbian graph embeddings, bayesian personalized ranking, singular value decomposition and non-negative matrix factorization.

Comparison of Three Recent Personalization Algorithms
This article compares three recent personalization algorithms including Taxonomy Discovery for Personalized Recommendations, Multi-Matrix Factorization and Bayesian Personalized Ranking. It computes the hit rate @10 for all three algorithms on a sample of 1 million users and 200 thousand items.

Implementation of iterative local search (ILS) for the quadratic assignment problem
A population based iterative local search for the quadratic assignment problem. The algorithmis quite fast and is able to find optimum solutions to benchmark instances.

Simulated Annealing Algorithm for the Multiple Choice Multidimensional Knapsack Problem
A randomized algorithm based on simulated annealing for the multiple choice multi-constraint knapsack problem. The implementation is open source and uses new ways of making an infeasible solution valid using a randomized greedy approach.

Hebbian Graph Embeddings
Using an error-free (derivative-free) associative learning update rule (Gaussian perturbed, as in annealing), this algorithm learns the embeddings of the nodes (vertices) of a graph. Using nearest neighbors, the algorithm is applied to link prediction, reconstruction and to a recommender system at Target Corporation. The algorithm outperforms almost all current algorithms (except SEAL) in the task of link prediction and reconstruction.

Genetic Algorithm for a class of Knapsack Problems
A genetic algorithm is applied to the task of solving hard knapsack problems introducted by D. Pisinger called Spanner knapsack instances. The algorithm is able to solve all Spanner instances to optimality, and so, it can be applied to larger and harder instances for which the optimum solution is unknown.

GCLIQUE: An Open Source Genetic Algorithm for the Maximum Clique Problem
A genetic algorithm for the maximum clique problem which finds good solutions very fast to most DIMACS instances.

Analysis of Greenhouse Gases
Using counterfactuals (through extrapolation), this work tries to predict the temperature increase in the atmosphere when the CO2 and CH4 levels increase.

Randomized heuristic for the maximum clique problem
A fast random search algorithm for the maximum clique problem. The algorithm finds good large cliques in DIMACS benchmarks very fast, and these cliques can be used in further applications, like graph coloring or recommender systems.

Other Technical Papers:

Markowitz Portfolio Optimization
This is a course project on Markowitz Portfolio Optimization as part of a masters degree course. Portfolio Optimization aims to generate proportions of the stocks to invest into.

A One Pager on Bayesian Statistics in Baseball
This short one page tutorial explains how Bayesian statistics and hierarchical models can be used for analyzing the batting hits of baseball players.

Introduction to Matrix Factorization for Recommender Systems
This short expository tutorial on matrix factorization introduces the reader to PCA, SVD and general principles of matrix factorization for recommender systems including describing briefly on how to parallelize the algorithm to run on potentially millions of dimensions.

Notes on Icebreaker
This is an expository tutorial on Icebreaker, a new publication from Microsoft Research, which is able to achieve state of the art performance on tasks in which there is inherent missing data. Icebreaker is able to suggest which attributes to impute to gain the maximum benefit, using mutual information. The tutorial talks about the Bayes Rule, Autoencoders, the evidence lower bound, Variational Methods and Autoencoders, Deep Sets (set based machine learning) and finally leading to Icebreaker.

Analysis of E-Commerce Product Graphs
This work analyses the e-commerce product graph at Target Corporation using the Erdos-Renyi random graph model. Results show that the graph exhibits assortive mixing, the small world property, and a scale free degree distribution (power law).

Open Source Code:

Stock Market Forecasting Exercise (R)
This R code has implementations of five forecasting models (using R libraries) which try to predict stock prices 54 days into the future.

Bayesian Personalized Ranking (BPR) (Python)
Bayesian Personalized Ranking (BPR) is a recommender systems algorithm that can be used for personalization in movie rental services, retail stores, online book stores and so on. This is a python implementation which achieves 51% test set hit rate @ 10 on the MovieLens data set.

Randomized Heuristic for the Maximum Clique Problem (Java)
A simple randomized heuristic for the maximum clique problem in Java. The algorithm is very fast and requires little memory.

Genetic Algorithm for the Maximum Clique Problem (C++)
A genetic algorithm for the maximum clique problem that uses innovative crossovers. The algorithm is able to solve publicly available instances quite efficiently. Results are on the git page.

Genetic Algorithm - 0/1 Multi-Constraint Knapsack Problem (C++ and Java)
Implementation of a genetic algorithm for the 0/1 multiconstraint (multidimensional) knapsack problem in C++ as well as Java.

Heuristic for Graph Coloring (Java)
A very powerful Java implementation of a heuristic that requires very little computational resources. The code is able to solve several of the DIMACS instanes in a short amount of time.

Sudoku Solvers and Puzzle Generator (C++ and Java)
A backtracking Sudoku solver which is non-recursive. It is inspired by Knuth's dancing links algorithm. The backtracking solver is quite fast and solves most Sudokus in less than a second. A Java based puzzle generator generates Sudoku puzzles which have a unique solution.

Simulated Annealing Algorithm - Multiple Choice Multidimensional knapsack problem (C++)
A simulated annealing implementation for the multiple choice multidimensional knapsack problem with C++ code.

Iterative Local Search - Quadratic Assignment Problem (C++)
The QAP (quadratic assignment problem) is a strongly NP-hard problem which is difficult to solve even for very small instances. This algorithm is able to solve the nugXX (e.g. nug22) instances quite fast. More experiments are needed to see if the algorithm scales to very large instances.

Branch and Bound Algorithm for the 0/1 Knapsack Problem (Java)
Branch and bound is an exact algorithm that can solve the knapsack problem with provable optimality. This Java code implements branch and bound for the 0/1 knapsasck problem using Lagrangian relaxation to prune the search tree. It uses best first search and is able to solve moderately large instances quite fast (like 2000 variables).

Monty Hall Simulation (Java)
A simulation of the Monty hall paradox in probability theory.

Education:

M.S. Applied and Computational Mathematics, Johns Hopkins University
M.S. Computer Science, University of Southern California
Computational Linguistics, University of Washington

Graduate Courses:

Statistical Methods
Real Analysis
Computational Statistics
Stochastic Differential Equations
Matrix Theory
Monte Carlo Methods
Bayesian Statistics
Probability and Stochastic Processes I and II
Data Mining
Natural Language Processing I and II
Quantitative Social Science Methods
Software Engineering
Analysis of Algorithms
Compilers
Programming Languages

Some Links:

shah AT shalin.info