Saransh Surana

Data Science · ML · AI

About me

I'm a Data Scientist & AI Engineer with expertise in machine learning, deep learning, and large-scale data systems. My work focuses on building scalable, end-to-end ML solutions that solve real-world problems, from data preprocessing to deployment.

I enjoy working at the intersection of AI research and practical applications, turning complex data into insights and intelligent systems. I aim to contribute to cutting-edge AI innovation—LLMs, generative AI, optimization-driven ML—while driving measurable business impact.

Download Resume

Interests

AI · ML · Deep Learning · Data Engineering · Statistics

Education

M.S. Data Science — Stony Brook · B.E. ECE — Andhra University

Experience

Scraped and structured 50,000+ housing and social service records across multiple counties into machine-readable datasets.
Automated web data extraction using Playwright with asynchronous concurrent scraping, reducing collection time by ~70%
Designed a deduplication framework to merge duplicate organizations while preserving unique attributes, cutting redundancy by ~35%.
Implemented a relevance-filtering prompt system that improved classification accuracy of housing-related records to >60% precision.
Delivered a final cleaned dataset for NGO partners, enabling more accurate housing service mapping and supporting advocacy for individuals with serious illness.

PythonPlaywrightGemini ProPandas

Built Python pipelines to clean and preprocess unstructured data from web pages, PDFs, and other raw formats, version-controlled with Git for reproducibility and collaboration

PySparkNumPyPandas

Built scalable ETL pipelines in BigQuery and SQL on GCP to support end-to-end ML workflows for anomaly detection.
Enhanced early fault detection by identifying spikes and irregular patterns in manufacturing time-series data through Z-score thresholds, facilitating effective data visualization for analysis.
Trained unsupervised models (Isolation Forest, One-Class SVM) to detect anomalies in manufacturing sensor data achieving 78% recall and 73% precision, supporting early fault detection.
Explained model results to technical and non-technical teams and engaged with data science experts to learn more about the field, supporting fault resolution and alignment.

Pythonscikit-learnPandasmatplotlibGCPAnomaly Detection

Developed real-time demand forecasting and inventory optimization by deploying XGBoost models on GCP using FAST API and Docker, reducing forecast error by 18% across 30+ SKUs.
Designed end-to-end ML data pipelines on unstructured data and SQL-based ETL workflows using Spark, Hive, and Kafka, accelerating deployment time by 40% and supporting analysis of 1K+ events daily.
Drove 30% marketing ROI uplift by applying clustering on 10K+ customer profiles, enabling business teams to target high-value segments effectively.
Conducted A/B testing on promotional strategies and new product placements across multiple regions, identifying winning variants that increased sales conversion by 7%.

XGBoostFastAPIDockerBigQueryGCPA/B TestingKafkaSparkHivescikit-learnClusteringMarketing

Projects

Skills

Python

SQL

Java

Bash

C/C++

NoSQL

Ocaml

Saransh Surana

About me

Interests

Education

Experience

AI Software Research Volunteer

Research Assistant

Data Science Intern

Data Science Intern

Projects

Data Science & ML

Deep Learning

LLMs & Generative AI

NLP

Skills

Open Source & Writing

Open Source Contributions

Haystack Core Integrations (deepset-ai)

Statsmodels

Skrub

Outlines (dottxt-ai)

Writing & Publications

Prompt engineering was a phase, prompt design is the craft!

How Text Chunking Works: The Foundation of Every RAG System

How Do You Measure an LLM’s Intelligence? A Complete Guide to Evaluation Strategies

Leave a Message