Silicon Valley MUG: ETL for LLMs

ETL for LLMs - Everything you need to preprocess your unstructured and structured data to make it GenAI and LLM Ready. The use case demonstrated will be a Private GenAI Q&A System using Dolly v2, Spark, MongoDB & Dataworkz

ChatGPT had to be trained on enormous amounts of data to make it excel at human-like, iterative content creation. But it’s only as good as the data it was trained on - ChatGPT can hallucinate (confident-sounding but erroneous output) especially when asked domain-specific questions.

:question: What if you could train ChatGPT in a matter of minutes to answer questions based on your data like PDF manuals, product reviews in semi-structured JSON format, internal wikis, customer conversations in a CRM and the list goes on?

:bulb: To make ChatGPT work with your data, you need to build sophisticated data pipelines. It takes a new approach to data management to create chunks for better information retrieval. In this session you will build a data pipeline to process biomedical literature available on

What to expect?

In this workshop you will learn how Dataworkz streamlines creating a high-quality curated dataset from unstructured PDF files available on PubMed and use MongoDB Vector Search for storing LLM ready embeddings using the embedding model of your choice – OpenAI’s text-embedding-ada-002, all-mpnet-base-v2 from the MTEB benchmark or one of your own from Hugging Face.

Come join us to experience the fastest path to building a Retrieval Augmented Generation (RAG) application – Dataworkz with MongoDB Atlas Vector Search


Time Topic
4:00 pm Registration & Networking
4:15 pm “ETL for LLMs” Talk with Demo with Nikhil Smotra of Dataworkz
5:00 pm Food & Beverage
5:30 pm Hands-on workshop
6:30 pm Closing Remarks


Event Type: In-Person
Location: MongoDB, Inc. 499 Hamilton Ave · Palo Alto, CA


:arrow_right: To RSVP - Please click on the “ ✓ RSVP ” link at the top of this event page if you plan to attend. The link should change to a green button if you are Going. You need to be signed in to access the button. Do you use You can also RSVP there



Nikhil Smotra,

CTO and Co-founder, Dataworkz

Nikhil is driven by the potential for innovation and really excited about leveraging advanced technologies such as artificial intelligence, especially LLMs, and applying them to extract valuable insights from customer data. Nikhil’s robust experience working with data management at scale led him to co-found Dataworkz. His vision is to create a self-service experience that brings together – data, transformation, and AI applications – for users of different skill levels.

Before Dataworkz, Nikhil worked as SVP, Head of Data Engineering at iQor, a leader in BPO and Product Support, where he led the development and management of BigData platforms. Nikhil helped launch the enterprise data initiative and built a high-performing global data engineering team. During his tenure at iQor, Nikhil also managed QeyMetrics – a Business Intelligence and Operational Analytics SaaS offering. Nikhil spent several years at Lockheed Martin(R&D) where he harnessed the potential of NoSQL technology before it gained popularity and used it along with semantic web technologies to build a massively scalable Digital Archive with automated data preservation, curation, and classification.
Nikhil is an executive alumnus of Haas School of Business, UC Berkeley (Data Science and Analytics Program) and holds a B.E in Computer Science from University of Pune, India. Nikhil also served on Advisory Board of Rutgers University’s BigData certificate program for executives from 2018-2022