Spaces:
Sleeping
Sleeping
| { | |
| "Pre training objectives": "Hello everyone, welcome to this course! So, in this lecture, we are going to talk about pre-training strategies. Last lecture, we covered transformers and different blocks of a transformer model, right? We also talked about what to wear for glove models that are pre-trained. what embedding methods, right. We will see the overall arena of pre-training with transformers. And then we will also see how the paradigm of pre-training actually changed. From pre-trained word embeddings to pre-trained language models.\n\nLet us look at the pretraining strategy. So, while discussing this, let us start with this famous quote. which we already mentioned earlier in the distributional semantics chapter. In the word embedding chapter, it says that \"you shall know a word by the company it keeps,\" right? So, the same person modified the quote later on, and then he said. That the complete meaning of a word is always contextual. Okay, this is always contextual, and no study of meaning can occur apart from a complete context.\n\nCan be taken seriously, okay? And this is essentially the foundation behind building the pre-training strategies, okay. This is an example; I record the record right. If you look at these two, the two positions of the word \"record\" are right. The meanings are completely different. Now if you use a word-to-way kind of method or a glove-type method which basically produces pre-trained embeddings, right? You will get the same embedding for these two positions of the record, right? But the meanings are different.\n\nOkay. Let's take another example. A bat flew out of the cave. He hit the ball with the bat. Again, here are the two words bat, the two positions of the word bat. The two contexts of the word \"bat\" are different, right? If you use the same embedding to represent this, You will not be able to do the proper processing of it, right? So what will we do here? We will produce embeddings that are going to be contextualized. Meaning, depending on the context, the embedding will change. For example, here let us say that for the word \"bat,\" you will get one embedding.\n\nFor the word \"bat\" here, you get another embedding that is different, okay? How do you do this? The first approach that was proposed is not a transformer-based approach. What was proposed in 2018 is called ELMO. Deeply contextualized word representation. This was done mostly by the LNAI team and some folks from this university as well. One of them is a very good friend of mine. So, the idea behind the ELMO method is as follows. So ELMO is a non-transformer-based approach. There is no concept of a transformer.\n\nTransformers were introduced in 2017, around the same time it was introduced, in 2018. So ELMO stands for embeddings from language models. What does it do? It essentially relies on RNNs, which can be LSTM or GRU. And what it does is that it essentially processes a sequence, right? When you process a sequence using RNN, each of the hidden states Which correspond to basically the to", | |
| "Pre trained models": "Hello, everyone. Welcome back. So we were discussing different pre-training strategies, and in the last lecture, we discussed. You know, pre-training and encoder-only models, specifically, we discussed the BERT model, right? So, in today's class, we will focus on two other pre-training strategies. One is encoder-decoder-only models, right? Models like T5 and BART. So B-A-R-T is another model, right? This is not B-E-R-T; this is B-A-R-T, BART. And at the same time, we will also discuss decoder-only models, right?\n\nwhich includes ChatGPT, all GPT series models, LLaMA models, and Decoder-only models are essentially very popular. So, the encoder-only model, as I discussed, was discussed in the last class. We saw that a new pre-training strategy called masked language model was introduced. This was used in BERT, where you essentially mask some of the tokens in the input. And you ask the model to predict those masked tokens, right? This is a self-supervised approach. You don't need any labeled data for training, right?\n\nAnd being an encoder model, it essentially means, The BERT model can essentially look at all the tokens present in the input. It is not an autoregressive model, is it? So the entire input is exposed to the model. And you basically perform self-attention. Through self-attention, you look at all the tokens present in the input. And based on that, you predict the mass of the tokens. On the other hand, the encoder-decoder model that we discussed here, As we discussed in today's class, we have this encoder component and the decoder component.\n\nAnd we will see how you can best use both parts of the encoder. And the decoder part during pre-training. And then we will also discuss the decoder-only model. Now, in the decoder-only model, as the name suggests, the decoder part, If you remember in the transformer class we discussed, it's an autoregressive model. autoregressive component in the sense that when you predict a word at the t-th location, You only have access to tokens until the t minus 1th location, right? You don't have access to tokens after the location, right?\n\nWe will see how you understand this autoregressive style of pre-training essentially. helps the decoder-only model learn things correctly This set of models is very popular these days, as people realize. Over the times that an encoder-only model is not the right solution for. For a generative task, right? Because a generative task requires an autoregressive setup, right? Whereas in an encoder-decoder model, right, as you require. Both the encoder part and the decoder part require a lot of memory.\n\nIt requires a lot of parameters, right? You can do all these generative tasks through the encoder-decoder model together, right? But the decoder model makes more sense both in terms of parameters. As well as the setup, right? The autoregressive model setup itself is suitable for, you know, next token generation. And, parameter-wise, you need half of the parameters. than the encoder d", | |
| "Tutorial: Introduction to huggingface": "Hi, everyone. In this tutorial, we'll be discussing a kind of introduction to the HuggingFace library. So Hugging Face is a very useful library when we work with basically transformer-based models. Mostly the transformer-based model. All these models, open-source models, and data sets are available on something called the Hugging Face Hub. So we will see how we can use this Hugging Face library to load the models. To fine-tune them using some toy examples, how do I load the data sets? And how to use it for inference.\n\nSo let's get started. So here... That package in Hugging Face, which deals with transformer-based models, is called Transformers. And there's a package called Datasets that deals with all the open-source datasets. So we need to first install them. Let's first look at the whole pipeline: the whole flow of how we process an input. So, first, we have this raw text. This course is amazing. And then we have a tokenization algorithm that breaks the text into tokens. And map it to some numbers correctly.\n\nSo these numbers represent a token, right? So this code is amazing; suppose we apply some tokenization algorithm, say like byte pair encoding. Or something, or a sentence piece tokenization, something like that. And it maps it to a sequence of tokens, and tokens are represented as numbers, right? So these token numbers are basically a kind of dictionary mapping. Please provide the sentence that needs correction. So 101, say, this tells us this is a token, and its number is 101. Something like that.\n\nSo we then have a list of input IDs. And then these are passed on to the model. So when the model receives the input IDs, which are a list of token numbers, What it does is go to its embedding matrix and do a lookup. So the embedding dimension of the token is already stored. In the embedding matrix of the pre-trained model. If it is not pre-trained and we are looking to train it from scratch, Then initialization is either some random initialization or some informed initialization like Xavier initialization or something like that.\n\nAnd then, when you train the model, these embeddings are also updated. Embeddings of the tokens are also updated. So for now, let us assume that the model has been pre-trained. So once we pass the input IDs, the model maps the token IDs to their token embeddings. And then the token embeddings are passed on to the model's position encoding. is added in the case of transformers, regardless of the model architecture present Whether it's an encoder-decoder model or a decoder-only model. Whether it's an encoder-only model, it is processed accordingly.\n\nWe have already seen in our lecture, I guess, one or two weeks ago. That's how you implement transformers from scratch using PyTorch, right? There we saw how a transformer layer is implemented within the model. We saw every component of it: multi-reduction positional encoding and layer normalization. Encoder block, decoder block; we saw all that. So now we are putting all o", | |
| "Fine tuning LLM": "hey everyone I'm Shaw and this is the fifth video in the larger series on how to use large language models in practice in the previous video we talked about prompt engineering which is concerned with using large language models out of the box while prompt engineering is a very powerful approach and can handle a lot of llm use cases in practice for some applications prompt engineering just doesn't cut it and for those cases we can go one step further and fine-tune a existing large language model for a\n\nspecific use case so Navy's question is what is model fine tuning the way I like to Define it is taking a pre-trained model and training at least one internal model parameter and here I mean the internal weights or biases inside the neural network what this typically looks like is taking a pre-trained existing model like gpt3 and fine-tuning it for a particular use case for example chatgypt to use an analogy here gpt3 is like a raw diamond right out of the earth it's a diamond but it's a bit rough around\n\nthe edges fine tuning is taking this raw diamond and transforming it into something a bit more practical something that you can put on a diamond ring for example so the process of taking the raw base model of gpt3 and transforming it into the fine-tuned model of gbt 3.5 turbo for example is what gives us applications like chat GPT or any of the other incredible applications of large language models we're seeing these days to get a more concrete sense of the difference between a base model link\n\ngpt3 and a fine-tuned model let's look at this particular example we have to keep in mind that these Foundation large language models like gpg3 llama 2 or whatever your favorite large language model is these models are strictly trained to do word prediction given a sequence of words predicting the next word so when you train one of these launch language models on huge Corpus of text and documents and web pages what it essentially becomes is a document completer what that translates to in practice is if you plug into a lot of\n\nthese base models like gpt3 the prompt tell me how to find tune a model a typical completion might look something like this where it's just listing out questions like you might see in a Google search or maybe like a homework assignment or something here when I prompted gpt3 to tell me how to fine-tune a model the completion was as follows how can I control the complexity of a model how do I know when my model is done how do I test a model well this might be reasonable for gpt3 to do based on the data that it was trained on this\n\nisn't something that's very practical now let's look at the fine-tuned model completion so now we have text DaVinci zero zero three which is just one of the many fine-tuned models based on gpt3 coming from open AI we give it the same prompt tell me how to fine tune a model and this is the completion fine-tuning a model involves a adjusting the parameters of a pre-trained model in order to make it better suited f", | |
| "Instruction tuning": "Hello everyone, today we will discuss instruction fine-tuning or instruction tuning. which is one of the key advancements in recent language modeling research. which enables us to have a conversation with language models easily Or, simply put, we can chat with language models. So, first, a quick review. So in previous weeks, we have learned about decoder-based language models. So such models are trained on vast amounts of text from the internet. Using the next word prediction task. As a result of this, these models learn to encode a great deal of information.\n\nAbout the world. They also have the ability to understand language to some extent. So these models are very powerful. They're pretty amazing, but we'll see in the upcoming slides. That they have some major limitations. One note: these pre-trained language models are also known as base models. So I'll be using the term \"base models\" throughout the lecture. So whenever I mention base models, it simply refers to pre-trained language models. For example, let's say we have been given the following prompt.\n\nWhat is the national flower of India? We have prompted the language model with this question. Now, what can happen is that the language model can generate the following response: What is the national animal of India? What is the national bird of India? So this response is nothing but the continuation of the prompt. And this is the result of the next word prediction that is happening. after the prompt. So here we see that the response contains questions that are quite common. which we can come across such questions on the web, where we see\n\nThere is a web page on general knowledge questions about India. Such questions are very common. However, this is not the desired response. Because when we asked this question of the language model, we were expecting an answer. That is, the national flower of India is the lotus. This was the desired outcome. However, since the language model is just predicting the next word, So the response to a question could be that it may. or may not follow the question. May or may not. Might follow the instructions, might not follow the instructions.\n\nBecause, as I said, it's simply just doing next-word prediction at this point. So the key takeaway from this slide is that next-word prediction, which is what is governing this response generation, That does not necessarily ensure that the model understands or follows instructions. So the reason we need instruction tuning is that. We want to teach the language models how to follow and understand instructions. So multitask learning is another very important paradigm in the natural language processing literature.\n\nSo in classical multitask learning, what we do is combine multiple tasks. We train the model, the language model, on multiple tasks with the intention. that these models will have a positive influence on one another And thereby, the final outcome will be improved across all the tasks. So here, if we take a look at th", | |
| "Prompt based learning": "Welcome back. So in today's lecture, we are going to talk about prompts, right? And you know when it comes to chat GPT kind of models, Large language models, the first question that comes to our mind is how to write a prompt, right. What is going to be the optimal prompt for a specific model, right? So in this lecture, we will discuss different types of prompting techniques, right? how prompts affect the accuracy of the models, right? And how, with the scaling and the increasing size of the models,\n\nhow it basically affects the accuracy and how a specific prompt will be responsible for it Producing accuracy across different types of models. We'll also discuss prompt sensitivity, right? We'll see that most of these models are highly sensitive to simple perturbations of prompts. Right? And that would affect the accuracy; that would affect other aspects of the model. And how can we quantify them, right? So far, we have discussed different pretraining strategies. We have seen encoder-only models, haven't we?\n\nWe have seen models like BERT, which is a pure encoder-only model, right? And these models are pre-trained with an objective called masked language modeling, right, MLM. We have seen models like GPT, which is a purely decoder-only model. And these models are trained using an autoregressive setup, right, or causal language modeling. This is called causal language modeling or an autoregressive setup. There are other models like T5 and BART, and they are encoder-decoder models. Okay, which are also trained with a kind of autoregressive setup.\n\nWhere your input will be fed to the encoder. and decoder will predict the next word Our decoder will predict if your input is part of, and if your input has noise. the decoder model will be able to essentially denoise your input, right? And you know the board model came out around 2018. The first GPT paper was written in 2018, GPT-1, and then GPT-2 came in 2019. GPT-2 was released in 2019 and then GPT-3 in 2020. Right, and the GPT-3 paper showed that the model doesn't need any fine-tuning. We just need to write prompts, and the model will be able to.\n\nunderstand your prompt, okay. We will discuss all these topics in this lecture. Okay, so I strongly suggest that you guys read this wonderful survey paper. This is a survey paper written by Graham Neubig and his team from CMU. And this is, you know, this rightly and nicely reflects the, you know, Different prompting strategies and the kind of evolution of the overall. You know, prompt, you know, prompt as a kind of aspect overall, how it evolves, right? How do we quantify different components of a prompt, and so on and so forth?\n\nIt's a very nice survey paper that I strongly recommend you read. Okay, so you know there have been, kind of, we have seen. We have witnessed that there has been a kind of war, right? Among all these giant industries, such as Meta, OpenAI, and Google, They have been building larger and larger models with more and more parameters. and so", | |
| "Parameter efficient fine tuning": "# Summary: Parameter-Efficient Fine-Tuning (PEFT) for Large Language Models\n\nThis lecture by Dinesh Raghu from IBM Research covers efficient methods for fine-tuning Large Language Models (LLMs) without updating all parameters.\n\n## Key Concepts\n\n**Why PEFT is Needed:**\n- Full fine-tuning requires 12-20× model size in memory for optimizer states, gradients, and activations\n- Storage overhead: each task requires saving a full model checkpoint (e.g., 350GB)\n- In-context learning (prompting) has limitations: lower accuracy than fine-tuning, sensitivity to prompt wording, and high inference costs\n\n**Main PEFT Techniques:**\n\n1. **Prompt Tuning (Soft Prompting)**\n - Reserves special trainable tokens in the input while freezing all model weights\n - Extremely parameter-efficient (~0.1% of model parameters)\n - Enables multi-task serving: different soft prompts can be swapped for different tasks on the same base model\n - Performance approaches full fine-tuning for large models (11B+ parameters)\n\n2. **Prefix Tuning**\n - Adds trainable parameters at every transformer layer, not just the input\n - Uses a bottleneck MLP architecture to prevent training instability\n - Achieves comparable performance to full fine-tuning with only 0.1% trainable parameters\n\n3. **Adapters**\n - Inserts new trainable layers (bottleneck architecture) within each transformer block\n - Down-projects hidden dimensions, applies nonlinearity, then up-projects\n - Achieves good performance with ~3.6% of parameters\n - Drawback: inference latency overhead due to added layers\n\n4. **LoRA (Low-Rank Adaptation)**\n - Most popular PEFT method based on intrinsic dimensionality theory\n - Decomposes weight updates into low-rank matrices: ΔW = BA\n - Only modifies query, key, value, and output projection matrices\n - Advantages: no inference latency, can be merged back into base weights\n - Variants: QLoRA (memory-efficient), DyLoRA (dynamic rank selection), LoRA+\n\n**Key Benefits of PEFT:**\n- Reduced memory and compute requirements (can use older GPUs)\n- Faster convergence due to smaller parameter space\n- Less overfitting and catastrophic forgetting\n- Better out-of-domain generalization\n- Minimal storage per task\n\nThe lecture emphasizes that PEFT bridges the gap between inefficient in-context learning and computationally prohibitive full fine-tuning, making LLM adaptation accessible and practical.\n\n[1](https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/attachments/52028987/a8512587-2c43-4219-84a6-bc60a1065297/paste.txt)", | |
| "Incontext Learning": "welcome everyone this is the first screencast in our series on in context learning this series is a kind of companion to the one that we did on information retrieval the two series come together to help you with homework 2 and bake off two which is focused on few shot open domain question answering with frozen retrievers and Frozen large language models to start this series I thought we would just reflect a bit on the origins of the idea of in context learning which is really a story of how NLP got to this\n\nstrange and exciting and chaotic moment for the field and maybe also for the society more broadly all credit to the Chomsky bot for bringing us to this moment I'm only joking the the Chomsky bot is a very simple pattern-based language model it's been around since the 90s I believe and with very simple mechanisms it produces Pros that is roughly in the style of the political philosopher and sometimes linguist Noam Chomsky it produces prose that Delights and maybe informs us and the underlying mechanisms are very\n\nsimple and I think that's a nice reminder about what all of these large language models might be doing even in the present day but I'm only joking although it's only partly a joke I think when we think about precedence for in context learning it is worth mentioning that in the pre-deep learning era engram based language models very sparse large language models were often truly massive for example brands at all 2007 use a 300 billion parameter language model trained on 2 trillion tokens of text to help\n\nwith machine translation that is a very large and very powerful mechanism with a different character from the large language models of today but it is nonetheless worth noting that they played an important role in a lot of different fields way back when I think for in context learning as we know it now the earliest paper as far as I know is the DECA NLP paper this is McCann adult 2018. they do multitask training with task instructions that are natural language questions and that does seem that like the origin of the idea\n\nthat with freeform natural language instructions we could essentially end up with artifacts that could do multiple things guided solely by text and then it's worth noting also that in the GPT paper Radford at all 2018 you can find buried in there some tentative proposals to do prompt-based uh experiments with that model but the real origins of the ideas again as far as I know are Radford at all 2019 this is the gpt2 paper and let me just show you some Snippets from this paper it's really inspiring\n\nhow much they did they say at the start we demonstrate language models can perform Downstream tasks in a zero shot setting without any parameter or architecture modification so there you see this idea of using Frozen models prompting them and seeing if they will produce interesting behaviors they looked at a bunch of different tasks for summarization they say to induce summarization Behavior we add the text tldr after the art", | |
| "Prompting methods": "welcome everyone this is the first screencast in our series on in context learning this series is a kind of companion to the one that we did on information retrieval the two series come together to help you with homework 2 and bake off two which is focused on few shot open domain question answering with frozen retrievers and Frozen large language models to start this series I thought we would just reflect a bit on the origins of the idea of in context learning which is really a story of how NLP got to this\n\nstrange and exciting and chaotic moment for the field and maybe also for the society more broadly all credit to the Chomsky bot for bringing us to this moment I'm only joking the the Chomsky bot is a very simple pattern-based language model it's been around since the 90s I believe and with very simple mechanisms it produces Pros that is roughly in the style of the political philosopher and sometimes linguist Noam Chomsky it produces prose that Delights and maybe informs us and the underlying mechanisms are very\n\nsimple and I think that's a nice reminder about what all of these large language models might be doing even in the present day but I'm only joking although it's only partly a joke I think when we think about precedence for in context learning it is worth mentioning that in the pre-deep learning era engram based language models very sparse large language models were often truly massive for example brands at all 2007 use a 300 billion parameter language model trained on 2 trillion tokens of text to help\n\nwith machine translation that is a very large and very powerful mechanism with a different character from the large language models of today but it is nonetheless worth noting that they played an important role in a lot of different fields way back when I think for in context learning as we know it now the earliest paper as far as I know is the DECA NLP paper this is McCann adult 2018. they do multitask training with task instructions that are natural language questions and that does seem that like the origin of the idea\n\nthat with freeform natural language instructions we could essentially end up with artifacts that could do multiple things guided solely by text and then it's worth noting also that in the GPT paper Radford at all 2018 you can find buried in there some tentative proposals to do prompt-based uh experiments with that model but the real origins of the ideas again as far as I know are Radford at all 2019 this is the gpt2 paper and let me just show you some Snippets from this paper it's really inspiring\n\nhow much they did they say at the start we demonstrate language models can perform Downstream tasks in a zero shot setting without any parameter or architecture modification so there you see this idea of using Frozen models prompting them and seeing if they will produce interesting behaviors they looked at a bunch of different tasks for summarization they say to induce summarization Behavior we add the text tldr after the art", | |
| "Retrieval Methods": "All right, welcome everyone to the Comet YouTube channel. We are doing a series of guest speakers where we dive into some very fun technical topics related to building and scaling Genai systems. And today I have with us Lena. She is a absolute expert in building rags. I've had some good conversations with her as we prepared for this session about her approaches. And I'm really excited to just go through really some of the best practices around like optimizing retrieval for LMS. We're going to deep\n\ndive into some rag techniques, some advanced stuff here. Um, before we dive in, I want to pass it over to Lena to introduce herself. Thank you Claire for the introduction and uh, hi everyone who is watching the recording. I'm so happy to be here. Uh, my name is Lennena. I'm a founder of pars labs and chatbot and I've been working in this space of chatbot development conversational AI agents since um I think eight years now. Uh my career started in linguistics. I studied theoretical linguistics then AI. I\n\nworked as an NLP research engineer and full stack developer and uh yeah now running a small agency with a team of five now and doing also a lot of um public speaking and sharing my knowledge online with people um running courses on AI automation and things like that. Uh so hopefully I will have something interesting to share uh with you today. >> All right well let's dig into some deep technical topics. Um, I wanted to dive into the advanced techniques for improving rags. I think there's a few\n\nideas that you had here. Um, and if you want to share screen, uh, feel free. We can, uh, pull up some slides here. But let's dive into some of these more advanced techniques in regards to just improving your RAG systems. >> Uh, yes. Um, what should we start with? >> Let's see. I'm thinking about optimizing the retrieval quality. Um, >> okay. Let me share my screen and then we'll just see where it leads us. So I prepared a lot of different diagrams uh with different techniques\n\n>> um in a random order. So you are welcome. >> Thank you. I love this. This is a gold mine of information. >> Uh yeah. So let's start with um maybe uh this um I classified it as improving a retrieval quality trick. So uh in the past we used intent detection when we were building chatbots. So instead of generating anything for anything we had to make data sets for okay this type of question is about salary this type of question is I want to talk to customer support. We had all\n\nthose categories and um I don't see that often being used yet with newer teams who just joined this whole chatbot development. So I want to bring it back. Uh so instead of just generating the response using LLM, you get the user question. So what's the salary? Then you predict an intent and you get an an answer that your SMMES have written. So the answer that's in your database and it says the salary for this position is salary. And then you paste this information uh to the prompt and then\n\nyou generate an answer o", | |
| "Retrieval Augmented Generation": "### Lecture Summary: Retrieval-Augmented Generation (RAG) and Contextualization of Language Models\n\nWelcome everyone. Today we’ll discuss **retrieval augmentation** — one of the most active areas in modern NLP. Our speaker, Adella, is CEO of Contextual AI, an enterprise LLM company, an adjunct professor at Stanford, and former head of research at Hugging Face and Facebook AI. His research focuses on machine learning and NLP, especially on language understanding, generation, and evaluation.\n\n---\n\n### 1. The Age of Language Models\n\nWe live in the era of large language models (LLMs). However, **language models** are not a recent invention—neural language modeling dates back to the 1990s. The idea is simple: given a sequence of tokens, predict the next token. Early versions (e.g., Bengio et al., 2003) already had embeddings and similar formulations to today’s models. What changed is scale.\n\nThe main breakthrough of **ChatGPT** wasn’t the model architecture—it was the **user interface**. Previously, using a language model required strange prompt engineering. ChatGPT solved this through *instruction tuning* and *reinforcement learning from human feedback (RLHF)*, allowing people to simply “ask” the model naturally.\n\n---\n\n### 2. Problems with Pure Language Models\n\nEven with good interfaces, LLMs have serious issues:\n\n* **Hallucination:** Models generate incorrect facts with high confidence.\n* **Attribution:** We don’t know *why* a model produced an answer.\n* **Staleness:** Models become outdated quickly.\n* **Editing:** We can’t easily revise or delete knowledge (e.g., for GDPR compliance).\n* **Customization:** Hard to adapt models to specific domains or private data.\n\nThese limitations make LLMs unreliable for enterprise or high-accuracy applications.\n\n---\n\n### 3. Enter Retrieval-Augmented Generation (RAG)\n\nThe solution many have turned to is **RAG** — connecting a generator (LLM) with an **external retriever** or **memory**. Instead of relying solely on parameters, the model can look up information dynamically.\n\nThink of **closed-book vs open-book exams**:\n\n* Closed-book → memorize everything (parametric LMs).\n* Open-book → look up relevant facts when needed (RAG).\n\nThis architecture has two main parts:\n\n1. **Retriever:** Fetches relevant documents or passages from an external database.\n2. **Generator:** Takes the query and retrieved context to produce an answer.\n\nThis approach gives **updatable, grounded, and customizable** models that hallucinate less and can cite their sources.\n\n---\n\n### 4. Retrieval Foundations\n\nEarly retrieval methods used **sparse retrieval** (e.g., TF-IDF, BM25).\nBM25 scores documents by term frequency and inverse document frequency, emphasizing distinctive words. It’s fast and efficient on CPUs but fails with synonyms or paraphrases.\n\n**Dense retrieval** (e.g., DPR, ORQA) replaces sparse counts with **embedding-based similarity** using models like BERT. Each document and query is encoded into a dense vector; retrieval is done v", | |
| "Quantization": "Alright, hello everyone. So, today we will be talking about quantization, pruning, and distillation. So, these all fall under the same umbrella, and maybe once I go into the introduction, It will be clear why we are discussing these three topics together. So, like we discussed last time, the reason why we wanted to do theft was Because of how the model sizes have been increasing over time. And this is, again, a recap of that. Over time, the size of the model is increasing exponentially. It is not just the size but also the performance of these models.\n\nare also increasing over time. So, here we see that these are the test laws for various language modeling tasks. on different datasets and you see that they are decreasing considerably Based on the number of parameters that are in the large language model. So now, what is the flip side of having such large models during inference? So, the first thing, as we discussed last time, is that the bigger the model is, You will have to buy new hardware to support them. In a sort of way, it is not very friendly to keep buying hardware.\n\nAnd when the model keeps growing over time. So, this also puts a cap on how many organizations can actually run these LLM inferences. So, the GPU requirement is going to get larger and One of the biggest problems in deployment is latency. So, the larger the model is, the more time it will take to come back. with a completion for a given prompt. Now, let's say that you have a chatbot that has been deployed. whose backbone is an LLM, then having to wait for 30 or 40 seconds, It seems really difficult at this point in time when we are so used to.\n\nGetting replies very quickly. And the way the LLM field is progressing now, People are not just using a single LLM call per sort of response. For example, there is now a new paradigm called agentic behavior. where once a language model responded, You ask the same language model or a different language model to reflect on the output. or even sometimes execute the output, for example, if the model generates a code, you execute the code, Get the outputs and then make the model reflect on the output.\n\nAnd then decide whether there is an execution error; should it redo things. And finally, you give back the answer. So now, if you have to make multiple LLM inferences, to just get back with one output. Then that's going to increase latency even more. So, latency is one of the biggest concerns. Third is the inference cost. If you have an application deployed that uses LLMs, Of course, you'll be worried about how much money. You're going to spend to serve a single user. And the money that a single user is going to pay you.\n\nshould be more than what you invest for that user. So, if LLM is for slightly more improvement in accuracy, If you have to spend a lot more, then it does not seem commercially viable. So, of course, the inference cost is going to be one of the biggest dimensions. And finally, sustainability and environmental concerns. So, yo", | |
| "Mixture of Experts Model": "## **Scaling Transformers Through Sparsity**\n\n### **1. Motivation**\n\nThe driving idea behind this research is **scaling**. In the deep learning community, performance has been shown to improve predictably with model size and compute, as outlined in *OpenAI’s 2020 paper “Scaling Laws for Neural Language Models.”*\nThese scaling laws hold across several orders of magnitude, demonstrating that **larger models are more sample-efficient**: for a fixed compute budget, it is better to train a **larger model for fewer steps** than a smaller one for longer.\n\nSo far, most scaling has relied on **dense models**, where every parameter participates in every forward pass. But this is expensive and inefficient.\nToday’s discussion explores a new axis of scaling: **sparsity** — models where different inputs activate **different subsets of weights**, performing **adaptive computation** depending on the input.\n\n---\n\n### **2. What Is Sparsity?**\n\nSparsity here doesn’t mean pruning or zeroing weights, but **conditional computation**:\neach input token is processed by a subset of the network — “experts” — chosen dynamically.\n\nThis idea dates back to 1991’s *Adaptive Mixtures of Local Experts*, revived in modern NLP by Noam Shazeer and colleagues at Google with **Mixture of Experts (MoE)** for LSTMs.\nThe architecture has:\n\n* Several **experts**, each a small feed-forward network.\n* A **router (gating network)** that predicts which experts to send each token to, using a softmax distribution.\n* The output is a weighted mixture of the selected experts.\n\nMoE proved successful in machine translation but had issues like communication cost and training instability.\nThe **Switch Transformer** simplified this by sending each token to **only its top-1 expert**, reducing both cost and complexity.\n\n---\n\n### **3. The Switch Transformer**\n\nA Switch Transformer modifies the Transformer block by replacing some feed-forward layers with **Switch Layers**:\n\n* A **router** decides which expert each token goes to.\n* Each token goes to one expert (top-1 routing).\n* The same amount of computation is done overall, but different tokens use different weight matrices.\n\nThis makes computation adaptive while keeping the FLOPs (floating-point operations) roughly constant.\n\n---\n\n### **4. Key Improvements for Sparse Training**\n\nSparse models can be unstable to train. The team made several innovations to stabilize and optimize performance:\n\n#### (a) **Selective Precision**\n\n* Training in low precision (bfloat16) improves speed but can cause divergence due to numerical instability, especially in routers (softmax/exponentiation).\n* Casting router computations to **float32** (while keeping others in bfloat16) solved this without meaningful speed loss.\n\n#### (b) **Initialization Scaling**\n\n* Default initializations made training unstable.\n* Simply reducing the initialization scale significantly improved convergence and performance.\n\n#### (c) **Expert Dropout Regularization**\n\n* Sparse models with many param", | |
| "Agentic AI": "my name is inso so uh today uh we like to uh go over agent AI aent language model as a progression of language model usage so here is the outline of the today's talk uh we'll go over uh the overview of language model and how we use and then and the common limitations and then um some of the method that improves towards this common limitation and then we'll transition it to uh what is the Agent B language model and its design patterns so uh language model is a machine learning model that predicts the\n\nnext coming word given the input text as in this example if the input is the the students open there then um language model uh can predict what's the most likely word coming uh next as a next word so if the language model is trained with the large corpers it is a predict it is generating the probability of next coming word in this example uh as you could see books and laptops have a high higher probability than other other words in the vocabulary so um the so the completion of this whole sentence\n\ncould be uh the students open their books and then if you want to keep generating um the what's coming next then we can uh uh turn them in as an input and then uh um put it into the language model and then language model continuously generating the next coming word then how uh these uh language models are trained largely uh two two parts pre-training part and then posttraining part and then um first pre-training portion is the one that language models are trained with lot with large copers uh text are collected\n\nfrom internet or books or different type of text publicly available text and then trained with the next token or next word prediction objectives so once the models uh is finished in this pre-training stage models are fairly good at predicting um any words coming uh next as a word um given the inputs um however um this type uh the pre-train model uh itself is um is not easy to use so the hence the posttraining uh steps are coming and then these uh post trining stage um uh would includes uh instruction following training as well\n\nas reinforcement learning with human feedback and what this uh training stage uh means is uh um we could prepare a data set such a way that the specific instruction or question and then the answers or the uh generated output that is what the uh user would expect or uh more uh more uh um uh related to the questions and answers so that's how uh the models are trained so that it's easier to use and then also it'll respond to a specific uh and then uh once this done and then uh additional training uh method uh is uh\n\num uh aligning to Humane preference by using uh reinforcement learning with human feedback which uhu U is using human uh preference to align the model by using uh rewards schemes and um let's take a quick look really quick look on the instruction data set this is the template that uh we uh uh we would use to train the model in instruction following training phase as you can see uh there's a specific uh instructions w", | |
| "Multimodal LLMs": "um Hello thank you all for joining CS uh 25 Transformers today uh for today's talk we have Ming ding a research scientist at jeu AI based in Beijing he obtained his bachelor's and doctoral degrees at tingua University and he does research on multimodal generative models and pre-training Technologies um he has LED or participated in the research work Works about multimodal generative models such as Cog View and Cog video and multimodal understanding models such as Cog uh VM and Cog agent uh for today's\n\nattendance the attendance form is up on the course website and if you have any questions ask them through slido s l d o and for the code you just have to input cs25 um thank you Ming for today's talk and I'm going to pass it off to you thank you for the instructors of cs25 to is very happy to gave a talk in Stanford University about multimodality in training and uh uh actually I have checked the uh all the previous talks in cs25 and uh they are really diverse topics someone share the intuition in\n\ntheir research about PR training someone shared recent Works about maybe M Moe and some other technical uh actually I'm working in a uh large language model company in China and our company working on print training and uh maybe there's lots of different area uh from a large langage model and multi modality model and generative model diffusion and uh tatto Speech something like that so uh I Le all the multimodality model research in J AI so I will uh share lots of different topics in in in this talk um some some of them\n\nmay be not very familiar to you so uh yeah it's okay but you can uh get more information of the the area uh yeah I will talk about several aspects of Transformers and I will generally follow the history of large language model and say why are we here it's about large language model introduction and history uh and how did we get here it's about the some practical techniques for training large lar langage models and what are we working on it's about the last one year uh the real language\n\nmodels and other techniques in the uh papers of all the uh V language model community and finally I will talk about the some possible and valuable direction for research in multimodality okay okay well uh I will share three moments uh I think the most important three moments in the development of language model uh the first moment is called Bo moment actually I get I got into the area at this moment it's very uh honored that I'm the first among the first group of people who publish paper on the next year ACL when\n\nB came out and at that time since is is we don't really know what is language modeling so at that time nearly all the people talking about uh how can we get a better self-supervised method for an option uh at that time a common opinion is mask language model is just for is is good at understanding the the TX and GPT the auto regressive model is better for tax generations and T5 maybe can U can do the B but is redundant and that's true but uh uh n", | |
| "Vision Language Models": "## **Vision–Language Models (VLMs)**\n\n### **1. Introduction and Motivation**\n\nVision–language models (VLMs) are systems that **learn jointly from images and text**, enabling understanding and reasoning across both modalities. These models can describe images, answer visual questions, classify objects in an open vocabulary, and perform grounding or retrieval tasks — all using a shared understanding between vision and language.\n\nThe talk is divided into three parts:\n\n1. **Foundations and Early Models** – how it started (around 2021)\n2. **Cross-Modal Models** – modern multimodal transformers\n3. **Applications and Outlook** – where VLMs are being used and what’s next\n\nThe goal is to trace how we moved from basic dual-encoder models like **CLIP** to modern multimodal systems such as **Gemini**, and how this shift is transforming research and real-world applications.\n\n---\n\n## **2. What Is a Vision–Language Model?**\n\nA VLM jointly processes both **images and text**.\nInput: image(s) + text(s)\nOutput: typically text (caption, answer, label, etc.)\n\nWhile some models can also *generate* images, this talk focuses on those that produce **text outputs**.\n\nTo design a VLM, we must decide:\n\n* How to **encode** images and text (shared vs separate architectures)\n* When and how to **fuse** the modalities\n* What **losses** to use (contrastive, captioning, etc.)\n* Whether to train from **scratch** or **fine-tune** pretrained models\n* What kind of **data** to use: paired (image–text), interleaved, or unpaired\n\n---\n\n## **3. Dual-Encoder Models: The Beginning**\n\n### **3.1 The Idea**\n\nThe simplest form of VLMs are **dual encoders**:\n\n* An **image encoder** and a **text encoder**, each processing its own modality\n* The two encoders only interact **at the loss level** — their final embeddings are compared to learn alignment.\n\nThis structure laid the foundation for large-scale models like **CLIP (OpenAI)** and **ALIGN (Google)**, both published in early 2021.\n\n### **3.2 CLIP: Connecting Images and Text**\n\n**CLIP (Contrastive Language–Image Pretraining)** became the turning point for multimodal learning.\n\n**Training setup:**\n\n* 400 million **image–text pairs** scraped from the web.\n* Train two encoders (ViT for images, Transformer for text) from scratch.\n* Use a **contrastive loss** to bring matching image–text pairs closer and push others apart.\n\nThis simple recipe led to highly transferable representations and *open-vocabulary* capabilities — allowing classification without retraining on new classes.\n\n---\n\n## **4. Contrastive Learning in CLIP**\n\n### **4.1 Principle**\n\nContrastive learning teaches the model to:\n\n* **Maximize similarity** between two *positive* samples (e.g., an image and its true caption)\n* **Minimize similarity** with *negative* samples (other image–text pairs in the batch)\n\nIn formula form, the **InfoNCE loss** compares one positive pair to all others using a softmax over cosine similarities.\n\n### **4.2 Implementation Details**\n\n* **Normalize** embeddings", | |
| "Policy learning using DQN": "Hello everyone, welcome back to the course on LLMs. So we were discussing the alignments of large language models. And in the last class, we discussed how you know. Human-generated feedback can be injected into this model for further refining. So, specifically, we discussed that we have a policy model. The policy model is the LLM that you want to refine. that you want to fine-tune. And then this policy model will generate certain output. This output will be scored by a reward model. This reward model is another LLM.\n\nSo, it will produce a reward and based on this reward, we will further refine this policy model. We also discussed that only reward maximization is not enough. Because what happens if we only maximize the reward? What would end up happening is that the policy model would start hacking the reward model. meaning that it will start producing such responses for which The reward model will produce higher reward values. But the responses may not be realistic. For example, the policy model will start producing a lot of emoticons.\n\nIt will start producing sentences that are verbose, lengthy, and so on. Not to the point, and so on and so forth, which you do not want. Therefore, to address this reward hacking loophole, what you do? you have another component in the objective function. Which do you want to minimize, and what is this component? This component is basically the divergence scale between the old policy. And the updated policy, the old LLM, and the updated LLM. So, you do not want the updated LLM to be too far. from the starting LLM.\n\nSo, we discussed maximizing the reward, which is the expected reward. that you obtain given a policy. So, the policy model is θ. So, this is parameterized by θ. You essentially sample, given a prompt x. this policy model will generate a y and for this y, You have this reward. So, for example, x is the prompt that it is. Where is the Taj Mahal located, and let us say that y is the response? Let us say the response is, \"The Taj Mahal is located in Uttar Pradesh.\" The Taj Mahal is located in France and so on and so forth.\n\nand based on that you give high reward or low reward. And this second term, this term, is essentially the KL divergence between The updated policy model and the reference policy model. So, pi theta and pi ref. So, the reference LLM policy model is the one. from which you started your reward maximization process. These two components will be combined. So, you want to maximize rewards and minimize scale divergence. And there is this beta scaling factor. Now, this lambda is essentially responsible for scaling these two components.\n\nIf you want to give more weight to scale divergence, You increase the value of lambda, and so on and so forth. So this is what we discussed. So now the question is or given this regularized reward maximization Why is this a regularized reward? Because this component... You can think of this as a regularizer. This is your objective, and this is your regularizer. ", | |
| "RLHF": "Incredibly powerful and a ton of people were just able to download this and use this on their own and that was transformative on how people viewed machine learning as a technology that interfaced with people's lives and we at hug and face kind of see this as a theme that's going to continue to accelerate as time goes on and there's kind of a lot of questions on Where is this going and kind of how do these tools actually work and one of the big things that has come up in recent years is that these machine learning models can fall short which is they're not perfect and they have some really interesting failure modes so on the left you could see a snippet from chat GPT which uh if you've used chat gbt there's these filters that are built in and essentially if you ask it to say like how do I make a bomb it's going to say I can't do this because I'm a robot I don't know how to do this and this seems harmful but what people have done is that they have figured out how to jailbreak this this agent in a way which is you kind of tell it I have a certain I'm a playwriter how do I do this and you're a character in my play what happens and there's all sorts of huge issues around this where we're trying to make sure these models are safe but there's a long history of failure and challenges with interfacing in society and a like fair and safe Manner and on the right are two a little bit older examples where there's Tay which is a chatbot from Microsoft that was trying to learn in the real world and by interacting with humans and being trained on a large variety of data without any grounding and what values are it quickly became hateful and was turned off and then a large history of it field studying bias in machine learning algorithms and data sets where the by the data and the algorithm often reflect biases of their designers and where the data was created from so it's kind of a question of like how do we actually use machine learning models where we have the goals of mitigating these issues and something that we're going to come and talk to in this talk is is reinforcement learning a lot so I'm just going to kind of get the lingo out of the way for some people that might not be familiar with deeprl essentially reinforcement learning is a mathematical framework when you hear RL you should think about this is kind of like a set of math problems that we're looking at that are constrained and in this framework we can study a lot of different interactions in the world so some terminology that we'll revisit again and again is that there's an agent interacting with an environment and the agent interacts with the environment by taking an action and then the environment returns two things called the state and the reward the reward is the objective that we want to optimize and the state is just kind of a representation of the world at that current time index and the agent uses something called a policy to map from that state to an action and the beauty of this is that " | |
| } |