Mastering LLMs course notes – Lecture 1
Notes on Mastering LLMs Maven course
Lecture 1 #
Most common tools for fine-tuning:
- Axolotl, huggingface’s transformer apply_chat() method — formats your fine-tune examples into templates that the LLM knows how to use
- Many issues in development are due to templating inconsistencies (e.g. the
###
special token that you put between prompt & response in the fine-tuning examples) - (May 2024 statement) With larger context windows, bigger LLMs are able to take in more examples in the prompt itself, which makes fine-tuning less important
- Is this true?
- You should prove to yourself that you need to fine-tune — and only do it so after you’ve tried using the base model, doing prompt eng, etc.
- A lot of reasons to fine-tune are around owning your own model & data privacy (e.g. not having to get users’ opt-in to pass their prompts to OpenAI)
- Fine-tuning shines in narrow domains (e.g. natural language query → that question translated into a domain specific language)
- RAG = retrieval augmented generation (e.g. appending the schema of the user’s tables asking the question, e.g. appending to the user prompt a domain specific language spec, tips & a few examples of successful translations, the latter of which is somewhat competitive with fine-tuning)
Q&A #
- Preference optimization = a form of fine-tuning where you label the responses as “better” or “worse”
- Direct preference optimization is the most common way to do this
- Reinforcement Learning from Human Feedback (RLHF) is a more general version of this approach
- LlaVa is a good model for multi-modal fine-tuning
- Using the most powerful model to generate synthetic data is a good idea (e.g. asking a powerful model to slightly perturb your base examples)
- Mistral Large has permissive T&C for this
- Function calling = when to call a function to return the correct query response
- Llama 2/3 have been fine-tuned to be pretty good at this
- Trace = log of the entire LLM input/output process (e.g. user query + RAG → function-calling? & intermediate thoughts the LLM has → user output)
- You can pass data through this and this can be used to generate synthetic data
- Base model = the raw weights, only available for e.g. Mistral/Llama, where the weights have been made public
- Instruction-tuned models = models that have been fine-tuned for chatting with humans
- For chatbots (which are open-ended, human-readable forever conversations), you need a lot more data to successfully fine-tune
- For more single prompt-response problems or more narrow problem domains, you need fewer examples and better chance for fine-tuning to work
- Guardrails for malicious prompts is (surprisingly) not a solved problem, because most solutions work by modifying your prompt (e.g. RAG-like)
- Currently, only experimental users are allowed to fine-tune GPT4 and GPT4o
- Gradio/Streamlit — good tools for giving visibility of the human-in-the-loop in terms of best responses, weaker responses, & failure modes