June 23, 2024

Mastering LLMs course notes – Lecture 1

Notes on Mastering LLMs Maven course

https://maven.com/parlance-labs/fine-tuning

Lecture 1 #

Most common tools for fine-tuning:

Axolotl, huggingface’s transformer apply_chat() method — formats your fine-tune examples into templates that the LLM knows how to use
- https://github.com/OpenAccess-AI-Collective/axolotl
- https://huggingface.co/docs/transformers/main/en/chat_templating
Many issues in development are due to templating inconsistencies (e.g. the ### special token that you put between prompt & response in the fine-tuning examples)
(May 2024 statement) With larger context windows, bigger LLMs are able to take in more examples in the prompt itself, which makes fine-tuning less important
- Is this true?
You should prove to yourself that you need to fine-tune — and only do it so after you’ve tried using the base model, doing prompt eng, etc.
A lot of reasons to fine-tune are around owning your own model & data privacy (e.g. not having to get users’ opt-in to pass their prompts to OpenAI)
Fine-tuning shines in narrow domains (e.g. natural language query → that question translated into a domain specific language)
RAG = retrieval augmented generation (e.g. appending the schema of the user’s tables asking the question, e.g. appending to the user prompt a domain specific language spec, tips & a few examples of successful translations, the latter of which is somewhat competitive with fine-tuning)

Q&A #

Preference optimization = a form of fine-tuning where you label the responses as “better” or “worse”
- Direct preference optimization is the most common way to do this
- Reinforcement Learning from Human Feedback (RLHF) is a more general version of this approach
LlaVa is a good model for multi-modal fine-tuning
- https://huggingface.co/docs/transformers/en/model_doc/llava
Using the most powerful model to generate synthetic data is a good idea (e.g. asking a powerful model to slightly perturb your base examples)
- Mistral Large has permissive T&C for this
Function calling = when to call a function to return the correct query response
- Llama 2/3 have been fine-tuned to be pretty good at this
Trace = log of the entire LLM input/output process (e.g. user query + RAG → function-calling? & intermediate thoughts the LLM has → user output)
You can pass data through this and this can be used to generate synthetic data
Base model = the raw weights, only available for e.g. Mistral/Llama, where the weights have been made public
Instruction-tuned models = models that have been fine-tuned for chatting with humans
For chatbots (which are open-ended, human-readable forever conversations), you need a lot more data to successfully fine-tune
For more single prompt-response problems or more narrow problem domains, you need fewer examples and better chance for fine-tuning to work
Guardrails for malicious prompts is (surprisingly) not a solved problem, because most solutions work by modifying your prompt (e.g. RAG-like)
Currently, only experimental users are allowed to fine-tune GPT4 and GPT4o
- https://help.openai.com/en/articles/7127982-can-i-fine-tune-gpt-4o-or-gpt-4
Gradio/Streamlit — good tools for giving visibility of the human-in-the-loop in terms of best responses, weaker responses, & failure modes

Kudos

Mastering LLMs course notes – Lecture 1

Lecture 1 #

Q&A #

Now read this

Game theory, asymmetric opportunities, and how I lost $40 million