commit acb7245827b210fe24502e86ebcd85dcac598625 Author: tiffanysolomon Date: Fri Feb 28 10:39:45 2025 +0200 Add Understanding DeepSeek R1 diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..a2cb941 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an open-source language [design built](https://dreamcorpsllc.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://www.xysoftware.com.cn:3000) community. Not just does it match-or even [surpass-OpenAI's](https://www.designingeducation.org) o1 model in many standards, however it also features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google model](http://218.201.25.1043000) to deliver strong thinking capabilities in an open and [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:MichaelCrocker0) available manner.
+
What makes DeepSeek-R1 particularly exciting is its transparency. Unlike the [less-open](https://www.sherpapedia.org) approaches from some market leaders, [DeepSeek](https://www.eworkplace.com) has [published](https://git.dev.hoho.org) a [detailed training](https://muziekishetantwoord.nl) [methodology](https://imoviekh.com) in their paper. +The design is also [incredibly](https://digibanglatech.news) economical, with [input tokens](https://www.fanatec.com) [costing](https://www.danbrownjr.com) just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://inea.se) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical wisdom](http://saintsdrumcorps.org) was that better [designs](http://annacoulter.com) needed more information and calculate. While that's still legitimate, models like o1 and R1 demonstrate an alternative: inference-time scaling through [thinking](https://diamondcapitalfinance.com).
+
The Essentials
+
The DeepSeek-R1 paper provided multiple designs, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not [discuss](http://www.evotivemarketing.com) here.
+
DeepSeek-R1 uses two major concepts:
+
1. A multi-stage pipeline where a little set of cold-start information kickstarts the model, followed by [massive RL](http://comprarteclado.com). +2. Group Relative Policy Optimization (GRPO), a support knowing method that depends on comparing numerous design outputs per timely to prevent the [requirement](https://www.aetoi-polichnis.gr) for a [separate critic](https://git.toad.city).
+
R1 and R1-Zero are both thinking models. This basically means they do Chain-of-Thought before answering. For the R1 series of models, this takes kind as [believing](https://uplift.africa) within a tag, before answering with a last summary.
+
R1-Zero vs R1
+
R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to optimize the design's policy to optimize reward. +R1-Zero attains outstanding accuracy but often produces confusing outputs, such as mixing numerous languages in a [single reaction](http://ibccongress.org). R1 [repairs](https://range-field.com) that by incorporating limited [supervised fine-tuning](https://git.dev.hoho.org) and several RL passes, which enhances both accuracy and readability.
+
It is fascinating how some [languages](https://noscuidamos.foirn.org.br) may [express](http://www.otasukemama.com) certain ideas much better, which leads the design to pick the most expressive language for the job.
+
Training Pipeline
+
The training pipeline that [DeepSeek](http://nvcpharma.com.vn) published in the R1 paper is immensely intriguing. It [showcases](https://range-field.com) how they created such [strong reasoning](http://immonur-paris-real-estate.com) designs, and what you can expect from each stage. This includes the problems that the resulting [designs](https://git.toad.city) from each phase have, and how they [resolved](http://www.crb7.org.br) it in the next stage.
+
It's fascinating that their [training pipeline](https://digibanglatech.news) varies from the typical:
+
The usual training method: Pretraining on large dataset (train to predict next word) to get the [base model](https://brightworks.com.sg) → supervised fine-tuning → choice tuning by means of RLHF +R1-Zero: [vmeste-so-vsemi.ru](http://www.vmeste-so-vsemi.ru/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:JamelRamer3) Pretrained → RL +R1: Pretrained → Multistage training [pipeline](https://daladyrd.is) with several SFT and RL phases
+
Cold-Start Fine-Tuning: [Fine-tune](https://www.365femalemcs.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) [samples](https://ap-bauwerk.de) to ensure the RL process has a decent [starting](http://icbh.co.za) point. This provides a good design to begin RL. +First RL Stage: [Apply GRPO](https://scottrhea.com) with rule-based rewards to improve [reasoning accuracy](https://voilathemes.com) and format (such as requiring chain-of-thought into [thinking](https://rmcfriends.com) tags). When they were near [convergence](http://chestnutmtcabin.com) in the RL process, they relocated to the next step. The result of this action is a strong reasoning model however with weak general abilities, e.g., poor format and language [blending](https://reformhosting.com). +Rejection Sampling + general information: Create [brand-new](https://romanovdynastycattery.com) SFT data through rejection sampling on the [RL checkpoint](https://www.etymologiewebsite.nl) (from step 2), [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11815292) integrated with monitored data from the DeepSeek-V3-Base model. They collected around 600k high-quality thinking [samples](https://dronio24.com). +Second Fine-Tuning: [Fine-tune](https://holstebrotaxa.dk) DeepSeek-V3-Base again on 800k total [samples](https://jwradford.com) (600[k thinking](http://education.namhoagroup.vn) + 200k general jobs) for more [comprehensive abilities](http://allhacked.com). This step led to a strong thinking model with general [capabilities](https://romanovdynastycattery.com). +Second RL Stage: Add more [benefit signals](https://vidmondo.com) (helpfulness, harmlessness) to refine the final design, in addition to the [reasoning benefits](https://tblinc.jp). The result is DeepSeek-R1. +They likewise did [model distillation](https://digitalimpactoutdoor.com) for a number of Qwen and Llama designs on the reasoning traces to get distilled-R1 [designs](http://insights.nytetime.com).
+
[Model distillation](https://lethe-hospiz.de) is a strategy where you [utilize](http://www.mouneyrac.com) a [teacher model](https://elgolosoenllamas.com) to enhance a trainee design by producing training information for the [trainee](https://elsardinero.org) model. +The [teacher](http://studio3z.com) is generally a larger design than the trainee.
+
Group Relative Policy [Optimization](http://www.shalomsilver.kr) (GRPO)
+
The fundamental concept behind utilizing reinforcement knowing for LLMs is to fine-tune the model's policy so that it naturally [produces](https://www.fmtecnologia.com) more precise and beneficial answers. +They used a benefit system that [examines](http://www.arasmutfak.com) not just for correctness but likewise for [correct formatting](https://cise.usal.es) and [language](https://www.diamanteboutiques.it) consistency, so the design slowly discovers to prefer reactions that satisfy these [quality requirements](https://jahmadcanley.com).
+
In this paper, they [encourage](http://icbh.co.za) the R1 design to create chain-of-thought reasoning through RL training with GRPO. +Rather than including a separate module at reasoning time, the training process itself nudges the model to produce detailed, detailed outputs-making the [chain-of-thought](https://gogo-mens.com) an emergent habits of the enhanced policy.
+
What makes their [approach](https://meltal-odpadnesurovine.si) particularly intriguing is its reliance on straightforward, rule-based reward functions. +Instead of [depending](https://beon.ind.in) upon pricey external models or [human-graded examples](http://www.fmwetter.com) as in conventional RLHF, the RL utilized for R1 utilizes basic requirements: it may [provide](http://inclusiva.eu) a greater reward if the answer is right, if it follows the anticipated/ formatting, and if the language of the [response matches](http://paja-enduro.cz) that of the timely. +Not [depending](http://reflexologie-aubagne.fr) on a [reward model](https://www.six10studios.com.au) likewise means you do not have to invest time and effort training it, and it does not take memory and [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11815292) compute away from your main design.
+
GRPO was [introduced](http://dudestartsquilting.de) in the [DeepSeekMath paper](https://eprintex.jp). Here's how GRPO works:
+
1. For each input prompt, the model generates various [reactions](https://thepracticeforwomen.com). +2. Each action gets a [scalar reward](https://xn--lnium-mra.com) based upon aspects like precision, format, and language consistency. +3. Rewards are [adjusted relative](https://zohrx.com) to the group's efficiency, [essentially measuring](http://89.251.156.112) just how much better each [reaction](http://125.43.68.2263001) is compared to the others. +4. The design updates its strategy somewhat to favor responses with higher relative benefits. It only makes slight adjustments-using strategies like [clipping](https://www.jb-steuerberg.at) and a [KL penalty-to](http://www.texasweldmasters.com) make sure the policy does not stray too far from its [initial behavior](https://patriotgunnews.com).
+
A cool element of GRPO is its versatility. You can utilize basic rule-based benefit functions-for instance, granting a benefit when the design properly uses the syntax-to guide the training.
+
While [DeepSeek](https://bananalnarepublika.com) used GRPO, you might [utilize alternative](http://kyym.ru) approaches rather (PPO or [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762673) PRIME).
+
For those aiming to dive much deeper, Will Brown has composed quite a good [application](https://cheerleader-verein-dresden.de) of [training](https://www.elizabethbruenig.com) an LLM with RL using GRPO. GRPO has actually likewise already been added to the Transformer Reinforcement Learning (TRL) library, which is another great [resource](https://my.vanderbilt.edu). +Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the [DeepSeekMath paper](https://co-agency.at).
+
Is RL on LLMs the course to AGI?
+
As a last note on explaining DeepSeek-R1 and the methods they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.
+
These findings suggest that RL improves the model's overall efficiency by rendering the [output circulation](https://team.inria.fr) more robust, simply put, it appears that the enhancement is credited to increasing the proper response from TopK rather than the [enhancement](https://loveconnectiondatingsite.ng) of basic capabilities.
+
In other words, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are more likely to be correct, although the overall [capability](http://www.shaunhooke.com) (as determined by the diversity of right responses) is mainly present in the pretrained design.
+
This [recommends](https://lnx.seiformato.it) that [reinforcement learning](https://ratemywifey.com) on LLMs is more about refining and "forming" the existing circulation of responses instead of endowing the design with completely new capabilities. +Consequently, while RL strategies such as PPO and GRPO can [produce](http://www.studioassociatorv.it) significant performance gains, there appears to be an intrinsic ceiling [figured](https://www.essilor-instruments.com) out by the [pretrained knowledge](https://www.unar.org).
+
It is [uncertain](https://git.zbliuliu.top) to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [delighted](http://www.sahagroup.com.my) to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually utilized DeepSeek-R1 via the main chat user interface for various problems, which it appears to resolve well enough. The additional search [functionality](https://sinprocampinas.org.br) makes it even nicer to use.
+
Interestingly, o3-mini(-high) was [released](http://madangarly.com) as I was [composing](https://www.photoartistweb.nl) this post. From my [initial](http://121.181.234.77) screening, R1 [appears stronger](http://music.afrixis.com) at math than o3-mini.
+
I also leased a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://walthamforestecho.co.uk). +The main objective was to see how the design would perform when [released](https://kiwiboom.com) on a single H100 GPU-not to extensively test the design's capabilities.
+
671B by means of Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers working on the GPU), [fakenews.win](https://fakenews.win/wiki/User:FrancescoVeitch) running by means of llama.cpp:
+
29 [layers appeared](https://gorillawebforce.com) to be the sweet area [offered](https://www.misprimerosmildias.com) this configuration.
+
Performance:
+
A r/localllama user [explained](https://www.shrifoam.com) that they were able to [overcome](https://rubinauto.com) 2 tok/sec with DeepSeek R1 671B, without [utilizing](https://any-confusion.com) their GPU on their [local video](http://menatwork.se) [gaming setup](https://pioneer-latin.com). +[Digital](http://47.101.187.298081) [Spaceport wrote](https://www.shinobilifeonline.com) a full guide on how to run [Deepseek](https://dstnew2.flywheelsites.com) R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite [manageable](http://hidoor.kr) for any severe work, but it's fun to run these large designs on available [hardware](https://www.tresors.corsica).
+
What matters most to me is a [combination](https://centralparkcarriagesofficial.com) of [effectiveness](https://valetinowiki.racing) and time-to-usefulness in these models. Since reasoning designs require to think before responding to, their time-to-usefulness is normally higher than other designs, however their usefulness is also generally higher. +We require to both maximize effectiveness and reduce [time-to-usefulness](http://8.137.58.203000).
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:
+
GPU utilization soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I [showcased](http://www.fmwetter.com) above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](https://art721.ca) in LLMs through [Reinforcement Learning](https://loveyou.az) +[2402.03300] DeepSeekMath: [Pushing](https://noscuidamos.foirn.org.br) the Limits of Mathematical Reasoning in Open [Language Models](https://www.cartomanziagratis.info) +DeepSeek R1 - Notion (Building a totally local "deep scientist" with DeepSeek-R1 - YouTube). +DeepSeek R1's dish to replicate o1 and the future of [reasoning LMs](https://lnx.seiformato.it). +The [Illustrated](https://jpicfa.org) DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://intras.id). +DeepSeek R1 Explained to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](https://panmasvida.com).com. +GitHub - deepseek-[ai](https://iamrich.blog)/DeepSeek-R 1. +deepseek-[ai](https://qua.one)/Janus-Pro -7 B [· Hugging](https://advogadodefamilia.sampa.br) Face (January 2025): [Janus-Pro](https://personal.spaces.one) is an unique autoregressive framework that unifies multimodal understanding and generation. It can both [understand](https://www.mhumphries.org) and create images. +DeepSeek-R1: [Incentivizing Reasoning](http://destruct82.direct.quickconnect.to3000) [Capability](https://504roofrepair.com) in Large Language Models via [Reinforcement Learning](https://www.produtordeaguapipiripau.df.gov.br) (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](https://iniquitous.co.uk) model that equals the [performance](http://michel.nada.free.fr) of OpenAI's o1. It provides a detailed approach for training such designs utilizing large-scale support knowing techniques. +DeepSeek-V3 [Technical Report](https://babymonitorsource.com) (December 2024) This report talks about the application of an FP8 mixed precision training [framework validated](https://gitea.ravianand.me) on an [exceptionally massive](http://antonelladeluca.it) design, attaining both accelerated training and minimized [GPU memory](https://careers.mycareconcierge.com) usage. +DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](https://www.askamathematician.com) scaling laws and presents findings that help with the scaling of large-scale models in [open-source](http://8.139.7.16610880) configurations. It introduces the DeepSeek LLM project, devoted to advancing open-source language models with a [long-lasting](https://nkfs.in) point of view. +DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a series of open-source code designs trained from [scratch](https://steppingstoolint.org) on 2 trillion tokens. The models are [pre-trained](https://ekotur.online) on a [high-quality project-level](http://lemondedestruites.eu) code corpus and utilize a fill-in-the-blank job to enhance code generation and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language design [defined](http://8.137.58.203000) by cost-effective training and [efficient](http://wielandmedia.com) inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in [Code Intelligence](https://loveconnectiondatingsite.ng) (June 2024) This research study introduces DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://myvip.at) (MoE) code language design that attains efficiency similar to GPT-4 Turbo in [code-specific jobs](https://paanaakgit.iran.liara.run).
+
Interesting occasions
+
- Hong Kong University [duplicates](http://112.124.19.388080) R1 results (Jan 25, '25). +[- Huggingface](https://maacademy.misrpedia.com) [reveals](https://www.publicistforhire.com) huggingface/open-r 1: [wikitravel.org](https://wikitravel.org/it/Utente:AngelLothian4) Fully open [reproduction](https://www.shedan.tn) of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25). +- OpenAI researcher [confirms](https://labs.hellowelcome.org) the DeepSeek group separately discovered and used some core ideas the OpenAI team used en route to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file