Add Understanding DeepSeek R1

2025-02-28 10:39:45 +02:00 · 2025-02-28 10:39:45 +02:00 · acb7245827
commit acb7245827
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language [design built](https://dreamcorpsllc.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://www.xysoftware.com.cn:3000) community. Not just does it match-or even [surpass-OpenAI's](https://www.designingeducation.org) o1 model in many standards, however it also features completely MIT-licensed weights. This marks it as the first non-OpenAI/[Google model](http://218.201.25.1043000) to deliver strong thinking capabilities in an open and  [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:MichaelCrocker0) available manner.<br>
+<br>What makes DeepSeek-R1 particularly exciting is its transparency. Unlike the [less-open](https://www.sherpapedia.org) approaches from some market leaders, [DeepSeek](https://www.eworkplace.com) has [published](https://git.dev.hoho.org) a [detailed training](https://muziekishetantwoord.nl) [methodology](https://imoviekh.com) in their paper.
+The design is also [incredibly](https://digibanglatech.news) economical, with [input tokens](https://www.fanatec.com) [costing](https://www.danbrownjr.com) just $0.14-0.55 per million (vs o1's $15) and [output tokens](https://inea.se) at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [typical wisdom](http://saintsdrumcorps.org) was that better [designs](http://annacoulter.com) needed more information and calculate. While that's still legitimate, models like o1 and R1 demonstrate an alternative: inference-time scaling through [thinking](https://diamondcapitalfinance.com).<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper provided multiple designs, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not [discuss](http://www.evotivemarketing.com) here.<br>
+<br>DeepSeek-R1 uses two major concepts:<br>
+<br>1. A multi-stage pipeline where a little set of cold-start information kickstarts the model, followed by [massive RL](http://comprarteclado.com).
+2. Group Relative Policy Optimization (GRPO), a support knowing method that depends on comparing numerous design outputs per timely to prevent the [requirement](https://www.aetoi-polichnis.gr) for a [separate critic](https://git.toad.city).<br>
+<br>R1 and R1-Zero are both thinking models. This basically means they do Chain-of-Thought before answering. For the R1 series of models, this takes kind as [believing](https://uplift.africa) within a tag, before answering with a last summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to optimize the design's policy to optimize reward.
+R1-Zero attains outstanding accuracy but often produces confusing outputs, such as mixing numerous languages in a [single reaction](http://ibccongress.org). R1 [repairs](https://range-field.com) that by incorporating limited [supervised fine-tuning](https://git.dev.hoho.org) and several RL passes, which enhances both accuracy and readability.<br>
+<br>It is fascinating how some [languages](https://noscuidamos.foirn.org.br) may [express](http://www.otasukemama.com) certain ideas much better, which leads the design to pick the most expressive language for the job.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek](http://nvcpharma.com.vn) published in the R1 paper is immensely intriguing. It [showcases](https://range-field.com) how they created such [strong reasoning](http://immonur-paris-real-estate.com) designs, and what you can expect from each stage. This includes the problems that the resulting [designs](https://git.toad.city) from each phase have, and how they [resolved](http://www.crb7.org.br) it in the next stage.<br>
+<br>It's fascinating that their [training pipeline](https://digibanglatech.news) varies from the typical:<br>
+<br>The usual training method: Pretraining on large dataset (train to predict next word) to get the [base model](https://brightworks.com.sg) → supervised fine-tuning → choice tuning by means of RLHF
+R1-Zero:  [vmeste-so-vsemi.ru](http://www.vmeste-so-vsemi.ru/wiki/%D0%A3%D1%87%D0%B0%D1%81%D1%82%D0%BD%D0%B8%D0%BA:JamelRamer3) Pretrained → RL
+R1: Pretrained → Multistage training [pipeline](https://daladyrd.is) with several SFT and RL phases<br>
+<br>Cold-Start Fine-Tuning: [Fine-tune](https://www.365femalemcs.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) [samples](https://ap-bauwerk.de) to ensure the RL process has a decent [starting](http://icbh.co.za) point. This provides a good design to begin RL.
+First RL Stage: [Apply GRPO](https://scottrhea.com) with rule-based rewards to improve [reasoning accuracy](https://voilathemes.com) and format (such as requiring chain-of-thought into [thinking](https://rmcfriends.com) tags). When they were near [convergence](http://chestnutmtcabin.com) in the RL process, they relocated to the next step. The result of this action is a strong reasoning model however with weak general abilities, e.g., poor format and language [blending](https://reformhosting.com).
+Rejection Sampling + general information: Create [brand-new](https://romanovdynastycattery.com) SFT data through rejection sampling on the [RL checkpoint](https://www.etymologiewebsite.nl) (from step 2),  [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11815292) integrated with monitored data from the DeepSeek-V3-Base model. They collected around 600k high-quality thinking [samples](https://dronio24.com).
+Second Fine-Tuning: [Fine-tune](https://holstebrotaxa.dk) DeepSeek-V3-Base again on 800k total [samples](https://jwradford.com) (600[k thinking](http://education.namhoagroup.vn) + 200k general jobs) for more [comprehensive abilities](http://allhacked.com). This step led to a strong thinking model with general [capabilities](https://romanovdynastycattery.com).
+Second RL Stage: Add more [benefit signals](https://vidmondo.com) (helpfulness, harmlessness) to refine the final design, in addition to the [reasoning benefits](https://tblinc.jp). The result is DeepSeek-R1.
+They likewise did [model distillation](https://digitalimpactoutdoor.com) for a number of Qwen and Llama designs on the reasoning traces to get distilled-R1 [designs](http://insights.nytetime.com).<br>
+<br>[Model distillation](https://lethe-hospiz.de) is a strategy where you [utilize](http://www.mouneyrac.com) a [teacher model](https://elgolosoenllamas.com) to enhance a trainee design by producing training information for the [trainee](https://elsardinero.org) model.
+The [teacher](http://studio3z.com) is generally a larger design than the trainee.<br>
+<br>Group Relative Policy [Optimization](http://www.shalomsilver.kr) (GRPO)<br>
+<br>The fundamental concept behind utilizing reinforcement knowing for LLMs is to fine-tune the model's policy so that it naturally [produces](https://www.fmtecnologia.com) more precise and beneficial answers.
+They used a benefit system that [examines](http://www.arasmutfak.com) not just for correctness but likewise for [correct formatting](https://cise.usal.es) and [language](https://www.diamanteboutiques.it) consistency, so the design slowly discovers to prefer reactions that satisfy these [quality requirements](https://jahmadcanley.com).<br>
+<br>In this paper, they [encourage](http://icbh.co.za) the R1 design to create chain-of-thought reasoning through RL training with GRPO.
+Rather than including a separate module at reasoning time, the training process itself nudges the model to produce detailed, detailed outputs-making the [chain-of-thought](https://gogo-mens.com) an emergent habits of the enhanced policy.<br>
+<br>What makes their [approach](https://meltal-odpadnesurovine.si) particularly intriguing is its reliance on straightforward, rule-based reward functions.
+Instead of [depending](https://beon.ind.in) upon pricey external models or [human-graded examples](http://www.fmwetter.com) as in conventional RLHF, the RL utilized for R1 utilizes basic requirements: it may [provide](http://inclusiva.eu) a greater reward if the answer is right, if it follows the anticipated/ formatting, and if the language of the [response matches](http://paja-enduro.cz) that of the timely.
+Not [depending](http://reflexologie-aubagne.fr) on a [reward model](https://www.six10studios.com.au) likewise means you do not have to invest time and effort training it, and it does not take memory and  [setiathome.berkeley.edu](https://setiathome.berkeley.edu/view_profile.php?userid=11815292) compute away from your main design.<br>
+<br>GRPO was [introduced](http://dudestartsquilting.de) in the [DeepSeekMath paper](https://eprintex.jp). Here's how GRPO works:<br>
+<br>1. For each input prompt, the model generates various [reactions](https://thepracticeforwomen.com).
+2. Each action gets a [scalar reward](https://xn--lnium-mra.com) based upon aspects like precision, format, and language consistency.
+3. Rewards are [adjusted relative](https://zohrx.com) to the group's efficiency, [essentially measuring](http://89.251.156.112) just how much better each [reaction](http://125.43.68.2263001) is compared to the others.
+4. The design updates its strategy somewhat to favor responses with higher relative benefits. It only makes slight adjustments-using strategies like [clipping](https://www.jb-steuerberg.at) and a [KL penalty-to](http://www.texasweldmasters.com) make sure the policy does not stray too far from its [initial behavior](https://patriotgunnews.com).<br>
+<br>A cool element of GRPO is its versatility. You can utilize basic rule-based benefit functions-for instance, granting a benefit when the design properly uses the syntax-to guide the training.<br>
+<br>While [DeepSeek](https://bananalnarepublika.com) used GRPO, you might [utilize alternative](http://kyym.ru) approaches rather (PPO or  [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=762673) PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has composed quite a good [application](https://cheerleader-verein-dresden.de) of [training](https://www.elizabethbruenig.com) an LLM with RL using GRPO. GRPO has actually likewise already been added to the Transformer Reinforcement Learning (TRL) library, which is another great [resource](https://my.vanderbilt.edu).
+Finally, Yannic Kilcher has a terrific video explaining GRPO by going through the [DeepSeekMath paper](https://co-agency.at).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a last note on explaining DeepSeek-R1 and the methods they have actually presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
+<br>These findings suggest that RL improves the model's overall efficiency by rendering the [output circulation](https://team.inria.fr) more robust, simply put, it appears that the enhancement is credited to increasing the proper response from TopK rather than the [enhancement](https://loveconnectiondatingsite.ng) of basic capabilities.<br>
+<br>In other words, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are more likely to be correct, although the overall [capability](http://www.shaunhooke.com) (as determined by the diversity of right responses) is mainly present in the pretrained design.<br>
+<br>This [recommends](https://lnx.seiformato.it) that [reinforcement learning](https://ratemywifey.com) on LLMs is more about refining and "forming" the existing circulation of responses instead of endowing the design with completely new capabilities.
+Consequently, while RL strategies such as PPO and GRPO can [produce](http://www.studioassociatorv.it) significant performance gains, there appears to be an intrinsic ceiling [figured](https://www.essilor-instruments.com) out by the  [pretrained knowledge](https://www.unar.org).<br>
+<br>It is [uncertain](https://git.zbliuliu.top) to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [delighted](http://www.sahagroup.com.my) to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually utilized DeepSeek-R1 via the main chat user interface for various problems, which it appears to resolve well enough. The additional search [functionality](https://sinprocampinas.org.br) makes it even nicer to use.<br>
+<br>Interestingly, o3-mini(-high) was [released](http://madangarly.com) as I was [composing](https://www.photoartistweb.nl) this post. From my [initial](http://121.181.234.77) screening, R1 [appears stronger](http://music.afrixis.com) at math than o3-mini.<br>
+<br>I also leased a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://walthamforestecho.co.uk).
+The main objective was to see how the design would perform when [released](https://kiwiboom.com) on a single H100 GPU-not to extensively test the design's capabilities.<br>
+<br>671B by means of Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 layers working on the GPU),  [fakenews.win](https://fakenews.win/wiki/User:FrancescoVeitch) running by means of llama.cpp:<br>
+<br>29 [layers appeared](https://gorillawebforce.com) to be the sweet area [offered](https://www.misprimerosmildias.com) this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user [explained](https://www.shrifoam.com) that they were able to [overcome](https://rubinauto.com) 2 tok/sec with DeepSeek R1 671B, without [utilizing](https://any-confusion.com) their GPU on their [local video](http://menatwork.se) [gaming setup](https://pioneer-latin.com).
+[Digital](http://47.101.187.298081) [Spaceport wrote](https://www.shinobilifeonline.com) a full guide on how to run [Deepseek](https://dstnew2.flywheelsites.com) R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite [manageable](http://hidoor.kr) for any severe work, but it's fun to run these large designs on available [hardware](https://www.tresors.corsica).<br>
+<br>What matters most to me is a [combination](https://centralparkcarriagesofficial.com) of [effectiveness](https://valetinowiki.racing) and time-to-usefulness in these models. Since reasoning designs require to think before responding to, their time-to-usefulness is normally higher than other designs, however their usefulness is also generally higher.
+We require to both maximize effectiveness and reduce [time-to-usefulness](http://8.137.58.203000).<br>
+<br>70B via Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
+<br>GPU utilization soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I [showcased](http://www.fmwetter.com) above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://art721.ca) in LLMs through [Reinforcement Learning](https://loveyou.az)
+[2402.03300] DeepSeekMath: [Pushing](https://noscuidamos.foirn.org.br) the Limits of Mathematical Reasoning in Open [Language Models](https://www.cartomanziagratis.info)
+DeepSeek R1 - Notion (Building a totally local "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1's dish to replicate o1 and the future of [reasoning LMs](https://lnx.seiformato.it).
+The [Illustrated](https://jpicfa.org) DeepSeek-R1 - by Jay Alammar.
+Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://intras.id).
+DeepSeek R1 Explained to your grandmother - YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](https://panmasvida.com).com.
+GitHub - deepseek-[ai](https://iamrich.blog)/DeepSeek-R 1.
+deepseek-[ai](https://qua.one)/Janus-Pro -7 B [· Hugging](https://advogadodefamilia.sampa.br) Face (January 2025): [Janus-Pro](https://personal.spaces.one) is an unique autoregressive framework that unifies multimodal understanding and generation. It can both [understand](https://www.mhumphries.org) and create images.
+DeepSeek-R1: [Incentivizing Reasoning](http://destruct82.direct.quickconnect.to3000) [Capability](https://504roofrepair.com) in Large Language Models via [Reinforcement Learning](https://www.produtordeaguapipiripau.df.gov.br) (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](https://iniquitous.co.uk) model that equals the [performance](http://michel.nada.free.fr) of OpenAI's o1. It provides a detailed approach for training such designs utilizing large-scale support knowing techniques.
+DeepSeek-V3 [Technical Report](https://babymonitorsource.com) (December 2024) This report talks about the application of an FP8 mixed precision training [framework validated](https://gitea.ravianand.me) on an [exceptionally massive](http://antonelladeluca.it) design, attaining both accelerated training and minimized [GPU memory](https://careers.mycareconcierge.com) usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](https://www.askamathematician.com) scaling laws and presents findings that help with the scaling of large-scale models in [open-source](http://8.139.7.16610880) configurations. It introduces the DeepSeek LLM project, devoted to advancing open-source language models with a [long-lasting](https://nkfs.in) point of view.
+DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a series of open-source code designs trained from [scratch](https://steppingstoolint.org) on 2 trillion tokens. The models are [pre-trained](https://ekotur.online) on a [high-quality project-level](http://lemondedestruites.eu) code corpus and utilize a fill-in-the-blank job to enhance code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language design [defined](http://8.137.58.203000) by cost-effective training and [efficient](http://wielandmedia.com) inference.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in [Code Intelligence](https://loveconnectiondatingsite.ng) (June 2024) This research study introduces DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://myvip.at) (MoE) code language design that attains efficiency similar to GPT-4 Turbo in [code-specific jobs](https://paanaakgit.iran.liara.run).<br>
+<br>Interesting occasions<br>
+<br>- Hong Kong University [duplicates](http://112.124.19.388080) R1 results (Jan 25, '25).
+[- Huggingface](https://maacademy.misrpedia.com) [reveals](https://www.publicistforhire.com) huggingface/open-r 1:  [wikitravel.org](https://wikitravel.org/it/Utente:AngelLothian4) Fully open [reproduction](https://www.shedan.tn) of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25).
+- OpenAI researcher [confirms](https://labs.hellowelcome.org) the DeepSeek group separately discovered and used some core ideas the OpenAI team used en route to o1<br>
+<br>Liked this post? Join the newsletter.<br>