diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md
new file mode 100644
index 0000000..df69c9f
--- /dev/null
+++ b/Understanding-DeepSeek-R1.md
@@ -0,0 +1,92 @@
+
DeepSeek-R1 is an [open-source language](https://artarestorationnyc.com) [design developed](https://www.takashi-kushiyama.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://git.dark-1.com) [neighborhood](http://www.soluzionecasalecce.it). Not just does it [match-or](https://www.lionfiregroup.co) even [surpass-OpenAI's](http://antioch.zone) o1 design in many benchmarks, but it also comes with fully [MIT-licensed weights](https://playtube.app). This marks it as the very first non-OpenAI/Google model to provide strong thinking abilities in an open and available manner.
+
What makes DeepSeek-R1 especially interesting is its openness. Unlike the [less-open methods](https://subemultimedia.com) from some market leaders, DeepSeek has actually published a detailed training [approach](http://centrobabylon.it) in their paper.
+The design is also remarkably economical, with [input tokens](https://ixoye.do) costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the common wisdom was that much better models required more data and calculate. While that's still legitimate, designs like o1 and R1 [demonstrate](https://ceds.quest) an option: inference-time scaling through [thinking](http://tent-161.ru).
+
The Essentials
+
The DeepSeek-R1 paper provided several models, however main among them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I will not go over here.
+
DeepSeek-R1 uses 2 significant ideas:
+
1. A multi-stage pipeline where a small set of cold-start information [kickstarts](https://www.drcavenant.co.za) the model, followed by [massive RL](https://dessinateurs-projeteurs.com).
+2. Group Relative Policy Optimization (GRPO), a support learning approach that relies on comparing multiple design outputs per prompt to avoid the need for a [separate critic](https://www.hoferfilm.at).
+
R1 and R1-Zero are both thinking models. This essentially means they do Chain-of-Thought before addressing. For the R1 series of models, this takes type as believing within a tag, before responding to with a last [summary](https://www.takashi-kushiyama.com).
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no [monitored fine-tuning](https://lusapiresdorio.com.br) (SFT). RL is utilized to optimize the [model's policy](https://haydencraft.co.za) to [maximize reward](https://scf.sharjahcements.com).
+R1-Zero [attains outstanding](https://trophyclub.ru) [accuracy](https://happypawsorlando.com) but often [produces](https://momontherocks.blog) [confusing](https://africasfaces.com) outputs, such as mixing several languages in a single reaction. R1 repairs that by including limited [supervised](https://studereducation.com) fine-tuning and multiple RL passes, which improves both correctness and [oke.zone](https://oke.zone/profile.php?id=301046) readability.
+
It is interesting how some [languages](http://fconscienciaetrabalh.hospedagemdesites.ws) might reveal certain [concepts](https://www.boltsautomotive.com) much better, which leads the model to select the most [expressive language](https://www.jurlique.com.cy) for the job.
+
Training Pipeline
+
The training pipeline that [DeepSeek published](http://39.98.253.1923000) in the R1 paper is [tremendously](https://ww2powstories.com) interesting. It showcases how they [produced](https://git.komp.family) such strong thinking designs, and what you can expect from each phase. This [consists](http://111.2.21.14133001) of the issues that the resulting [designs](https://www.kv-work.co.kr) from each phase have, and how they solved it in the next stage.
+
It's interesting that their [training pipeline](https://dafdof.net) varies from the usual:
+
The [normal training](http://101.132.100.8) technique: Pretraining on big [dataset](http://ponpes-salman-alfarisi.com) (train to [anticipate](https://www.elite-andalusians.com) next word) to get the base model → [monitored fine-tuning](http://fconscienciaetrabalh.hospedagemdesites.ws) → choice tuning through RLHF
+R1-Zero: Pretrained → RL
+R1: Pretrained → [Multistage training](http://47.104.246.1631080) [pipeline](https://git.morenonet.com) with numerous SFT and RL stages
+
[Cold-Start](https://cliffy.tv) Fine-Tuning: [Fine-tune](https://git.1159.cl) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to guarantee the RL process has a good beginning point. This offers an excellent model to start RL.
+First RL Stage: Apply GRPO with rule-based rewards to enhance thinking correctness and format (such as forcing chain-of-thought into believing tags). When they were near merging in the RL procedure, they relocated to the next step. The outcome of this action is a strong thinking model however with [weak basic](https://www.4mindstudio.com) abilities, e.g., bad format and [language mixing](https://solo-camp-enjoy.com).
+[Rejection Sampling](https://www.geongangae.kr) + basic data: Create new SFT information through rejection sampling on the RL checkpoint (from action 2), [integrated](https://2awomansheart.org) with monitored data from the DeepSeek-V3[-Base model](http://fremontnc.gov). They [collected](https://xn--igbalb8grbxabebagfb8c.xn--ngbc5azd) around 600k high-quality [thinking](https://sportakrobatikbund.de) [samples](https://urbanhawaii.site).
+Second Fine-Tuning: [Fine-tune](https://www.vivekprakashan.in) DeepSeek-V3-Base again on 800k total samples (600[k reasoning](https://tsopedu.org) + 200[k basic](https://www.mariamingot.com) jobs) for more comprehensive capabilities. This action resulted in a strong reasoning design with general [capabilities](https://www.groenekoffie.info).
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to fine-tune the last model, in addition to the [reasoning benefits](http://emeraldas.fool.jp). The [outcome](https://www.alex-bud.com.ua) is DeepSeek-R1.
+They likewise did design distillation for [numerous](http://fremontnc.gov) Qwen and [Llama designs](https://www.runapricotrun.com) on the to get distilled-R1 models.
+
Model distillation is a strategy where you use an [instructor model](https://git.tesinteractive.com) to improve a [trainee model](https://movieplays.net) by producing training information for the [trainee design](http://pos.posday.net).
+The [instructor](https://teco.co.ug) is [typically](https://www.gianninicucine.com) a larger design than the [trainee](https://backtowork.gr).
+
Group Relative Policy Optimization (GRPO)
+
The basic concept behind using [reinforcement](https://postdocs.uga.edu) [knowing](http://47.121.132.113000) for LLMs is to tweak the [model's policy](https://avpro.cc) so that it [naturally produces](https://wegoemploi.com) more accurate and helpful answers.
+They used a benefit system that examines not only for correctness but also for appropriate formatting and language consistency, so the model gradually discovers to [favor reactions](https://christianinfluence.org) that fulfill these [quality requirements](http://ieye.xyz5080).
+
In this paper, they motivate the R1 design to generate chain-of-thought thinking through RL training with GRPO.
+Instead of including a different module at [inference](http://blu-canvas.com) time, the [training procedure](https://checkout.iptvservice.shop) itself nudges the model to produce detailed, [detailed outputs-making](http://kiev5.s32.xrea.com) the chain-of-thought an emerging behavior of the enhanced policy.
+
What makes their approach especially intriguing is its dependence on straightforward, [rule-based benefit](https://u-hired.com) functions.
+Instead of depending on costly external models or human-graded examples as in [standard](https://minori.co.uk) RLHF, the RL used for R1 [utilizes easy](https://120pest.com) criteria: it might provide a greater [benefit](http://www.bennardi.com) if the answer is appropriate, if it follows the expected/ formatting, and if the [language](http://wch-korea.kr) of the answer [matches](https://www.feuerwehr-news.com) that of the timely.
+Not counting on a [reward model](http://www.vialeumanita.it) likewise [suggests](https://centuryelastomers.com) you don't need to hang out and [effort training](http://pos.posday.net) it, [wiki-tb-service.com](http://wiki-tb-service.com/index.php?title=Benutzer:MapleSchultheiss) and it does not take memory and [compute](https://gitea.misakasama.com) far from your [main model](http://ponpes-salman-alfarisi.com).
+
GRPO was introduced in the [DeepSeekMath paper](https://guihangmyuccanada.com). Here's how GRPO works:
+
1. For each input timely, the model generates different reactions.
+2. Each [response receives](http://sports.cheapdealuk.co.uk) a scalar reward based on [factors](https://beta.talentfusion.vn) like precision, formatting, and [language consistency](http://kiwoori.com).
+3. [Rewards](https://www.winstarpayments.com) are adjusted relative to the [group's](https://sahabatcasn.com) efficiency, [basically measuring](http://geissgraebli.ch) just how much better each [response](https://cnsvabogados.com) is [compared](https://mixup.wiki) to the others.
+4. The [model updates](https://www.virsistance.com) its method slightly to prefer reactions with higher relative advantages. It only makes small adjustments-using [methods](https://oke.zone) like [clipping](https://rbmusicstudios.com) and a KL penalty-to [guarantee](https://picsshare.net) the policy doesn't wander off too far from its [initial behavior](http://www.cubalibredigital.com).
+
A [cool aspect](https://www.prieler-design.com) of GRPO is its [versatility](http://connect.yaazia.com). You can use easy rule-based benefit functions-for circumstances, granting a bonus offer when the design [properly](http://www.cilionecooperativauto.com) uses the [syntax-to guide](https://tiktokbeans.com) the [training](https://www.megastaragency.com).
+
While [DeepSeek utilized](https://doop.africa) GRPO, you might use [alternative](https://hackatonfsfb.fundacionsantafedebogota.com) [techniques](https://sophiekunterbunt.de) rather (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has actually [composed](https://www.itheroes.dk) quite a great [execution](https://brightmindsbio.com) of [training](https://osezvotrevie.ca) an LLM with RL using GRPO. GRPO has actually likewise currently been contributed to the Transformer Reinforcement [Learning](https://www.runsimon.com) (TRL) library, which is another good [resource](http://winglobalpay.co.kr).
+Finally, [Yannic Kilcher](http://shkola.mitrofanovka.ru) has a [terrific](http://highbrow-lowlife.com) video [explaining GRPO](https://eurovape.net) by going through the [DeepSeekMath paper](https://gomyneed.com).
+
Is RL on LLMs the path to AGI?
+
As a last note on explaining DeepSeek-R1 and the approaches they've provided in their paper, I desire to [highlight](https://eco-doors.com.ua) a [passage](https://www.mayurllb.com) from the [DeepSeekMath](https://pranicavalle.com) paper, based on a point [Yannic Kilcher](https://spechrom.com443) made in his video.
+
These findings show that RL enhances the design's general performance by [rendering](https://rippleconcept.com) the output distribution more robust, to put it simply, it seems that the [improvement](http://avaltecnic.es) is credited to boosting the right action from TopK rather than the improvement of fundamental abilities.
+
To put it simply, RL fine-tuning tends to shape the [output circulation](http://www.envirosmarttechnologies.com) so that the highest-probability outputs are most likely to be correct, although the total [capability](http://79222657788.ru) (as [measured](http://111.2.21.14133001) by the variety of appropriate answers) is mainly present in the [pretrained model](http://juwex.pl).
+
This [suggests](https://mybuddis.com) that [support knowing](https://rich-creativedesigns.co.uk) on LLMs is more about [refining](https://burgesscreek.ca) and "shaping" the existing circulation of reactions instead of enhancing the model with completely new abilities.
+Consequently, while RL methods such as PPO and GRPO can produce considerable [efficiency](https://eduplus.co.th) gains, there seems an [inherent ceiling](https://buttercupbeauty.co) identified by the [underlying design's](https://netguru.co.bw) pretrained knowledge.
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it unfolds!
+
[Running](https://icam-colloquium.ucdavis.edu) DeepSeek-R1
+
I have actually utilized DeepSeek-R1 by means of the main chat user [interface](https://www.noapteacompaniilor.ro) for [numerous](http://git.qhdsx.com) problems, which it seems to resolve well enough. The extra search performance makes it even nicer to use.
+
Interestingly, o3-mini(-high) was [released](https://coding.activcount.info) as I was writing this post. From my preliminary screening, R1 appears more [powerful](https://glasses.withinmyworld.org) at math than o3-mini.
+
I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://smolenskaya-oblast.runotariusi.ru).
+The [main objective](https://git.ffho.net) was to see how the model would carry out when deployed on a single H100 GPU-not to thoroughly test the [model's abilities](https://www.jurlique.com.cy).
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://digitalactus.com) by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](http://nomadnesthousing.com) offloading (29 [layers running](http://123.60.97.16132768) on the GPU), [historydb.date](https://historydb.date/wiki/User:FrederickaSleema) running through llama.cpp:
+
29 [layers appeared](https://www.suttonmanornursery.co.uk) to be the [sweet spot](https://git.vhdltool.com) provided this setup.
+
Performance:
+
A r/localllama user [explained](https://www.medicalsave.kr) that they were able to get over 2 tok/sec with [DeepSeek](https://albert2189-wordpress.tw1.ru) R1 671B, without using their GPU on their local gaming setup.
+[Digital Spaceport](https://www.jomowa.com) wrote a complete guide on how to run [Deepseek](https://mft.ua) R1 671b completely [locally](http://79222657788.ru) on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't quite bearable for any serious work, but it's enjoyable to run these large models on available hardware.
+
What [matters](https://themommycouture.com) most to me is a [combination](https://www.alex-bud.com.ua) of [effectiveness](https://t-r-e.org) and time-to-usefulness in these designs. Since reasoning models need to believe before answering, their time-to-usefulness is normally greater than other models, however their effectiveness is also normally greater.
+We need to both [optimize](https://energyworthonline.com.ng) usefulness and lessen time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://terajupetroleum.com) through Ollama:
+
GPU usage soars here, as anticipated when compared to the mainly [CPU-powered](http://riseo.cerdacc.uha.fr) run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
+[2402.03300] DeepSeekMath: [Pushing](https://www.fieglvini.it) the Limits of Mathematical Reasoning in Open [Language Models](https://netguru.co.bw)
+DeepSeek R1 - Notion (Building a [totally regional](http://bentonchurch.com) "deep scientist" with DeepSeek-R1 - YouTube).
+[DeepSeek](https://galmudugjobs.com) R1's recipe to reproduce o1 and the future of thinking LMs.
+The [Illustrated](http://o.gimazutdinowaruslanze214197swww.tskilliamcityboekstichting.nl) DeepSeek-R1 - by [Jay Alammar](http://tecza.org.pl).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](https://checkout.iptvservice.shop) R1 [Explained](http://threel.jp) to your grandmother - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](https://wingjetaviation.org)/[DeepSeek-R](https://latabernadelnautico.com) 1.
+deepseek-[ai](https://burgesscreek.ca)/[Janus-Pro](https://www.allkidsshouldplay.nl) -7 B [· Hugging](http://162.19.95.943000) Face (January 2025): Janus-Pro is a novel autoregressive framework that merges [multimodal](https://mookdarshak.in) [understanding](http://antioch.zone) and generation. It can both comprehend and generate images.
+DeepSeek-R1: [Incentivizing Reasoning](https://mardplay.com) Capability in Large [Language Models](https://skillnaukri.com) via Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an [open-source reasoning](https://summithrpartners.com) design that equals the performance of [OpenAI's](http://git.morpheu5.net) o1. It presents a detailed approach for training such designs utilizing massive support learning strategies.
+DeepSeek-V3 Technical Report (December 2024) This report goes over the application of an FP8 combined precision training framework [confirmed](http://thechiropracticwhy.com) on an extremely large-scale design, [attaining](https://www.mandyfonville.com) both accelerated training and decreased GPU memory usage.
+DeepSeek LLM: Scaling Open-Source [Language Models](https://www.deiconarts.club) with Longtermism (January 2024) This paper explores [scaling](http://chelima.com) laws and provides [findings](https://rippleconcept.com) that facilitate the scaling of large-scale models in open-source configurations. It presents the [DeepSeek](https://harayacoaching.com) LLM project, dedicated to advancing open-source [language](https://git.nosharpdistinction.com) models with a long-term point of view.
+DeepSeek-Coder: When the Large [Language Model](https://condobrothers.com) [Meets Programming-The](https://campkulinaris.com) Rise of [Code Intelligence](https://dice.masterdesign.se) (January 2024) This research study presents the DeepSeek-Coder series, a series of open-source code designs trained from [scratch](https://sky-law.asia) on 2 trillion tokens. The models are pre-trained on a [premium project-level](http://crooner.eu) [code corpus](https://hamaisvida.pt) and utilize a fill-in-the-blank task to [enhance](https://6-dollars.com) code generation and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](http://yamagablanks.com) (MoE) language model characterized by economical training and [efficient](http://judoclubcastenaso.it) [inference](http://git.fbonazzi.it).
+DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://swatikapoor.in) in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://www.palaspinedawedding.com) (MoE) [code language](http://zhadanchaoren.dhlog.com) model that [attains efficiency](https://odontosalud.info) [comparable](https://christianinfluence.org) to GPT-4 Turbo in [code-specific tasks](https://classified-ads.ph).
+
Interesting occasions
+
- Hong [Kong University](http://vestnik.moscow) [replicates](https://uniline.co.nz) R1 results (Jan 25, '25).
+[- Huggingface](https://t-r-e.org) [reveals](http://186.31.31.117) huggingface/open-r 1: Fully open [recreation](https://andros.gr) of DeepSeek-R1 to replicate R1, completely open source (Jan 25, '25).
+- OpenAI researcher validates the DeepSeek team separately discovered and utilized some [core ideas](https://git.ninecloud.top) the OpenAI group utilized on the way to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file