In short, the students won. They did so by fine-tuning LFM2. LFM2 is a foundation built by Liquid AI. Liquid AI is a $2 billion startup from MIT.
nyuuzyou PRO
nyuuzyou
AI & ML interests
None yet
Recent Activity
liked
a dataset
2 days ago
ajibawa-2023/Java-Code-Large
updated
a collection
2 days ago
Code
Organizations
reacted to
ajibawa-2023's
post with π₯
2 days ago
Post
3113
JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large
JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.
By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.
JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
ajibawa-2023/JavaScript-Code-Large
JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.
By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.
JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
posted
an
update
3 days ago
Post
226
π° Casino Benchmark: Dataset + Space
nyuuzyou/casino-benchmark
nyuuzyou/casino-benchmark
14 models faced 1,400 simulations of heads-up Blackjack and European Roulette. Shared seeds locked identical cards and spins for each.
Key Stats:
- 14 models benchmarked
- 59,483 rows
- 35 MB compressed Parquet
- 35,000 scored decisions
- Full prompts, JSON responses, reasoning traces, latency
- Bankroll tracking from $1,000 start per run
Live leaderboard tracks bets, hits, stands, and risk management.
Gemini 3 Flash leads at +$3,396. Claude 4.5 Haiku at -$7,788.
Traces in the dataset. Leaderboard in the space.
nyuuzyou/casino-benchmark
nyuuzyou/casino-benchmark
14 models faced 1,400 simulations of heads-up Blackjack and European Roulette. Shared seeds locked identical cards and spins for each.
Key Stats:
- 14 models benchmarked
- 59,483 rows
- 35 MB compressed Parquet
- 35,000 scored decisions
- Full prompts, JSON responses, reasoning traces, latency
- Bankroll tracking from $1,000 start per run
Live leaderboard tracks bets, hits, stands, and risk management.
Gemini 3 Flash leads at +$3,396. Claude 4.5 Haiku at -$7,788.
Traces in the dataset. Leaderboard in the space.
reacted to
vikhyatk's
post with π₯
8 days ago
Post
5342
Just released a preview of Moondream 3!
moondream/moondream3-preview
This is a 9B parameter, 2B active MoE VLM with state of the art visual reasoning capabilities.
More details in the release blog post: https://moondream.ai/blog/moondream-3-preview
This is a 9B parameter, 2B active MoE VLM with state of the art visual reasoning capabilities.
More details in the release blog post: https://moondream.ai/blog/moondream-3-preview
reacted to
mitkox's
post with π
13 days ago
Post
4688
I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.
With local AI, I donβt have /fast CC switch, but I have /absurdlyfast:
- 100β499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707β200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; Itβs not the car. Itβs the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
With local AI, I donβt have /fast CC switch, but I have /absurdlyfast:
- 100β499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707β200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; Itβs not the car. Itβs the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.
replied to
ZennyKenny's
post
16 days ago
SEO spam has also become a lot less noticeable. I'm hoping that the next step will be to crack down on storage and traffic abuse, and maybe that will mean more generous storage limits.
reacted to
ZennyKenny's
post with π
16 days ago
posted
an
update
19 days ago
Post
391
Earlier I asked for a storage grant for some new datasets. One of those, the Google Code Archive
nyuuzyou/google-code-archive, is now in trending. Thanks to Hugging Face and the community for the support. π€
posted
an
update
21 days ago
Post
2711
ποΈ Microsoft CodePlex Archive Dataset -
nyuuzyou/ms-codeplex-archive
Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoftβs primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories
- 3.6 GB compressed Parquet
- 91 programming languages (Heavily featuring C#, ASP.NET, and C++)
- Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages)
- Includes platform-specific license metadata (Ms-PL, Ms-RL)
Following the strong response to the Google Code Archive nyuuzyou/google-code-archive (thanks!), this release preserves another major historical repository: the Microsoft CodePlex Archive.
CodePlex served as Microsoftβs primary open-source hosting platform from 2006 to 2017. This dataset captures the distinct .NET and Windows-centric development ecosystem that flourished before the industry standardizing on GitHub.
Key Stats:
- 5,043,730 files from 38,087 repositories
- 3.6 GB compressed Parquet
- 91 programming languages (Heavily featuring C#, ASP.NET, and C++)
- Cleaned of binaries, build artifacts, and vendor directories (node_modules, packages)
- Includes platform-specific license metadata (Ms-PL, Ms-RL)
reacted to
raincandy-u's
post with π
22 days ago
Post
2941
Introducing Rain-v2: Democratizing LLM training on gaming GPUs! β‘
βFollowing Rain-100M, weβre scaling up. Rain-v2 features a larger training dataset.
Weβve published a comprehensive blog covering the end-to-end journeyβfrom raw data collection to rigorous evaluation and safety testing.
βHF Repo: π€ raincandy-u/Rain-v2
βBlog: π
https://angelkawaii.xyz/2026/01/29/rain-v2/
βSpecial thanks to the open-source community and the SmolLM2 team for their foundational work! π
HuggingFaceTB
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)
βFollowing Rain-100M, weβre scaling up. Rain-v2 features a larger training dataset.
Weβve published a comprehensive blog covering the end-to-end journeyβfrom raw data collection to rigorous evaluation and safety testing.
βHF Repo: π€ raincandy-u/Rain-v2
βBlog: π
https://angelkawaii.xyz/2026/01/29/rain-v2/
βSpecial thanks to the open-source community and the SmolLM2 team for their foundational work! π
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model (2502.02737)
posted
an
update
27 days ago
Post
1891
π NNTP Discussion Archives - 387M Messages from Public Newsgroups -
nyuuzyou/nntp-text-387m
Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code.
Key Stats:
- 386,629,949 messages from 159,345 newsgroups
- 191 GB compressed Parquet storage
- Spans 2002-2026
- Multilingual: English, German, French, Italian, Dutch, Polish, Russian, and others
- Email addresses redacted for privacy
The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go.
Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.
Here's something different from the code datasets: 20+ years of public discussion archives from NNTP newsgroups. Clean Parquet format, but this time it's conversations instead of code.
Key Stats:
- 386,629,949 messages from 159,345 newsgroups
- 191 GB compressed Parquet storage
- Spans 2002-2026
- Multilingual: English, German, French, Italian, Dutch, Polish, Russian, and others
- Email addresses redacted for privacy
The data is messy in the way real discussions are messy. Spam wasn't filtered out - you get the advertisements, the arguments, the off-topic threads, all of it. If you want sanitized text, this isn't it. If you want to see how people actually talked online before Discord and Reddit took over, here you go.
Processing kept it simple: convert everything to UTF-8, remove exact duplicates, strip binary attachments, redact emails. Legacy character encodings were a nightmare - had to handle Windows-1252, ISO-8859 variants, KOI8-R, Shift-JIS, GBK, and others just to get readable text. At least it was fun to do, and I think the result turned out pretty well. I hope someone else will also be able to have fun or gain something useful from this project.
reacted to
raincandy-u's
post with π₯
28 days ago
Post
5407
π€ Just released Rain-100M, an experimental ~97M-parameter Qwen3-style language model trained from random initialization.
Repo: raincandy-u/Rain-100M
Data: HuggingFaceFW/fineweb-edu, ~3B tokens, English only
Tokenizer: custom 16k BPE, context length 4096
Architecture: 12 Transformer layers, hidden size 768, 12 heads, MLP 2048, SiLU, bf16
Rain-100M is a raw base model (not instruction-tuned or safety-aligned), aimed at small-scale research, debugging training pipelines, and CPU/edge experiments. If you run evaluations, finetunes, or visualizations with it, I would be very interested in your results!
Repo: raincandy-u/Rain-100M
Data: HuggingFaceFW/fineweb-edu, ~3B tokens, English only
Tokenizer: custom 16k BPE, context length 4096
Architecture: 12 Transformer layers, hidden size 768, 12 heads, MLP 2048, SiLU, bf16
Rain-100M is a raw base model (not instruction-tuned or safety-aligned), aimed at small-scale research, debugging training pipelines, and CPU/edge experiments. If you run evaluations, finetunes, or visualizations with it, I would be very interested in your results!
reacted to
ZennyKenny's
post with π
about 1 month ago
Post
3238
π My new personal website is live! Check out https://kennethhamilton.me to chat with an LLM about my professional skills and personal projects.
π Think of it like a really, really vain version of ChatGPT.
π Think of it like a really, really vain version of ChatGPT.
posted
an
update
about 1 month ago
Post
1460
ποΈ Google Code Archive Dataset -
nyuuzyou/google-code-archive
Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.
Key Stats:
- 65,825,565 files from 488,618 repositories
- 47 GB compressed Parquet storage
- 454 programming languages (Heavily featuring Java, PHP, and C++)
- Extensive quality filtering (excluding vendor code and build artifacts)
- Rich historical metadata: original repo names, file paths, and era-specific licenses
This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?
Expanding beyond the modern code series, this release presents a massive historical snapshot from the Google Code Archive. This dataset captures the open-source landscape from 2006 to 2016, offering a unique time capsule of software development patterns during the era before GitHub's dominance.
Key Stats:
- 65,825,565 files from 488,618 repositories
- 47 GB compressed Parquet storage
- 454 programming languages (Heavily featuring Java, PHP, and C++)
- Extensive quality filtering (excluding vendor code and build artifacts)
- Rich historical metadata: original repo names, file paths, and era-specific licenses
This is one of those releases that I'm most interested in getting feedback on. Would you like to see more old code datasets?
reacted to
ZennyKenny's
post with π₯
about 1 month ago
posted
an
update
about 1 month ago
Post
277
π¨π³ GitCode Dataset - Continuing the Chinese Code Series
nyuuzyou/gitcode-code
Following up on the Gitee release, here's another major Chinese code dataset from GitCode (CSDN's code hosting platform). Same pipeline, same clean format, more valuable data from China's developer ecosystem.
Key Stats:
- 48,142,567 files from 85,632 repositories
- 40 GB compressed Parquet storage
- 537 programming languages
- Extensive quality filtering applied
- Rich metadata: repo names, file paths, licenses, and sizes
The final dataset in the Chinese code series is also available: nyuuzyou/jihulab-code. It's smaller in size but shares the same pipeline and formatting.
Following up on the Gitee release, here's another major Chinese code dataset from GitCode (CSDN's code hosting platform). Same pipeline, same clean format, more valuable data from China's developer ecosystem.
Key Stats:
- 48,142,567 files from 85,632 repositories
- 40 GB compressed Parquet storage
- 537 programming languages
- Extensive quality filtering applied
- Rich metadata: repo names, file paths, licenses, and sizes
The final dataset in the Chinese code series is also available: nyuuzyou/jihulab-code. It's smaller in size but shares the same pipeline and formatting.
replied to
Ujjwal-Tyagi's
post
about 1 month ago
Glad you are finding it useful! You should also check out these datasets:
https://huggingface.co/datasets/nyuuzyou/gitcode-code
https://huggingface.co/datasets/nyuuzyou/jihulab-code
They use the same data processing pipeline and format, but they are sourced from different Chinese services.
reacted to
Ujjwal-Tyagi's
post with π€
about 1 month ago
Post
2602
I am very excited to see the release of
nyuuzyou/gitee-code. This is exactly what I have been looking for. Thank you to
@nyuuzyou
for his hard work on this.
reacted to
codelion's
post with π₯
about 1 month ago
Post
6123
Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!
Key findings from our research on optimal architectures for small language models:
β Depth beats width: 32 layers outperforms 12 layers at the same parameter count
β Best-in-class factuality: 47.5% on TruthfulQA
β 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
β Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
Key findings from our research on optimal architectures for small language models:
β Depth beats width: 32 layers outperforms 12 layers at the same parameter count
β Best-in-class factuality: 47.5% on TruthfulQA
β 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
β Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m
posted
an
update
about 1 month ago
Post
1677
π¨π³ Gitee Code Dataset - The Missing Piece of the Stack
nyuuzyou/gitee-code
Gitee is not included in the Software Heritage archive, meaning it is currently missing from datasets like The Stack. This release fills that massive gap, serving as the largest Chinese code dataset and one of the largest code corpuses overall.
- 819,472,785 files from 3,105,923 repositories
- 536 GB compressed Parquet storage
- 554 programming languages
- Extensive quality filtering: Removed vendor code, artifacts, and generated files
- Rich Chinese language understanding: High volume of Chinese comments and docs
Huge thanks to Hugging Face for the storage grant that made hosting this (and all my other datasets) possible!
I have also already dropped several other new code datasets and rolled out QoL improvements for older ones. I will be dropping posts on those throughout the week.
nyuuzyou/gitee-code
Gitee is not included in the Software Heritage archive, meaning it is currently missing from datasets like The Stack. This release fills that massive gap, serving as the largest Chinese code dataset and one of the largest code corpuses overall.
- 819,472,785 files from 3,105,923 repositories
- 536 GB compressed Parquet storage
- 554 programming languages
- Extensive quality filtering: Removed vendor code, artifacts, and generated files
- Rich Chinese language understanding: High volume of Chinese comments and docs
Huge thanks to Hugging Face for the storage grant that made hosting this (and all my other datasets) possible!
I have also already dropped several other new code datasets and rolled out QoL improvements for older ones. I will be dropping posts on those throughout the week.