Skip to content
@toy-gpt

Toy GPT

Toy GPT

A family of small repositories that demonstrate how language models are trained and how model behavior changes with architecture and data.

This organization is designed for illustration, experimentation, and structural understanding; the repositories are not suitable for production use. Several corpora are intentionally neutral, serving as negative controls to show when additional context provides little or no benefit.

  • App -interactive visualization of next-word (token) prediction in GPT-style language models.
  • App Repo

Matrix (Model – Corpus)

Training repo Corpus trained Notes (corpus type)
train-100-unigram 000_cat_dog.txt single-token model (neutral)
train-200-bigram 000_cat_dog.txt conditional next-token with 2-token stats (neutral)
train-300-context-2 000_cat_dog.txt canonical context-2 (neutral)
train-400-context-3 000_cat_dog.txt canonical context-3 (neutral)
train-100-unigram-animals 001_animals.txt single-token model (structured)
train-200-bigram-animals 001_animals.txt conditional next-token with 2-token stats (structured)
train-300-context-2-animals 001_animals.txt canonical context-2 (structured)
train-400-context-3-animals 001_animals.txt canonical context-3 (structured)
train-301-context-2-llm-glossary 010_llm_glossary.txt same architecture as 300, different corpus
train-302-context-2-repo-tour 020_repo_tour.txt same architecture as 300, different corpus
train-500-embeddings 030_analytics.txt canonical embeddings demo corpus
train-600-attention 030_analytics.txt same corpus as 500 to isolate effect

Grid View

Corpus ID Description 100 unigram 200 bigram 300 context-2 400 context-3 500 embeddings 600 attention
000_cat_dog small neutral (low-signal) corpus X X X X
001_animals small structured (proximity-signal) corpus X X X X
010_llm_glossary language model definitions X (301)
020_repo_tour repo files and folders X (302)
030_analytics technical micro-lessons X X

Legend: X indicates a completed training run with committed artifacts. Blank cells indicate combinations intentionally not explored.

Model numbering

Model codes increase with architectural complexity:

  • 100/unigram: predicts the next token using only overall frequency, without considering surrounding context.
  • 200/bigram: predicts the next token conditioned on the immediately preceding token.
  • 300/context-2: predicts the next token using the previous two tokens as context. Architecturally equivalent to a bigram model, but implemented in a way that generalizes to longer contexts.
  • 400/context-3: predicts the next token using the previous three tokens as context.
  • 500/embeddings: learns numeric vector representations of tokens so related tokens occupy nearby locations in a learned space.
  • 600/attention: dynamically weights which prior tokens matter most when predicting the next token, rather than treating all context equally.

See the paper Attention Is All You Need for background on attention-based models.

Corpus numbering

Corpus IDs are three digits and follow a project-wide standard:

  • 000–009: canonical reference corpora
  • 010–049: curated explanatory corpora for focused demonstrations
  • 100–799: additional contributions

Each corpus file is named NNN_slug.txt. The corpus ID and slug are recorded in SE_MANIFEST.toml.

Terms

  • neutral corpus A corpus deliberately constructed so that adjacency carries little or no predictive signal. Increasing context length (bigram, context-2, context-3) should provide minimal benefit. Used as a negative control.

  • structured corpus A corpus where adjacency or proximity carries meaningful predictive signal (e.g., attributes, phrases, or semantic groupings). Increasing context length should measurably improve predictions.

  • unigram A model that conditions on a single token (uni = one).

  • bigram A model that conditions on two tokens (bi = two).

Purpose

This organization emphasizes clarity, reproducibility, and explicit boundaries between models, data, corpora, and claims.

Popular repositories Loading

  1. train-100-unigram train-100-unigram Public

    Demonstrates, at very small scale, how a language model is trained (unigram language model).

    Python 1

  2. train-200-bigram train-200-bigram Public

    Demonstrates, at very small scale, how a language model is trained (bigram language model).

    Python 1

  3. train-300-context-2 train-300-context-2 Public

    Demonstrates, at very small scale, how a language model is trained (two-context model).

    Python 1

  4. train-100-unigram-animals train-100-unigram-animals Public

    Demonstrates, at very small scale, how a language model is trained (unigram language model).

    Python 1

  5. toy-gpt-chat toy-gpt-chat Public

    Interactive visualization of next-word (token) prediction in GPT-style language models.

    Vue 1

  6. train-200-bigram-animals train-200-bigram-animals Public

    Demonstrates, at very small scale, how a language model is trained (bigram language model).

    Python 1

Repositories

Showing 10 of 11 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…