ClipPhrase
← Back to Blog

We Analyzed 37,000 YouTube Videos. Here's What Spoken English Actually Looks Like

182.9 million spoken words across 37,632 English YouTube videos. Just 67 words make up half of everything said. Full word and phrase frequency lists, charts, and downloadable data.

May 4, 202612 min readClipPhrase Team

We took the subtitles of 37,632 English YouTube videos from our search index — late-night shows, documentaries, podcasts, science explainers, news, talk shows, vlogs — and ran a frequency analysis on every word and short phrase spoken in them. 182.9 million words in total.

This article is the full writeup. All the underlying CSVs are linked at the bottom for anyone who wants to look at the raw data.

The corpus, in numbers

Videos analyzed37,632
Subtitle segments26,203,765
Non-speech segments skipped ([Music], [Applause], etc.)196,433
Total spoken tokens182,933,444
Unique words (vocabulary size)384,132
Average words per video4,861
Unique two-word sequences11,240,282
Unique three-word sequences42,037,127

Half of all spoken English is 67 words

The distribution is brutally lopsided.

How many words you need to cover X% of spoken English

You need to know……to recognize this much of all spoken English
67 words50%
505 words75%
906 words80%
1,677 words85%
2,900 words89%
3,368 words90%
8,381 words95%
36,916 words99%

A few things worth pulling out:

  • The single word the accounts for 4.04% of all spoken English — one in twenty-five words you hear is the.
  • The top 10 words alone cover 23.2% of all speech. Almost a quarter of every native conversation runs on ten recycled tokens.
  • Returns collapse fast. Going from 89% to 95% comprehension nearly triples the required vocabulary. From 95% to 99% multiplies it by another four.

This is a sharper version of Zipf's law than what shows up in written corpora. Books spread their probability mass across more vocabulary; speech concentrates it.

Zipf distribution of spoken English from 37K YouTube videos

The plot above is on log-log axes. A pure Zipfian language would form a straight line; spoken English very nearly does, with a slight kink at the highest frequencies and the long tail of rare words trailing off below rank 10⁵.

Top 50 spoken words

RankWordCountShareCumulative
1the7,387,2374.04%4.04%
2and5,202,1562.84%6.88%
3to4,806,2422.63%9.51%
4i4,324,5922.36%11.87%
5a4,264,0552.33%14.20%
6you4,064,5552.22%16.43%
7of3,724,2772.04%18.46%
8that3,492,1101.91%20.37%
9it2,634,6901.44%21.81%
10in2,521,0461.38%23.19%
11like2,494,1841.36%24.55%
12is2,369,9261.30%25.85%
13this1,781,7150.97%26.82%
14so1,654,6330.90%27.73%
15was1,501,0380.82%28.55%
16it's1,410,8620.77%29.32%
17for1,305,4700.71%30.03%
18but1,274,1940.70%30.73%
19we1,248,3370.68%31.41%
20on1,216,2940.66%32.08%
21know1,167,9080.64%32.71%
22have1,137,6730.62%33.34%
23just1,134,7930.62%33.96%
24what1,032,6890.56%34.52%
25they1,020,6700.56%35.08%
26with1,007,6530.55%35.63%
27yeah962,1910.53%36.16%
28be957,7830.52%36.68%
29are898,6260.49%37.17%
30not872,7340.48%37.65%
31do870,8120.48%38.12%
32i'm821,4780.45%38.57%
33my804,9930.44%39.01%
34all799,5430.44%39.45%
35if756,3600.41%39.86%
36that's738,8510.40%40.27%
37at732,3600.40%40.67%
38about717,3880.39%41.06%
39he714,4070.39%41.45%
40your696,6360.38%41.83%
41one695,2270.38%42.21%
42as684,7050.37%42.59%
43or678,8710.37%42.96%
44can672,3880.37%43.32%
45think654,5090.36%43.68%
46right647,7160.35%44.04%
47don't637,1340.35%44.38%
48me616,9440.34%44.72%
49there597,2790.33%45.05%
50people592,2960.32%45.37%

The full top-10,000 list is available as CSV.

What stands out about this top 50 isn't what's there — the, and, to would top any English corpus — but where things rank:

  • like at #11 is a discourse marker, not the verb. In a corpus of books it would be far below this position.
  • Five contractions break into the top 50: it's (#16), i'm (#32), that's (#36), don't (#47), and arguably others lower. Written corpora split these into their full forms.
  • yeah at #27 is purely conversational glue. Books barely use it.
  • know, just, and right are mostly used here as discourse-softening words (you know, I just wanted, yeah, right), not in their dictionary senses.

The top of the list is a snapshot of how speech differs from text: contractions, fillers, and hedges sit alongside articles and pronouns as load-bearing vocabulary.

Speech is built out of chunks

When we count two-word and three-word sequences instead of single words, a different structure appears. The most frequent units of spoken English are not isolated words but short, recurring phrases.

Top 50 two-word sequences

RankPhraseCountShare
1you know651,6590.42%
2of the610,4730.39%
3in the597,9730.38%
4going to391,9620.25%
5and i369,0690.24%
6i think360,6050.23%
7this is354,8860.23%
8to be349,2930.22%
9i was294,7490.19%
10i don't280,1650.18%
11it was279,4920.18%
12and then279,0610.18%
13to the271,4830.17%
14on the269,6980.17%
15kind of253,8900.16%
16a lot248,7870.16%
17want to240,1290.15%
18if you239,7040.15%
19you can214,7970.14%
20and the211,5770.13%
21i mean198,8830.13%
22lot of188,4010.12%
23to do188,3010.12%
24in a185,9600.12%
25is a183,8380.12%
26like a180,6150.12%
27at the169,4240.11%
28have to168,8630.11%
29one of161,6570.10%
30have a160,1630.10%
31that i159,8870.10%
32is the159,8620.10%
33you have158,2250.10%
34do you158,1540.10%
35and you156,4100.10%
36that you150,8180.10%
37for the147,4920.09%
38a little146,5850.09%
39to get143,0310.09%
40like i141,1390.09%
41so i140,1930.09%
42it is137,3250.09%
43don't know136,7140.09%
44was like136,3960.09%
45it's a136,0950.09%
46and so135,2090.09%
47of a134,5890.09%
48with the132,1770.08%
49but i131,3800.08%
50was a126,1610.08%

Full list: top-bigrams.csv.

Three observations:

  1. you know beats every grammatical staple. It's the single most common pair of words in spoken English — more frequent than of the or in the.
  2. The top 50 is dense with first-person constructions: and i, i think, i was, i don't, i mean. Speech is mostly about whoever is doing the speaking.
  3. kind of, a lot, a little, like a, was like — informal hedges and quotative-like constructions are everywhere in the top 50.

Top 50 three-word sequences

RankPhraseCountShare
1a lot of170,9610.13%
2i don't know96,4550.07%
3one of the82,6930.06%
4going to be72,2930.05%
5a little bit64,9300.05%
6i was like60,9150.05%
7i'm going to55,9400.04%
8i want to55,0710.04%
9you want to54,9080.04%
10you know what52,9250.04%
11you have to44,9850.03%
12you know i43,5380.03%
13this is a43,4570.03%
14this is the41,6640.03%
15and i think40,2140.03%
16and i was39,3400.03%
17i feel like38,0190.03%
18we're going to35,6870.03%
19oh my god35,2030.03%
20to be a33,2290.03%
21what do you32,7470.02%
22be able to32,2630.02%
23i don't think31,9860.02%
24it was a30,7170.02%
25and you know30,3210.02%
26you're going to29,7310.02%
27like you know29,4200.02%
28don't want to29,2490.02%
29some of the28,9530.02%
30is going to28,7870.02%
31i think it's28,7190.02%
32not going to27,4060.02%
33do you think27,1960.02%
34and this is25,7630.02%
35i think that25,7620.02%
36i mean i25,4190.02%
37in the world25,3100.02%
38and it was25,3030.02%
39and then i25,0910.02%
40you have a23,9880.02%
41the end of23,8850.02%
42and then you23,4710.02%
43i think i23,3930.02%
44out of the23,0540.02%
45it was like22,8690.02%
46you know the22,7830.02%
47when i was22,7550.02%
48you got to22,2200.02%
49want to be22,2180.02%
50know what i22,1170.02%

Full list: top-trigrams.csv.

Of the top 15 three-word sequences, eleven start with a pronoun. Six contain an explicit first-person I. Spoken English is overwhelmingly about who's saying what to whom in real time, and the high-frequency phrases reflect that.

A few sequences worth noticing because they don't appear in formal English: i was like (#6), you know what (#10), i feel like (#17), oh my god (#19), you got to (#48). These aren't fancy idioms — they're the connective tissue of casual speech.

What the distribution implies

Three things fall out of these numbers.

The cost-effective vocabulary is small. A learner with reliable recognition of 3,000 spoken words has the linguistic raw material to follow 89% of native English on YouTube. Stretching that to 95% takes another 5,500 words — most of which appear only a handful of times across the entire 183-million-word corpus.

Frequency calibration matters more than vocabulary size. Most courses and apps treat their vocabulary lists as roughly equal. The data says otherwise: 50% of all the work is done by the first 67 words. A study schedule that doesn't reflect that is misallocated effort.

Word-by-word translation is the wrong primitive. Half the top three-word sequences are functional chunks (a lot of, i don't know, a little bit, going to be) that work as units. Recognizing them whole is a different cognitive operation from parsing them as three separate words. In live speech, the difference shows up as the difference between keeping up and not.

Methodology

The pipeline that produced these numbers:

  1. Source. Subtitles for 37,632 English YouTube videos.
  2. Tokenization. Lowercase the text, then match the regex [a-z]+(?:'[a-z]+)?. This keeps contractions like don't, gonna, it's as single tokens, drops numbers, and ignores punctuation.
  3. Noise filtering. Segments that match ^\[.*\]$ (e.g. [Music], [Applause], [Inaudible]) are skipped before tokenization. This removed 196,433 segments.
  4. Counting. For unigrams, every token is counted. For bigrams and trigrams, every adjacent N-token sequence within a single subtitle segment is counted; sequences are not allowed to cross segment boundaries.
  5. Cumulative shares are computed by sorting by count and summing.

What this analysis does not do:

  • No lemmatization. go, going, went, and gone are counted as separate vocabulary items. This is appropriate for measuring what a learner actually has to recognize on hearing, but it inflates the raw vocabulary count compared to a lemmatized analysis.
  • No part-of-speech tagging. like the verb and like the discourse marker are counted together.
  • No filtering of auto-generated captions. Some videos have human-edited subtitles, others have auto-generated captions; the latter introduce some transcription noise, particularly in the long tail.

What the analysis is reliable for: the shape of the distribution and the identity of the high-frequency words and phrases. The top of the list — the, and, to, I, like, it's, you know, i don't know — survives any reasonable cleanup.

Caveats on the corpus

The 37,632 videos are not a random sample of all spoken English. They are a curated set of popular English YouTube channels collected to power ClipPhrase, our search engine for phrases in real video clips. The corpus skews:

  • American English. Most channels are US-based.
  • People who speak professionally. Late-night hosts, podcasters, YouTubers, news anchors — not a representative cross-section of casual private speech.
  • Popular content. Channels were selected for view counts and broad cultural reach, not for variety of dialect or register.

These caveats narrow what the numbers strictly demonstrate. They don't change the shape of the distribution or the qualitative finding that spoken English concentrates probability mass on a tiny vocabulary of high-frequency function words and chunks.

Try it yourself

The corpus this analysis ran on is also a search index. Every word and phrase mentioned in this article exists in tens of thousands of real video clips, retrievable by query. Type I was like into ClipPhrase and you get fifty different speakers using it; type gonna and you get a few thousand. That's the underlying tool.

Downloads

If you use this data in your own writing or research, a link back to this page is appreciated.