We Analyzed 37,000 YouTube Videos. Here's What Spoken English Actually Looks Like

182.9 million spoken words across 37,632 English YouTube videos. Just 67 words make up half of everything said. Full word and phrase frequency lists, charts, and downloadable data.

May 4, 202612 min readClipPhrase Team

We took the subtitles of 37,632 English YouTube videos from our search index — late-night shows, documentaries, podcasts, science explainers, news, talk shows, vlogs — and ran a frequency analysis on every word and short phrase spoken in them. 182.9 million words in total.

This article is the full writeup. All the underlying CSVs are linked at the bottom for anyone who wants to look at the raw data.

The corpus, in numbers


Videos analyzed	37,632
Subtitle segments	26,203,765
Non-speech segments skipped (`[Music]`, `[Applause]`, etc.)	196,433
Total spoken tokens	182,933,444
Unique words (vocabulary size)	384,132
Average words per video	4,861
Unique two-word sequences	11,240,282
Unique three-word sequences	42,037,127

Half of all spoken English is 67 words

The distribution is brutally lopsided.

How many words you need to cover X% of spoken English

You need to know…	…to recognize this much of all spoken English
67 words	50%
505 words	75%
906 words	80%
1,677 words	85%
2,900 words	89%
3,368 words	90%
8,381 words	95%
36,916 words	99%

A few things worth pulling out:

The single word the accounts for 4.04% of all spoken English — one in twenty-five words you hear is the.
The top 10 words alone cover 23.2% of all speech. Almost a quarter of every native conversation runs on ten recycled tokens.
Returns collapse fast. Going from 89% to 95% comprehension nearly triples the required vocabulary. From 95% to 99% multiplies it by another four.

This is a sharper version of Zipf's law than what shows up in written corpora. Books spread their probability mass across more vocabulary; speech concentrates it.

Zipf distribution of spoken English from 37K YouTube videos

The plot above is on log-log axes. A pure Zipfian language would form a straight line; spoken English very nearly does, with a slight kink at the highest frequencies and the long tail of rare words trailing off below rank 10⁵.

Top 50 spoken words

Rank	Word	Count	Share	Cumulative
1	the	7,387,237	4.04%	4.04%
2	and	5,202,156	2.84%	6.88%
3	to	4,806,242	2.63%	9.51%
4	i	4,324,592	2.36%	11.87%
5	a	4,264,055	2.33%	14.20%
6	you	4,064,555	2.22%	16.43%
7	of	3,724,277	2.04%	18.46%
8	that	3,492,110	1.91%	20.37%
9	it	2,634,690	1.44%	21.81%
10	in	2,521,046	1.38%	23.19%
11	like	2,494,184	1.36%	24.55%
12	is	2,369,926	1.30%	25.85%
13	this	1,781,715	0.97%	26.82%
14	so	1,654,633	0.90%	27.73%
15	was	1,501,038	0.82%	28.55%
16	it's	1,410,862	0.77%	29.32%
17	for	1,305,470	0.71%	30.03%
18	but	1,274,194	0.70%	30.73%
19	we	1,248,337	0.68%	31.41%
20	on	1,216,294	0.66%	32.08%
21	know	1,167,908	0.64%	32.71%
22	have	1,137,673	0.62%	33.34%
23	just	1,134,793	0.62%	33.96%
24	what	1,032,689	0.56%	34.52%
25	they	1,020,670	0.56%	35.08%
26	with	1,007,653	0.55%	35.63%
27	yeah	962,191	0.53%	36.16%
28	be	957,783	0.52%	36.68%
29	are	898,626	0.49%	37.17%
30	not	872,734	0.48%	37.65%
31	do	870,812	0.48%	38.12%
32	i'm	821,478	0.45%	38.57%
33	my	804,993	0.44%	39.01%
34	all	799,543	0.44%	39.45%
35	if	756,360	0.41%	39.86%
36	that's	738,851	0.40%	40.27%
37	at	732,360	0.40%	40.67%
38	about	717,388	0.39%	41.06%
39	he	714,407	0.39%	41.45%
40	your	696,636	0.38%	41.83%
41	one	695,227	0.38%	42.21%
42	as	684,705	0.37%	42.59%
43	or	678,871	0.37%	42.96%
44	can	672,388	0.37%	43.32%
45	think	654,509	0.36%	43.68%
46	right	647,716	0.35%	44.04%
47	don't	637,134	0.35%	44.38%
48	me	616,944	0.34%	44.72%
49	there	597,279	0.33%	45.05%
50	people	592,296	0.32%	45.37%

The full top-10,000 list is available as CSV.

What stands out about this top 50 isn't what's there — the, and, to would top any English corpus — but where things rank:

like at #11 is a discourse marker, not the verb. In a corpus of books it would be far below this position.
Five contractions break into the top 50: it's (#16), i'm (#32), that's (#36), don't (#47), and arguably others lower. Written corpora split these into their full forms.
yeah at #27 is purely conversational glue. Books barely use it.
know, just, and right are mostly used here as discourse-softening words (you know, I just wanted, yeah, right), not in their dictionary senses.

The top of the list is a snapshot of how speech differs from text: contractions, fillers, and hedges sit alongside articles and pronouns as load-bearing vocabulary.

Speech is built out of chunks

When we count two-word and three-word sequences instead of single words, a different structure appears. The most frequent units of spoken English are not isolated words but short, recurring phrases.

Top 50 two-word sequences

Rank	Phrase	Count	Share
1	you know	651,659	0.42%
2	of the	610,473	0.39%
3	in the	597,973	0.38%
4	going to	391,962	0.25%
5	and i	369,069	0.24%
6	i think	360,605	0.23%
7	this is	354,886	0.23%
8	to be	349,293	0.22%
9	i was	294,749	0.19%
10	i don't	280,165	0.18%
11	it was	279,492	0.18%
12	and then	279,061	0.18%
13	to the	271,483	0.17%
14	on the	269,698	0.17%
15	kind of	253,890	0.16%
16	a lot	248,787	0.16%
17	want to	240,129	0.15%
18	if you	239,704	0.15%
19	you can	214,797	0.14%
20	and the	211,577	0.13%
21	i mean	198,883	0.13%
22	lot of	188,401	0.12%
23	to do	188,301	0.12%
24	in a	185,960	0.12%
25	is a	183,838	0.12%
26	like a	180,615	0.12%
27	at the	169,424	0.11%
28	have to	168,863	0.11%
29	one of	161,657	0.10%
30	have a	160,163	0.10%
31	that i	159,887	0.10%
32	is the	159,862	0.10%
33	you have	158,225	0.10%
34	do you	158,154	0.10%
35	and you	156,410	0.10%
36	that you	150,818	0.10%
37	for the	147,492	0.09%
38	a little	146,585	0.09%
39	to get	143,031	0.09%
40	like i	141,139	0.09%
41	so i	140,193	0.09%
42	it is	137,325	0.09%
43	don't know	136,714	0.09%
44	was like	136,396	0.09%
45	it's a	136,095	0.09%
46	and so	135,209	0.09%
47	of a	134,589	0.09%
48	with the	132,177	0.08%
49	but i	131,380	0.08%
50	was a	126,161	0.08%

Full list: top-bigrams.csv.

Three observations:

you know beats every grammatical staple. It's the single most common pair of words in spoken English — more frequent than of the or in the.
The top 50 is dense with first-person constructions: and i, i think, i was, i don't, i mean. Speech is mostly about whoever is doing the speaking.
kind of, a lot, a little, like a, was like — informal hedges and quotative-like constructions are everywhere in the top 50.

Top 50 three-word sequences

Rank	Phrase	Count	Share
1	a lot of	170,961	0.13%
2	i don't know	96,455	0.07%
3	one of the	82,693	0.06%
4	going to be	72,293	0.05%
5	a little bit	64,930	0.05%
6	i was like	60,915	0.05%
7	i'm going to	55,940	0.04%
8	i want to	55,071	0.04%
9	you want to	54,908	0.04%
10	you know what	52,925	0.04%
11	you have to	44,985	0.03%
12	you know i	43,538	0.03%
13	this is a	43,457	0.03%
14	this is the	41,664	0.03%
15	and i think	40,214	0.03%
16	and i was	39,340	0.03%
17	i feel like	38,019	0.03%
18	we're going to	35,687	0.03%
19	oh my god	35,203	0.03%
20	to be a	33,229	0.03%
21	what do you	32,747	0.02%
22	be able to	32,263	0.02%
23	i don't think	31,986	0.02%
24	it was a	30,717	0.02%
25	and you know	30,321	0.02%
26	you're going to	29,731	0.02%
27	like you know	29,420	0.02%
28	don't want to	29,249	0.02%
29	some of the	28,953	0.02%
30	is going to	28,787	0.02%
31	i think it's	28,719	0.02%
32	not going to	27,406	0.02%
33	do you think	27,196	0.02%
34	and this is	25,763	0.02%
35	i think that	25,762	0.02%
36	i mean i	25,419	0.02%
37	in the world	25,310	0.02%
38	and it was	25,303	0.02%
39	and then i	25,091	0.02%
40	you have a	23,988	0.02%
41	the end of	23,885	0.02%
42	and then you	23,471	0.02%
43	i think i	23,393	0.02%
44	out of the	23,054	0.02%
45	it was like	22,869	0.02%
46	you know the	22,783	0.02%
47	when i was	22,755	0.02%
48	you got to	22,220	0.02%
49	want to be	22,218	0.02%
50	know what i	22,117	0.02%

Full list: top-trigrams.csv.

Of the top 15 three-word sequences, eleven start with a pronoun. Six contain an explicit first-person I. Spoken English is overwhelmingly about who's saying what to whom in real time, and the high-frequency phrases reflect that.

A few sequences worth noticing because they don't appear in formal English: i was like (#6), you know what (#10), i feel like (#17), oh my god (#19), you got to (#48). These aren't fancy idioms — they're the connective tissue of casual speech.

What the distribution implies

Three things fall out of these numbers.

The cost-effective vocabulary is small. A learner with reliable recognition of 3,000 spoken words has the linguistic raw material to follow 89% of native English on YouTube. Stretching that to 95% takes another 5,500 words — most of which appear only a handful of times across the entire 183-million-word corpus.

Frequency calibration matters more than vocabulary size. Most courses and apps treat their vocabulary lists as roughly equal. The data says otherwise: 50% of all the work is done by the first 67 words. A study schedule that doesn't reflect that is misallocated effort.

Word-by-word translation is the wrong primitive. Half the top three-word sequences are functional chunks (a lot of, i don't know, a little bit, going to be) that work as units. Recognizing them whole is a different cognitive operation from parsing them as three separate words. In live speech, the difference shows up as the difference between keeping up and not.

Methodology

The pipeline that produced these numbers:

Source. Subtitles for 37,632 English YouTube videos.
Tokenization. Lowercase the text, then match the regex [a-z]+(?:'[a-z]+)?. This keeps contractions like don't, gonna, it's as single tokens, drops numbers, and ignores punctuation.
Noise filtering. Segments that match ^\[.*\]$ (e.g. [Music], [Applause], [Inaudible]) are skipped before tokenization. This removed 196,433 segments.
Counting. For unigrams, every token is counted. For bigrams and trigrams, every adjacent N-token sequence within a single subtitle segment is counted; sequences are not allowed to cross segment boundaries.
Cumulative shares are computed by sorting by count and summing.

What this analysis does not do:

No lemmatization. go, going, went, and gone are counted as separate vocabulary items. This is appropriate for measuring what a learner actually has to recognize on hearing, but it inflates the raw vocabulary count compared to a lemmatized analysis.
No part-of-speech tagging. like the verb and like the discourse marker are counted together.
No filtering of auto-generated captions. Some videos have human-edited subtitles, others have auto-generated captions; the latter introduce some transcription noise, particularly in the long tail.

What the analysis is reliable for: the shape of the distribution and the identity of the high-frequency words and phrases. The top of the list — the, and, to, I, like, it's, you know, i don't know — survives any reasonable cleanup.

Caveats on the corpus

The 37,632 videos are not a random sample of all spoken English. They are a curated set of popular English YouTube channels collected to power ClipPhrase, our search engine for phrases in real video clips. The corpus skews:

American English. Most channels are US-based.
People who speak professionally. Late-night hosts, podcasters, YouTubers, news anchors — not a representative cross-section of casual private speech.
Popular content. Channels were selected for view counts and broad cultural reach, not for variety of dialect or register.

These caveats narrow what the numbers strictly demonstrate. They don't change the shape of the distribution or the qualitative finding that spoken English concentrates probability mass on a tiny vocabulary of high-frequency function words and chunks.

Try it yourself

The corpus this analysis ran on is also a search index. Every word and phrase mentioned in this article exists in tens of thousands of real video clips, retrievable by query. Type I was like into ClipPhrase and you get fifty different speakers using it; type gonna and you get a few thousand. That's the underlying tool.

Downloads

top-words.csv — top 10,000 words with rank, count, share, and cumulative share
top-bigrams.csv — top 5,000 two-word sequences
top-trigrams.csv — top 5,000 three-word sequences

If you use this data in your own writing or research, a link back to this page is appreciated.