我们分析了37,000个YouTube视频。这就是英语口语的真实面貌
37,632个英语YouTube视频中的1.829亿口语词。仅67个单词就占了所有口语的一半。完整的词频与短语频率列表、图表,以及可下载的原始数据。
我们从自己的搜索索引中取出了37,632个英语YouTube视频的字幕——深夜脱口秀、纪录片、播客、科普讲解、新闻、访谈节目、Vlog——并对其中出现的每一个单词与短语做了频率分析。总计1.829亿个词。
本文是完整的分析报告。所有原始CSV数据均在文末提供链接,方便希望查看原始数据的读者下载。
语料库的核心数据
| 分析视频数 | 37,632 |
| 字幕片段数 | 26,203,765 |
跳过的非语音片段([Music]、[Applause]等) | 196,433 |
| 口语词总量 | 182,933,444 |
| 独立词数(词汇量) | 384,132 |
| 平均每个视频词数 | 4,861 |
| 独立的二元词组(bigram) | 11,240,282 |
| 独立的三元词组(trigram) | 42,037,127 |
一半的英语口语只用67个单词
这种分布极端不均衡。

| 你需要掌握…… | ……才能听懂这么多英语口语 |
|---|---|
| 67个词 | 50% |
| 505个词 | 75% |
| 906个词 | 80% |
| 1,677个词 | 85% |
| 2,900个词 | 89% |
| 3,368个词 | 90% |
| 8,381个词 | 95% |
| 36,916个词 | 99% |
几个值得单独指出的点:
- 单词the一个词就占了所有英语口语的4.04%——你听到的每25个词里就有一个是the。
- 仅前10个单词就覆盖了**23.2%**的口语内容。母语对话里几乎四分之一的内容由这十个反复出现的词构成。
- 边际收益迅速崩塌。从89%的理解度提升到95%,所需词汇量几乎要翻三倍。从95%再到99%,又要再乘以四。
这是齐普夫定律(Zipf's law)的一种更极端的体现,比书面语料中观察到的更陡峭。书籍把概率质量分散到更多词汇上;口语则把它高度集中。

上图采用双对数坐标。纯Zipf分布的语言会形成一条直线;英语口语非常接近这条直线,仅在最高频区有轻微弯折,长尾的稀有词在排名10⁵之后逐渐消散。
最常见的50个口语单词
| 排名 | 单词 | 数量 | 占比 | 累计 |
|---|---|---|---|---|
| 1 | the | 7,387,237 | 4.04% | 4.04% |
| 2 | and | 5,202,156 | 2.84% | 6.88% |
| 3 | to | 4,806,242 | 2.63% | 9.51% |
| 4 | i | 4,324,592 | 2.36% | 11.87% |
| 5 | a | 4,264,055 | 2.33% | 14.20% |
| 6 | you | 4,064,555 | 2.22% | 16.43% |
| 7 | of | 3,724,277 | 2.04% | 18.46% |
| 8 | that | 3,492,110 | 1.91% | 20.37% |
| 9 | it | 2,634,690 | 1.44% | 21.81% |
| 10 | in | 2,521,046 | 1.38% | 23.19% |
| 11 | like | 2,494,184 | 1.36% | 24.55% |
| 12 | is | 2,369,926 | 1.30% | 25.85% |
| 13 | this | 1,781,715 | 0.97% | 26.82% |
| 14 | so | 1,654,633 | 0.90% | 27.73% |
| 15 | was | 1,501,038 | 0.82% | 28.55% |
| 16 | it's | 1,410,862 | 0.77% | 29.32% |
| 17 | for | 1,305,470 | 0.71% | 30.03% |
| 18 | but | 1,274,194 | 0.70% | 30.73% |
| 19 | we | 1,248,337 | 0.68% | 31.41% |
| 20 | on | 1,216,294 | 0.66% | 32.08% |
| 21 | know | 1,167,908 | 0.64% | 32.71% |
| 22 | have | 1,137,673 | 0.62% | 33.34% |
| 23 | just | 1,134,793 | 0.62% | 33.96% |
| 24 | what | 1,032,689 | 0.56% | 34.52% |
| 25 | they | 1,020,670 | 0.56% | 35.08% |
| 26 | with | 1,007,653 | 0.55% | 35.63% |
| 27 | yeah | 962,191 | 0.53% | 36.16% |
| 28 | be | 957,783 | 0.52% | 36.68% |
| 29 | are | 898,626 | 0.49% | 37.17% |
| 30 | not | 872,734 | 0.48% | 37.65% |
| 31 | do | 870,812 | 0.48% | 38.12% |
| 32 | i'm | 821,478 | 0.45% | 38.57% |
| 33 | my | 804,993 | 0.44% | 39.01% |
| 34 | all | 799,543 | 0.44% | 39.45% |
| 35 | if | 756,360 | 0.41% | 39.86% |
| 36 | that's | 738,851 | 0.40% | 40.27% |
| 37 | at | 732,360 | 0.40% | 40.67% |
| 38 | about | 717,388 | 0.39% | 41.06% |
| 39 | he | 714,407 | 0.39% | 41.45% |
| 40 | your | 696,636 | 0.38% | 41.83% |
| 41 | one | 695,227 | 0.38% | 42.21% |
| 42 | as | 684,705 | 0.37% | 42.59% |
| 43 | or | 678,871 | 0.37% | 42.96% |
| 44 | can | 672,388 | 0.37% | 43.32% |
| 45 | think | 654,509 | 0.36% | 43.68% |
| 46 | right | 647,716 | 0.35% | 44.04% |
| 47 | don't | 637,134 | 0.35% | 44.38% |
| 48 | me | 616,944 | 0.34% | 44.72% |
| 49 | there | 597,279 | 0.33% | 45.05% |
| 50 | people | 592,296 | 0.32% | 45.37% |
完整的前10,000词列表见CSV文件。
这个前50榜单引人注目的不是榜上有谁——the、and、to会在任何英语语料中都名列前茅——而是某些词的位置:
- **
like**位列第11,这里是话语标记,而不是动词。在书籍语料中它的位置会低得多。 - 五个缩略形式挤进了前50:it's(第16)、i'm(第32)、that's(第36)、don't(第47),更靠后还有一些。书面语料通常把它们拆回完整形式。
- **
yeah**位列第27,纯属对话中的衔接词。书籍里几乎不会用它。 know、just、**right**在这里大多用作话语缓和词(you know、I just wanted、yeah, right),而不是它们的字典义。
榜单顶端清晰地展示了口语与文本的差异:缩略形式、填充词和缓和语与冠词、代词一起,构成了口语中的承重词汇。
口语是由"语块"搭起来的
如果我们不再统计单个单词,而是统计二元和三元词组,便会浮现出另一种结构。英语口语中最高频的单位不是孤立的单词,而是反复出现的短语。
最常见的50个二元词组
| 排名 | 短语 | 数量 | 占比 |
|---|---|---|---|
| 1 | you know | 651,659 | 0.42% |
| 2 | of the | 610,473 | 0.39% |
| 3 | in the | 597,973 | 0.38% |
| 4 | going to | 391,962 | 0.25% |
| 5 | and i | 369,069 | 0.24% |
| 6 | i think | 360,605 | 0.23% |
| 7 | this is | 354,886 | 0.23% |
| 8 | to be | 349,293 | 0.22% |
| 9 | i was | 294,749 | 0.19% |
| 10 | i don't | 280,165 | 0.18% |
| 11 | it was | 279,492 | 0.18% |
| 12 | and then | 279,061 | 0.18% |
| 13 | to the | 271,483 | 0.17% |
| 14 | on the | 269,698 | 0.17% |
| 15 | kind of | 253,890 | 0.16% |
| 16 | a lot | 248,787 | 0.16% |
| 17 | want to | 240,129 | 0.15% |
| 18 | if you | 239,704 | 0.15% |
| 19 | you can | 214,797 | 0.14% |
| 20 | and the | 211,577 | 0.13% |
| 21 | i mean | 198,883 | 0.13% |
| 22 | lot of | 188,401 | 0.12% |
| 23 | to do | 188,301 | 0.12% |
| 24 | in a | 185,960 | 0.12% |
| 25 | is a | 183,838 | 0.12% |
| 26 | like a | 180,615 | 0.12% |
| 27 | at the | 169,424 | 0.11% |
| 28 | have to | 168,863 | 0.11% |
| 29 | one of | 161,657 | 0.10% |
| 30 | have a | 160,163 | 0.10% |
| 31 | that i | 159,887 | 0.10% |
| 32 | is the | 159,862 | 0.10% |
| 33 | you have | 158,225 | 0.10% |
| 34 | do you | 158,154 | 0.10% |
| 35 | and you | 156,410 | 0.10% |
| 36 | that you | 150,818 | 0.10% |
| 37 | for the | 147,492 | 0.09% |
| 38 | a little | 146,585 | 0.09% |
| 39 | to get | 143,031 | 0.09% |
| 40 | like i | 141,139 | 0.09% |
| 41 | so i | 140,193 | 0.09% |
| 42 | it is | 137,325 | 0.09% |
| 43 | don't know | 136,714 | 0.09% |
| 44 | was like | 136,396 | 0.09% |
| 45 | it's a | 136,095 | 0.09% |
| 46 | and so | 135,209 | 0.09% |
| 47 | of a | 134,589 | 0.09% |
| 48 | with the | 132,177 | 0.08% |
| 49 | but i | 131,380 | 0.08% |
| 50 | was a | 126,161 | 0.08% |
完整列表:top-bigrams.csv。
三个观察:
- **
you know**力压所有语法骨架词。它是英语口语中最高频的一对单词——比of the或in the出现得还多。 - 前50里第一人称结构密度极高:and i、i think、i was、i don't、i mean。口语大多是说话人在谈论自己。
kind of、a lot、a little、like a、was like——非正式的缓和语和类引语结构充斥前50。
最常见的50个三元词组
| 排名 | 短语 | 数量 | 占比 |
|---|---|---|---|
| 1 | a lot of | 170,961 | 0.13% |
| 2 | i don't know | 96,455 | 0.07% |
| 3 | one of the | 82,693 | 0.06% |
| 4 | going to be | 72,293 | 0.05% |
| 5 | a little bit | 64,930 | 0.05% |
| 6 | i was like | 60,915 | 0.05% |
| 7 | i'm going to | 55,940 | 0.04% |
| 8 | i want to | 55,071 | 0.04% |
| 9 | you want to | 54,908 | 0.04% |
| 10 | you know what | 52,925 | 0.04% |
| 11 | you have to | 44,985 | 0.03% |
| 12 | you know i | 43,538 | 0.03% |
| 13 | this is a | 43,457 | 0.03% |
| 14 | this is the | 41,664 | 0.03% |
| 15 | and i think | 40,214 | 0.03% |
| 16 | and i was | 39,340 | 0.03% |
| 17 | i feel like | 38,019 | 0.03% |
| 18 | we're going to | 35,687 | 0.03% |
| 19 | oh my god | 35,203 | 0.03% |
| 20 | to be a | 33,229 | 0.03% |
| 21 | what do you | 32,747 | 0.02% |
| 22 | be able to | 32,263 | 0.02% |
| 23 | i don't think | 31,986 | 0.02% |
| 24 | it was a | 30,717 | 0.02% |
| 25 | and you know | 30,321 | 0.02% |
| 26 | you're going to | 29,731 | 0.02% |
| 27 | like you know | 29,420 | 0.02% |
| 28 | don't want to | 29,249 | 0.02% |
| 29 | some of the | 28,953 | 0.02% |
| 30 | is going to | 28,787 | 0.02% |
| 31 | i think it's | 28,719 | 0.02% |
| 32 | not going to | 27,406 | 0.02% |
| 33 | do you think | 27,196 | 0.02% |
| 34 | and this is | 25,763 | 0.02% |
| 35 | i think that | 25,762 | 0.02% |
| 36 | i mean i | 25,419 | 0.02% |
| 37 | in the world | 25,310 | 0.02% |
| 38 | and it was | 25,303 | 0.02% |
| 39 | and then i | 25,091 | 0.02% |
| 40 | you have a | 23,988 | 0.02% |
| 41 | the end of | 23,885 | 0.02% |
| 42 | and then you | 23,471 | 0.02% |
| 43 | i think i | 23,393 | 0.02% |
| 44 | out of the | 23,054 | 0.02% |
| 45 | it was like | 22,869 | 0.02% |
| 46 | you know the | 22,783 | 0.02% |
| 47 | when i was | 22,755 | 0.02% |
| 48 | you got to | 22,220 | 0.02% |
| 49 | want to be | 22,218 | 0.02% |
| 50 | know what i | 22,117 | 0.02% |
完整列表:top-trigrams.csv。
在前15个三元词组里,有11个以代词开头,6个明确包含第一人称I。英语口语压倒性地是关于"谁在实时对谁说什么",而这些高频短语正好印证了这一点。
还有几个值得注意的序列,因为它们在正式英语中并不会出现:i was like(第6)、you know what(第10)、i feel like(第17)、oh my god(第19)、you got to(第48)。它们不是什么花哨的习语——而是日常口语中的连接组织。
这一分布说明了什么
从这些数字里能得出三个结论。
**性价比最高的词汇量其实不大。**一位能稳定听辨3,000个口语单词的学习者,已经具备听懂YouTube上89%母语英语的语言素材。要把这一比例推高到95%,还得再额外掌握5,500个词——其中大多数在整个1.83亿词的语料里只会出现寥寥几次。
**频率校准比词汇量大小更重要。**多数课程和App把词表里的词当作大致同等重要来对待。但数据并非如此:所有"工作量"的50%都由前67个词承担。如果学习计划没有反映这一点,就是在错配努力。
**逐词翻译是错误的最小单位。**前列三元词组里有一半是功能性语块(a lot of、i don't know、a little bit、going to be),它们作为一个整体在起作用。把它们整体识别和把它们当作三个独立单词来解析,是两种不同的认知操作。在实时口语中,这一差别就是"跟得上"和"跟不上"的区别。
方法论
得出这些数字的处理流程:
- **来源。**37,632个英语YouTube视频的字幕。
- **分词。**先把文本转为小写,再用正则
[a-z]+(?:'[a-z]+)?匹配。这样可以把don't、gonna、it's等缩略形式作为单个token保留,丢弃数字,并忽略标点。 - **噪声过滤。**匹配
^\[.*\]$的片段(如[Music]、[Applause]、[Inaudible])在分词之前就被跳过。本步去除了196,433个片段。 - **计数。**对一元词,每个token都计入。对二元词组和三元词组,仅在同一字幕片段内统计相邻的N个token序列;序列不允许跨片段边界。
- 累计占比通过按计数排序后逐项求和得到。
本分析没有做的事:
- **未做词形还原。**go、going、went、gone被视为不同的词项。这对衡量学习者听到时实际需要识别的内容是合适的,但相比做了词形还原的分析会高估词汇总量。
- **未做词性标注。**作为动词的like和作为话语标记的like被一起统计。
- **未过滤自动生成的字幕。**部分视频有人工编辑的字幕,部分则是自动生成的字幕;后者会带来一定的转写噪声,尤其是在长尾部分。
本分析的可靠之处在于:分布的形态,以及高频词与高频短语的身份。榜首的那些词——the、and、to、I、like、it's、you know、i don't know——经得起任何合理的清洗与处理。
关于语料的几点说明
这37,632个视频并不是所有英语口语的随机样本。它们是为我们的视频片段搜索引擎ClipPhrase精心整理的一组热门英语YouTube频道。这一语料库存在以下偏向:
- **美式英语为主。**大多数频道位于美国。
- **以专业说话人为主。**深夜秀主持人、播客主、YouTuber、新闻主播——并不能代表日常私人口语的横截面。
- **偏向热门内容。**频道是按观看量和广泛文化影响力筛选的,而不是为了覆盖各种方言或语域。
这些限制收窄了这些数字严格意义上能证明什么。但它们并不会改变分布的形态,也不会动摇这一定性的发现:英语口语把概率质量高度集中在很小的一组高频功能词与语块上。
自己来试试
这套语料同时也是一个搜索索引。本文中提到的每一个单词和短语,都在数以万计的真实视频片段里出现过,可按查询检索。在ClipPhrase中输入I was like,你能看到五十位不同的说话人在使用它;输入gonna,能找到几千条结果。这就是支撑本文的底层工具。
数据下载
- top-words.csv——前10,000个单词,含排名、计数、占比与累计占比
- top-bigrams.csv——前5,000个二元词组
- top-trigrams.csv——前5,000个三元词组
如果你在自己的写作或研究中使用了这些数据,欢迎注明来源并附上本页面的链接。