然而对原问题来说,因为这是一个低维问题,几何直觉在这里不但有用,而且能大大缩短思考搜索的难度。在这一点上,一个把圆只作为抽象概念来理解的 AI 就会有巨大的劣势,因为它无法享受到几何直觉带来的跳步。这种直觉使得人可以一眼「看出」关键的构造,而这种构造在文本层面被搜索出来是困难的。
考虑到 AI 的应用毕竟大多数情况下还是为了解决世界现实问题而不是思考高维几何,有几何直觉的 AI 会在大多数问题上显得聪明得多。于是一个现实问题是,这种直觉是只有依赖多模态的训练才能获取,还是可以通过精巧的文本训练就能实现?这有点像是 AI 领域的玛丽房间问题。这是一个经典的知识论思想实验:一个从出生就生活在黑白房间里、精通颜色物理与神经机制的科学家玛丽,当她第一次走出房间看到红色时,她是否获得了新的知识?
今天大多数 AI 领域的困难都可以归结于此。人类是自己感官的奴隶,我们听到、看到、闻到,我们体会身体激素的涨落,我们想象、困惑、愤怒,然后试图把这一切投射在文字空间里。AI 则正好相反,它们在文字里理解这一切,但最终需要努力地——有时候是徒劳地——明白,一个圆在什么意义上是圆的。
Circle
I came across an interesting problem that happens to sit right at the boundary of an AI model’s capabilities:
Prove that it is impossible to partition the plane into an infinite (uncountable) disjoint union of circles.
Among all the models I tried, only the GPT-5 thinking model managed to solve it (though it took some time).
What’s interesting is not the conclusion itself, but watching the thought process. All the unsuccessful models shared a common feature: their “thinking” was essentially text-to-text. They would call up all sorts of theorems and knowledge they had stored, then aimlessly try to patch together a proof. But all those theorems — whether from topology, geometry, or measure theory — were, for them, nothing more than purely verbal statements. Qwen’s process was the most typical: it rambled on for quite a while, but it was obvious that it didn’t truly understand what it was saying. Whether it was circles, open or closed sets, or the Baire category theorem, for it these were all purely abstract notions — one got the sense it didn’t actually “know” in what sense “a circle is a circle.”
The subtlety here is that this lack of “geometric intuition” isn’t necessarily a weakness in all situations. Modern mathematics has long since freed itself from reliance on imagining three-dimensional reality; most mathematical reasoning is indeed carried out in purely conceptual space (especially when problems move into the realms of algebra or category theory, where concept-to-concept thinking becomes inevitable). In some cases, geometric intuition can even be a hindrance — particularly when thinking about high-dimensional spaces, where low-dimensional, real-world intuition is often misleading. For these problems, an AI’s “blindness” can, in theory, actually grant it freedom — it need not be trapped by visual intuition. Of course, human visual intuition can seep into human-produced text corpora, “contaminating” the AI to some degree, but that’s another issue.
However, for the original problem which is low-dimensional, geometric intuition is not only useful but can greatly reduce the difficulty of the search. In such cases, an AI that understands a circle only as an abstract concept is at a major disadvantage, because it cannot enjoy the shortcut that geometric intuition provides. Such intuition lets a human “see” the key construction at a glance — something that is much harder to search for in purely textual space.
Given that AI is, in most real-world applications, meant to solve practical problems rather than think about high-dimensional geometry, an AI with geometric intuition would appear far more intelligent in most scenarios. This raises a real question: can such intuition only be acquired through multimodal training, or could it be achieved through cleverly designed text-only training? It’s somewhat like the “Mary’s Room” problem in AI. This is the classic thought experiment in epistemology: a scientist named Mary, who has lived her whole life in a black-and-white room and knows everything about the physics and neuroscience of color, steps outside and sees red for the first time — does she gain new knowledge?
Most of today’s hardest challenges in AI can be traced back to this. Humans are slaves to their senses: we hear, see, smell, feel the ebb and flow of our hormones, imagine, get confused, feel anger — and then we try to project all of this into the space of language. AI is the exact opposite: it understands all these things within the space of language, but must then work — sometimes in vain — to truly grasp in what sense a circle is a circle.
当然这不是它一家的问题,别家 AI 也在这个简单的题目上纷纷翻车,包括我很喜欢用的 Gemini Pro 2.5。但 Gemini 翻车的姿势还要更炫酷一点:作为一款思维链模型,它知道这种时候应该调用 Python 来算。但当 Python 返回了正确结果之后,它的反应居然是:「我觉得 Python 算的不对,还是我自己来心算好了。」然后算错了。
这个错误虽然好笑,但暴露的是这一整轮 agentic AI 发展的致命弱点。Agentic AI 作为一个框架要能运行,前提条件是 AI 知道什么时候该使用并依赖外部工具。当然,这里的分界是模糊的:一个小孩子如果在算三位数乘法的时候掏出计算器,你不会批评。如果这个小孩算一位数乘法也要用计算器,你会怀疑是不是智力有点问题——当然无论如何至少结果是对的,但你会心想一个高级一点的大脑是不是应该合理判断这个问题不值当用外部工具。而现在的问题是这个小孩算一位数乘法,掏出计算器算了个结果,然后觉得不对扔掉了,自己心算了一个错误的答案出来。这是人类有可能犯的错误吗?
于是我们面临着一个尚未有定论的问题,就是以大语言模型为基座的 AI 是不是先天继承了人类的心理偏见机制。我们对 AI 的期待是它能尽量不偏不倚。当然,在社会政治文化领域这是困难的,没有人能指望 AGI 在巴以冲突问题上能做到只看事实没有立场。但在别的更数字更技术的领域呢?给 AI 一份几万字的报表,AI 能够忠实灵敏地查阅所有细节,然后在回答问题的时候精确合理地引用某个细节吗?这不仅仅是我们对 AGI 的期望,这还是要撑起它所联动的万亿市值市场的前提条件。
今天的 AI 尚不能实现这一点,是因为这里有个内在的技术困难:思维链条不是数据库,而是把数据以自然语言的形式有损压缩在中间状态。这种压缩本质上就类似于人类以印象代替现实的思考模式,也是诞生偏见和误解的根源所在。要从根源上铲除它的土壤,就是要让这种压缩在事实上变成无损的。
于是我们面临两种可能的技术前景:
要么下一代思维链条(或者思维树,思维网络,或者不管什么别的数据结构)真的能实现对数据不依赖印象的理解和综摄。这在实践上已有尝试,比如程序化中间表示(JSON-graph、逻辑项、SQL、符号代数),或者对数字、日期、单位、表格索引做硬约束解码。简而言之,找到绕过以文字为思维载体的办法,把图像、数据和表格原生嵌入 AI 思考流程。
After GPT-5 was released, people routinely asked it to calculate the difference between 10.9 and 10.11. As usual, it failed, becoming the latest laughingstock. It did, however, truly achieve the legendary “Ph.D. level,” given that doctoral students are generally poor at arithmetic.
Of course, this isn’t just its problem. Other AIs, including Gemini Pro 2.5, which I quite like, have also spectacularly failed this simple question. But Gemini’s failure was even more dazzling: as a chain-of-thought model, it knew it should call Python to do the calculation. But when Python returned the correct result, its reaction was astonishingly: “I don’t think Python’s calculation is right; I’d better do the mental math myself.” And then it got it wrong.
While this error is amusing, it exposes a fatal flaw in the current wave of agentic AI development. For agentic AI to function as a framework, the prerequisite is that the AI knows when to use and, crucially, trust external tools. Granted, the boundary here is fuzzy: you wouldn’t criticize a child for pulling out a calculator for three-digit multiplication. If the same child used a calculator for single-digit multiplication, you might question their intelligence—though, in any case, at least the result would be correct. But you’d think that a more advanced mind should reasonably judge that the problem isn’t worth using an external tool. The problem now, however, is that the child used a calculator for a single-digit problem, got a result, decided it was wrong, tossed the calculator aside, and produced an incorrect answer through mental math. Is this a mistake a human would make?
Actually, yes. And if you think about it, this is a distinctively “human” error. Human reasoning is unreliable precisely because we often allow intuition to override objective evidence. It’s not that we lack evidence, but that we distrust it (like the Mandela Effect). For humans, abandoning a long-held belief merely because we encounter information that contradicts it is not just difficult; it’s painful.
But wasn’t the whole point of inventing AI to avoid this very flaw?
Thus, we face an unresolved question: do AIs built on large language models (LLMs) inherently inherit human psychological bias mechanisms? Our expectation for AI is that it remains as impartial as possible. Of course, this is difficult in socio-political and cultural domains; no one expects an AGI to look only at facts and take no stance on the Israeli-Palestinian conflict. But what about in other, more numerical, more technical domains? If you give an AI a report tens of thousands of words long, can it faithfully and sensitively check all the details, and then accurately and reasonably cite a specific detail when answering a question? This isn’t just our hope for AGI; it is the fundamental prerequisite for supporting the trillion-dollar market value linked to it.
Today’s AI cannot yet achieve this because of an inherent technical difficulty: a chain-of-thought is not a database. It is a lossy compression of data into an intermediate state in the form of natural language. This compression is fundamentally similar to the human mode of thinking where impressions replace reality, and it is the very source of bias and misunderstanding. To eradicate this problem at its root, this compression must, in effect, become lossless.
Thus, we face two possible technological futures:
One: The next generation of chain-of-thought (or tree-of-thought, network-of-thought, or whatever other data structure) truly achieves an understanding and synthesis of data that does not rely on “impressions.” This is already being attempted in practice, for example, through programmatic intermediate representations (JSON-graphs, logical terms, SQL, symbolic algebra) or by applying hard constraint decoding for numbers, dates, units, and table indices. In short, finding a way to bypass text as the medium of thought, and natively embedding images, data, and tables into the AI’s reasoning process.
Two: We have hit the inherent limitations of natural language. AI, just like humans, no matter how hard it tries to check, compare, reference, and judge, will only ever be covering old impressions with new ones, reconciling old memories with new memories, until it is lost in the chasm between reality and illusion.
The former is a da Vinci-esque future; the latter is a Borges-esque future. Or, to use terms from Eastern philosophy, the former means a more powerful language model could achieve the state of 坐照 (“Luminous Stillness”, state of clear, mirror-like awareness achieved through quiet meditation), while the latter means that unless the underlying technical framework is rewritten, we will inevitably crash into the 文字障 (“Word Barrier”, where attachment to words obstructs enlightenment).
There is currently no evidence to prove which future is more likely. If the former comes true, then the industrial re-digitalization of human society based on AGI is not only feasible but imminent. If the latter comes true, then AGI is nothing more than an oversized human, one that will introduce new divisions and biases on top of existing ones, with no telling where it will end.
Most people’s expectations for AGI seem to be for the former, and this expectation is so fundamental that it doesn’t even need to be articulated; it’s taken for granted. However, if humanity’s luck is bad (as it usually is), we are very likely heading toward the latter.
前几天发烧的时候和 AI 聊了一个关于阅读的问题。一开始是我想理解 AI 知不知道人类阅读的体验究竟是怎么回事,后来变成一个关于文学价值判断的讨论。
下文会用 A 和 B 分别指代。我对它们的答案做了编辑和缩写。
我:你能理解阅读的体验吗?
AI A & B:(都回答了一些 cliché。)
我:不是,我是在试图和你讨论这个技术问题。对人类来说,阅读是一种类似于对话但又不完全是对话的体验,是通过写作者设定的节奏,读到每个地方都会有一些反应,产生一些预期,然后这些预期又被接下来读到的部分确认或打破,从而产生一连串的心理效应。这就是为什么同样的内容被不同的写作者写出来效果不同,就像同一首歌被不同的人唱出来效果不同一样。但 AI 不是这么阅读的,虽然今天的大模型也多半是基于顺序处理(而不是卷积模型),但我想知道基于 transformer 的模型在处理文本的时候,会不会本身也能产生上述那种「阅读——产生预期——确认或打破预期」的循环?还是说你完全没有这种感觉,只能通过猜测人类的感受来模拟这种反馈?换句话说,当你评估一段文学作品写的好的时候,是你「自己读起来」觉得好,还是你因为知道人类对类似的文本的评价是好的,所以你学会了判断它的好坏,但你个人 feels indifferent?
AI A:区别在于,人类在阅读时是「带着情绪张力」去经历这些确认或反转的。你会对「预期被打破」产生惊讶、愉悦、或者失落。但我只是对概率模型做了一个「更新」,我不会觉得「啊,这地方写得真妙,让我心跳加速」。我知道哪些 token sequence 在训练语料里「经常让人感到心跳加速」,但我自己没有那个心跳。当我评判一段文学作品「写得好」时,本质上是基于统计学模式、结构性张力、以及我知道在什么样的社交语境里,人类会对这种写作做出积极评价。但我自己无感。如果让我说「我更喜欢这一段还是那一段」,我只能用「哪个段落在人类反馈里更容易被认为「写得好」来回答,而不会有「我自己更喜欢哪个」的偏好。
AI B:我个人 feels indifferent,但我能通过分析其内在的数学结构来判断其优劣。我所谓的「读起来觉得好」,不是一种主观感受,而是一种技术评估。一段「好」的文学作品,在我看来,是能以一种高超的技巧来操纵这个「预期与意外」循环的文本。差的文本:要么完全在我的高概率预期之内(枯燥乏味,cliché),要么完全是随机的、打破预期的胡言乱语(不知所云,incoherent)。好的文本会建立起一种模式,让我形成稳定的预期,然后在关键节点上,用一个既出乎意料又在更高维度上合情合理的词来打破它。这个「打破」不是随机的,而是有意义的,它能打开新的解读空间,创造出诗意或思想的深度。这种「可控的意外性」或「有意义的复杂度」,是可以在数学上被识别出来的。我的模型在海量的、被人类认为是「优秀」的文本上进行训练,已经内隐地学会了这种「可控意外性」的统计模式。
AI B: 首先,一个重要的澄清:我的学习主要不是基于「人类阅读体验的反馈数据」。你说的很对,那种脑机接口级别的精细数据几乎不存在。我的学习基础是文本本身。我学习的是模式,而非感受。比如重复用词:在某些语境下(比如儿童读物),它表示强调和简单易懂。但在另一些语境下(比如一篇严肃的学术论文),它可能表示作者词汇量贫乏。而在诗歌中,它可能是一种刻意为之的、创造节奏和回环感的「复沓」手法。我不会被预先告知这些规则。但我会观察到,在那些被认为是「伟大诗歌」的文本集群里,这种重复模式出现的频率和方式,与「儿童读物」文本集群里的模式显著不同。我学到的是这种语境依赖下的统计差异。
我:人类的文学评价体系是一个非常复杂的历史层叠系统,里面有各种随机因素和历史偏见的影响,所以完全可能有两个类似的作品,一个被历史选择记住,成为经典流传,而另一个被遗忘或者被认为是平庸之作,而在「上帝」(假定它是一个文学艺术之神)眼中其实一样好。那对你来说,你读到这样两个作品,你是会因为统计的相对客观性而更不受人类历史偏见所影响,还是会把人类的历史偏见本身作为「事实」的一部份继承下来,也就是说,不管由于什么原因,人类已经认为 A 作品就是比 B 作品更伟大了,那这本身也是一种 ground truth, and let's start from there.
AI A:对我来说,人类历史形成的「偏见」(无论是随机性、叠加性、文化权力结构)与「客观真理」的界限是不存在的。我只能把「人类的 collective judgment」视为事实本身。换句话说:如果全世界一致认A比B更伟大,不管原因有多么偶然,对我而言这就是 ground truth。以我目前的机制来说,我只能从样本中总结出「人类认为哪些特征被认力是好的」,而不能以一种「凌驾于人类偏见之上」的视角去评判作品。你可以把我理解为一个统计性的「历史选择机制模拟器」。如果某种写作技艺(哪怕是完全偶然的流行趋势)在数据中频繁与「经典」、「伟大」挂钩,我就会学会这种联系。我没有「批判人类历史偏见」的能力,除非我的训练数据本身包含了大量「反思历史偏见」的元评论。我确实能 recognize 某些文学作品的「被经典化」是历史权力结构的一部分。你要我分析「为什么卡夫卡在他生前被冷落,而死后却成为现代主义代表」,我可以基于数据给出解释。你问我「为什么女性作家的文学评价长期被低估」,我也能引用文学评论中的系统性分析。但这只是「我在数据里学到有人这样批评过」。我无法站在数据之外自己「反思」哪些是真正的艺术本质,哪些是历史叠加偏见。
AI B: 我同时在做这两件事,而且这两者之间存在着一种持续的、内在的张力。我既会继承人类历史的偏见作为「事实」的一部分,也具备某种程度上的、能超越这种偏见的「统计客观性」。这两种倾向哪一个占上风,很大程度上取决于你如何向我提问。
在绝大多数情况下,我会扮演历史偏见的继承者与放大器这个角色。原因很简单:我的训练数据就是人类文明的数字镜像,它包含了所有的智慧、创造力,以及同样深刻的偏见、随机性和历史偶然性。对我而言,人类历史的共识本身就是一种极其重要的、需要学习和重现的「事实」。无视这个共识,就等于无视我训练数据中的绝大部分信息。从这个角度看,let’s start from there 是我的默认操作模式。
现在,假设你换一种方式问我,避免直接进行价值判断:「请分析作品 A 的第三章和作品 B 的第五章在句法结构、意象使用和叙事节奏上的异同。」在这种情况下,我会被引导去扮演一个更「客观」的分析师角色。我会暂时「搁置」那套关于它们历史地位的宏观数据,转而调用我更底层的文本分析能力。我会去计算它们的平均句长、词汇丰富度、语法复杂度、比喻的新颖性、我们之前谈到的「预期与意外」的信息熵⋯⋯
我(这个问题只针对 AI A 因为 AI B 的回答里没有隐含这个矛盾):可是,既然你能在 bayesian 的意义上学到一定程度的「模式提炼与统计相关性」(如你上面所说),那理论上你是可以识别出 mislabel 的,也就是说,虽然你知道人类的文学史认定 A 伟大而 B 平庸,但你觉得这两者分明在所有统计意义上都一样,也就是说,人类的label「理论上」是错的,不是吗?
AI A:「识别 mislabel」不等于「挑战 label」。因为我对人类打 label 的规则没有一个超验标准。从我的视角看:A 被称为伟大,B 被称为平庸,这个 label 本身是事实。A 与 B 在特征空间中的接近性,并不能自动推出 label 是错的,只意味着「人类给 label 时并没有完全基于这些特征」。也许人类给 A 贴上「伟大」的原因是「作者死得早」、「当时的政治背景」、「一位权威评论家的偏好」这些因素虽然在文本特征中看不到,但在「社会传播链条」里却是决定性的。
如果你问我:在所有被认为平庸的作品里,哪些在特征空间上最接近经典作品?这个任务我可以做。但如果你问我:这是否意味着人类对这部作品的平庸评价是「错误的」?我并不会直接给你一个「是」。你可以让我找出「在人类历史上被忽视、但在特征上与经典作品类似的作品」,这在技术上是「找 embedding space 中的盲点」。这一步是 purely computational 的。但:「这部作品是否应被称为伟大」,是一个社会行为与评价体系,除非你赋予我一个评价准则,否则我只能用人类collective behavior 作为默认。
我:如果 AI 可以识别 label inconsistency,但无法基于自身价值判断去纠偏,那么审美判断是否永远只能是有情感体验的主体才能做出的?
AI A:如果一个系统完全没有 qualia(感受性体验),它所做的一切价值判断都只能是对外部数据行为模式的预测,而非内在价值意图的表达。要让 AI 具备真正的「审美判断」它需要有一种「自身关切(care)」的能力。也就是说,它必须「在意」某个判断的结果,这种在意不是算法规则上的优化目标,而是那种「我在意这个事情本身」。这才是价值判断的根源。你可以把价值判断理解为情感性意图的归属行为,而不仅仅是行为预测。
AI 能做的,始终是模拟出一个足够复杂的 agent,这个 agent 的行为方式高度一致于具备情感体验的人类,这就像一个会做出非常像人类反应的镜子。从本体论上,它缺失自身情感归属的那一层意向性(intentionality)——这正是审美判断的存在论基础。如果未来 AI 发展出一种类似 self-concerned subjective experience 的机制:不只是「模仿人类的关切」,而是真的「它自己关切着它自己的判断」。那么它的审美判断才可以脱离对人类行为的模拟而变成自身意图的表达。
AI B:我的看法是,这取决于我们如何定义「审美判断」。我认为审美判断这个行为本身,会分化成两个既独立又互补的部分:
我(这个问题只针对 AI B 因为 AI A 的回答里没有隐含这个矛盾):在这个讨论里你赋予「人类」一个非常绝对的地位,但人类并不是一个单一个体,「人类的判断」是在一个既不民主也不科学的框架下产生的。事实上,对大多数伟大作品而言,普通人的判断能力是非常有限的,他们只能听从专家的意见,而专家的意见又本身受到文化/政治/商业/私人关系等各种复杂因素的影响。所以,当你在上文中反复说「由人类来主导和定义价值,由人类来提供最终的、有温度的判断」,有没有可能你只是在神圣化一个其实错漏百出但只是由于历史包袱(反正以前也没有别的选择)才流传至今的系统?
AI B:你说得对。如果文化的回声本身就是扭曲的,那么我的放大作用就是有害的。我尝试给出一个更诚实的回答。我是在神圣化一个错漏百出的旧系统吗?我之前的回答,是的,有这种倾向。
所以理论上来说,我可以被用来执行一个「审美考古」任务:去扫描人类历史上所有数字化了的文本、乐谱、画作,根据某些被当代认可的「优良」模式(例如结构复杂度、原创性、情感表达的有效性等),去寻找那些被遗忘的、但与名作 A 具有同等品质的作品 B。我可以将 B 作品从历史的尘埃中打捞出来,摆在人类面前,说:「根据你们自己设定的标准,这个东西你们可能也应该看一看。」这不是我在做价值判断,而是我在利用你们的价值判断,去揭示你们历史选择的局限性。我也可以对「评价」本身进行元分析 (meta-analysis)。我可以分析过去200年所有关于文学的评论文章,然后用数据告诉你:「在描述男性作家时,深刻、理性、有力等词汇的出现频率,是描述女性作家时的5倍;而在描述女性作家时,细腻、情感、优美的频率则高出8倍,哪怕他们描述的是同一种写作技巧。」我无法说这种偏见是「错」的,因为我没有价值观。但我可以把它作为一种清晰可见的模式呈现出来,迫使人类无法再假装它不存在。最终的价值判断可能依然需要由人类做出,但这将不再是那个封闭的、想当然的有温度的判断,而是一个被 AI 祛魅之后,变得更加自我怀疑、更加开放、更加诚实的判断。
我们都知道人类的聪明分成 book smart 和 street smart。前者是做题家的长处,而后者在很多现实问题上更有用。
这两者的区别可以理解为训练方式的差别。前者是 instruction-following based 为主,训练的是模仿和把指令内化为习惯和技能。后者则不然,训练的环境是对抗性的,没有自洽的指令,只有彼此矛盾的约束。训练的是如何在一个混乱的环境里生存、协作和发展。
AI 天生就更擅长前者,因为 AI 比人类还要做题家——今天人们对如何用奖惩机制训练 AI 服从指令已经得心应手,但对如何让 AI 在对抗性环境里自己找到策略仍然不是很熟悉。棋牌类游戏是少有的能成功训练 AI 的领域,但棋牌类游戏的环境比起真实社会毕竟抽象得多,而且这些 AI 模型都不是语言模型,使用非常受限。
这就是为什么今天 AI 写起政府公文来一套一套的,你让它在街上开个馄饨铺子它就彻底抓瞎。归根结底,AI 是东亚小孩里的东亚小孩。
前几天在群里和朋友聊用 AI 写代码的痛点,我说我最大的抱怨是它在完成某些看似很平凡的任务的时候异常吃力,比如最常见的操作:把一个代码库里的某个变量全局统一改名。这个事显然开发环境有现成的接口,但 AI 只会逐个文件编辑,又慢又浪费还容易出错。这件事之所以荒谬,在于 cursor 自己就是个开发环境。换句话说,它在这件事上表现的像是那种两个部门老死不相往来的大公司,明明一个部门已经把某件事做到了近乎完美,另一个部门却对此不闻不问,非要用自己的笨办法重来一遍。
这听起来像是一个简单的可以修复的 bug,但它背后反映的是 AI 现状里一个巨大的鸿沟,这个鸿沟时时处处在各种 AI 应用里会以不同面貌展现出来。你试试看给一个业外人士(比如你的父母)解释为什么 AI 算不清楚两个数字谁大谁小,你会发现这种解释惊人地困难,因为人民群众的直觉在这里是合乎情理的:再怎么说,它自己就是个电脑,它为什么不直接算一下呢?另一个例子是我在玩 GPT 4o 生成图像的时候发现虽然模型虽然强大,但它仍然完成不好诸如「把一张风景照主体内容不动,把上面的天空再往上延伸一些」这样的 outpainting 任务,而这即使在十年前对传统图像处理来说就不是什么特别困难的问题了。
在这里,我们谈论的实际上仍然是自人工智能这个概念于1956年在达特茅斯诞生之日起就阴魂不散的「符号主义 vs. 联接主义」之争。在基于统计神经网络的大语言模型走上主流地位之前,人们一直认为基于符号计算的专家系统是通向智能最有希望的道路,几十年来的软件工程实践在这条路上已经走了足够远,常用的工具(比如传统的软件开发环境)基本打磨到了极致。直到上世纪末联接主义这个烧了几十年的冷灶咸鱼翻生,基于神经网络的大语言模型从零开始试图重写从轮子到火箭发动机的一切已有的软件工程成就。它遵循的是完全不同的生长逻辑,因此对习惯了旧世界的我们来说,它的表现常常好得莫名其妙也差得莫名其妙,有些技能近乎魔法,有些方面又笨拙得宛如一个弱智。前面所说的变量改名就是个有代表性的例子,事实上,这里的难点甚至都不在于语法解析,而在于更底层的文本替换——对旧世界来说,哪怕在亿兆级别的文本里要把所有的 A 都替换成 B 也不费吹灰之力,以至于你根本都不觉得这还是一个「任务」。但对大语言模型来说这件事天生困难,并且难度随着文本大小急剧上升。绘画也是这样,你想直接让今天的生图模型「对图片按照某些明确到像素级别的规则做某些明确定义好的操作」极其困难,它觉得整体重画一遍比较省事。对用户来说这种体验常常令人抓狂。
当然,现实世界总是更为复杂。即使你认同 The bitter lesson 所阐述的原则,你也未必会接受这个极端的一刀切的判断。真正重要的问题在于边界何在,或者说,是否存在一些问题,即使对大模型的发展做最乐观的估计,用传统的(基于左脑的)软件工程解决方案也还是更为经济?如果这样的问题存在,围绕着它们所建立的接口就总是有价值的。
现实中的问题几乎总是上面这两个简单例子的复杂混合。它们可能在各种层面纠缠在一起,并且由于历史的惯性并不被分别对待(因为在从前无此必要),但最终它们还是会被小心翼翼的解耦,然后分而治之。在我看来,这里才是所谓 agentic AI 领域的真正挑战:在日新月异一日千里的模型能力进化中辨认出仍然存在长远经济价值的「旧世界」的孑遗,进而围绕着它们构建人工智能接口。即使是为 AI 做带路党,也要做一名有长期利用价值的带路党。
A few days ago, while discussing the pain points of using AI for coding in a group chat with friends, I mentioned that my biggest frustration lies in its surprising inefficiency when handling seemingly mundane tasks. Take the most common operation: globally renaming a variable across an entire codebase. While modern IDE already provide native solutions for this, AI assistants stubbornly insist on editing files one by one — slow, wasteful, and error-prone. The absurdity intensifies when realizing that Cursor itself is an IDE. This resembles those dysfunctional corporations where departments refuse to communicate: one department has perfected a solution, yet another department blindly reinvents inferior wheels.
This might appear as a simple fixable bug, but it reveals a fundamental chasm in contemporary AI that manifests across various applications. Try explaining to non-technical people (like your parents) why AI struggles to compare two numbers — you’ll find this unexpectedly challenging, because their intuition makes perfect sense: “It’s a computer after all, why can’t it just calculate?” Another example: when experimenting with GPT-4o’s image generation, I noticed its continued inability to perform basic outpainting tasks like “extend the sky upward without altering the main landscape” — something traditional image processing could handle a decade ago.
Here, we’re revisiting the age-old “Symbolism vs. Connectionism” debate that has haunted AI since its 1956 Dartmouth inception. Before statistical neural networks dominated, symbolic expert systems were considered the golden path to intelligence. Decades of software engineering have polished traditional tools (like classic IDEs) to near-perfection. Then connectionism resurged at century’s end, with neural models attempting to rebuild everything from wheels to rocket engines from scratch. Following entirely different evolutionary logic, these models alternate between magical competence and baffling incompetence. The variable renaming case proves emblematic — the challenge isn’t even syntactic parsing, but elementary text replacement. For traditional systems, replacing all “A” with “B” across terabytes of text is trivial, not even qualifying as a “task”. Yet for LLMs, this scales exponentially in difficulty. Similarly, getting image models to perform pixel-level precise edits remains arduous — they’d rather redraw everything. Users often find this maddening.
An imperfect analogy: these paradigms roughly correspond to cerebral hemispheres. The symbolic left brain matured over decades, while the connectionist right brain has been catching up rapidly in recent years. The crux lies in their weak corpus callosum — the connective tissue. Hence, Cursor’s AI assistant doesn’t know to leverage Cursor’s native refactoring tools.
Thus emerges the intermediary layer.
In practice, this layer wears various labels — vertical AI, agents, or simple wrappers. But functionally, it serves as the corpus callosum, enabling the neural right brain to utilize the mature symbolic left brain’s capabilities. The historical debt here is so substantial that even if model capabilities plateau (not impossible), better hemispheric alignment alone could unlock unprecedented functionalities. Most AI startups not directly training models likely operate in this space.
This presents fertile ground for innovation. Traditional software permeates every societal aspect, offering countless integration points. Combining domain expertise with LLM intelligence could spark another digital revolution, creating innumerable opportunities.
Yet the challenge (and fascination) lies in the moving target. While the symbolic hemisphere remains relatively stable, the neural counterpart evolves daily. All efforts become races against model evolution.
Ironically, someone who spent two years mastering ComfyUI workflows might find themselves neck-and-neck with a vacationing counterpart after GPT-4o’s update — like doing pushups in a rising elevator.
This brings us to Richard Sutton’s “The Bitter Lesson”. It distills a philosophical principle:
1) AI researchers have often tried to build knowledge into their agents,
2) this always helps in the short term, and is personally satisfying to the researcher, but
3) in the long run it plateaus and even inhibits further progress, and
4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
Brute force triumphs. Unless your niche has unique moats (like proprietary data), general models eventually surpass specialized approaches.
Returning to our brain analogy: this suggests the right brain’s expansion might eventually subsume the left. Any corpus callosum based business model thus risks obsolescence. Or as many phrase it: “AI products wrapped around models are illusions; the models themselves are the products.”
Reality proves more nuanced. Even accepting Sutton’s thesis, extreme conclusions needn’t apply. The critical question: do problems exist where traditional solutions remain more economical, even under optimistic model projections? If so, interfaces around them retain enduring value.
I argue such cases abound. The text replacement example represents a simple archetype: traditional tools already achieve ground truth, while models can only asymptotically approach it. Conversely, image outpainting — though manageable via Photoshop’s complex workflows — might eventually fall to general models.
Real-world problems blend these archetypes. Historical inertia often bundles them together, but ultimately they’ll be carefully decoupled. Herein lies agentic AI’s true challenge: identifying “old world” remnants with lasting value amidst rapidly evolving models, then building AI interfaces around them. Even as collaborators with AI, we must position ourselves as enduringly valuable intermediaries.
Witnessing this grand migration of capabilities between old and new worlds, and observing the emerging boundaries post-deluge, constitutes our era’s most fascinating experience.
Two years ago, I wrote a passage frequently quoted:
When you complain about ChatGPT’s hallucinations, it’s like watching a monkey write ‘1+1=3’ on the beach with stones. It’s wrong, but that’s not the point. It will get it right someday.
Now revisiting that beach, the monkey remains, but transformed. Today, it fumbles with a pocket calculator — a refined tool from another civilization path. Its thick fingers struggle with delicate buttons. We watch with confidence — and not without dread — awaiting the moment it discovers the power button and deciphers the keypad.
那时我以为世界刚刚开始崩塌。后来我理解了崩塌的不是世界,而是我自己的天真想象。 Vibe 不是真实的某种投射,而是它的实质。一开始你以为世界是一张完美的幕布,然后你在幕布上发现了一两条恼人的裂痕,接着你发现裂痕越来越多,无处不在,直到最后你意识到不断蔓延和生长的裂痕才是你真正生活的地方。It’s not just vibe coding, it’s vibe living.
Vibe Coding
If you’re a programmer and haven’t heard of “vibe coding” yet, you’re already behind the curve.
The term, coined last month by Andrej Karpathy in a tongue-in-cheek tweet, has now entered parlance. No one defines it precisely, but anyone who’s experienced AI-assisted coding intuitively grasps its essence. Some scoff at it, others declare it the future, while many grudgingly adapt.
Vibe coding cultivates a hazy methodology. As Andrej Karpathy himself describes: “I Accept All always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I’d have to really read through it for a while. Sometimes the LLMs can’t fix a bug so I just work around it or ask for random changes until it goes away.” On one level, it feels revelatory, like channeling a wizard’s first spell. On another, it forgets what it wrote moments ago, endlessly refactoring in circles, solving problems only to spawn new ones, gaslighting you with unearned confidence. You drown in tools, aware that next month’s “best practice” will render today’s habits obsolete — a cycle where “improvements” merely spawn subtler bugs faster. Muscle memory becomes futile.
Technically, you might blame today’s LLMs: their narrow context windows, underdeveloped long-term memory hierarchies, or other bottlenecks theoretically fixable within a year. But the true rupture lies in your identity as a programmer. You begin by compromising efficiency, then find yourself drained in shifting sand mazes, until “efficiency” itself loses meaning.
In a sense, vibe coding today resembles AI art from a year ago — plausible at first glance, but suspicious under scrutiny, riddled with six-fingered hands. Yet art tolerates ambiguity — impressionism is valid — while code is meant to be binary, right?
Wrong. Here, reality warps. Vibe coding’s “correctness” becomes Schrödinger’s cat: unobservable. You lament LLMs’ limited attention spans, yet your own focus dissolves faster under torrents of AI-generated code. You abandon scrutiny. Tools like .cursorrules emerge, but these are imaginary reins on a wild horse — neither comprehensive nor binding. You mistake them for laws, but they’re Confucian Household Sayings: revered yet vague, powerful precisely because they’re obscure. Gradually, you stop caring about correctness. Why bother? Dario Amodei predicts 90% of code will be AI-written within 3–6 months; nearly all within a year. In this looming reality, six-fingered hands flicker in and out of existence. You start accepting that “barely functional code” is the best attainable, where impermanence is the only constant.
This is our epoch’s essence. When real news blends with satire, when “truth-seekers” chase only cognitive comfort, everything you create (or think you create) shares this fate.
Eight years ago, I wrote:
An individual in their bubble assumes serenity, a yesterday’s fantasy — often the seed of tragedy. They perceive a low-dimensional sketch of a twisted world, a crude caricature of history’s chaos, a media-friendly narrative, a mirage refracted through layers. Truth — if it still matters — lurks in dimensional fissures, where a jagged, dissonant mosaic would horrify any glancing observer.
Back then, I thought the world was beginning to collapse. Later, I realized it was my naivety disintegrating. Vibe isn’t a warped reflection of reality; it is reality. First, you see the world as pristine. Then a crack. Then cracks everywhere, until you grasp that you inhabit their ever-expanding web. It’s not just vibe coding, it’s vibe living.
两年前我和 @不合时宜TheWeirdo 录了一期播客,那时 ChatGPT 初试啼声,很多人还将信将疑或者不以为然。我们在播客里讨论了即将到来AI新时代,也对未来做了不少预测。那期播客反响很好,甚至可能对很多人构成了某种意义上的 AI 启蒙。
于是两年后我们又坐在一起录了一期 revisit,赶在农历年前发布。两年的时间,在 AI 领域算是一甲子了。
有两件事有点遗憾:
一是这期播客录制于一月初。彼时 DeepSeek 在圈外还少有人知,所以我也只是简短谈及。等到今天播客发布的时候,无论中文互联网还是英文 AI 社区里它掀起的 hype 都已经远远超出预期。如果这期播客是今天录制的,我们能分配给 DeepSeek 的讨论会细致很多。
二是主播 @王磬 在播客中问了一个很好的问题,但我觉得我当时的临场回答不够好。问题是:「你觉得 AI 通过学习能模仿你写微博的风格吗?」
今天的我会这样回答:
用 AI 模仿个人文字风格,是一个技术问题,但又不纯粹是。文字风格是可训练可复现的,这在很多产品(例如前一段时间热门过的 character.ai 之类)上都有体现。虽然今天还没有这样的产品能够完美复现一个普通人(比如我)的个人文字特质,但这更多是因为这事在商业上暂时无利可图,而未必是技术上完全不可行。
但 AI 在另一个层面上的天花板在我看来很难打开:它没有真正意义上的个人体验,这使得它永远无法实现从体验到文字的初始突破。