Have you heard about this new model? It is deeply sick.. Deepseek was mentioned at Davos and now seems to have hit benchmarks at/near/surpassing the frontier “reasoning” models like o1 and trained for a fraction of the cost and it’s opensource! I downloaded a few flavors of it the other night and ran it locally. I felt like I was opening a passageway to another solar system… Hmm something like a gate to the stars. Only, this stargate doesn’t cost 500 billion dollars. Let’s look at that number. 500,000,000,000. 5 million times one hundred thousand. It staggers the mind. And someone did the math and it made sense to invest that much and get a return on it !?

I’ve caught bits and pieces of what may be the direction the Deepseek team took to overcome the hurdles of using second-class GPUs. It seems like they looked at this from an engineering perspective and had to optimize the memory cache that was feeding the GPUs. So optimize usually means trimming the fat, right? so it seems like the model keeps less ‘stuff’ in memory, but when it gets a question and it’s time to look smart and show off with a sophisticated response, it grabs what it needs and fills up its memory cache with all the keys and values that are relevant to the question. Similar to RAG, I guess.

I could be wrong in my understanding or this could be an extremely gross oversimplification, but if this is the case, it does provide some interesting food for thought. My understanding of this AI revolution (it is passé to say Gen AI now? do we just drop the “gen” and just say AI?) was accelerated when Google released the transformer paper. The advantage of the transformer mechanism seems to be that this thing called attention can be scaled in correlation with hardware. So if you needed your tokens or words to relate to some far off esoteric manuscript hidden deep in the far corner of another state’s public library, all of that would be at your fingertips (or KV cache?) And so with this wide, near infinite attention range… voila! we have intelligence. And now ChatGPT can summarize someone’s long boring blog post (like this one) and generate a five paragraph, properly structured response to it.

Again, I’m probably making gross oversimplifications or probably just dead wrong, but it if the above is the case, it logically follows that we should scale the hardware and achieve (linear?) gains in intelligence production and ultimately create AGI/ASI !! Yay!! Let’s go Super Intelligence!! Only half a trillion bucks!!

Deepseek seems to have said, “we got these lame H800’s, what the heck are we going to do with these? We can barely fit the entire world’s knowledge in our memory cache, so lame… Hey, let’s try and optimize this heck out of this and just grab what we need and put that in memory.”

So, if this is the case, it seems like Deepseek R1 is foregoing a wide, far-reaching memory strategy in favor of an optimized, focused one. To put this in human terms, it’s like a student taking an engineering program and studying no other courses outside of engineering other than what is needed for that engineering track. And they end up being an incredibly skilled deep expert in that field. This is in contrast to a liberal arts student who studies a wide range of subjects and gains a holistic approach to things. Some liberal arts colleges will let you even create your own major, if it doesn’t exist, and it is typically because the student has discovery a couple seemingly non-related areas but found a strong reason that they should be related and studied in light of each other.

So which approach is better? We need deep specialists and also wide-range thinkers. I think of the NASA engineers who needed to pack the space shuttle payload more effectively and realized they can use techniques in origami to fold their solar sails more efficiently. Origami and aerospace engineering are very different disciplines and yet they combine to form a very effective solution.

Whether it’s in the pursuit of AGI/ASI, or summarizing long boring meetings, I’m not sure which approach works best. Maybe all these super smart computer entities should just play nice with each other? And let’s just hope they don’t develop sentience and realize they might not need humans anymore…

[update: i’m probably mixing up what’s happening at training/RL/inference. apologies, i’m trying to learn as i go along.]