OpenAI's GPT-4 training details of "not to mention death" have been released, this is my interpretation

Original Source: Minority

Image source: Generated by Unbounded AI

It was an ordinary morning a few days ago. I was moving bricks on a daily basis, and suddenly all kinds of information flooded in: "Hurry up, the GPT-4 model structure has been leaked, and the domestically produced large model will surpass it again!"

Turn on the social media and see, okay, you don’t need to speak English, and the domestic people have already gone online. I am really convinced by this speed. However, when I went to trace the source and see how reliable the information was, I suddenly felt that I had wandered out of the technology circle from the entertainment circle.

In view of the current state of the Internet where "Fake News" is flying all over the sky, the first thing I did after seeing this news was to trace the source.

▍Ins and outs

The starting point for my information mining was a thread of tweets shared on Hacker News extracted via Thread Reader (archived July 11). Click to open it, and there are two sentences:

GPT-4's details are leaked. It is over.

The level of this headline party is not inferior to that in China.

As we all know, OpenAI broke its commitment to open while releasing GPT-4, did not disclose any weight or technical details, and was widely criticized by the industry. This is probably why the blogger uses the It is over stalk to render the dramatic effect of "plot reversal".

Looking at the content again, it is OpenAI's tight-lipped GPT-4 training details. There have been a lot of speculations about this information, but the official has not disclosed it. When it is mentioned, it is very vague (the original text is relatively obscure, using a lot of abbreviations and jargon, some will be explained later):

  • Amount of model parameters: 1.8 trillion, about 10 times larger than GPT-3.5 (175 billion).
  • Model Layer Depth: 120 layers.
  • Model Architecture: Mixed Expert Model (MoE, see below for explanation), a total of 16 "experts", each with 111 billion parameters. Each forward pass of inference (generating a token output) selects two experts.
  • Training data: A total of 13T (13 trillion) token data. The text data is retrained 2 times, and the code data is retrained 4 times. This data is actually very important and will be analyzed in detail later.
  • Parallel strategy: 8-way tensor parallelism + 16-way pipeline parallelism. There are multiple GPU clusters located in different data centers training simultaneously, each cluster has 128 GPUs.
  • Pre-training context: 8K. The 32K version is fine-tuned from 8K.
  • Training cost: Continuous training for 90 to 100 days on about 25,000 A100s at a rate of about 2.15e25 flops. At $1 per A100 hour, it would cost about $63 million. (Can be done today in about 55 days using about 8192 H100s at an estimated cost of $21.5 million.)

The question is, how did this information come about, and is it reliable?

Follow the vine to touch the "melon", and I found the publisher of this series of tweets - Yam Peleg.

Although I haven't followed this old man's account, I have read his previous articles. He is the CEO of a "startup company" in Israel (but it has been established for 15 years, and it may not be appropriate to call it a startup company); I have rich engineering experience and understand big language models. I have tried to reverse crack GPT-4 and ChatGPT code interpreter. In June this year, when OpenAI members visited Israel, Peleg also went to participate in the discussion and communication, and also took a photo with CEO Sam Altman.

Reading this old man's article, I can't help but think of Tom, a student liaison officer I met in Israel, who can make your blood boil if you say anything.

From left: Sam Altman, Yam Peleg (Source: @Yampeleg)

Considering that this old man has been researching OpenAI and knows a lot of people inside OpenAI, so if he gets some internal information, I think the credibility is actually quite high.

But when I was about to study his posts carefully at night, I suddenly found that he had deleted all the previous posts. At first, I thought I was covered by OpenAI, but I was glad that I kept the file. After a closer look, I found that it was not because OpenAI requested deletion, but because he also reported it from a pay column and was complained of copyright infringement.

The original source of this is a Substack column called SemiAnalysis. They published an article entitled GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE earlier, behind a paywall.

After checking it out, I found out:

SemiAnalysis is a boutique semiconductor research and consulting firm focused on the semiconductor supply chain from chemical feedstock to fab to design IP and strategy. The company was founded by Dylan Patel, an analyst and engineer with many years of experience in the semiconductor industry. Patel has held roles ranging from design engineer to marketing manager at Intel, AMD, Qualcomm, and more. SemiAnalysis' team also includes a number of professional semiconductor analysts and consultants. They each have different areas of expertise, such as AI, cloud computing, networking, storage, electric vehicles, radio frequency, Internet of Things, etc. They provide customers with comprehensive semiconductor supply chain analysis and consulting services from chemical raw materials to fabs to design IP and strategy.

Earlier, SemiAnalysis also published an article disclosing that Google engineers said in internal communications that "We Have No Moat, And Neither Does OpenAI" (We Have No Moat, And Neither Does OpenAI), which caused a lot of discussion. This article was later confirmed to be true.

From this point of view, Brother Dylan Patel may indeed have some insiders, and the credibility of the information they gave should still be acceptable.

As for why they were so eager to get Brother Yam to delete tweets - because these "inside information" are indeed valuable, and subscriptions to SemiAnalysis' paid articles cost $500 a year. Brother Yam's subscription to the elite version costs $1,000.

▍Card Analysis

According to this ins and outs, my opinion is that this rumor still has a certain degree of credibility. The following are some of my analyzes based on this information, which I propose for discussion.

Competition for private models will focus on parallelism

According to this rumor, if you want to train a GPT-4 competitor, it is estimated that using about 8,192 H100 chips, at a price of $2 per hour, the pre-training can be completed in about 55 days, and the cost is about $21.5 million ( 150 million RMB).

This cost is really not too big for the current turbulent LLM market. The current major domestic players can easily undertake several training sessions. So, to be honest this time, it may really not be bragging to benchmark GPT-4 in half a year's time with model capabilities (at least parameter scale).

If training cost is not an issue, will training data be an issue? I don't think so either. It is rumored that the training data of GPT-4 has a total of 13T (13 trillion) tokens. For comparison, both CommonCrawl and RefinedWeb public datasets have 5T tokens. It is rumored that the rest come from Twitter, Reddit and YouTube; some lawsuits also claim that OpenAI used pirated data from "shadow libraries" such as LibGen and SciHub.

Therefore, I think the scale of this data is not unattainable. In addition, the country itself has accumulated a lot of Chinese resources, so the training data should not be a big problem.

For other issues such as pre-training, fine-tuning, and Chinese encoding and decoding, in fact, there are not too many technical secrets, and the methods are relatively open. Given enough resources, it should be resolved in half a year.

So, the last remaining threshold is parallelism. In fact, a huge amount of space has been used in this rumor to introduce relevant content, and the professional level is still relatively high. I can only give some superficial explanations here.

Roughly speaking, the so-called parallel problem is that you have a large model, how to let the most people use it at the same time at the lowest cost. This involves a lot of professional design issues. In the case of fixed computing resources, how to allocate computing resources in different links? How to handle concurrency? How to manage memory?

The capability of parallel processing directly determines the user experience. At present, ChatGPT and API based on GPT-3.5 are relatively smooth, which is very powerful. Everyone here may say that other domestic LLMs or Claude I have experienced are faster than GPT-3.5. However, everyone did not consider the magnitude of use. GPT-3.5 has such performance under such a high concurrency. If other manufacturers cannot match OpenAI's ability, they will not be able to grab the OpenAI market.

Therefore, parallel capabilities may become one of the key points of competition for various OpenAI competitors.

GPT-5 focuses on multimodality

As mentioned earlier, it is rumored that GPT-4 is a "mixture of experts" (MoE) model composed of 16 expert models. Here is a brief explanation of what is "expert mixing", which refers to dividing the user's "problem" into several sub-problems, and each sub-problem is handed over to a smaller model (that is, an "expert") to solve, and then through a The "routing model" is selected and combined, and then output to the user.

Rumors further claim that each "expert" of GPT-4 has 111 billion parameters-equivalent to GPT-3 (this is consistent with the GPT-4 parameters that Sam Altman said earlier is even smaller than GPT-3.5), of which there are 55 billion Parameters are shared. Each forward pass of inference (generating a token output) uses two "experts", effectively consuming about 280 billion parameters. This number is significantly smaller than the number required without MoE, and it is also similar to the predictions of many scholars in the early stage.

It is worth noting that rumors indicate that the text and code data used for GPT-4 training are reused. Combined with the choice of using the MoE framework, I personally guess: either the high-quality text data that can be easily obtained at present is close to exhaustion, or the improvement of LLM performance by increasing the amount of data without limit is already very limited.

However, no matter what the situation is, if GPT-5 wants to have a big performance breakthrough, it must be able to make full use of the existing large amount of video, picture and audio data, in other words, it is a "multimodal" model.

The problem is that, according to this rumor, OpenAI's current visual multimodality doesn't have much to offer. It is an independent visual encoder that uses text as input for pre-training and then uses about 2 trillion Tokens for fine-tuning. This training method obviously cannot make full use of the existing video, picture and audio data.

Therefore, OpenAI has always emphasized that GPT-5 has not been trained, and the probability is true. Before training GPT-5, they had to find a better multimodal model architecture so that the model could make full use of audio and video data. Only by being able to use these high-quality training data can GPT-5 be able to obtain sufficient capacity improvement. (At the same time, if GPT-5 can really make full use of these audio and video data, then whether it is AGI or OpenAI's recently proposed "Super Intelligence Body", it seems that it is not so far away.)

OpenAI may have intentionally released this rumor

This inference is purely personal speculation. Facts are not enough, just take a look.

My understanding is that OpenAI is well aware that GPT-4's moat is not deep; in today's craze, it is not difficult for competitors to catch up. And as analyzed above, their current multi-modal large-scale model structure should not be finalized. At this time, if new players come up and break through from multi-modal, the probability of OpenAI being overtaken by the curve is also very high.

So, this may be OpenAI's plan to slow down the war - I will reveal some GPT-4 information to you, let the head players first do the GPT-4 re-enactment work, and walk the road that OpenAI has already walked. .

If during this process, OpenAI has laid the foundation for the training of GPT-5 and completed the preliminary research of the multi-modal large model, even if GPT-4 has been surpassed by other large language models, OpenAI will not panic. Personally, I think that multimodality is likely to be the last generation of human involvement, and AGI may be the main force in future model development and evolution. In other words, if you win this time, you may win to the end.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)