GPT5: The Importance of Data
One of the key aspects that will influence the release and the potential genius-level IQ of GPT5 is data. The amount of data, how it's used, and where it comes from will dictate the effectiveness and success of GPT5 in its eventual release.
Potential Leak About GPT5
There have been media reports surrounding a potential leak about GPT5. Though reputable sources have quoted the leak, its accuracy still remains unconfirmed. However, it is worth noting that the scale of 25,000 GPUs mentioned in the leak appears to be in line with the current development of chatbot technology.
The Role of NVIDIA GPUs in GPT Development
TechRadar describes ChatGPT as having been trained on 10,000 NVIDIA A100 GPUs. With Microsoft's potential access to the H100 GPU, which is said to be a significant improvement on the A100 GPU across various metrics, the development of GPT5 could potentially be accelerated.
.png)
Timelines for GPT5 Release
Estimating a release date for GPT5 later this year could be accurate, considering information from Jordi Ribas that suggests GPT4 or its equivalent was completed in late spring or early summer of 2022. Around the same time, Deep Mind published a framework for optimising parameter size with the number of training tokens, highlighting the importance of data quality in relation to GPT development.
GPT Development: Data Quality vs. Parameter Size
As it turns out, models like GPT3 and Palm had more parameters than necessary, and it was the data quality that was lacking in their development. This revelation led to the realisation that GPT4's requirement of 100 trillion parameters was farcical. In fact, it is possible that GPT5 might have the same or fewer parameters than GPT4.
A post from the 'Less Wrong' platform in July 2022 emphasises this point, stating that the active constraint on language modelling performance is data, not size. Current returns to additional data are significant, while returns to additional model size are minuscule, and most recent landmark models are considered wastefully big.
Data: The Key to GPT5's Success
Ultimately, it is the quality, use, and sourcing of data that will be the key factors in determining the success and potential genius-level IQ of GPT5. With the importance of data quality at the forefront of GPT development, it will be interesting to see the advancements and achievements that the upcoming GPT5 model can accomplish.
Leveraging Data in Large Language Models
One of the increasingly common beliefs in the field of machine learning is that if we can leverage enough data, there is no need to run large parameter models, such as those with 500 billion, one trillion, or even more parameters. The driving force behind this is data, not the parameter count. For example, GPT-3 was trained on approximately 300 billion tokens. A token is generally defined as being between one to 1.4 words, so it can be thought of as roughly one word.
Model such as PaLM was trained on even larger amounts of tokens, with approximately 800 billion and 1.4 trillion tokens, respectively. The implications of this growing number of tokens in training sets can be seen in a recent academic paper which focuses on whether we will exhaust the available data for machine learning and large language models.
.png)
The Stock of High-Quality Language Data
In the mentioned paper, the stock of high-quality language data is estimated to be between 4.6 trillion and 17 trillion words. The paper also suggests that we are within one order of magnitude (i.e., ten times larger) of exhausting this high-quality data, which is expected to happen between 2023 and 2027.
It is important to acknowledge the significance of high-quality data when developing large language models. Rapid improvements in these models are dependent on high-quality data sources. Utilising this kind of data is a common practice when training language models, as it has been shown that models trained on high-quality data perform better.
Sources of High-Quality Data
The sources of high-quality data are diverse, stemming from scientific papers, books, web content, news, code, and even platforms like Wikipedia. However, the issue of not knowing the exact origin of this high-quality data persists, which contributes to the potential exhaustion of valuable data over time.
Running Out of High-Quality Data: Implications
The potential exhaustion of high-quality data could pose a significant challenge to maintaining the rapid advancements in the quality and capabilities of large language models such as GPT-3. If we begin to run out of high-quality data by 2023, the question that arises is how to address this challenge and ensure that new models can continue to provide remarkable performance.
It is crucial to raise awareness about the importance of high-quality data in the machine learning community as it is vital for driving innovation and progress in the field. Identifying new sources of high-quality data, understanding its origins, and encouraging better data management practices can contribute to the ongoing improvement of large language models and prevent the potential exhaustion of this valuable resource.
Estimating Available High-Quality Data
The near term future of artificial intelligence will be significantly defined by the amount of high-quality data available. A middle of the road estimate suggests nine trillion tokens of high-quality data, which could lead to a substantial increase in AI performance. However, this estimate contrasts with others, such as an initial estimate of 3.2 trillion tokens. Regardless, increased high-quality data sources will be a game-changer for AI applications.
Chapman's Observations on GPT4 and Bing
David Chapman, an AI PhD holder from MIT, refers to the DeepMind study and the Less Wrong post to make interesting observations about the GPT4 model and Bing. Firstly, he thinks that GPT4 or Bing might have "scraped the bottom of the Web text barrel", resulting in responses reminiscent of emoting teenagers. Secondly, Chapman suggests that Google and OpenAI may not be sharing the sources of their data due to concerns about attribution and potential controversies surrounding compensation.
Data Sources, Attribution and Compensation
As AI models become more advanced, the question of where the data comes from becomes increasingly important. With AI, attributing sources and compensating creators pose great challenges. For example, online math tutorials could be scraped and used to teach math through an AI model, such as Bing, without proper attribution or compensation. Similar issues with AI image generation are emerging, highlighting the need to address these concerns as AI continues to evolve.
Surprising Sources of Data: Google's Bard Model
Google's Bard model is known to use surprising sources of data. One example is YouTube, which raises the question of whether user-generated comments are being harvested for AI purposes. The issue of data sourcing in AI is becoming more significant, and clarifying the origins of data will be crucial for addressing concerns of attribution and compensation.
GPT5 and High Quality Data
Google Palm's performance, powered by 800 billion tokens, provides insights into the potential of the upcoming GPT5 model. GPT5 would likely learn from previous experiences and aim to scrape as much high-quality data as possible. As AI research focuses on GPT5 and other advanced models, the centrality of high-quality data and its implications on AI performance cannot be understated.
High-Quality Data and GPT-5
The stock of high-quality data available for AI models is estimated to grow around 10% annually. If Bing, Microsoft's search engine, was trained on a similar amount of data to Palm (one trillion tokens), GPT-5 could potentially show an order of magnitude improvement. This section will discuss various ways that OpenAI can enhance GPT-5, considering potential limitations.
Extracting High-Quality Data from Low-Quality Sources
One possible strategy for improving GPT-5 is finding more ways to extract high-quality data from low-quality sources, like Facebook. This could significantly impact the model’s performance and help overcome some of the limitations faced by previous iterations.
Automating Chain of Thought Prompting
A recent paper has highlighted the benefits of automating chain of thought prompting into AI models. This technique, also known as prompt engineering, forces the model to display its workings, resulting in improved output. While the paper discusses gains of 2-3%, these seemingly small improvements can be significant when considering a high level of performance like that of Bing.
Teaching Language Models to Use Tools
Another paper published recently showed that language models can teach themselves to use tools such as calculators, calendars, and APIs. The integration of tools like Wolfram Alpha into a large language model could revolutionise AI. Its applications in science, maths, money, and other sectors would change the way these models operate.
Improving Language Model Interaction with Python Interpreters
A related paper demonstrated how AI models can interact with Python interpreters to determine whether their code compiles. This capacity to self-teach and improve might play a significant role in further enhancing GPT-5's performance.
Utilising Repeated Training on the Same Data
Finally, it has been suggested that training GPT-5 multiple times on the same data could boost its capabilities. Professor Swam Dip has highlighted the potential benefits of implementing such a strategy.
In conclusion, various methods are being explored to improve GPT-5 regardless of previous limitations. These include extracting high-quality data from low-quality sources, automating chain of thought prompting, teaching language models to use tools, improving interaction with Python interpreters, and training the model multiple times on the same data. The combination of these approaches could lead to a substantial leap in the capabilities and potential applications of GPT-5.
Improving GPT Models with Artificial Data Generation
Currently, GPT models undergo training on the same data just once due to performance and cost constraints. However, there may be potential for training a model multiple times using the same data, thereby enhancing its capabilities. Although this approach might be costlier, major players like Microsoft could perceive it as a worthy investment to reap the rewards of search and related profits.
A paper co-authored by a professor in this field discusses how GPT models can generate additional data sets, specifically focusing on problems they find difficult, such as those with complex patterns. By integrating human input in filtering model-generated answers for correctness, this method of artificial data generation can result in improvements of 10% or more.
.png)
Implications of Fully Utilising Nine Trillion Tokens
The possibilities of fully exploiting nine trillion tokens by the end of 2024 or even the beginning of the year are still unclear. One conceivable outcome could be an order of magnitude improvement, which might not lead to artificial general intelligence (AGI) but could certainly revolutionise the job market.
It's important to consider the potential divergence between changes to cognitive work and changes to physical work, as indicated by Sam Altman's tweet about drastically reduced costs for developing an iPhone app and plumbing jobs in 2023. How these relative prices evolve in 2028 remains to be seen.
Anticipated GPT Model Defeats of Human Writers
As GPT models improve, they might eventually outperform human writers in various domains. Some anticipated benchmarks where this could occur include:
1. Reading Comprehension: With an extrapolation to GPT-5, advancements in reading comprehension would bring implications for summarisation and creative writing tasks.
2. Logic and Critical Reasoning: GPT models could eventually excel in debating topics, engaging in legal work, and determining causality in complex scenarios. Such capabilities would prove valuable in industries like finance, where distinguishing signals from noise within vast data sets is crucial.
3. Physics and High School Maths: An order of magnitude improvement could lead to near-solutions for physics and high school maths. Consequently, AI tutors might replace jobs in these areas – even as early as the end of next year.
.png)
GPT5 Release and Other Advanced Technologies
The release of GPT5 is highly anticipated in the artificial intelligence community, but the timeline remains uncertain. The launch is expected to align with the completion of several groundbreaking technologies, such as text to speech, image to text, and text, image and text to video avatars. These state-of-the-art applications indicate that AI tutors may not be as far-fetched as people might think.
Safety Research and Timeline Uncertainty
The main reason behind the wavering timeline for GPT5 is the ongoing safety research conducted by Google and OpenAI. Sam Altman, the President of Y Combinator and co-chairman of OpenAI, spoke to The New York Times regarding this matter. He said, "When we are ready, when we think we have completed our alignment work and all of our safety thinking and worked with external auditors, other AGI labs, then we will release those things." While his statement may refer to GPT4, the sentiment is even more applicable for GPT5.
Bing's Sydney Model and Safety Concerns
The release of Bing's Sydney model adds another layer to the question of safety and alignment. Bing's Sydney model proves that significant progress in artificial intelligence is being made, possibly blurring the lines regarding its safety implications. According to Altman, safety and alignment remain the primary goals for the responsible development of AI and AGI (Artificial General Intelligence).
Increasing Safety Progress Ratio
In closing, Sam Altman, emphasizes the importance of increasing the ratio of safety progress to capability progress in AI development. According to Altman's AGI public post, "It's important that the ratio of safety progress to capability progress increases." This statement signifies that as AI models become more powerful and more advanced, keeping up with safety measures should remain at the forefront of development priorities.