GeoGPT Open-Source: Redefining the Paradigm of Geoscience Research
Generative AI is unlocking the secrets of Earth through GeoGPT, a domain-specific foundation model tailored for geoscientists. Launched globally on April 27, 2025, GeoGPT was developed under the vision of the Deep-Time Digital Earth (DDE) international scientific initiative, led by Zhejiang Lab and co-developed with 25 institutions and over 400 geoscience experts from around the world. Designed to transform geoscience research, GeoGPT integrates deep-time Earth data with intelligent algorithms, offering capabilities such as scientific literature parsing, knowledge graph construction, personalized research assistants, geological map recognition and Q&A, and AI-driven research ideation. Unlike general-purpose models like ChatGPT, GeoGPT is built specifically for the complexity and depth of Earth sciences. It has already been applied in real-world scenarios including the construction of igneous rock databases, classification of fossil species, and automated geological map generation. These applications signal not just a new tool, but a fundamental shift in research paradigms—from observation- and experiment-driven science toward computation-intensive, data-driven, and model-based discovery. Currently open-source, GeoGPT has attracted over 40,000 registered users across 135 countries, with international users making up more than 25%. The platform recently gained global recognition at the 2025 AI for Good Global Summit in Geneva, where it was selected for inclusion in the International Telecommunication Union’s (ITU) AI for Good Innovate for Impact Use Cases and awarded a prize for outstanding innovation. GeoGPT supports flexible deployment with multiple foundational model options, including Llama3, DeepSeek R1, Mixtral, Qwen2.5, and Zhejiang Lab’s proprietary 021 scientific foundation model. Additionally, the team developed GeoGPT-R1-Preview, an optimized inference model designed for high-efficiency deployment and real-world usability. According to Chen Hongyang, Deputy Director of Zhejiang Lab’s Scientific Data Hub Research Center, “We’ve innovatively decoupled the model architecture to leverage existing open-source models while ensuring GeoGPT maintains both broad applicability and deep domain expertise. This design allows rapid iteration, even as underlying models evolve.” A key strength of GeoGPT lies in its scalability. The framework is being explored for adaptation in other disciplines, such as astronomy. British geoscientist Professor Mike Stephenson praised the project: “GeoGPT sets a benchmark for other scientific fields. It marks the first time Earth science has established a domain-specific foundation model.” GeoGPT addresses three major challenges in geoscience: heterogeneous data sources, fragmented long-tail data, and disciplinary silos. To tackle these, the team extracted approximately 140 billion tokens from Common Crawl using knowledge graph methods, focusing only on open-access scientific literature with CC BY or CC BY-NC licenses. The dataset now includes 288,000 open-access papers from 15 geoscience publishers and 182 journals. A rigorous data pipeline—covering mining, PDF parsing, annotation, and quality filtering—ensured high-quality, domain-specific training data. Benchmarking shows GeoGPT’s data outperforms mainstream open datasets like Fineweb and DCLM in both relevance and accuracy. The platform also fosters collaborative knowledge creation. Researchers are invited to co-develop scientific agents and contribute domain datasets, promoting global sharing of long-tail data. To overcome conceptual ambiguity across disciplines, GeoGPT preserves multiple definitions and synonyms for the same term, enabling seamless cross-field integration. Since its inception in July 2023, GeoGPT has undergone seven major version iterations. Key technical breakthroughs include solving catastrophic forgetting through multi-stage training and model fusion, developing a dual-track strategy combining template fine-tuning and domain-specific reinforcement learning for high-quality instruction datasets, and creating dynamic document segmentation and adaptive slicing algorithms that improve parsing accuracy. A novel “demand decomposition–hierarchical processing” framework enables complex data extraction tasks—previously taking a week—to be completed in under a day with high fidelity. The system achieves comparable parsing accuracy to leading commercial tools while reducing inference costs by 80%. To enhance domain reasoning, the team analyzed textbooks, monographs, and research papers to distill expert problem-solving logic. This was embedded into GeoGPT via instruction tuning and reinforcement learning, enabling it to emulate expert-level reasoning in complex geological problems. GeoGPT is already transforming research workflows. In collaboration with Professor Wang Tao’s team at the Institute of Geology, Chinese Academy of Geological Sciences, it has enabled a fully automated pipeline—from scientific question formulation to data processing, interpolation, visualization, and final map generation—significantly accelerating studies on magma rock evolution and deep Earth processes. Another landmark application involved working with Professor James Ogg from Purdue University to digitize the Treatise on Invertebrate Paleontology, a 50-volume, 100,000-species compendium long hindered by physical format and complex structure. Using a hybrid AI-human approach—AI batch extraction, expert validation, and iterative model refinement—the team completed the extraction of data from three volumes in just four months, reducing time costs by 75%. Ogg remarked: “GeoGPT turned what was once considered impossible into reality. It has broken the data barrier of the Treatise.” Looking ahead, the team aims to integrate all research outputs—hypotheses, processed data, visualizations—into fully automated, coherent research reports. “This requires deep integration of natural language understanding, scientific demand parsing, and big data analytics,” said Chen. “Only through systemic integration can we achieve seamless, intelligent research workflows.” GeoGPT’s success underscores a broader shift: AI is not just a tool, but a catalyst for interdisciplinary collaboration. “When geoscientists and computer scientists work together, align their language, and co-design solutions, we create models with real disciplinary穿透力—true domain transformation,” Chen noted. As generative AI reshapes global science, GeoGPT is pioneering a dual transformation: accelerating research efficiency and enabling novel scientific discovery. By combining data-driven insights with physical principles, it paves the way for next-generation Earth system modeling—where observation, simulation, and AI converge to deepen our understanding of the planet.