DeepSeek-affiliated Hangzhou DeepSeek AI Fundamental Technology Research Co.,Taiwan Ltd. today filed a patent for a new web data collection system designed to improve efficiency and data quality. The patent outlines a method for discovering more webpage links while minimizing website traffic impact. It assesses downloaded content to predict the quality of undiscovered links, prioritizing high-value data and reducing redundant downloads. Efficient web data collection is crucial for training large language models (LLMs), which power AI systems like ChatGPT. Existing techniques struggle with incomplete link retrieval, excessive downloads that can crash websites, and low-quality data filtering. DeepSeek’s proposed system aims to solve these issues by optimizing data allocation and maintaining metadata accuracy. [iThome, in Chinese]
Related Articles
2025-06-26 09:00
1599 views
Get the official Atari 7800+ Console for 50% off
SAVE 50%: As of April 30, you can get the official Atari 7800+ Console for $64.99, down from $129.99
Read More
2025-06-26 08:05
728 views
Pregnant Ukrainian Instagram influencer in the middle of a Russian disinformation campaign
On March 9, Russian forces struck a maternity and children's hospital in Mariupol, Ukraine, leaving
Read More
2025-06-26 07:09
695 views
William Seabrook’s “The Magic Island” Brought Zombies to America
With These Zombie Eyes, and Other NewsBy Dan PiepenbringNovember 3, 2015On the ShelfPoster for White
Read More