InternLM2-Chat-1.8B入门实践：Python爬虫数据清洗与智能分析

news2026/3/13 22:48:23

InternLM2-Chat-1.8B入门实践Python爬虫数据清洗与智能分析你是不是也遇到过这样的烦恼辛辛苦苦用Python爬虫抓了一大堆数据结果发现里面什么都有——重复的、格式乱的、夹杂着广告和无关信息的光是整理这些数据就要花上大半天时间。我以前也这样总觉得爬虫最难的部分是绕过反爬机制后来才发现真正费劲的是数据抓回来之后的“脏活累活”。直到我开始尝试用大模型来帮忙整个流程才变得轻松起来。今天我就带你一起用InternLM2-Chat-1.8B这个轻量级模型给你的爬虫项目装上“智能大脑”。我们不讲复杂的理论就从最实际的场景出发看看怎么让模型帮你自动清洗数据、提取关键信息甚至生成摘要。整个过程很简单就算你刚接触Python不久跟着做也能跑起来。1. 环境准备与模型部署我们先从最基础的开始——把环境搭好把模型跑起来。1.1 基础环境配置InternLM2-Chat-1.8B对硬件要求不高普通的笔记本电脑就能跑。如果你有GPU哪怕是入门级的速度会快很多。# 创建并激活虚拟环境推荐 conda create -n internlm-env python3.9 conda activate internlm-env # 安装基础依赖 pip install torch torchvision torchaudio pip install transformers pip install sentencepiece pip install protobuf如果你用的是Windows系统安装PyTorch时可能需要去官网选择对应的版本。Linux和macOS用户直接用上面的命令就行。1.2 快速部署InternLM2-Chat-1.8B这个模型真的很轻量下载和加载都很快。我们直接用Hugging Face的transformers库来调用from transformers import AutoTokenizer, AutoModelForCausalLM import torch # 加载模型和分词器 model_name internlm/internlm2-chat-1_8b tokenizer AutoTokenizer.from_pretrained(model_name, trust_remote_codeTrue) model AutoModelForCausalLM.from_pretrained( model_name, torch_dtypetorch.float16, # 使用半精度减少显存占用 device_mapauto, # 自动分配设备CPU/GPU trust_remote_codeTrue ) print(模型加载完成)第一次运行时会自动下载模型文件大概需要3-4GB的磁盘空间。下载完成后后续使用就很快了。如果你网络不太好也可以先下载到本地# 使用huggingface-cli下载 huggingface-cli download internlm/internlm2-chat-1_8b --local-dir ./internlm2-chat-1.8b然后修改代码中的model_name为本地路径即可。1.3 测试模型是否正常工作加载完成后我们先简单测试一下def chat_with_model(prompt): 简单的对话函数 inputs tokenizer(prompt, return_tensorspt) inputs {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs model.generate( **inputs, max_new_tokens500, # 生成的最大长度 temperature0.7, # 控制随机性 do_sampleTrue ) response tokenizer.decode(outputs[0], skip_special_tokensTrue) return response # 测试 test_prompt 你好请介绍一下你自己。 response chat_with_model(test_prompt) print(模型回复, response)如果看到模型正常回复了说明环境配置成功。接下来我们就可以把它用到爬虫项目里了。2. 爬虫数据清洗实战数据清洗是爬虫最头疼的环节。我们来看几个常见的场景看看模型怎么帮我们。2.1 智能去重与合并爬虫经常抓到重复内容但有些重复不是完全一样的——可能标题相同内容不同或者内容相同标题不同。传统方法很难处理这种情况。假设我们爬取了一些新闻数据news_data [ { title: 人工智能助力医疗诊断, content: 近日某医院引入AI辅助诊断系统准确率达到95%。, source: 科技新闻网 }, { title: AI医疗诊断新突破, content: 医院采用人工智能技术诊断准确率提升至95%。, source: 健康日报 }, { title: 人工智能助力医疗诊断, content: 医疗领域迎来AI革命诊断效率大幅提升。, source: 创新科技 } ]传统去重方法比如计算文本相似度效果有限我们可以让模型来判断def intelligent_deduplicate(data_list, similarity_threshold0.8): 智能去重函数 deduplicated [] for i, item in enumerate(data_list): is_duplicate False # 与已保留的数据比较 for kept_item in deduplicated: # 构建判断提示 prompt f请判断以下两篇新闻是否在讲同一件事新闻A标题{item[title]} 新闻A内容{item[content]} 新闻B标题{kept_item[title]} 新闻B内容{kept_item[content]} 请只回答“是”或“否”。 response chat_with_model(prompt) if 是 in response.lower(): is_duplicate True # 合并信息保留更详细的内容 if len(item[content]) len(kept_item[content]): kept_item[content] item[content] # 记录来源 if sources not in kept_item: kept_item[sources] [kept_item[source]] kept_item[sources].append(item[source]) break if not is_duplicate: # 深拷贝避免后续修改影响原数据 new_item item.copy() new_item[sources] [new_item.pop(source)] deduplicated.append(new_item) return deduplicated # 使用智能去重 cleaned_news intelligent_deduplicate(news_data) print(f原始数据{len(news_data)}条) print(f去重后{len(cleaned_news)}条) for news in cleaned_news: print(f标题{news[title]}) print(f来源{news[sources]}) print(---)这种方法比简单的文本匹配更智能能理解语义层面的相似性。2.2 结构化信息提取爬虫数据经常是半结构化的我们需要从中提取特定信息。比如从商品描述中提取规格参数product_descriptions [ 苹果iPhone 15 Pro Max 256GB 深空黑色全网通5G手机电池容量4422mAh支持20W快充, 华为Mate 60 Pro 512GB 雅川青卫星通话旗舰手机电池容量5000mAh支持88W超级快充, 小米14 Ultra 1TB 白色徕卡影像套装电池容量5300mAh支持90W有线快充和50W无线快充 ] def extract_product_info(description): 从商品描述中提取结构化信息 prompt f请从以下商品描述中提取信息并以JSON格式返回描述{description} 需要提取的字段 1. 品牌 2. 型号 3. 存储容量 4. 颜色 5. 电池容量 6. 快充功率如果某个信息不存在请填写“未知”。只返回JSON不要有其他文字。 response chat_with_model(prompt) # 尝试解析JSON try: import json # 找到JSON部分模型可能会在前后加一些文字 start_idx response.find({) end_idx response.rfind(}) 1 if start_idx ! -1 and end_idx ! 0: json_str response[start_idx:end_idx] return json.loads(json_str) except: pass return {error: 解析失败, raw_response: response} # 批量提取信息 for desc in product_descriptions: info extract_product_info(desc) print(f描述{desc[:50]}...) print(f提取结果{info}) print(---)这样我们就能把杂乱的商品描述变成整齐的结构化数据方便后续分析和入库。2.3 文本清洗与规范化爬虫抓到的文本经常有各种问题多余的空格、乱码、无关广告等。我们可以让模型帮忙清理dirty_texts [ 人工智能AI是当前最热门的领域之一 , Python爬虫教程-学习如何抓取网页数据【免费下载】, 机器学习深度学习神经网络——三者的区别与联系 ] def clean_text(text): 智能文本清洗 prompt f请清理以下文本使其更加规范整洁原始文本{text} 清理要求 1. 去除首尾空格和多余的空格 2. 去除无关的符号和广告语如【免费下载】 3. 规范标点符号 4. 保持原意不变只返回清理后的文本不要有其他说明。 response chat_with_model(prompt) return response.strip() # 测试清洗效果 for dirty in dirty_texts: cleaned clean_text(dirty) print(f清理前{dirty}) print(f清理后{cleaned}) print(---)3. 完整爬虫项目集成现在我们把模型集成到一个完整的爬虫项目中。以爬取技术博客文章为例3.1 基础爬虫实现import requests from bs4 import BeautifulSoup import time import json class TechBlogCrawler: def __init__(self): self.session requests.Session() self.session.headers.update({ User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 }) def crawl_blog_list(self, url): 爬取博客列表页 try: response self.session.get(url, timeout10) response.raise_for_status() soup BeautifulSoup(response.text, html.parser) articles [] # 假设文章链接在class为article-item的div中 for item in soup.find_all(div, class_article-item): link item.find(a) if link and link.get(href): article_url link[href] if not article_url.startswith(http): article_url url article_url title link.text.strip() if link.text else 无标题 articles.append({ url: article_url, title: title, crawled: False }) return articles except Exception as e: print(f爬取列表页失败{e}) return [] def crawl_article(self, article_info): 爬取单篇文章内容 try: response self.session.get(article_info[url], timeout10) response.raise_for_status() soup BeautifulSoup(response.text, html.parser) # 提取文章内容根据实际网站结构调整 content_div soup.find(div, class_article-content) if content_div: content content_div.get_text(stripTrue, separator\n) else: content soup.get_text(stripTrue, separator\n) # 提取发布时间 time_tag soup.find(time) publish_time time_tag[datetime] if time_tag and time_tag.get(datetime) else 未知 return { url: article_info[url], title: article_info[title], content: content[:2000], # 限制长度 publish_time: publish_time, crawl_time: time.strftime(%Y-%m-%d %H:%M:%S) } except Exception as e: print(f爬取文章失败{article_info[url]} - {e}) return None3.2 集成智能处理模块现在我们把之前的模型功能集成进来class IntelligentProcessor: def __init__(self): # 这里复用之前加载的模型和tokenizer self.model model self.tokenizer tokenizer def generate_summary(self, text, max_length100): 生成文章摘要 prompt f请为以下技术文章生成一个简洁的摘要不超过{max_length}字 {text[:500]}... # 只取前500字避免过长摘要要求 1. 抓住核心内容 2. 语言简洁明了 3. 突出技术要点摘要 return self._call_model(prompt) def extract_keywords(self, text, num_keywords5): 提取关键词 prompt f请从以下文本中提取{num_keywords}个最重要的关键词 {text[:300]}... 要求 1. 按重要性排序 2. 用逗号分隔 3. 只返回关键词不要其他文字关键词 response self._call_model(prompt) return [kw.strip() for kw in response.split(,) if kw.strip()] def categorize_article(self, title, content): 文章分类 prompt f请根据标题和内容对这篇文章进行分类标题{title} 内容{content[:200]}... 可选分类Python开发、机器学习、前端技术、后端架构、数据库、运维部署、其他请只返回分类名称。 return self._call_model(prompt) def _call_model(self, prompt): 调用模型的统一方法 inputs self.tokenizer(prompt, return_tensorspt) inputs {k: v.to(self.model.device) for k, v in inputs.items()} with torch.no_grad(): outputs self.model.generate( **inputs, max_new_tokens200, temperature0.3, # 低温度保证稳定性 do_sampleTrue ) response self.tokenizer.decode(outputs[0], skip_special_tokensTrue) # 移除提示部分只返回生成的内容 if prompt in response: response response.replace(prompt, ).strip() return response3.3 完整流程示例def main(): # 初始化爬虫和处理器 crawler TechBlogCrawler() processor IntelligentProcessor() # 1. 爬取文章列表 print(开始爬取文章列表...) blog_url https://example-tech-blog.com/articles # 示例URL articles crawler.crawl_blog_list(blog_url) print(f找到 {len(articles)} 篇文章) processed_articles [] # 2. 爬取并处理每篇文章 for i, article_info in enumerate(articles[:3]): # 先处理3篇作为示例 print(f\n处理第 {i1} 篇文章{article_info[title]}) # 爬取文章内容 article_data crawler.crawl_article(article_info) if not article_data: continue # 使用模型进行智能处理 print(正在生成摘要...) article_data[summary] processor.generate_summary(article_data[content]) print(正在提取关键词...) article_data[keywords] processor.extract_keywords(article_data[content]) print(正在分类...) article_data[category] processor.categorize_article( article_data[title], article_data[content] ) processed_articles.append(article_data) # 避免请求过快 time.sleep(1) # 3. 保存结果 output_file processed_articles.json with open(output_file, w, encodingutf-8) as f: json.dump(processed_articles, f, ensure_asciiFalse, indent2) print(f\n处理完成结果已保存到 {output_file}) # 4. 展示处理结果 for article in processed_articles: print(f\n标题{article[title]}) print(f分类{article[category]}) print(f关键词{, .join(article[keywords])}) print(f摘要{article[summary][:100]}...) print(- * 50) if __name__ __main__: main()4. 实用技巧与优化建议在实际使用中有几个技巧能让效果更好4.1 提示词优化模型的输出质量很大程度上取决于提示词。针对爬虫数据处理可以这样优化# 基础提示词模板 PROMPT_TEMPLATES { clean_text: 请清理以下爬虫抓取的文本 {text} 清理要求 1. 去除HTML标签残留 2. 去除广告和无关信息 3. 规范标点符号和空格 4. 分段整理如果有多个段落清理后的文本, extract_structured: 请从以下文本中提取结构化信息 {text} 需要提取的信息 {fields} 要求 1. 以JSON格式返回 2. 如果信息不存在值为null 3. 不要添加额外说明提取结果, summarize: 请用一句话总结以下内容的核心要点 {content} 总结要求 1. 不超过50字 2. 突出最关键的信息 3. 语言简洁明了总结 } def optimized_prompt(task_type, **kwargs): 使用优化后的提示词 template PROMPT_TEMPLATES.get(task_type) if not template: return kwargs.get(text, ) return template.format(**kwargs)4.2 批量处理优化处理大量数据时可以这样做def batch_process_texts(texts, process_func, batch_size5, delay0.5): 批量处理文本避免频繁调用 results [] for i in range(0, len(texts), batch_size): batch texts[i:ibatch_size] print(f处理批次 {i//batch_size 1}/{(len(texts)batch_size-1)//batch_size}) for text in batch: try: result process_func(text) results.append(result) except Exception as e: print(f处理失败{e}) results.append(None) # 批次间延迟避免过热或触发限流 time.sleep(delay) return results4.3 错误处理与重试网络请求和模型调用都可能出错需要做好错误处理def robust_model_call(prompt, max_retries3): 带重试机制的模型调用 for attempt in range(max_retries): try: response chat_with_model(prompt) if response and len(response.strip()) 0: return response.strip() except Exception as e: print(f第{attempt1}次尝试失败{e}) if attempt max_retries - 1: wait_time 2 ** attempt # 指数退避 print(f等待{wait_time}秒后重试...) time.sleep(wait_time) return # 所有重试都失败返回空字符串4.4 结果验证与修正模型输出有时会有小错误可以增加验证步骤def validate_and_correct_json(json_str): 验证并修正JSON格式 import json # 尝试直接解析 try: data json.loads(json_str) return data except json.JSONDecodeError: pass # 如果失败尝试让模型修正 prompt f以下文本应该是一个JSON但格式可能有误请修正 {json_str} 请返回修正后的标准JSON格式只返回JSON不要其他文字。 corrected chat_with_model(prompt) try: # 再次尝试解析 start_idx corrected.find({) end_idx corrected.rfind(}) 1 if start_idx ! -1 and end_idx ! 0: json_str corrected[start_idx:end_idx] return json.loads(json_str) except: pass return {} # 返回空字典作为兜底5. 常见问题与解决方案在实际使用中你可能会遇到这些问题问题1模型响应太慢解决方案调整生成参数减少max_new_tokens或者使用量化版本代码调整# 使用更快的生成参数 outputs model.generate( **inputs, max_new_tokens100, # 减少生成长度 temperature0.3, # 降低随机性 do_sampleFalse, # 使用贪婪解码更快但多样性降低 num_beams1 # 减少束搜索宽度 )问题2输出格式不符合要求解决方案在提示词中明确指定格式并添加格式示例改进提示词prompt 请提取以下商品信息商品描述苹果iPhone 15 128GB 黑色请按照以下JSON格式返回 { 品牌: 品牌名称, 型号: 具体型号, 存储: 存储容量, 颜色: 颜色 } 只返回JSON不要其他文字。问题3处理长文本时效果不佳解决方案分段处理或者先提取关键部分def process_long_text(text, chunk_size500): 分段处理长文本 chunks [text[i:ichunk_size] for i in range(0, len(text), chunk_size)] results [] for chunk in chunks: # 对每段进行摘要 summary generate_chunk_summary(chunk) results.append(summary) # 合并各段摘要 combined .join(results) # 对合并后的摘要再次摘要 final_summary generate_final_summary(combined) return final_summary问题4需要处理特定领域的数据解决方案在提示词中加入领域知识def process_medical_text(text): 处理医学领域文本 prompt f你是一个医学信息处理专家。请从以下文本中提取关键信息 {text} 请关注 1. 疾病名称 2. 症状描述 3. 治疗方法 4. 药物名称用JSON格式返回。 return chat_with_model(prompt)6. 总结用下来这段时间InternLM2-Chat-1.8B给我的爬虫项目带来了实实在在的效率提升。最明显的感受是以前需要写很多正则表达式和复杂规则来处理的数据现在只需要给模型一个清晰的指令它就能理解我的意图给出不错的结果。对于刚接触Python爬虫的朋友我的建议是先从简单的任务开始尝试。比如先让模型帮你清洗一下抓到的文本或者从大段内容里提取关键信息。这些场景下模型的优势很明显——它能理解语义而不仅仅是匹配模式。在实际使用中提示词的质量很关键。我发现把要求说得越具体、越清晰模型的表现就越好。比如不只是说“提取信息”而是明确说“提取品牌、型号、价格用JSON格式返回”。性能方面1.8B的模型在普通电脑上运行完全没问题响应速度也够快。对于大多数爬虫数据处理任务这个规模已经足够用了。如果数据量特别大可以考虑批量处理或者先对数据进行初步筛选只把需要智能处理的部分交给模型。当然它也不是万能的。对于格式要求极其严格、或者需要100%准确率的场景可能还是需要结合传统方法。但在大多数情况下特别是处理非结构化、半结构化的网页数据时它能帮你省下不少时间。如果你正在做爬虫项目不妨试试把模型集成进去。从简单的文本清洗开始慢慢尝试更复杂的任务。你会发现很多繁琐的数据处理工作现在可以变得更智能、更高效。获取更多AI镜像想探索更多AI镜像和应用场景访问 CSDN星图镜像广场提供丰富的预置镜像覆盖大模型推理、图像生成、视频生成、模型微调等多个领域支持一键部署。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2409085.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！