一聚教程网:一个值得你收藏的教程网站

热门教程

php通过正则提取页面指定内容实例

时间:2022-06-25 02:41:03 编辑:袖梨 来源:一聚教程网

例子代码如下,可常用于采集哦、

 代码如下 复制代码


1、获取页面标题

//提取标题
            preg_match('/(?<title>.*?)<\/title>/i', $html, $titleArr);<br />             $title = $titleArr['title'];<br /> 2、获取body主体内容,并将背景图片提取出来替换成其他图片地址</p> <p>/**<br />  * 获取BODY主体区域内容<br />  * @param $html<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getBody($html,$urlRoot = null){<br />     //提取BODY主体<br />     preg_match('/<!--body-->(.*?)<!--body-->/is ', $html, $bodyArr);<br />     if(!$bodyArr){<br />         preg_match('/<body.*?>(.*?)<\/body>/is ', $html, $bodyArr);<br />     }<br />     $body = $bodyArr[1];<br />     //替换img文件<br />     $body =  preg_replace('/(<[img|IMG].*src=[\'|"])(\.\.\/)*(img.[^\'||^"]+)/',"$1$urlRoot$3",$body);<br />     //替换html文件内的css背景图片<br />     $body =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$body);<br />     return $body;<br /> }<br /> 3、提取页面Description内容</p> <p>function getDescription($html){<br />     // Get the 'content' attribute value in a <meta name="description" ... /><br />     $matches = array();<br />  <br />     // Search for <meta name="description" content="Buy my stuff" /><br />     preg_match('/<meta.*?name=("|\')description("|\').*?content=("|\')(.*?)("|\')/i', $html, $matches);<br />     if (count($matches) > 4) {<br />         return trim($matches[4]);<br />     }<br />  <br />     // Order of attributes could be swapped around: <meta content="Buy my stuff" name="description" /><br />     preg_match('/<meta.*?content=("|\')(.*?)("|\').*?name=("|\')description("|\')/i', $html, $matches);<br />     if (count($matches) > 2) {<br />         return trim($matches[2]);<br />     }<br />  <br />     // No match<br />     return null;<br /> }<br /> 4、替换css文件的背景图片地址</p> <p>/**<br />  * 获取CSS内容<br />  * @param $cssCnt<br />  * @param $urlRoot<br />  * @return mixed<br />  */<br /> function getCss($cssCnt,$urlRoot =null){<br />     //匹配包含 img文件夹的相对路径图片 (含义绝对路径的不包含在其中)<br />     //匹配替换不一定准确,因为只是将 含义 ../ 的地址转为url 而没有考虑 ../../ 之类的层级关系<br />     $css =  preg_replace('~\b(background(-image)?\s*:(.*?)\(\s*[\'|"]?)(\.\.\/)*(img.*?)?\s*\)~i',"$1$urlRoot$5)",$cssCnt);<br />     //添加css前缀<br />     $css =  preg_replace('/\b.(.*?)[,|{]/',"pat .$0",$cssCnt);<br />     //TODO 压缩css<br />     return $css;<br /> }</p> <p><br />  </p> </td> </tr> </table> <p>从上面例子来看其实都是非常的简单就是批有规律的标签为开始与结束节点,这样我们可以获取这两个字符之类的内容也就是我们要提取的内容了哦,只是在中间有字符或空格的一些处理了哦。</p></td> </tr> </table> <div class="articles"> <div class="tit02"> <h4>相关文章</h4> </div> <ul> <li> <a target="_blank" href="/new/399340.htm">《下一站江湖2》力士身份介绍</a> <span>04-18</span> </li> <li> <a target="_blank" href="/new/399341.htm">《下一站江湖2》周目开启方法</a> <span>04-18</span> </li> <li> <a target="_blank" href="/new/399338.htm">《星球工匠》售价介绍</a> <span>04-18</span> </li> <li> <a target="_blank" href="/new/399339.htm">《下一站江湖2》焕然一新怎么做</a> <span>04-18</span> </li> <li> <a target="_blank" href="/new/399337.htm">《下一站江湖2》行走身份介绍</a> <span>04-18</span> </li> <li> <a target="_blank" href="/new/399334.htm">《下一站江湖2》雁九天介绍</a> <span>04-18</span> </li> </ul> </div> </div> <div class="pages art-detail"> </div> </div> </div> </div> </div> <div class="hot-column"> <div class="cont"> <div class="tit"> <h4>热门栏目</h4> </div> <ul class="clearfix"> <li> <h6><a href="/list-1/" target="_blank">php教程</a></h6> <a href="/list-45/" target="_blank">php入门</a> <a href="/list-46/" target="_blank">php安全</a> <a href="/list-47/" target="_blank">php安装</a> <a href="/list-48/" target="_blank">php常用代码</a> <a href="/list-49/" target="_blank">php高级应用</a> </li> <li> <h6><a href="/list-2/" target="_blank">asp.net教程</a></h6> <a href="/list-78/" target="_blank">基础入门</a> <a href="/list-79/" target="_blank">.Net开发</a> <a href="/list-80/" target="_blank">C语言</a> <a href="/list-81/" target="_blank">VB.Net语言</a> <a href="/list-82/" target="_blank">WebService</a> </li> <li> <h6><a href="/list-6/" target="_blank">手机开发</a></h6> <a href="/list-208/" target="_blank">安卓教程</a> <a href="/list-209/" target="_blank">ios7教程</a> <a href="/list-210/" target="_blank">Windows Phone</a> <a href="/list-211/" target="_blank">Windows Mobile</a> <a href="/list-212/" target="_blank">手机常见问题</a> </li> <li> <h6><a href="/list-3/" target="_blank">css教程</a></h6> <a href="/list-99/" target="_blank">CSS入门</a> <a href="/list-100/" target="_blank">常用代码</a> <a href="/list-101/" target="_blank">经典案例</a> <a href="/list-102/" target="_blank">样式布局</a> <a href="/list-103/" target="_blank">高级应用</a> </li> <li> <h6><a href="/list-4/" target="_blank">网页制作</a></h6> <a href="/list-136/" target="_blank">设计基础</a> <a href="/list-137/" target="_blank">Dreamweaver</a> <a href="/list-138/" target="_blank">Frontpage</a> <a href="/list-139/" target="_blank">js教程</a> <a href="/list-140/" target="_blank">XNL/XSLT</a> </li> <li> <h6><a href="/list-7/" target="_blank">办公数码</a></h6> <a href="/list-236/" target="_blank">word</a> <a href="/list-237/" target="_blank">excel</a> <a href="/list-238/" target="_blank">powerpoint</a> <a href="/list-239/" target="_blank">金山WPS</a> <a href="/list-240/" target="_blank">电脑新手</a> </li> <li> <h6><a href="/list-11/" target="_blank">jsp教程</a></h6> <a href="/list-68/" target="_blank">Application与Applet</a> <a href="/list-69/" target="_blank">J2EE/EJB/服务器</a> <a href="/list-70/" target="_blank">J2ME开发</a> <a href="/list-71/" target="_blank">Java基础</a> <a href="/list-72/" target="_blank">Java技巧及代码</a> </li> </ul> </div> </div> <div class="footer"> <div class="cont"> <p> <a href="/" target="_self">一聚教程网</a>| <a href="javascript:;" class="about" target="_self">关于我们</a>| <a href="javascript:;" class="contact" target="_self">联系我们</a>| <a href="javascript:;" class="gg_contact" target="_self">广告合作</a>| <a href="javascript:;" class="friend_link" target="_self">友情链接</a>| <a href="javascript:;" class="copyright_notice" target="_self">版权声明</a> </p> <p> <span>copyRight@2007-2022 www.111CN.NET AII Right Reserved <a href="https://beian.miit.gov.cn/" target="_blank" class="beian"></a></span> </p> <p> <span> 网站内容来自网络整理或网友投稿如有侵权行为请邮件:111cn.com@163.com 我们24小时内处理 </span> </p> </div> </div> <script src="/jspc/func.js" type="text/javascript"></script> <script src="/js/stat.js"></script> </body> </html>