删除基本URL

首页课程实战体系课手记专栏慕课教程

删除基本URL

我编写了一个python脚本来href从给定网页上的所有链接中提取值：

from BeautifulSoup import BeautifulSoup

import urllib2

import re

html_page = urllib2.urlopen("http://kteq.in/services")

soup = BeautifulSoup(html_page)

for link in soup.findAll('a'):

print link.get('href')

当我运行上面的代码时，我得到以下输出，其中包括外部和内部链接：

index

solutions#internet-of-things

solutions#online-billing-and-payment-solutions

solutions#customer-relationship-management

solutions#enterprise-mobility

solutions#enterprise-content-management

solutions#artificial-intelligence

solutions#b2b-and-b2c-web-portals

solutions#robotics

solutions#augement-reality-virtual-reality`enter code here`

solutions#azure

solutions#omnichannel-commerce

solutions#document-management

solutions#enterprise-extranets-and-intranets

solutions#business-intelligence

solutions#enterprise-resource-planning

services

clients

contact

https://www.facebook.com/KTeqSolutions/

#contactform

index

services

contact

iOSDevelopmentServices

AndroidAppDevelopment

WindowsAppDevelopment

HybridSoftwareSolutions

CloudServices

HTML5Development

iPadAppDevelopment

services

contact

None

https://www.facebook.com/KTeqSolutions/

我想删除具有完整URL的外部链接，https://www.facebook.com/KTeqSolutions/同时保留诸如的链接solutions#internet-of-things。我如何有效地做到这一点？

largeQ

浏览 174回答 2

2回答

慕神8447489

如果我对您的理解正确，则可以尝试以下方法：l = []for link in soup.findAll('a'):    print link.get('href')    l.append(link.get('href'))l = [x for x in l if "www" not in x] #or 'https'

0 0

智慧大石

您可以parse_url从requests模块中使用。import requestsurl = 'https://www.facebook.com/KTeqSolutions/'requests.urllib3.util.parse_url(url)给你Url(scheme='https', auth=None, host='www.facebook.com', port=None, path='/KTeqSolutions/', query=None, fragment=None)

0 0

随时随地看视频慕课网APP