删除基本URL

我编写了一个python脚本来href从给定网页上的所有链接中提取值:


from BeautifulSoup import BeautifulSoup

import urllib2

import re


html_page = urllib2.urlopen("http://kteq.in/services")

soup = BeautifulSoup(html_page)

for link in soup.findAll('a'):

    print link.get('href')

当我运行上面的代码时,我得到以下输出,其中包括外部和内部链接:


index

index

#

solutions#internet-of-things

solutions#online-billing-and-payment-solutions

solutions#customer-relationship-management

solutions#enterprise-mobility

solutions#enterprise-content-management

solutions#artificial-intelligence

solutions#b2b-and-b2c-web-portals

solutions#robotics

solutions#augement-reality-virtual-reality`enter code here`

solutions#azure

solutions#omnichannel-commerce

solutions#document-management

solutions#enterprise-extranets-and-intranets

solutions#business-intelligence

solutions#enterprise-resource-planning

services

clients

contact

#

#

#

https://www.facebook.com/KTeqSolutions/

#

#

#

#

#contactform

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

index

services

#

contact

#

iOSDevelopmentServices

AndroidAppDevelopment

WindowsAppDevelopment

HybridSoftwareSolutions

CloudServices

HTML5Development

iPadAppDevelopment

services

services

services

services

services

services

contact

contact

contact

contact

contact

None

https://www.facebook.com/KTeqSolutions/

#

#

#

#

我想删除具有完整URL的外部链接,https://www.facebook.com/KTeqSolutions/同时保留诸如的链接solutions#internet-of-things。我如何有效地做到这一点?


largeQ
浏览 144回答 2
2回答

慕神8447489

如果我对您的理解正确,则可以尝试以下方法:l = []for link in soup.findAll('a'):    print link.get('href')    l.append(link.get('href'))l = [x for x in l if "www" not in x] #or 'https'

智慧大石

您可以parse_url从requests模块中使用。import requestsurl = 'https://www.facebook.com/KTeqSolutions/'requests.urllib3.util.parse_url(url)给你Url(scheme='https', auth=None, host='www.facebook.com', port=None, path='/KTeqSolutions/', query=None, fragment=None)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python