[Python] Web Scraping v1.0

README

Webscraping Tool

1
2
3
Author: Patrick Dong

Modules Developed: abuseipdb.py, ipvoid.py, myIPWhois.py, sans.py, xforceIBM.py

OS Pre-requisite [Ubuntu]

1
2
3
4
5
6
7
Python 2.7 and BeautifulSoup, requests, pandas

Install Easy Install: $ sudo apt-get install python-setuptools python-dev build-essential.

Install pip: $ sudo easy_install pip

Install virtualenv: $ sudo pip install --upgrade virtualenv

Usage Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
[kali-linux]$ python webscraping.py <IP|Domain>

>>>>> Welcome to WebScraping Tool <<<<<
Now checking 183.63.206.116 ...

[.] IP Whois Result
% [whois.apnic.net]
% Whois data copyright terms http://www.apnic.net/db/dbcopyright.html

% Information related to '183.0.0.0 - 183.63.255.255'

% Abuse contact for '183.0.0.0 - 183.63.255.255' is '[email protected]'

inetnum: 183.0.0.0 - 183.63.255.255
netname: CHINANET-GD
descr: CHINANET Guangdong province network
descr: Data Communication Division
descr: China Telecom
country: CN
admin-c: IC83-AP
tech-c: IC83-AP
status: ALLOCATED PORTABLE
remarks: service provider
remarks: --------------------------------------------------------
remarks: To report network abuse, please contact mnt-irt
remarks: For troubleshooting, please contact tech-c and admin-c
remarks: Report invalid contact via www.apnic.net/invalidcontact
remarks: --------------------------------------------------------
mnt-by: APNIC-HM
mnt-lower: MAINT-CHINANET-GD
last-modified: 2016-05-04T00:19:59Z
source: APNIC
mnt-irt: IRT-CHINANET-CN

irt: IRT-CHINANET-CN
address: No.31 ,jingrong street,beijing
address: 100032
e-mail: [email protected]
abuse-mailbox: [email protected]
admin-c: CH93-AP
tech-c: CH93-AP
auth: # Filtered
mnt-by: MAINT-CHINANET
last-modified: 2010-11-15T00:31:55Z
source: APNIC

person: IPMASTER CHINANET-GD
nic-hdl: IC83-AP
e-mail: [email protected]
address: NO.18,RO. ZHONGSHANER,YUEXIU DISTRIC,GUANGZHOU
phone: +86-20-87189274
fax-no: +86-20-87189274
country: CN
mnt-by: MAINT-CHINANET-GD
remarks: IPMASTER is not for spam complaint,please send spam complaint to [email protected]
abuse-mailbox: [email protected]
last-modified: 2014-09-22T04:41:26Z
source: APNIC

% This query was served by the APNIC Whois Service version 1.88.15-43 (WHOIS-US4)



[.] IPVoid Result:
ITEM DATA
0 Analysis Date 2018-04-01 23:23:23
1 Elapsed Time 1 seconds
2 Blacklist Status BLACKLISTED 6/96
3 IP Address 183.63.206.116
4 Reverse DNS Unknown
5 ASN AS134772
6 ASN Owner CHINANET Guangdong province Dongguan MAN network
7 ISP China Telecom Guangdong
8 Continent Asia
9 Country Code (CN) China
10 Latitude / Longitude 23.1167 / 113.25
11 City Guangzhou
12 Region Guangdong


[.] SANS Result
Report Times: 119
Total Targets: 45
First Reported: 2017-11-02
Recent Report: 2018-03-30 14:14:27


[.] AbuseIPDB Result
Reported 8 times
Date Reporter Category
0 1 minute ago greensnow.co | Port Scan |
1 29 Mar 2018 greensnow.co | Port Scan |
2 18 Mar 2018 greensnow.co | Port Scan |
3 24 Feb 2018 greensnow.co | Port Scan |
4 05 Nov 2017 Anonymous | Port Scan |
5 18 Oct 2017 danielmellum.com | Port Scan |
6 21 Nov 2016 Anonymous | DDoS Attack | Exploited Host |
7 22 Nov 2015 greensnow.co | Hacking |

[.] IBM X-Force Result
Country: China
Risk Score: 1 (low)
Categorization: Unsuspicious

Main ‘webscraping.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/usr/bin/python
from __future__ import print_function
import sys,re, smtplib
import abuseipdb, ipvoid, sans, myIPwhois, xforceIBM

def webscraping():
myIPvoidPrint1 = ''
mySansPrint2 = ''
myAbuseIPDBPrint3 = ''
myXForcePrint4 = ''

if len(sys.argv) != 2:
print('>>>>> Welcome to my WebScraping Tool <<<<<')
print(' Usage: python webscraping.py [x.x.x.x | domain]')
else:
print('>>>>> Welcome to WebScraping Tool <<<<<')

# regular expression for IP
re_ip = re.compile('^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')

print("Now checking " + sys.argv[1] + " ...")
print("")

# If the IP format is valid:
if re_ip.match(sys.argv[1]):
# Call myIPwhois.py
myIPwhois.IPWhoisChecker("https://www.abuseipdb.com/whois/" + sys.argv[1])

# Call ipvoid.py
myIPvoidPrint1 = ipvoid.ipvoidChecker(sys.argv[1])

# Call sans.py
mySansPrint2 = sans.sansChecker(sys.argv[1])

# Call abuseipdb.py
myAbuseIPDBPrint3 = abuseipdb.abuseipdbChecker("https://www.abuseipdb.com/check/" + sys.argv[1])

# Call xforceIBM.py
myXForcePrint4 = xforceIBM.myXForceChecker("https://api.xforce.ibmcloud.com/ipr/" + sys.argv[1])


# If the input is a domain or other strings, let the website validate then ...
else:
#Call myIPwhois.py
myIPwhois.IPWhoisChecker("https://www.abuseipdb.com/whois/" + sys.argv[1])

# Call sans.py
# sans.sansChecker("https://isc.sans.edu/api/ip/" + sys.argv[1])
sans.sansChecker(sys.argv[1])

# Call abuseipdb.py
abuseipdb.abuseipdbChecker("https://www.abuseipdb.com/check/" + sys.argv[1])

# Call xforceIBM.py
# Check the domain
xforceIBM.myXForceChecker("https://api.xforce.ibmcloud.com/url/" + sys.argv[1])



print ("")
message = "[.] IPVoid Result: " + myIPvoidPrint1 + '\n' +\
"[.] SANS Result: " + ' | '.join(mySansPrint2) + '\n' +\
"[.] AbuseIPDB Result: " + myAbuseIPDBPrint3 + '\n' +\
"[.] XForce Result: " + ' | '.join(myXForcePrint4)
print (message)

'''
sendemail(from_addr='xxxx@gmail.com',
to_addr_list=['xxx@xx.co.nz'],
cc_addr_list=['xxx@xx.co.nz'],
subject='Some Testing Shxxt',
message= message,
login='xxxx',
password='xxx!')
'''


def sendemail(from_addr, to_addr_list, cc_addr_list,
subject, message,
login, password,
smtpserver='smtp.gmail.com:587'):
header = 'From: %s\n' % from_addr
header += 'To: %s\n' % ','.join(to_addr_list)
header += 'Cc: %s\n' % ','.join(cc_addr_list)
header += 'Subject: %s\n\n' % subject
message = header + message

server = smtplib.SMTP(smtpserver)
server.starttls()
server.login(login, password)
problems = server.sendmail(from_addr, to_addr_list, message)
server.quit()
return problems

if __name__ == '__main__':
webscraping()

Module ‘abuseipdb.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
from bs4 import BeautifulSoup
import requests

import pandas as pd
from pandas import Series, DataFrame

def abuseipdbChecker(url):

# e.g. url = "https://www.abuseipdb.com/check/220.191.211.7"
# url = "https://www.abuseipdb.com/check/baidu.com"
# HTTP Query
myResult = requests.get(url)
printResult = ''
print '[.] AbuseIPDB Result:'

# if the input value is invalid, such as 'baidu.comx', 'x.x.x.x.x', etc.
# Invalid Input: '422 Unprocessable Entity'
if myResult.status_code == 422:
print 'Error: 422 Unprocessable Entity (e.g. http://www.com)'
print 'We expected a valid IP address or Domain name.'
exit()
else:
# If domain resolved to an IP
if url != myResult.url:
print 'Your request has been resolved to ' + myResult.url
c = myResult.content
soup = BeautifulSoup(c, "lxml")

# Part 1: Locate the reporting times that we want
# reportTimes = soup.find_all(class_="well")
mySoup = soup.find('div', {'class': 'col-md-6'})

# Http Response code is still 200 but we got a message:
# 'We can't resolve the domain www.comz! Please try your query again.'
if mySoup is None:
print ('We expected a valid IP address or Domain name.')
else:
# Get the first 'p' tag in <div class="well">
# You can only put 'find_all' after 'find'
pTag = mySoup.find('p')
reportTimes = pTag.find('b')

# Print reporting times
try:
if reportTimes.string == "Important Note:":
print 'Note: You probably input a private IP. Please check again ...'
exit()
else:
print 'Reported ' + reportTimes.string + ' times'
printResult = 'Reported ' + reportTimes.string + ' times'
# if result equals 'None'
except Exception:
reportTimes = 0
print 'Reported ' + str(reportTimes) + ' times'
printResult = 'Reported ' + str(reportTimes) + ' times'
print ''

# Part 2: Locate the table that we want
tables = soup.find_all(class_="table table-striped responsive-table")

if tables != []:
# Use BeautifulSoup to find the table entries with a For Loop
rawData = []

# Looking for every row in a table
# table[0] is just the format for BeautifulSoup
rows = tables[0].findAll('tr')

for tr in rows:
cols = tr.findAll('td')
for td in cols:
# data-title = "Reporter"
text = cols[0].text
rawData.append(text)
# data-title = "Date"
text = cols[1].text
rawData.append(text)
'''
# data-title = "Comment" (Ingnored)
text = cols[2].text
rawData.append(text)
'''
# data-title = "Categories"
text = cols[3].text + '\n'
rawData.append(text)

# Modify rawData
reporter = []
date = []
category = []

itemNum = len(rawData)
index = 0

# For 'reporter'
index1 = 0
# For 'date'
index2 = 1
# For 'category'
index3 = 2

for index in range(0, itemNum - 1):
# Make sure this loop will not exceed the limit
if index1 <= itemNum - 3:
# Reporter
reporter.append(rawData[index1].replace('\n', ''))
index1 += 3

# Date
date.append(rawData[index2].replace('\n', ''))
index2 += 3

# Category
category.append(rawData[index3].replace('\n\n', ' | ').replace('\n', ' | '))
index3 += 3

# Global Index
index += 1

# Panda Series
reporter = Series(reporter)
date = Series(date)
category = Series(category)

# Concatenate into a DataFrame
legislative_df = pd.concat([date, reporter, category], axis=1)

# Set up the columns
legislative_df.columns = ['Date', 'Reporter', 'Category']

# Delete the dups and reset index (and drop the old index)
legislative_df = legislative_df.drop_duplicates().reset_index(drop=True)

# Show the finished DataFrame
print legislative_df,
print ''
return printResult

Module ‘ipvoid.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from bs4 import BeautifulSoup
import requests

import pandas as pd
from pandas import Series, DataFrame

def ipvoidChecker(ip):

url = "http://www.ipvoid.com/ip-blacklist-check/"
headers = {"Content-Type": "application/x-www-form-urlencoded",
"Referer":"http://www.ipvoid.com/ip-blacklist-check/",}
payload = {'ip':ip}
# Note: Using 'data' instead of 'params'
r = requests.post(url, headers=headers, data=payload)
returnData = r.content
soup = BeautifulSoup(returnData, "lxml")

#mySoup = soup.find('div', {'class': 'responsive'})
tables = soup.find_all(class_="table table-striped table-bordered")

column1 = []
column2 = []
printResult = ''

if tables !=[]:
rows = tables[0].findAll('tr')
i = 0
for tr in rows:
i+=1
cols = tr.findAll('td')
column1.append(cols[0].text)
column2.append(cols[1].text.
replace(" Find Sites | IP Whois","").
replace(" Google Map",""))
#Get the Blacklist Status
if i == 3:
printResult = cols[1].text
# Panda Series
column1 = Series(column1)
column2 = Series(column2)

# Concatenate into a DataFrame
legislative_df = pd.concat([column1, column2], axis=1)

# Set up the columns
legislative_df.columns = ['ITEM', 'DATA']

# Show the finished DataFrame
print '[.] IPVoid Result: '
print legislative_df,'\n\n'
return printResult

Module ‘myIPwhois.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from bs4 import BeautifulSoup
import requests

def IPWhoisChecker(url):

# e.g. url = "https://www.abuseipdb.com/whois/114.200.4.207"
myResult = requests.get(url)
print '[.] IP Whois Result:'

# if the input value is invalid, such as 'baidu.comx', 'x.x.x.x.x', etc.
# Invalid Input: '422 Unprocessable Entity'
if myResult.status_code == 422:
print 'Error: 422 Unprocessable Entity (e.g. http://www.com)'
print 'We expected a valid IP address or Domain name.'
print 'Program will exit ...'
exit()
elif myResult.status_code == 404:
print 'Response Error: 404 We expected a valid IP address or Domain name.'
print 'Program will exit ...'
exit()
else:
# If domain resolved to an IP
if url != myResult.url:
print 'Your request has been resolved to ' + myResult.url
c = myResult.content
soup = BeautifulSoup(c, "lxml")

# Parse https://www.abuseipdb.com/whois/x.x.x.x
mySoup = soup.find('section', {'id': 'report-wrapper'})
preTag = mySoup.find('pre')
ipWhoisInfo = preTag.text

if not ipWhoisInfo:
print "None"
else:
print ipWhoisInfo

Module ‘sans.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import re

from bs4 import BeautifulSoup
import requests, dns.resolver


def sansChecker(IPOrDomain):

# HTTP Query
url = "https://isc.sans.edu/api/ip/" + IPOrDomain

# If the input value is a domain
re_ip = re.compile('^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')
if not re_ip.match(IPOrDomain):
#Try to resolve the domain first
aRecord = []
my_resolver = dns.resolver.Resolver()
my_resolver.nameservers = ['8.8.8.8']
for rdata in my_resolver.query(IPOrDomain, "A"):
aRecord.append(rdata.address)
# Only use the 1st A record
url = "https://isc.sans.edu/api/ip/" + aRecord[0]

# Our actual checking begins from here
printResult = []
myResult = requests.get(url)
c = myResult.content
soup = BeautifulSoup(c, "lxml")
mySoup = soup.find('error')
print '[.] SANS Result:'

# If the input IP has a correct format
if mySoup is None:
c = myResult.content
soup = BeautifulSoup(c, "lxml")
try:
reportedTimes = soup.find('count')
if reportedTimes.text != '':
print 'Report Times ' + reportedTimes.text
printResult.append('Report Times ' + reportedTimes.text)
else:
print 'Report Times 0'
printResult.append("Report Times 0")
except Exception:
print 'Report Times 0'
printResult.append("Report Times 0")

try:
targets = soup.find('attacks')
if targets.text != '':
print 'Total Targets ' + targets.text
printResult.append('Total Targets ' + targets.text)
else:
print 'Total Targets 0'
printResult.append('Total Targets 0')
except Exception:
print 'Total Targets 0'
printResult.append('Total Targets 0')

try:
firstReported = soup.find('mindate')
if firstReported.text != '':
print 'First Reported ' + firstReported.text
printResult.append('First Reported ' + firstReported.text)
else:
print 'First Reported 0'
printResult.append('First Reported 0')
except Exception:
print 'First Reported 0'
printResult.append('First Reported 0')

try:
latestReported = soup.find('updated')
if latestReported.text != '':
print 'Recent Report ' + latestReported.text
printResult.append('Recent Report ' + latestReported.text)
else:
print 'Recent Report 0'
printResult.append('Recent Report 0')
except Exception:
print 'Recent Report 0'
printResult.append('Recent Report 0')

print "\n"

# Elif the input IP is wrong
elif mySoup.text == 'bad IP address':
print 'We expected a valid IP address.'
exit()

return printResult

Module - ‘xforceIBM.py’

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import requests
from requests.auth import HTTPBasicAuth

import json

def myXForceChecker(url):

# User: 473284ee-2c45-4719-a201-5e6c81c0253a
# Password: 8acd0774-7238-4ad7-bc09-a2003ca6e80f

# Auth first
print ''
print '[.] IBM X-Force Result:'

printResult = []
# e.g. url = "https://exchange.xforce.ibmcloud.com/ip//114.200.4.207"
# IP Report
myResult1 = requests.get(url, auth=HTTPBasicAuth('473284ee-2c45-4719-a201-5e6c81c0253a',
'8acd0774-7238-4ad7-bc09-a2003ca6e80f'))
c1 = myResult1.content
myJson1 = json.loads(c1)

# >>>>>>>>>>> IP/Domain Report Check <<<<<<<<<<<<<
# ...........
'''
# indent = 2
# json.dumps() change data to python dictionary
# sortedData = json.dumps(myJson1, sort_keys=True, indent=2)
# print sortedData
'''

#----------These three keys are for IP checker----------
# [Print] Geo information
if "geo" in myJson1:
for key, value in myJson1["geo"].items():
geo = "Country" + ": " + str(value)
print geo
printResult.append(geo)
# Only print country
# (Ingore country code)
break
# [Print] Overrall Risk Score
if "score" in myJson1:
if myJson1["score"] == 1:
print "Risk Score: " + str(myJson1["score"]) + " (low)"
printResult.append("Risk Score: " + str(myJson1["score"]) + " (low)")
else:
print "Risk Score: " + str(myJson1["score"])
printResult.append("Risk Score: " + str(myJson1["score"]))
# [Print] Categorization:
if "cats" in myJson1:
if myJson1["cats"]:
for key, value in myJson1["cats"].items():
cat = str(key) + " (" + str(value) + "%)"
print "Categorization: " + cat
printResult.append("Categorization: " + cat)
else:
print "Categorization: Unsuspicious"
printResult.append("Categorization: Unsuspicious")


# ----------These keys are for Domain checker----------
if "result" in myJson1:
myJsonResult = myJson1["result"]
if myJsonResult["score"] == 1:
print "Risk Score: " + str(myJsonResult["score"]) + " (low)"
printResult.append("Risk Score: " + str(myJsonResult["score"]) + " (low)")
else:
print "Risk Score: " + str(myJsonResult["score"])
printResult.append("Risk Score: " + str(myJsonResult["score"]))

if myJsonResult["categoryDescriptions"]:
for key, value in myJsonResult["categoryDescriptions"].items():
cat = "<" + str(key).replace(" / ", "|") + ">: " + str(value)
print cat
printResult.append(cat)

return printResult