This blog article introduces an efficient way to detect previously unknown hosting providers and their IP ranges. The detection algorithm aims to detect hosting providers from the entire Internet.
这篇博客文章介绍了一种检测以前未知的托管提供商及其 IP 范围的有效方法。检测算法旨在检测整个互联网的托管服务提供商。
Too long; Didn’t read
太长;没读
This blog article demonstrates how it is possible to systematically detect new, previously unknown hosting providers from the Internet. Knowing the IP ranges of hosting providers is important for many IT security applications and yields better coverage for the is_datacenter
field of the ipapi.is API.
这篇博客文章演示了如何从 Internet 系统地检测新的、以前未知的托管服务提供商。了解托管服务提供商的 IP 范围对于许多 IT 安全应用程序非常重要,并且可以更好地覆盖 ipapi.is API is_datacenter
领域。
The algorithm developed in this blog post found many new hosting providers. You can download the file classifiedHostingProviders.tsv that contains around 96,000 IP ranges of previously unknown hosting providers. Those 96,000 hosting IP ranges belong to more than 2000 newly detected hosting providers. At the point of writing, the algorithm is still running.
这篇博文中开发的算法发现了许多新的托管服务提供商。您可以下载文件classifiedHostingProviders.tsv,其中包含以前未知的托管提供商的大约96,000个IP范围。这 96,000 个托管 IP 范围属于 2000 多个新检测到的托管提供商。在撰写本文时,算法仍在运行。
Download classifiedHostingProviders.tsv
下载分类托管提供商.tsv
What are hosting providers?
什么是托管服务提供商?
Hosting providers are organizations that allow third parties to purchase computing resources such as Virtual Private Servers (VPS) or bare metal servers. Such servers are often assigned a public IP address and they are reachable from anywhere in the Internet. Hosting instances are used to run web servers, mail servers, file servers, SSH servers or other kind of software that requires steady uptime and a publicly reachable IP addresses.
托管提供商是允许第三方购买计算资源(如虚拟专用服务器 (VPS) 或裸机服务器)的组织。此类服务器通常被分配一个公共 IP 地址,并且可以从互联网上的任何地方访问它们。托管实例用于运行 Web 服务器、邮件服务器、文件服务器、SSH 服务器或其他类型的软件,这些软件需要稳定的正常运行时间和可公开访问的 IP 地址。
The definition of hosting providers is quite lenient for the purpose of ipapi.is and includes all of the following:
出于 ipapi.is 的目的,托管服务提供商的定义非常宽松,包括以下所有内容:
- Normal hosting providers such as Hetzner.de or Leaseweb.com
普通托管服务提供商,例如 Hetzner.de 或 Leaseweb.com - Large cloud providers such as Amazon AWS or Microsoft Azure
大型云提供商,如 Amazon AWS 或 Microsoft Azure - Content Delivery Networks such as Cloudflare, Fastly or edg.io
内容交付网络,如Cloudflare,Fastly或 edg.io - Anti-DDOS services such as qrator.net or ddos-guard.net
反DDOS服务,如 qrator.net 或 ddos-guard.net - IP leasing organizations such as ipxo.com or interlir.com
知识产权租赁组织,如 ipxo.com 或 interlir.com - Other SaaS, IaaS, or PaaS organizations such as fly.io or Heroku
其他SaaS,IaaS或PaaS组织,如 fly.io 或Heroku
Put differently: Every organization that allows anybody to quickly and anonymously obtain hosting resources or IP addresses is considered to be a hosting provider.
换句话说:每个允许任何人快速匿名获取托管资源或IP地址的组织都被视为托管服务提供商。
Why is hosting detection important?
为什么托管检测很重要?
Hosting resources can be used to run software that was developed with malicious intent in mind. Threat actors often abuse hosting infrastructure to run proxy servers or VPN servers in order to anonymously commit cyber crime. Furthermore, hosting resources are also frequently abused to run advanced bots or crawlers. For instance, many bot programmers run Headless Chrome from cloud instances.
托管资源可用于运行出于恶意目的开发的软件。威胁行为者经常滥用托管基础设施来运行代理服务器或 VPN 服务器,以便匿名实施网络犯罪。此外,托管资源也经常被滥用来运行高级机器人或爬虫。例如,许多机器人程序员从云实例运行无头Chrome。
Therefore, by knowing whether an IP address belongs to a hosting provider, it can be assumed that it is a potential threat. This knowledge helps to mitigate malicious traffic.
因此,通过了解IP地址是否属于托管服务提供商,可以假定它是潜在威胁。这些知识有助于缓解恶意流量。
Furthermore, it is not sufficient to only know the IP ranges of the major hosting providers. Many threat actors move their operation to smaller and less known hosting providers that are not well known.
此外,仅知道主要托管服务提供商的IP范围是不够的。许多威胁参与者将其操作转移到规模较小且鲜为人知的托管服务提供商。
The avid reader might object at this point: “Such illegal software can also be run on personal laptops or workstations. Why are hosting providers more prone to malicious activity?”
狂热的读者可能会在这一点上反对:“这种非法软件也可以在个人笔记本电脑或工作站上运行。为什么托管服务提供商更容易发生恶意活动?
It is correct that threat actors don’t exclusively use hosting resources to commit their crimes.
正确的是,威胁行为者并不专门使用托管资源来实施犯罪。
However, from the perspective of an website or app operator, there is almost no good reason why organic user traffic should originate from hosting IP ranges. Or to put it differently: No legitimate user is browsing the web from a server hosted somewhere in the Internet.
但是,从网站或应用程序运营商的角度来看,几乎没有充分的理由说明自然用户流量应该来自托管IP范围。或者换个说法:没有合法用户从托管在互联网某处的服务器上浏览网页。
The only plausible reason why humans might have hosting IP addresses is because they are using VPN or Proxy servers that are hosted in the cloud. In all other cases, it can be assumed that traffic originating from hosting ranges comes from bots and other malicious programs hosted in the cloud or in datacenters.
人类可能拥有托管 IP 地址的唯一合理原因是因为他们使用的是托管在云中的 VPN 或代理服务器。在所有其他情况下,可以假定源自托管范围的流量来自云或数据中心中托管的机器人和其他恶意程序。
There are likely some edge cases where legitimate users have cloud IP ranges, but usually, the average human Internet user surfs with either a residential or mobile IP address (ISP IP addresses).
可能存在一些合法用户拥有云 IP 范围的边缘情况,但通常,普通人类互联网用户使用住宅或移动 IP 地址(ISP IP 地址)进行冲浪。
Goals of this Research
本研究的目标
The goal of this article is to present a scalable algorithm that finds new hosting providers and the IP ranges that belong to them. The false positive rate – classifying organizations as hosting providers even though they are not – should be as small as possible. The newly detected hosting IP ranges will be used by the ipapi.is API to populate the is_datacenter
field.
本文的目标是介绍一种可扩展的算法,该算法可查找新的托管提供商以及属于它们的 IP 范围。误报率 – 将组织归类为托管提供商,即使它们不是 – 应尽可能小。ipapi.is API 将使用新检测到的托管 IP 范围来填充 is_datacenter
字段。
Some examples for IP addresses that belong to hosting providers:
属于托管提供商的 IP 地址的一些示例:
It is not possible to find every single hosting provider that exists in the Internet. But it certainly is possible to find a substantial part of existing hosting providers that are out there.
不可能找到互联网中存在的每个托管服务提供商。但肯定有可能找到很大一部分现有的托管服务提供商。
The Hosting Detection Algorithm
托管检测算法
The hosting detection algorithm is explained in the following sections. The algorithm is fully automatic and thus does not require manual human interaction. The algorithm consists of six different processing steps. Some processing steps are inherently risky and can lead to false positives (Classifying IP addresses as hosting IPs, even though they are not).
以下各节介绍了托管检测算法。该算法是全自动的,因此不需要人工交互。该算法由六个不同的处理步骤组成。某些处理步骤本质上存在风险,并可能导致误报(将 IP 地址分类为托管 IP,即使它们不是)。
In general, the false positive rate should be minimized as much as possible, even if the false negative rate is increasing as a consequence. Put differently: The algorithm should rather miss to detect some hosting providers instead of risking to wrongly classify a hosting providers.
通常,应尽可能降低假阳性率,即使假阴性率因此而增加。换句话说:算法应该错过检测一些托管服务提供商,而不是冒着错误地对托管服务提供商进行分类的风险。
Step 1: Download a List of Top 1 Million Domain Names
第 1 步:下载前 100 万个域名列表
In a first step, a list of the top 1 million domain names is downloaded from Cloudflare Radar. This list includes the Top 1,000,000 domains of the entire Internet. If you are interested in how exactly the Domain Ranking is computed, you can read about Cloudflare’s Domain Ranking method on their blog.
第一步,从Cloudflare Radar下载前100万个域名的列表。此列表包括整个互联网的前 1,000,000 个域。如果您对域名排名的确切计算方式感兴趣,可以在他们的博客上阅读有关Cloudflare的域名排名方法的信息。
Step 2: Lookup the IP address for every Domain Name
步骤 2:查找每个域名的 IP 地址
In a next step, each of the 1 million domain names must be resolved. Resolving means that the DNS name is translated into an IP address. This is a rather time consuming process, since it involves looking up 1 million domain names. The following python3
script is doing exactly that:
在下一步中,必须解析 100 万个域名中的每一个。解析意味着将 DNS 名称转换为 IP 地址。这是一个相当耗时的过程,因为它涉及查找 100 万个域名。以下 python3
脚本正是这样做的:
import socket
import random
import json
import os
def flush(results, fn='res.json'):
print(f'Flushing {len(results)} IPs to disk')
parsed = dict()
if os.path.exists(fn):
with open(fn) as fd:
parsed = json.load(fd)
for key, value in results.items():
parsed[key] = value
with open(fn, 'w') as fd:
json.dump(parsed, fd, indent=2)
def lookup_addresses(domain_list, flush_after=200):
ip_addresses = dict()
for domain in domain_list:
try:
ip = socket.gethostbyname(domain)
ip_addresses[domain] = ip
except socket.gaierror as err:
ip_addresses[domain] = str(err)
n = len(ip_addresses)
if n > 0 and n % flush_after == 0:
flush(ip_addresses)
ip_addresses = dict()
return ip_addresses
if __name__ == "__main__":
domain_list = open('top1M.csv').read().split('\n')
random.shuffle(domain_list)
print(f'Looking up {len(domain_list)} domains')
# Set the custom DNS server at the system level
socket.resolver = "1.1.1.1"
lookup_addresses(domain_list)
The lookup process took around 2 days and can be parallelized of course. After looking up all 1 million domain names, a JSON file is obtained that has the following structure (Only a small excerpt of the full file is shown):
查找过程大约需要 2 天,当然可以并行化。查找全部 100 万个域名后,会得到一个 JSON 文件,其结构如下(仅显示完整文件的一小部分摘录):
{
"bonanza-play.com": "104.21.72.206",
"rhythm.cloud": "76.223.17.25",
"poryadok.ru": "104.22.65.119",
"casino-x-noq.buzz": "104.21.19.128",
"pin-up-casino64.ru": "172.67.173.230",
"nostroy.ru": "89.253.229.54",
"mostbet-ru.life": "172.67.204.37",
"artmotion.net": "104.26.6.15",
"latnoticias365.com": "51.77.14.1",
"rtkba.com": "104.21.89.225",
"transitcard.ru": "89.104.86.143",
"alsatiapolynia.com": "[Errno -5] No address associated with hostname",
"xhfu.cn": "[Errno 8] nodename nor servname provided, or not known",
"joycasino-a16.top": "172.67.179.211",
"redstarslots.ru": "176.10.250.233",
"tdspsden.com": "172.64.149.13",
"vireq.com": "13.224.103.119",
"rickhendrickdodge.com": "54.243.57.127",
"pulsure.dk": "185.31.79.5",
"loomis.com": "52.17.152.5",
"allianceservices.im": "[Errno 8] nodename nor servname provided, or not known",
"loups-garous-en-ligne.com": "172.67.72.203"
}
Step 3: Clean the obtained IP Addresses
步骤 3:清理获取的 IP 地址
The next step involves cleaning the JSON file from those IP addresses that ipapi.is API already detects as hosting provider IP addresses. The explanation is obvious: IP addresses that are known to belong to hosting providers don’t need to be detected again. Furthermore, duplicates are removed from the resulting list of IP addresses.
下一步涉及从 API 已检测为托管提供商 IP 地址的 IP 地址中清理 JSON 文件 ipapi.is。解释很明显:已知属于托管服务提供商的 IP 地址不需要再次检测。此外,重复项将从生成的 IP 地址列表中删除。
After those two steps, a list of IP addresses that ipapi.is API currently does not classify as hosting IP addresses is obtained.
完成这两个步骤后,将获取 ipapi.is API 当前未归类为托管 IP 地址的 IP 地址列表。
From 1,001,400
domain names in total, 906,519
IPv4 addresses were obtained (The rest failed to resolve correctly). From those 906,519
IPv4 addresses, 721,823
were already classified to be datacenter IPs by the ipapi.is API. The rest (184,696
IPs) were de-duplicated and yielded 100,1400
unique IP addresses that are candidates for the algorithm.
从总共的域名中 1,001,400
, 906,519
获得了IPv4地址(其余的无法正确解析)。从这些 906,519
IPv4 地址中, 721,823
已被 ipapi.is API 归类为数据中心 IP。其余 ( 184,696
IP) 被重复数据删除并生成 100,1400
唯一的 IP 地址,这些地址是算法的候选者。
The list is called candidate IP address list, since there is a good chance that those IP addresses belong to a hosting provider that is previously unknown to ipapi.is API.
该列表称为候选 IP 地址列表,因为这些 IP 地址很有可能属于以前 ipapi.is API 未知的托管提供商。
Why is that the case?
为什么会这样呢?
While some organizations that are not hosting providers might choose to host their own domain (Such as universities, large organizations or government entities), most organizations do not run their own datacenters and rent hosting resources from a professional hosting provider. And since the list contains 1 million domains, it is very likely that a significant share of the existing hosting providers of the Internet is represented in this list.
虽然一些不是托管提供商的组织可能会选择托管自己的域(例如大学、大型组织或政府实体),但大多数组织不会运行自己的数据中心,也不会从专业托管提供商处租用托管资源。由于该列表包含 100 万个域,因此此列表中很可能代表了现有互联网托管服务提供商的很大一部分。
This is an excerpt of the candidate IP address list:
以下是候选 IP 地址列表的摘录:
[
"66.51.127.80",
"119.110.249.22",
"185.145.195.71",
"209.203.26.244",
"176.102.65.18",
"85.92.117.211",
"45.135.121.27",
"178.248.235.42",
"178.35.253.211",
"89.30.219.98",
"194.9.149.53",
"210.31.101.1",
"61.31.224.233",
"185.165.31.203",
"193.148.244.24",
]
There is no trivial way to infer whether the IP address belongs to a hosting provider or not without having more information about each particular IP address.
没有简单的方法可以推断IP地址是否属于托管服务提供商,而无需有关每个特定IP地址的更多信息。
A straightforward idea is to find the organization that is the owner of this IP address and to detect whether the owning organization is a hosting provider or not. Based on the organization name alone it is (usually) not possible to answer this. Therefore, the organization’s website must be crawled.
一个简单的想法是找到作为此IP地址所有者的组织,并检测拥有的组织是否是托管服务提供商。仅根据组织名称,(通常)不可能回答这个问题。因此,必须对组织的网站进行爬网。
But first, the organizations responsible for the IPs in the candidate IP address list must be found.
但首先,必须找到负责候选 IP 地址列表中的 IP 的组织。
Step 4: Obtain WHOIS Records for every IP Address
第 4 步:获取每个 IP 地址的 WHOIS 记录
In a first step, the organization that owns the IP address needs to be obtained. This is possible by conducting a WHOIS lookup for each IP address of the candidate IP address list. For example, the WHOIS lookup for whois 66.51.127.80
yields the following WHOIS record:
第一步,需要获取拥有 IP 地址的组织。这可以通过对候选IP地址列表的每个IP地址进行WHOIS查找来实现。例如,WHOIS 查找 whois 66.51.127.80
会生成以下 WHOIS 记录:
NetRange: 66.51.120.0 - 66.51.127.255
CIDR: 66.51.120.0/21
NetName: FLYIO
NetHandle: NET-66-51-120-0-1
Parent: NET66 (NET-66-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Fly.io, Inc. (FLYIO)
RegDate: 2021-12-06
Updated: 2021-12-06
Ref: https://rdap.arin.net/registry/ip/66.51.120.0
OrgName: Fly.io, Inc.
OrgId: FLYIO
Address: PO Box 803338 #19104
City: Chicago
StateProv: IL
PostalCode: 60680-3338
Country: US
RegDate: 2017-01-18
Updated: 2023-07-07
Ref: https://rdap.arin.net/registry/entity/FLYIO
OrgTechHandle: SANDE663-ARIN
OrgTechName: Sanders, Scott
OrgTechPhone: +1-803-767-0060
OrgTechEmail: [email protected]
OrgTechRef: https://rdap.arin.net/registry/entity/SANDE663-ARIN
OrgAbuseHandle: ABUSE8489-ARIN
OrgAbuseName: Abuse
OrgAbusePhone: +1-312-626-4490
OrgAbuseEmail: [email protected]
OrgAbuseRef: https://rdap.arin.net/registry/entity/ABUSE8489-ARIN
OrgNOCHandle: FLYOP-ARIN
OrgNOCName: Fly Ops
OrgNOCPhone: +1-312-283-4377
OrgNOCEmail: [email protected]
OrgNOCRef: https://rdap.arin.net/registry/entity/FLYOP-ARIN
OrgTechHandle: BERRY359-ARIN
OrgTechName: Berryman, Steve
OrgTechPhone: +447886749129
OrgTechEmail: [email protected]
OrgTechRef: https://rdap.arin.net/registry/entity/BERRY359-ARIN
OrgTechHandle: FLYOP-ARIN
OrgTechName: Fly Ops
OrgTechPhone: +1-312-283-4377
OrgTechEmail: [email protected]
OrgTechRef: https://rdap.arin.net/registry/entity/FLYOP-ARIN
Limitations of conducting WHOIS lookups
进行WHOIS查询的局限性
Because our candidate IP list contains 100,1400
IP addresses, it is possible that WHOIS servers are rate limiting us when querying too fast. Therefore, a realistic speed that stays under the radar is maybe 20,000 WHOIS lookups per day and the whole process takes at least 5 days (Which is fine).
由于我们的候选 IP 列表包含 100,1400
IP 地址,因此 WHOIS 服务器在查询速度过快时可能会限制我们的速率。因此,保持低调的实际速度可能是每天 20,000 次 WHOIS 查询,整个过程至少需要 5 天(这很好)。
Step 5: Parse the WHOIS record and extract the Company Name and Domain
第 5 步:解析 WHOIS 记录并提取公司名称和域
Based on this WHOIS example from above, it is still not possible to say whether the organization Fly.io, Inc.
is a hosting provider or not. The next goal is to find the organization’s website URL. Two attributes from the WHOIS record can be of help:
根据上面的WHOIS示例,仍然无法确定该组织 Fly.io, Inc.
是否是托管服务提供商。下一个目标是查找组织的网站 URL。WHOIS记录中的两个属性可能会有所帮助:
- The organization name can be parsed from the
OrgName: Fly.io, Inc.
attribute. The organization name can be Googled and the first search result could be the organization’s website.
可以从OrgName: Fly.io, Inc.
属性中解析组织名称。组织名称可以在谷歌上搜索,第一个搜索结果可以是组织的网站。 - The domain can be parsed from the
OrgAbuseEmail: [email protected]
attribute and the domain might be the same as the domain in the organization’s website (Which is correct withfly.io
).
可以从OrgAbuseEmail: [email protected]
属性中解析域,并且域可能与组织网站中的域相同(正确为fly.io
)。
Possible Limitations: 可能的限制:
- The WHOIS record does not include a domain. Action: Proceed with the organization name.
WHOIS 记录不包含域。操作: 继续使用组织名称。 - The WHOIS record does not include a organization name. Action: Proceed with the domain.
WHOIS 记录不包含组织名称。要执行的操作: 继续域。 - If both domain and organization name are not available, the algorithm aborts.
如果域名和组织名称都不可用,算法将中止。
If a domain is available in the WHOIS record, the following limitations apply:
如果 WHOIS 记录中存在某个域,则存在以下限制:
- The WHOIS record includes a misleading domain that is not the domain of the organization’s website. Action: Check with a blacklist of known bad domains if this is the case.
WHOIS 记录包含的误导性域名不属于组织网站的域名。要执行的操作: 如果是这种情况,请使用已知不良域的黑名单进行检查。 - The organization domain is not the primary website of the organization and only a technical domain. Action: This cannot be detected, a false positive is obtained.
组织域不是组织的主要网站,只是一个技术域。要执行的操作: 无法检测到,获得误报。
If only the organization name is available in the WHOIS record, the following limitations apply:
如果 WHOIS 记录中仅提供组织名称,则存在以下限制:
- The name of the organization cannot be found on Google. Action: Abort the algorithm.
在谷歌上找不到该组织的名称。要执行的操作: 中止算法。 - The name of the organization leads to wrong search results and a wrong organization URL is obtained. Problem: It is not possible to easily say whether the search result is really the organization’s website. Action: Use text similarity metric between organization name and url.
组织名称导致错误的搜索结果,并获取错误的组织 URL。问题:无法轻易说搜索结果是否真的是组织的网站。操作:在组织名称和网址之间使用文本相似度指标。
Step 6: Crawl the Website and Classify the Website Text
第 6 步:抓取网站并对网站文本进行分类
After visiting the website fly.io, it is possible to understand that Fly.io, Inc.
is a infrastructure as a service organization that sells specialized hosting resources. This is close enough and the organization Fly.io, Inc.
and all their known IP ranges can be classified as hosting IP ranges.
fly.io 访问网站后,可以理解为 Fly.io, Inc.
销售专用托管资源的基础设施即服务组织。这足够接近,组织 Fly.io, Inc.
及其所有已知 IP 范围都可以归类为托管 IP 范围。
The only way to classify a organization’s website to be a hosting provider or not is with some kind of text classification approach. If the website includes a certain quantity of hosting keywords, the website is classified as hosting provider. Some machine learning can be used, but this is out of the scope for this quick article.
将组织的网站分类为托管服务提供商的唯一方法是使用某种文本分类方法。如果网站包含一定数量的托管关键字,则该网站被归类为托管服务提供商。可以使用一些机器学习,但这超出了本快速文章的范围。
Limitations of Crawling a Website
抓取网站的限制
- Crawling the URL results in a ban, since the website has crawling protection. Action: Abort the algorithm or try again later.
抓取该网址会导致被禁止,因为该网站具有抓取保护。要执行的操作: 中止算法或稍后重试。 - The website rate limits requests and thus the request is blocked. Action: Abort the algorithm or try again later.
网站速率限制请求,因此请求被阻止。要执行的操作: 中止算法或稍后重试。 - The crawling results in some kind of error (Certificate error, 404 Not Found, 503 Server Error, or similar). Action: Abort the algorithm.
爬网会导致某种错误(证书错误、404 未找到、503 服务器错误或类似错误)。要执行的操作: 中止算法。
Limitations of Classifying Text
对文本进行分类的限制
- The classification result is a false negative (No hosting provider even though the website is one). Action: This happens and does not have a negative impact.
分类结果为假阴性(即使网站是一个托管服务提供商,也没有托管服务提供商)。操作:发生这种情况,不会产生负面影响。 - The classification result is a false positive (Classified as hosting provider, even though the website is not one). Problem: This has a large negative impact. Action: If the scoring is weak or indecisive, put the score result on a list for human verification.
分类结果为误报(归类为托管服务提供商,即使该网站不是托管服务提供商)。问题:这有很大的负面影响。措施:如果评分较弱或优柔寡断,请将评分结果放在列表中以供人工验证。 - The website’s language might be in any language. Therefore the text classification algorithm must support the most commonly used language in the Internet, which is hard to implement. It is better to translate the website into English. Action: Use Google translate automatically (Do they rate limit translation?).
网站的语言可以是任何语言。因此,文本分类算法必须支持互联网上最常用的语言,这是很难实现的。最好将网站翻译成英文。操作:自动使用谷歌翻译(他们是否限制翻译?
Algorithm Pseudo Code 算法伪代码
What follows is the pseudo code of the bespoken hosting detection algorithm. The code below implements the steps 4 – 6. The first three steps were left out, since they are trivial to implement.
以下是定制托管检测算法的伪代码。下面的代码实现了步骤 4 – 6。前三个步骤被省略了,因为它们实施起来微不足道。
const hostingDetectionAlgorithm = async () => {
const candidateIps = [
"66.51.127.80",
"119.110.249.22",
"61.31.224.233",
"193.148.244.24",
// ... (huge list of 100k candidate IPs)
];
let i = 0;
let whoisLookupServer = 'standard';
let failures = {};
let classified = {};
while (i < candidateIps.length) {
const ip = candidateIps[i];
let whoisRecord = whoisLookup(ip, whoisLookupServer);
if (whoisServerDeniedRequest(whoisRecord)) {
if (haveUnblockedWhoisLookupServer()) {
whoisLookupServer = getNextWhoisLookupServer();
continue;
} else {
waitForFiveMinutes();
continue;
}
}
// we have a good whois lookup result
const orgName = getOrgFromWhois(whoisRecord);
const allDomains = getDomainsFromWhois(whoisRecord);
if (!orgName && !allDomains) {
failures[ip] = {
ip: ip,
status: 'failed',
error: 'Cannot parse org name and domain name from whois record'
};
i++;
continue;
}
let orgUrlCandidates = [];
if (allDomains) {
// remove all domains that cannot be the organization's website domain
let filteredDomains = filterBadDomains(allDomains);
// order the filtered domain by frequency in the WHOIS record
let sortedByFrequency = sortByDomainFrequency(filteredDomains);
// insert trial organization url candidates according to domain frequency
for (const domain of sortedByFrequency) {
if (isSimilar(domain, orgName)) {
orgUrlCandidates.push(`https://${domain}`);
orgUrlCandidates.push(`https://www.${domain}`);
}
}
}
// try to get the organization's website
for (const url of orgUrlCandidates) {
let crawlResponse = crawlHtml(url);
if (isCrawlSuccessful(crawlResponse)) {
const isHostingProvider = classifyText(crawlResponse);
classified[ip] = {
isHostingProvider: isHostingProvider,
orgName: orgName,
url: url,
};
break;
}
}
i++;
}
};
Conclusion 结论
The algorithm above is rather complex. Unfortunately, there are many steps in the algorithm that can go wrong or where false assumptions can be made. Furthermore, making WHOIS lookups and crawling websites with thousands of IP addresses and organization names is rather slow and getting blocked is an issue.
上面的算法相当复杂。不幸的是,算法中的许多步骤可能会出错或做出错误的假设。此外,使用数千个IP地址和组织名称进行WHOIS查找和抓取网站相当缓慢,并且被阻止是一个问题。
Nevertheless, the discovery speed of the hosting detection algorithm doesn’t need to be large. Furthermore, for every discovered hosting provider, the algorithm becomes more efficient, since there are less candidate IP addresses.
尽管如此,托管检测算法的发现速度不需要很大。此外,对于每个发现的托管服务提供商,算法变得更加高效,因为候选 IP 地址更少。
To conclude, it can be said that thousands of new hosting providers could be detected by applying the hosting detection algorithm on the candidate IP address list. This gives ipapi.is one of the best hosting detection API’s that exist!
总而言之,可以说通过在候选IP地址列表上应用托管检测算法可以检测到数千个新的托管服务提供商。这 ipapi.is 提供了现有最好的托管检测 API 之一!
原文始发于ipapi.is:An Algorithm to Detect Hosting Providers and their IP Ranges
转载请注明:An Algorithm to Detect Hosting Providers and their IP Ranges | CTF导航