Python Digital Forensics 简明教程

Python Digital Network Forensics-II

前一章使用 Python 讨论了一些网络取证的概念。在本章中，让我们更深入地了解使用 Python 进行网络取证。

The previous chapter dealt with some of the concepts of network forensics using Python. In this chapter, let us understand network forensics using Python at a deeper level.

Web Page Preservation with Beautiful Soup

万维网 (WWW) 是一个独特的信息资源。然而，由于内容以惊人的速度丢失，它的遗产正面临着巨大的风险。许多文化遗产和学术机构、非营利组织和私营企业已经探讨了相关问题，并为 Web 存档的技术解决方案开发做出了贡献。

The World Wide Web (WWW) is a unique resource of information. However, its legacy is at high risk due to the loss of content at an alarming rate. A number of cultural heritage and academic institutions, non-profit organizations and private businesses have explored the issues involved and contributed to the development of technical solutions for web archiving.

网页保存或 Web 存档是从万维网上收集数据、确保数据保存在存档中并使其可供未来的研究人员、历史学家和公众使用的过程。在进一步深入网页保存之前，让我们讨论一下与网页保存相关的一些重要问题，如下所示：

Web page preservation or web archiving is the process of gathering the data from World Wide Web, ensuring that the data is preserved in an archive and making it available for future researchers, historians and the public. Before proceeding further into the web page preservation, let us discuss some important issues related to web page preservation as given below −

Change in Web Resources − Web resources keep changing everyday which is a challenge for web page preservation.
Large Quantity of Resources − Another issue related to web page preservation is the large quantity of resources which is to be preserved.
Integrity − Web pages must be protected from unauthorized amendments, deletion or removal to protect its integrity.
Dealing with multimedia data − While preserving web pages we need to deal with multimedia data also, and these might cause issues while doing so.
Providing access − Besides preserving, the issue of providing access to web resources and dealing with issues of ownership needs to be solved too.

在本章中，我们将使用名为 Beautiful Soup 的 Python 库来保护网站页面。

In this chapter, we are going to use Python library named Beautiful Soup for web page preservation.

What is Beautiful Soup?

Beautiful Soup 是用于从 HTML 和 XML 文件中提取数据的 Python 库。它可以与 urlib 一起使用，因为它需要一个输入（文档或 URL）来创建一个 soup 对象，因为它不能直接获取网站页面。您可以在 www.crummy.com/software/BeautifulSoup/bs4/doc/ 详细了解它。

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be used with urlib because it needs an input (document or url) to create a soup object, as it cannot fetch web page itself. You can learn in detail about this at www.crummy.com/software/BeautifulSoup/bs4/doc/

请注意，在使用它之前，我们必须使用以下命令安装第三方库 -

Note that before using it, we must install a third party library using the following command −

pip install bs4

然后，使用 Anaconda 包管理器，我们可以按如下方式安装 Beautiful Soup -

Next, using Anaconda package manager, we can install Beautiful Soup as follows −

conda install -c anaconda beautifulsoup4

Python Script for Preserving Web Pages

这里讨论了使用名为 Beautiful Soup 的第三方库保护网站页面的 Python 脚本 -

The Python script for preserving web pages by using third party library called Beautiful Soup is discussed here −

首先，导入所需的库，如下所示 -

First, import the required libraries as follows −

from __future__ import print_function
import argparse

from bs4 import BeautifulSoup, SoupStrainer
from datetime import datetime

import hashlib
import logging
import os
import ssl
import sys
from urllib.request import urlopen

import urllib.error
logger = logging.getLogger(__name__)

请注意，此脚本需要两个位置参数，一个是需要保护的 URL，另一个是期望的输出目录，如下所示 -

Note that this script will take two positional arguments, one is URL which is to be preserved and other is the desired output directory as shown below −

if __name__ == "__main__":
   parser = argparse.ArgumentParser('Web Page preservation')
   parser.add_argument("DOMAIN", help="Website Domain")
   parser.add_argument("OUTPUT_DIR", help="Preservation Output Directory")
   parser.add_argument("-l", help="Log file path",
   default=__file__[:-3] + ".log")
   args = parser.parse_args()

现在，通过为循环中的文件和流处理程序指定一个文件来设置脚本的记录，并记录采集过程，如下所示 -

Now, setup the logging for the script by specifying a file and stream handler for being in loop and document the acquisition process as shown −

logger.setLevel(logging.DEBUG)
msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-10s""%(levelname)-8s %(message)s")
strhndl = logging.StreamHandler(sys.stderr)
strhndl.setFormatter(fmt=msg_fmt)
fhndl = logging.FileHandler(args.l, mode='a')
fhndl.setFormatter(fmt=msg_fmt)

logger.addHandler(strhndl)
logger.addHandler(fhndl)
logger.info("Starting BS Preservation")
logger.debug("Supplied arguments: {}".format(sys.argv[1:]))
logger.debug("System " + sys.platform)
logger.debug("Version " + sys.version)

现在，让我们对期望的输出目录执行输入验证，如下所示 -

Now, let us do the input validation on the desired output directory as follows −

if not os.path.exists(args.OUTPUT_DIR):
   os.makedirs(args.OUTPUT_DIR)
main(args.DOMAIN, args.OUTPUT_DIR)

现在，我们将定义 main() 函数，它将通过移除实际名称之前的非必要元素以及对输入 URL 的附加验证来提取网站的基础名称，如下所示 -

Now, we will define the main() function which will extract the base name of the website by removing the unnecessary elements before the actual name along with additional validation on the input URL as follows −

def main(website, output_dir):
   base_name = website.replace("https://", "").replace("http://", "").replace("www.", "")
   link_queue = set()

   if "http://" not in website and "https://" not in website:
      logger.error("Exiting preservation - invalid user input: {}".format(website))
      sys.exit(1)
   logger.info("Accessing {} webpage".format(website))
   context = ssl._create_unverified_context()

现在，我们需要通过使用 urlopen() 方法使用此 URL 打开一个连接。让我们使用以下 try-except 块 -

Now, we need to open a connection with the URL by using urlopen() method. Let us use try-except block as follows −

try:
   index = urlopen(website, context=context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   logger.error("Exiting preservation - unable to access page: {}".format(website))
   sys.exit(2)
logger.debug("Successfully accessed {}".format(website))

下一行代码包含三个函数，如下所述 -

The next lines of code include three function as explained below −

write_output() to write the first web page to the output directory
find_links() function to identify the links on this web page
recurse_pages() function to iterate through and discover all links on the web page.

write_output(website, index, output_dir)
link_queue = find_links(base_name, index, link_queue)
logger.info("Found {} initial links on webpage".format(len(link_queue)))
recurse_pages(website, link_queue, context, output_dir)
logger.info("Completed preservation of {}".format(website))

现在，让我们定义 write_output() 方法，如下所示 -

Now, let us define write_output() method as follows −

def write_output(name, data, output_dir, counter=0):
   name = name.replace("http://", "").replace("https://", "").rstrip("//")
   directory = os.path.join(output_dir, os.path.dirname(name))

   if not os.path.exists(directory) and os.path.dirname(name) != "":
      os.makedirs(directory)

我们需要记录有关网页的一些详细信息，然后使用 hash_data() 方法记录数据的哈希，如下所示：

We need to log some details about the web page and then we log the hash of the data by using hash_data() method as follows −

logger.debug("Writing {} to {}".format(name, output_dir)) logger.debug("Data Hash: {}".format(hash_data(data)))
path = os.path.join(output_dir, name)
path = path + "_" + str(counter)
with open(path, "w") as outfile:
   outfile.write(data)
logger.debug("Output File Hash: {}".format(hash_file(path)))

现在，定义 hash_data() 方法，借助该方法，我们可以读取 UTF-8 编码数据，然后生成其 SHA-256 哈希，如下所示：

Now, define hash_data() method with the help of which we read the UTF-8 encoded data and then generate the SHA-256 hash of it as follows −

def hash_data(data):
   sha256 = hashlib.sha256()
   sha256.update(data.encode("utf-8"))
   return sha256.hexdigest()
def hash_file(file):
   sha256 = hashlib.sha256()
   with open(file, "rb") as in_file:
      sha256.update(in_file.read())
return sha256.hexdigest()

现在，让我们使用 find_links() 方法从网页数据中创建一个 Beautifulsoup 对象，如下所示：

Now, let us create a Beautifulsoup object out of the web page data under find_links() method as follows −

def find_links(website, page, queue):
   for link in BeautifulSoup(page, "html.parser",parse_only = SoupStrainer("a", href = True)):
      if website in link.get("href"):
         if not os.path.basename(link.get("href")).startswith("#"):
            queue.add(link.get("href"))
   return queue

现在，我们需要通过提供以下内容作为输入来定义 recurse_pages() 方法：网站 URL、当前链接队列、未验证的 SSL 上下文和输出目录：

Now, we need to define recurse_pages() method by providing it the inputs of the website URL, current link queue, the unverified SSL context and the output directory as follows −

def recurse_pages(website, queue, context, output_dir):
   processed = []
   counter = 0

   while True:
      counter += 1
      if len(processed) == len(queue):
         break
      for link in queue.copy(): if link in processed:
         continue
	   processed.append(link)
      try:
      page = urlopen(link,      context=context).read().decode("utf-8")
      except urllib.error.HTTPError as e:
         msg = "Error accessing webpage: {}".format(link)
         logger.error(msg)
         continue

现在，按如下所示通过传递链接名称、页面数据、输出目录和计数器将访问的各个网页的输出写入一个文件 -

Now, write the output of each web page accessed in a file by passing the link name, page data, output directory and the counter as follows −

write_output(link, page, output_dir, counter)
queue = find_links(website, page, queue)
logger.info("Identified {} links throughout website".format(
   len(queue)))

现在，当我们通过提供网站的 URL、输出目录和日志文件的路径运行此脚本时，我们将获取有关该网页的详细信息，可供将来的使用。

Now, when we run this script by providing the URL of the website, the output directory and a path to the log file, we will get the details about that web page that can be used for future use.

Virus Hunting

您有没有想过取证分析师、安全研究人员和事件响应人员如何理解有用软件和恶意软件之间的区别？答案就在于这个问题本身，因为如果不研究恶意软件（由黑客快速生成），研究人员和专家就不可能区分有用软件和恶意软件。在本节中，我们来讨论 VirusShare ，一个完成此任务的工具。

Have you ever wondered how forensic analysts, security researchers, and incident respondents can understand the difference between useful software and malware? The answer lies in the question itself, because without studying about the malware, rapidly generating by hackers, it is quite impossible for researchers and specialists to tell the difference between useful software and malware. In this section, let us discuss about VirusShare, a tool to accomplish this task.

Understanding VirusShare

VirusShare 是最大的私有恶意软件样本集合，可为安全研究人员、事件响应人员和法医分析师提供实时恶意代码样本。它包含超过 3000 万个样本。

VirusShare is the largest privately owned collection of malware samples to provide security researchers, incident responders, and forensic analysts the samples of live malicious code. It contains over 30 million samples.

VirusShare 的好处是可免费获得恶意软件散列列表。任何人可以利用这些散列创建非常全面的散列集并使用它识别潜在的恶意文件。但在使用 VirusShare 之前，我们建议您访问 https://virusshare.com 了解更多详情。

The benefit of VirusShare is the list of malware hashes that is freely available. Anybody can use these hashes to create a very comprehensive hash set and use that to identify potentially malicious files. But before using VirusShare, we suggest you to visit https://virusshare.com for more details.

Creating Newline-Delimited Hash List from VirusShare using Python

VirusShare 哈希列表可供各种取证工具（如 X-ways 和 EnCase）使用。在下面讨论的脚本中，我们将自动从 VirusShare 下载哈希列表，以创建以换行符分隔的哈希列表。

A hash list from VirusShare can be used by various forensic tools such as X-ways and EnCase. In the script discussed below, we are going to automate downloading lists of hashes from VirusShare to create a newline-delimited hash list.

对于这个脚本，我们需要一个名为 tqdm 的第三方 Python 库，下载方法如下 −

For this script, we need a third party Python library tqdm which can be downloaded as follows −

pip install tqdm

请注意在该脚本中，我们将首先读取 VirusShare 散列页面并动态识别最新的散列列表。然后，我们将初始化进度条并在所需范围内下载散列列表。

Note that in this script, first we will read the VirusShare hashes page and dynamically identify the most recent hash list. Then we will initialize the progress bar and download the hash list in the desired range.

首先，导入以下库 −

First, import the following libraries −

from __future__ import print_function

import argparse
import os
import ssl
import sys
import tqdm

from urllib.request import urlopen
import urllib.error

此脚本将采用一个位置参数，而位置参数将是哈希集的期望路径 −

This script will take one positional argument, which would be the desired path for the hash set −

if __name__ == '__main__':
   parser = argparse.ArgumentParser('Hash set from VirusShare')
   parser.add_argument("OUTPUT_HASH", help = "Output Hashset")
   parser.add_argument("--start", type = int, help = "Optional starting location")
   args = parser.parse_args()

现在，我们将执行如下标准输入验证 −

Now, we will perform the standard input validation as follows −

directory = os.path.dirname(args.OUTPUT_HASH)
if not os.path.exists(directory):
   os.makedirs(directory)
if args.start:
   main(args.OUTPUT_HASH, start=args.start)
else:
   main(args.OUTPUT_HASH)

现在我们必须使用 **kwargs 作为参数来定义 main() 函数，因为这将创建一个词典，可供我们引用按如下所示提供的支持键参数：

Now we need to define main() function with **kwargs as an argument because this will create a dictionary we can refer to support supplied key arguments as shown below −

def main(hashset, **kwargs):
   url = "https://virusshare.com/hashes.4n6"
   print("[+] Identifying hash set range from {}".format(url))
   context = ssl._create_unverified_context()

现在，我们需要使用 urlib.request.urlopen() 方法打开 VirusShare 哈希页面。我们将使用 try-except 块，如下所示：

Now, we need to open VirusShare hashes page by using urlib.request.urlopen() method. We will use try-except block as follows −

try:
   index = urlopen(url, context = context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   print("[-] Error accessing webpage - exiting..")
   sys.exit(1)

现在，从下载的网页中识别最新的哈希列表。通过查找 HTML href 标记在 VirusShare 哈希列表中的最后一个实例，可以执行此操作。可以用以下几行代码完成：

Now, identify latest hash list from downloaded pages. You can do this by finding the last instance of the HTML href tag to VirusShare hash list. It can be done with the following lines of code −

tag = index.rfind(r'a href = "hashes/VirusShare_')
stop = int(index[tag + 27: tag + 27 + 5].lstrip("0"))

if "start" not in kwa<rgs:
   start = 0
else:
   start = kwargs["start"]

if start < 0 or start > stop:
   print("[-] Supplied start argument must be greater than or equal ""to zero but less than the latest hash list, ""currently: {}".format(stop))
sys.exit(2)
print("[+] Creating a hashset from hash lists {} to {}".format(start, stop))
hashes_downloaded = 0

现在，我们将使用 tqdm.trange() 方法创建一个循环和进度条，如下所示：

Now, we will use tqdm.trange() method to create a loop and progress bar as follows −

for x in tqdm.trange(start, stop + 1, unit_scale=True,desc="Progress"):
   url_hash = "https://virusshare.com/hashes/VirusShare_"\"{}.md5".format(str(x).zfill(5))
   try:
      hashes = urlopen(url_hash, context=context).read().decode("utf-8")
      hashes_list = hashes.split("\n")
   except urllib.error.HTTPError as e:
      print("[-] Error accessing webpage for hash list {}"" - continuing..".format(x))
   continue

成功执行上述步骤后，我们将以 a+ 模式打开哈希集文本文件，以追加到文本文件的底部。

After performing the above steps successfully, we will open the hash set text file in a+ mode to append to the bottom of text file.

with open(hashset, "a+") as hashfile:
   for line in hashes_list:
   if not line.startswith("#") and line != "":
      hashes_downloaded += 1
      hashfile.write(line + '\n')
   print("[+] Finished downloading {} hashes into {}".format(
      hashes_downloaded, hashset))

运行上述脚本后，你将获得最新的哈希列表，其中包含文本格式的 MD5 哈希值。

After running the above script, you will get the latest hash list containing MD5 hash values in text format.