Python Digital Forensics 简明教程
Investigation Using Emails
在前面的章节中,我们讨论了网络取证的重要性、流程以及相关概念。在本章中,我们将了解电子邮件在数字取证中的作用及其使用 Python 进行调查。
The previous chapters discussed about the importance and the process of network forensics and the concepts involved. In this chapter, let us learn about the role of emails in digital forensics and their investigation using Python.
Role of Email in Investigation
电子邮件在商业交流中发挥着非常重要的作用,并已成为互联网上最重要的应用程序之一。它们是发送消息和文档的便捷模式,不仅可以从计算机上发送,还可以从其他电子设备(如手机和平板电脑)上发送。
Emails play a very important role in business communications and have emerged as one of the most important applications on internet. They are a convenient mode for sending messages as well as documents, not only from computers but also from other electronic gadgets such as mobile phones and tablets.
电子邮件的消极方面是,罪犯可能会泄露有关其公司的重要信息。因此,近年来,电子邮件在数字取证中的作用越来越大。在数字取证中,电子邮件被视为关键证据,电子邮件头分析已成为在取证过程中收集证据的重要手段。
The negative side of emails is that criminals may leak important information about their company. Hence, the role of emails in digital forensics has been increased in recent years. In digital forensics, emails are considered as crucial evidences and Email Header Analysis has become important to collect evidence during forensic process.
调查员在进行电子邮件取证时有以下目标:
An investigator has the following goals while performing email forensics −
-
To identify the main criminal
-
To collect necessary evidences
-
To presenting the findings
-
To build the case
Challenges in Email Forensics
电子邮件取证在调查中发挥着非常重要的作用,因为当今时代的大多数通信都依赖于电子邮件。但是,电子邮件取证调查员在调查过程中可能会遇到以下挑战:
Email forensics play a very important role in investigation as most of the communication in present era relies on emails. However, an email forensic investigator may face the following challenges during the investigation −
Fake Emails
电子邮件取证的最大挑战在于使用操纵和脚本标头等创建的虚假电子邮件。在这个类别中,犯罪分子还使用临时电子邮件,这是一种允许注册用户在一段时间后过期的临时地址接收电子邮件的服务。
The biggest challenge in email forensics is the use of fake e-mails that are created by manipulating and scripting headers etc. In this category criminals also use temporary email which is a service that allows a registered user to receive email at a temporary address that expires after a certain time period.
Techniques Used in Email Forensic Investigation
电子邮件取证是对电子邮件的来源和内容作为证据进行研究,以识别邮件的实际发送者和收件人以及日期/时间发送和发送者意图等其他信息。它涉及调查元数据、端口扫描以及关键字搜索。
Email forensics is the study of source and content of email as evidence to identify the actual sender and recipient of a message along with some other information such as date/time of transmission and intention of sender. It involves investigating metadata, port scanning as well as keyword searching.
一些常见的电子邮件取证调查技术包括
Some of the common techniques which can be used for email forensic investigation are
-
Header Analysis
-
Server investigation
-
Network Device Investigation
-
Sender Mailer Fingerprints
-
Software Embedded Identifiers
在以下章节中,我们将了解如何使用 Python 获取信息以进行电子邮件调查。
In the following sections, we are going to learn how to fetch information using Python for the purpose of email investigation.
Extraction of Information from EML files
EML 文件本质上是文件格式的电子邮件,它们广泛用于存储电子邮件消息。它们是结构化文本文件,与多个电子邮件客户端(如 Microsoft Outlook、Outlook Express 和 Windows Live Mail)兼容。
EML files are basically emails in file format which are widely used for storing email messages. They are structured text files that are compatible across multiple email clients such as Microsoft Outlook, Outlook Express, and Windows Live Mail.
EML 文件将电子邮件标头、正文内容、附件数据存储为纯文本。它使用 base64 对二进制数据进行编码,并使用可引用打印 (QP) 编码来存储内容信息。可以用来从 EML 文件中提取信息的 Python 脚本如下所示 −
An EML file stores email headers, body content, attachment data as plain text. It uses base64 to encode binary data and Quoted-Printable (QP) encoding to store content information. The Python script that can be used to extract information from EML file is given below −
首先,导入以下 Python 库,如下所示 −
First, import the following Python libraries as shown below −
from __future__ import print_function
from argparse import ArgumentParser, FileType
from email import message_from_file
import os
import quopri
import base64
在上述库中, quopri 用于解码 EML 文件中的 QP 编码值。任何 base64 编码数据都可以借助 base64 库进行解码。
In the above libraries, quopri is used to decode the QP encoded values from EML files. Any base64 encoded data can be decoded with the help of base64 library.
接下来,让我们为命令行处理器提供参数。请注意,这里它只接受一个参数,即 EML 文件的路径,如下所示 −
Next, let us provide argument for command-line handler. Note that here it will accept only one argument which would be the path to EML file as shown below −
if __name__ == '__main__':
parser = ArgumentParser('Extracting information from EML file')
parser.add_argument("EML_FILE",help="Path to EML File", type=FileType('r'))
args = parser.parse_args()
main(args.EML_FILE)
现在,我们需要定义 main() 函数,其中我们将使用电子邮件库中名为 message_from_file() 的方法来读取类似文件的对象。在这里,我们将通过使用名为 emlfile 的结果变量访问标头、正文内容、附件和其他有效载体信息,如下所示 −
Now, we need to define main() function in which we will use the method named message_from_file() from email library to read the file like object. Here we will access the headers, body content, attachments and other payload information by using resulting variable named emlfile as shown in the code given below −
def main(input_file):
emlfile = message_from_file(input_file)
for key, value in emlfile._headers:
print("{}: {}".format(key, value))
print("\nBody\n")
if emlfile.is_multipart():
for part in emlfile.get_payload():
process_payload(part)
else:
process_payload(emlfile[1])
现在,我们需要定义 process_payload() 方法,其中我们将使用 get_payload() 方法提取邮件正文内容。我们将使用 quopri.decodestring() 函数解码 QP 编码数据。我们还将检查内容 MIME 类型,以便它可以正确处理电子邮件的存储。观察下面给出的代码 −
Now, we need to define process_payload() method in which we will extract message body content by using get_payload() method. We will decode QP encoded data by using quopri.decodestring() function. We will also check the content MIME type so that it can handle the storage of the email properly. Observe the code given below −
def process_payload(payload):
print(payload.get_content_type() + "\n" + "=" * len(payload.get_content_type()))
body = quopri.decodestring(payload.get_payload())
if payload.get_charset():
body = body.decode(payload.get_charset())
else:
try:
body = body.decode()
except UnicodeDecodeError:
body = body.decode('cp1252')
if payload.get_content_type() == "text/html":
outfile = os.path.basename(args.EML_FILE.name) + ".html"
open(outfile, 'w').write(body)
elif payload.get_content_type().startswith('application'):
outfile = open(payload.get_filename(), 'wb')
body = base64.b64decode(payload.get_payload())
outfile.write(body)
outfile.close()
print("Exported: {}\n".format(outfile.name))
else:
print(body)
执行上述脚本后,我们将在控制台上获得标头信息以及各种有效载荷。
After executing the above script, we will get the header information along with various payloads on the console.
Analyzing MSG Files using Python
电子邮件有多种不同的格式。MSG 是 Microsoft Outlook 和 Exchange 使用的一种这样的格式。具有 MSG 扩展名的文件可能包含标头的主明文 ASCII 文本、正文以及超链接和附件。
Email messages come in many different formats. MSG is one such kind of format used by Microsoft Outlook and Exchange. Files with MSG extension may contain plain ASCII text for the headers and the main message body as well as hyperlinks and attachments.
在本节中,我们将学习如何使用 Outlook API 从 MSG 文件中提取信息。请注意,以下 Python 脚本仅适用于 Windows。为此,我们需要安装名为 pywin32 的第三方 Python 库,如下所示 −
In this section, we will learn how to extract information from MSG file using Outlook API. Note that the following Python script will work only on Windows. For this, we need to install third party Python library named pywin32 as follows −
pip install pywin32
现在,使用所示的命令导入以下库 −
Now, import the following libraries using the commands shown −
from __future__ import print_function
from argparse import ArgumentParser
import os
import win32com.client
import pywintypes
现在,让我们为命令行处理器提供一个参数。这里它将接受两个参数,一个参数是 MSG 文件的路径,另一个参数是所需输出文件夹,如下所示 −
Now, let us provide an argument for command-line handler. Here it will accept two arguments one would be the path to MSG file and other would be the desired output folder as follows −
if __name__ == '__main__':
parser = ArgumentParser(‘Extracting information from MSG file’)
parser.add_argument("MSG_FILE", help="Path to MSG file")
parser.add_argument("OUTPUT_DIR", help="Path to output folder")
args = parser.parse_args()
out_dir = args.OUTPUT_DIR
if not os.path.exists(out_dir):
os.makedirs(out_dir)
main(args.MSG_FILE, args.OUTPUT_DIR)
现在,我们需要定义 main() 函数,其中我们将调用 win32com 库来设置 Outlook API ,它进一步允许访问 MAPI 命名空间。
Now, we need to define main() function in which we will call win32com library for setting up Outlook API which further allows access to the MAPI namespace.
def main(msg_file, output_dir):
mapi = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = mapi.OpenSharedItem(os.path.abspath(args.MSG_FILE))
display_msg_attribs(msg)
display_msg_recipients(msg)
extract_msg_body(msg, output_dir)
extract_attachments(msg, output_dir)
现在,定义我们在该脚本中使用的一些不同的函数。下面代码展示了如何定义 display_msg_attribs() 函数,该函数允许我们显示消息的各种属性,例如主题、收件人、密件抄送、抄送、大小、发件人姓名、已发送等。
Now, define different functions which we are using in this script. The code given below shows defining the display_msg_attribs() function that allow us to display various attributes of a message like subject, to , BCC, CC, Size, SenderName, sent, etc.
def display_msg_attribs(msg):
attribs = [
'Application', 'AutoForwarded', 'BCC', 'CC', 'Class',
'ConversationID', 'ConversationTopic', 'CreationTime',
'ExpiryTime', 'Importance', 'InternetCodePage', 'IsMarkedAsTask',
'LastModificationTime', 'Links','ReceivedTime', 'ReminderSet',
'ReminderTime', 'ReplyRecipientNames', 'Saved', 'Sender',
'SenderEmailAddress', 'SenderEmailType', 'SenderName', 'Sent',
'SentOn', 'SentOnBehalfOfName', 'Size', 'Subject',
'TaskCompletedDate', 'TaskDueDate', 'To', 'UnRead'
]
print("\nMessage Attributes")
for entry in attribs:
print("{}: {}".format(entry, getattr(msg, entry, 'N/A')))
现在,定义 display_msg_recipeints() 函数,该函数通过这些消息迭代并显示收件人信息。
Now, define the display_msg_recipeints() function that iterates through the messages and displays the recipient details.
def display_msg_recipients(msg):
recipient_attrib = ['Address', 'AutoResponse', 'Name', 'Resolved', 'Sendable']
i = 1
while True:
try:
recipient = msg.Recipients(i)
except pywintypes.com_error:
break
print("\nRecipient {}".format(i))
print("=" * 15)
for entry in recipient_attrib:
print("{}: {}".format(entry, getattr(recipient, entry, 'N/A')))
i += 1
接下来,我们定义 extract_msg_body() 函数,该函数从消息中提取正文内容(包括 HTML 和纯文本)。
Next, we define extract_msg_body() function that extracts the body content, HTML as well as Plain text, from the message.
def extract_msg_body(msg, out_dir):
html_data = msg.HTMLBody.encode('cp1252')
outfile = os.path.join(out_dir, os.path.basename(args.MSG_FILE))
open(outfile + ".body.html", 'wb').write(html_data)
print("Exported: {}".format(outfile + ".body.html"))
body_data = msg.Body.encode('cp1252')
open(outfile + ".body.txt", 'wb').write(body_data)
print("Exported: {}".format(outfile + ".body.txt"))
接下来,我们将定义 extract_attachments() 函数,该函数将附件数据导出到期望的输出目录中。
Next, we shall define the extract_attachments() function that exports attachment data into desired output directory.
def extract_attachments(msg, out_dir):
attachment_attribs = ['DisplayName', 'FileName', 'PathName', 'Position', 'Size']
i = 1 # Attachments start at 1
while True:
try:
attachment = msg.Attachments(i)
except pywintypes.com_error:
break
一旦定义了所有这些函数,我们将使用以下代码行将所有属性打印到控制台 −
Once all the functions are defined, we will print all the attributes to the console with the following line of codes −
print("\nAttachment {}".format(i))
print("=" * 15)
for entry in attachment_attribs:
print('{}: {}'.format(entry, getattr(attachment, entry,"N/A")))
outfile = os.path.join(os.path.abspath(out_dir),os.path.split(args.MSG_FILE)[-1])
if not os.path.exists(outfile):
os.makedirs(outfile)
outfile = os.path.join(outfile, attachment.FileName)
attachment.SaveAsFile(outfile)
print("Exported: {}".format(outfile))
i += 1
运行以上脚本后,将收到控制台窗口中消息及其附件的属性以及输出目录中的一些文件。
After running the above script, we will get the attributes of message and its attachments in the console window along with several files in the output directory.
Structuring MBOX files from Google Takeout using Python
MBOX 文件是用特殊格式的文本文件,可分割存储在内部的消息。它们通常与 UNIX 系统、Thunderbolt 和 Google Takeout 相关联。
MBOX files are text files with special formatting that split messages stored within. They are often found in association with UNIX systems, Thunderbolt, and Google Takeouts.
在本节中,您将看到一个 Python 脚本,其中我们将构造从 Google Takeout 获取的 MBOX 文件。但在那之前,我们必须知道如何使用 Google 帐户或 Gmail 帐户生成这些 MBOX 文件。
In this section, you will see a Python script, where we will be structuring MBOX files got from Google Takeouts. But before that we must know that how we can generate these MBOX files by using our Google account or Gmail account.
Acquiring Google Account Mailbox into MBX Format
获取 Google 帐户邮箱意味着备份我们的 Gmail 帐户。备份可用于各种个人或专业原因。请注意,Google 提供 Gmail 数据的备份。要将我们的 Google 帐户邮箱获取到 MBOX 格式,您需要按照以下步骤操作 −
Acquiring of Google account mailbox implies taking backup of our Gmail account. Backup can be taken for various personal or professional reasons. Note that Google provides backing up of Gmail data. To acquire our Google account mailbox into MBOX format, you need to follow the steps given below −
-
Open My account dashboard.
-
Go to Personal info & privacy section and select Control your content link.
-
You can create a new archive or can manage existing one. If we click, CREATE ARCHIVE link, then we will get some check boxes for each Google product we wish to include.
-
After selecting the products, we will get the freedom to choose file type and maximum size for our archive along with the delivery method to select from list.
-
Finally, we will get this backup in MBOX format.
Python Code
现在,可以在 Python 中使用显示在上面的 MBOX 文件,如下所示 −
Now, the MBOX file discussed above can be structured using Python as shown below −
首先,需要按如下方式导入 Python 库 −
First, need to import Python libraries as follows −
from __future__ import print_function
from argparse import ArgumentParser
import mailbox
import os
import time
import csv
from tqdm import tqdm
import base64
我们已经在之前的脚本中使用并解释了所有这些库,但 mailbox 库除外,它是用来解析 MBOX 文件的。
All the libraries have been used and explained in earlier scripts, except the mailbox library which is used to parse MBOX files.
现在,为命令行处理程序提供参数。在此将接受两个参数−一个是 MBOX 文件的路径,另一个是期望的输出文件夹。
Now, provide an argument for command-line handler. Here it will accept two arguments− one would be the path to MBOX file, and the other would be the desired output folder.
if __name__ == '__main__':
parser = ArgumentParser('Parsing MBOX files')
parser.add_argument("MBOX", help="Path to mbox file")
parser.add_argument(
"OUTPUT_DIR",help = "Path to output directory to write report ""and exported content")
args = parser.parse_args()
main(args.MBOX, args.OUTPUT_DIR)
现在,将定义 main() 函数并利用邮箱库的 mbox 类通过提供其路径来解析 MBOX 文件 −
Now, will define main() function and call mbox class of mailbox library with the help of which we can parse a MBOX file by providing its path −
def main(mbox_file, output_dir):
print("Reading mbox file")
mbox = mailbox.mbox(mbox_file, factory=custom_reader)
print("{} messages to parse".format(len(mbox)))
现在,为 mailbox 库定义一个阅读器方法,如下所示 −
Now, define a reader method for mailbox library as follows −
def custom_reader(data_stream):
data = data_stream.read()
try:
content = data.decode("ascii")
except (UnicodeDecodeError, UnicodeEncodeError) as e:
content = data.decode("cp1252", errors="replace")
return mailbox.mboxMessage(content)
现在,创建一些用于进一步处理的变量,如下所示 −
Now, create some variables for further processing as follows −
parsed_data = []
attachments_dir = os.path.join(output_dir, "attachments")
if not os.path.exists(attachments_dir):
os.makedirs(attachments_dir)
columns = [
"Date", "From", "To", "Subject", "X-Gmail-Labels", "Return-Path", "Received",
"Content-Type", "Message-ID","X-GM-THRID", "num_attachments_exported", "export_path"]
接下来,使用 tqdm 来生成一个进度条并跟踪迭代过程,如下所示 −
Next, use tqdm to generate a progress bar and to track the iteration process as follows −
for message in tqdm(mbox):
msg_data = dict()
header_data = dict(message._headers)
for hdr in columns:
msg_data[hdr] = header_data.get(hdr, "N/A")
现在,检查消息是否有有效负载。如果有,我们将定义 write_payload() 方法,如下所示 −
Now, check weather message is having payloads or not. If it is having then we will define write_payload() method as follows −
if len(message.get_payload()):
export_path = write_payload(message, attachments_dir)
msg_data['num_attachments_exported'] = len(export_path)
msg_data['export_path'] = ", ".join(export_path)
现在,需要追加数据。然后,我们将调用 create_report() 方法,如下所示 −
Now, data need to be appended. Then we will call create_report() method as follows −
parsed_data.append(msg_data)
create_report(
parsed_data, os.path.join(output_dir, "mbox_report.csv"), columns)
def write_payload(msg, out_dir):
pyld = msg.get_payload()
export_path = []
if msg.is_multipart():
for entry in pyld:
export_path += write_payload(entry, out_dir)
else:
content_type = msg.get_content_type()
if "application/" in content_type.lower():
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "image/" in content_type.lower():
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "video/" in content_type.lower():
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "audio/" in content_type.lower():
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "text/csv" in content_type.lower():
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "info/" in content_type.lower():
export_path.append(export_content(msg, out_dir,
msg.get_payload()))
elif "text/calendar" in content_type.lower():
export_path.append(export_content(msg, out_dir,
msg.get_payload()))
elif "text/rtf" in content_type.lower():
export_path.append(export_content(msg, out_dir,
msg.get_payload()))
else:
if "name=" in msg.get('Content-Disposition', "N/A"):
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
elif "name=" in msg.get('Content-Type', "N/A"):
content = base64.b64decode(msg.get_payload())
export_path.append(export_content(msg, out_dir, content))
return export_path
注意,上面的 if-else 语句很容易理解。现在,我们需要定义一个方法,该方法将从 msg 对象中提取文件名,如下所示 −
Observe that the above if-else statements are easy to understand. Now, we need to define a method that will extract the filename from the msg object as follows −
def export_content(msg, out_dir, content_data):
file_name = get_filename(msg)
file_ext = "FILE"
if "." in file_name: file_ext = file_name.rsplit(".", 1)[-1]
file_name = "{}_{:.4f}.{}".format(file_name.rsplit(".", 1)[0], time.time(), file_ext)
file_name = os.path.join(out_dir, file_name)
现在,借助以下代码行,您实际上可以导出文件 −
Now, with the help of following lines of code, you can actually export the file −
if isinstance(content_data, str):
open(file_name, 'w').write(content_data)
else:
open(file_name, 'wb').write(content_data)
return file_name
现在,让我们定义一个函数,从 message 中提取文件名,以准确表示这些文件的名称,如下所示 −
Now, let us define a function to extract filenames from the message to accurately represent the names of these files as follows −
def get_filename(msg):
if 'name=' in msg.get("Content-Disposition", "N/A"):
fname_data = msg["Content-Disposition"].replace("\r\n", " ")
fname = [x for x in fname_data.split("; ") if 'name=' in x]
file_name = fname[0].split("=", 1)[-1]
elif 'name=' in msg.get("Content-Type", "N/A"):
fname_data = msg["Content-Type"].replace("\r\n", " ")
fname = [x for x in fname_data.split("; ") if 'name=' in x]
file_name = fname[0].split("=", 1)[-1]
else:
file_name = "NO_FILENAME"
fchars = [x for x in file_name if x.isalnum() or x.isspace() or x == "."]
return "".join(fchars)
现在,我们可以通过定义 create_report() 函数来编写 CSV 文件,如下所示 −
Now, we can write a CSV file by defining the create_report() function as follows −
def create_report(output_data, output_file, columns):
with open(output_file, 'w', newline="") as outfile:
csvfile = csv.DictWriter(outfile, columns)
csvfile.writeheader()
csvfile.writerows(output_data)
一旦您运行了上面给出的脚本,我们将获得 CSV 报告和一个装满附件的目录。
Once you run the script given above, we will get the CSV report and directory full of attachments.