Python Digital Forensics 简明教程
Investigating Embedded Metadata
在本章中,我们将详细了解使用 Python 数字取证调查嵌入式元数据。
In this chapter, we will learn in detail about investigating embedded metadata using Python digital forensics.
Introduction
嵌入式元数据是关于存储在同一文件中并由该数据描述的对象的信息。换句话说,它是存储在数字文件本身中的数字资产的信息。它总是与文件关联,并且永远无法分离。
Embedded metadata is the information about data stored in the same file which is having the object described by that data. In other words, it is the information about a digital asset stored in the digital file itself. It is always associated with the file and can never be separated.
在数字取证的情况下,我们无法提取有关特定文件的所有信息。另一方面,嵌入式元数据可以为我们提供对调查至关重要的信息。例如,文本文件的元数据可能包含有关作者、其长度、撰写日期甚至该文档的简要摘要的信息。数字图像可能包括元数据,如图像长度、快门速度等。
In case of digital forensics, we cannot extract all the information about a particular file. On the other side, embedded metadata can provide us information critical to the investigation. For example, a text file’s metadata may contain information about the author, its length, written date and even a short summary about that document. A digital image may include the metadata such as the length of the image, the shutter speed etc.
Artifacts Containing Metadata Attributes and their Extraction
在本节中,我们将了解包含元数据属性的各种伪影及其使用 Python 的提取过程。
In this section, we will learn about various artifacts containing metadata attributes and their extraction process using Python.
Audio and Video
以下是非常常见的具有嵌入式元数据的两个伪影。可以提取此元数据以进行调查。
These are the two very common artifacts which have the embedded metadata. This metadata can be extracted for the purpose of investigation.
可以使用以下 Python 脚本从音频或 MP3 文件和视频或 MP4 文件中提取常见属性或元数据。
You can use the following Python script to extract common attributes or metadata from audio or MP3 file and a video or a MP4 file.
请注意,对于这个脚本,我们需要安装名为 mutagen 的第三方 Python 库,它允许我们从音频和视频文件中提取元数据。可以使用以下命令进行安装 -
Note that for this script, we need to install a third party python library named mutagen which allows us to extract metadata from audio and video files. It can be installed with the help of the following command −
pip install mutagen
我们为此 Python 脚本需要导入的一些有用的库如下 -
Some of the useful libraries we need to import for this Python script are as follows −
from __future__ import print_function
import argparse
import json
import mutagen
命令行处理程序将获取一个参数,该参数表示 MP3 或 MP4 文件的路径。然后,我们将使用 mutagen.file() 方法打开一个指向该文件的文件句柄,如下所示 -
The command line handler will take one argument which represents the path to the MP3 or MP4 files. Then, we will use mutagen.file() method to open a handle to the file as follows −
if __name__ == '__main__':
parser = argparse.ArgumentParser('Python Metadata Extractor')
parser.add_argument("AV_FILE", help="File to extract metadata from")
args = parser.parse_args()
av_file = mutagen.File(args.AV_FILE)
file_ext = args.AV_FILE.rsplit('.', 1)[-1]
if file_ext.lower() == 'mp3':
handle_id3(av_file)
elif file_ext.lower() == 'mp4':
handle_mp4(av_file)
现在,我们需要使用两个句柄,一个用于从 MP3 提取数据,另一个用于从 MP4 文件提取数据。我们可以将这些句柄定义如下 -
Now, we need to use two handles, one to extract the data from MP3 and one to extract data from MP4 file. We can define these handles as follows −
def handle_id3(id3_file):
id3_frames = {'TIT2': 'Title', 'TPE1': 'Artist', 'TALB': 'Album','TXXX':
'Custom', 'TCON': 'Content Type', 'TDRL': 'Date released','COMM': 'Comments',
'TDRC': 'Recording Date'}
print("{:15} | {:15} | {:38} | {}".format("Frame", "Description","Text","Value"))
print("-" * 85)
for frames in id3_file.tags.values():
frame_name = id3_frames.get(frames.FrameID, frames.FrameID)
desc = getattr(frames, 'desc', "N/A")
text = getattr(frames, 'text', ["N/A"])[0]
value = getattr(frames, 'value', "N/A")
if "date" in frame_name.lower():
text = str(text)
print("{:15} | {:15} | {:38} | {}".format(
frame_name, desc, text, value))
def handle_mp4(mp4_file):
cp_sym = u"\u00A9"
qt_tag = {
cp_sym + 'nam': 'Title', cp_sym + 'art': 'Artist',
cp_sym + 'alb': 'Album', cp_sym + 'gen': 'Genre',
'cpil': 'Compilation', cp_sym + 'day': 'Creation Date',
'cnID': 'Apple Store Content ID', 'atID': 'Album Title ID',
'plID': 'Playlist ID', 'geID': 'Genre ID', 'pcst': 'Podcast',
'purl': 'Podcast URL', 'egid': 'Episode Global ID',
'cmID': 'Camera ID', 'sfID': 'Apple Store Country',
'desc': 'Description', 'ldes': 'Long Description'}
genre_ids = json.load(open('apple_genres.json'))
现在,我们需要迭代这个 MP4 文件,如下所示 -
Now, we need to iterate through this MP4 file as follows −
print("{:22} | {}".format('Name', 'Value'))
print("-" * 40)
for name, value in mp4_file.tags.items():
tag_name = qt_tag.get(name, name)
if isinstance(value, list):
value = "; ".join([str(x) for x in value])
if name == 'geID':
value = "{}: {}".format(
value, genre_ids[str(value)].replace("|", " - "))
print("{:22} | {}".format(tag_name, value))
上述脚本将为我们提供有关 MP3 和 MP4 文件的其他信息。
The above script will give us additional information about MP3 as well as MP4 files.
Images
图像可能包含不同类型的元数据,具体取决于其文件格式。但是,大多数图像都嵌入了 GPS 信息。我们可以使用第三方 Python 库提取此 GPS 信息。可以使用以下 Python 脚本执行此操作 -
Images may contain different kind of metadata depending upon its file format. However, most of the images embed GPS information. We can extract this GPS information by using third party Python libraries. You can use the following Python script can be used to do the same −
首先,下载名为 Python Imaging Library (PIL) 的第三方 Python 库,如下所示 -
First, download third party python library named Python Imaging Library (PIL) as follows −
pip install pillow
这将帮助我们从图像中提取元数据。
This will help us to extract metadata from images.
我们还可以将嵌入在图像中的 GPS详细信息写入 KML 文件,但为此我们需要下载名为 simplekml 的第三方 Python 库,如下所示 -
We can also write the GPS details embedded in images to KML file, but for this we need to download third party Python library named simplekml as follows −
pip install simplekml
在此脚本中,我们首先需要导入以下库 -
In this script, first we need to import the following libraries −
from __future__ import print_function
import argparse
from PIL import Image
from PIL.ExifTags import TAGS
import simplekml
import sys
现在,命令行处理器将接受一个位置参数,它基本上表示照片的文件路径。
Now, the command line handler will accept one positional argument which basically represents the file path of the photos.
parser = argparse.ArgumentParser('Metadata from images')
parser.add_argument('PICTURE_FILE', help = "Path to picture")
args = parser.parse_args()
现在,我们需要指定 URL,这些 URL 将填充坐标信息。URL 为 gmaps 和 open_maps 。我们还需要一个函数,将 PIL 库提供的度分秒 (DMS) 元组坐标转换为十进制。可以按如下方法执行此操作:
Now, we need to specify the URLs that will populate the coordinate information. The URLs are gmaps and open_maps. We also need a function to convert the degree minute seconds (DMS) tuple coordinate, provided by PIL library, into decimal. It can be done as follows −
gmaps = "https://www.google.com/maps?q={},{}"
open_maps = "http://www.openstreetmap.org/?mlat={}&mlon={}"
def process_coords(coord):
coord_deg = 0
for count, values in enumerate(coord):
coord_deg += (float(values[0]) / values[1]) / 60**count
return coord_deg
现在,我们将使用 image.open() 函数将文件作为 PIL 对象打开。
Now, we will use image.open() function to open the file as PIL object.
img_file = Image.open(args.PICTURE_FILE)
exif_data = img_file._getexif()
if exif_data is None:
print("No EXIF data found")
sys.exit()
for name, value in exif_data.items():
gps_tag = TAGS.get(name, name)
if gps_tag is not 'GPSInfo':
continue
找到 GPSInfo 标记后,我们将存储 GPS 引用,并使用 process_coords() 方法处理坐标。
After finding the GPSInfo tag, we will store the GPS reference and process the coordinates with the process_coords() method.
lat_ref = value[1] == u'N'
lat = process_coords(value[2])
if not lat_ref:
lat = lat * -1
lon_ref = value[3] == u'E'
lon = process_coords(value[4])
if not lon_ref:
lon = lon * -1
现在,按如下方法从 simplekml 库启动 kml 对象:
Now, initiate kml object from simplekml library as follows −
kml = simplekml.Kml()
kml.newpoint(name = args.PICTURE_FILE, coords = [(lon, lat)])
kml.save(args.PICTURE_FILE + ".kml")
我们现在可以按如下方法打印处理信息的坐标:
We can now print the coordinates from processed information as follows −
print("GPS Coordinates: {}, {}".format(lat, lon))
print("Google Maps URL: {}".format(gmaps.format(lat, lon)))
print("OpenStreetMap URL: {}".format(open_maps.format(lat, lon)))
print("KML File {} created".format(args.PICTURE_FILE + ".kml"))
PDF Documents
PDF 文档有很多种媒体,包括图像、文本、表单等。当我们提取 PDF 文档中的嵌入式元数据时,我们可能会以可扩展元数据平台 (XMP) 格式获取结果数据。我们可以借助以下 Python 代码提取元数据:
PDF documents have a wide variety of media including images, text, forms etc. When we extract embedded metadata in PDF documents, we may get the resultant data in the format called Extensible Metadata Platform (XMP). We can extract metadata with the help of the following Python code −
首先,安装一个名为 PyPDF2 的第三方 Python 库,以读取 XMP 格式中存储的元数据。可以按如下方法进行安装:
First, install a third party Python library named PyPDF2 to read metadata stored in XMP format. It can be installed as follows −
pip install PyPDF2
现在,导入以下库以从 PDF 文件中提取元数据:
Now, import the following libraries for extracting the metadata from PDF files −
from __future__ import print_function
from argparse import ArgumentParser, FileType
import datetime
from PyPDF2 import PdfFileReader
import sys
现在,命令行处理器将接受一个位置参数,它基本上表示 PDF 文件的文件路径。
Now, the command line handler will accept one positional argument which basically represents the file path of the PDF file.
parser = argparse.ArgumentParser('Metadata from PDF')
parser.add_argument('PDF_FILE', help='Path to PDF file',type=FileType('rb'))
args = parser.parse_args()
现在我们可以使用 getXmpMetadata() 方法按如下方法提供一个包含可用元数据的对象:
Now we can use getXmpMetadata() method to provide an object containing the available metadata as follows −
pdf_file = PdfFileReader(args.PDF_FILE)
xmpm = pdf_file.getXmpMetadata()
if xmpm is None:
print("No XMP metadata found in document.")
sys.exit()
我们可以使用 custom_print() 方法按如下方法提取和打印相关值,例如标题、创建者、贡献者等:
We can use custom_print() method to extract and print the relevant values like title, creator, contributor etc. as follows −
custom_print("Title: {}", xmpm.dc_title)
custom_print("Creator(s): {}", xmpm.dc_creator)
custom_print("Contributors: {}", xmpm.dc_contributor)
custom_print("Subject: {}", xmpm.dc_subject)
custom_print("Description: {}", xmpm.dc_description)
custom_print("Created: {}", xmpm.xmp_createDate)
custom_print("Modified: {}", xmpm.xmp_modifyDate)
custom_print("Event Dates: {}", xmpm.dc_date)
如果使用多个软件创建 PDF,我们还可以定义 custom_print() 方法,如下所示:
We can also define custom_print() method in case if PDF is created using multiple software as follows −
def custom_print(fmt_str, value):
if isinstance(value, list):
print(fmt_str.format(", ".join(value)))
elif isinstance(value, dict):
fmt_value = [":".join((k, v)) for k, v in value.items()]
print(fmt_str.format(", ".join(value)))
elif isinstance(value, str) or isinstance(value, bool):
print(fmt_str.format(value))
elif isinstance(value, bytes):
print(fmt_str.format(value.decode()))
elif isinstance(value, datetime.datetime):
print(fmt_str.format(value.isoformat()))
elif value is None:
print(fmt_str.format("N/A"))
else:
print("warn: unhandled type {} found".format(type(value)))
我们还可以按如下方法提取软件保存的任何其他自定义属性:
We can also extract any other custom property saved by the software as follows −
if xmpm.custom_properties:
print("Custom Properties:")
for k, v in xmpm.custom_properties.items():
print("\t{}: {}".format(k, v))
上述脚本将读取 PDF 文档,并将以 XMP 格式存储的元数据打印出来,其中包括该软件使用的一些自定义属性,这些属性用于制作该 PDF。
The above script will read the PDF document and will print the metadata stored in XMP format including some custom properties stored by the software with the help of which that PDF has been made.
Windows Executables Files
有时,我们可能会遇到可疑或未经授权的可执行文件。但是,为了调查目的,它可能会因嵌入的元数据而有用。我们可以获取其位置、其用途以及制造商、编译日期等其他属性之类的信息。借助以下 Python 脚本,我们可以获取编译日期、标题中的有用数据以及已导入和导出的符号。
Sometimes we may encounter a suspicious or unauthorized executable file. But for the purpose of investigation it may be useful because of the embedded metadata. We can get the information such as its location, its purpose and other attributes such as the manufacturer, compilation date etc. With the help of following Python script we can get the compilation date, useful data from headers and imported as well as exported symbols.
为此,首先,安装第三方 Python 库 pefile 。可以按如下方法进行安装:
For this purpose, first install the third party Python library pefile. It can be done as follows −
pip install pefile
一旦成功安装,按如下方法导入以下库:
Once you successfully install this, import the following libraries as follows −
from __future__ import print_function
import argparse
from datetime import datetime
from pefile import PE
现在,命令行处理器将接受一个位置参数,它基本上表示可执行文件的文件路径。您还可以选择输出样式,是需要详细冗长的样式还是简化样式。为此,您需要按如下所示提供一个可选参数:
Now, the command line handler will accept one positional argument which basically represents the file path of the executable file. You can also choose the style of output, whether you need it in detailed and verbose way or in a simplified manner. For this you need to give an optional argument as shown below −
parser = argparse.ArgumentParser('Metadata from executable file')
parser.add_argument("EXE_FILE", help = "Path to exe file")
parser.add_argument("-v", "--verbose", help = "Increase verbosity of output",
action = 'store_true', default = False)
args = parser.parse_args()
现在,我们将使用 PE 类加载输入可执行文件。我们还将使用 dump_dict() 方法将可执行数据转储到一个字典对象。
Now, we will load the input executable file by using PE class. We will also dump the executable data to a dictionary object by using dump_dict() method.
pe = PE(args.EXE_FILE)
ped = pe.dump_dict()
我们可以使用下面所示代码来提取基本的文件元数据,例如嵌入的作者、版本和编译时间:
We can extract basic file metadata such as embedded authorship, version and compilation time using the code shown below −
file_info = {}
for structure in pe.FileInfo:
if structure.Key == b'StringFileInfo':
for s_table in structure.StringTable:
for key, value in s_table.entries.items():
if value is None or len(value) == 0:
value = "Unknown"
file_info[key] = value
print("File Information: ")
print("==================")
for k, v in file_info.items():
if isinstance(k, bytes):
k = k.decode()
if isinstance(v, bytes):
v = v.decode()
print("{}: {}".format(k, v))
comp_time = ped['FILE_HEADER']['TimeDateStamp']['Value']
comp_time = comp_time.split("[")[-1].strip("]")
time_stamp, timezone = comp_time.rsplit(" ", 1)
comp_time = datetime.strptime(time_stamp, "%a %b %d %H:%M:%S %Y")
print("Compiled on {} {}".format(comp_time, timezone.strip()))
我们可以如下从头文件中提取有用的数据:
We can extract the useful data from headers as follows −
for section in ped['PE Sections']:
print("Section '{}' at {}: {}/{} {}".format(
section['Name']['Value'], hex(section['VirtualAddress']['Value']),
section['Misc_VirtualSize']['Value'],
section['SizeOfRawData']['Value'], section['MD5'])
)
现在,如下所示从可执行文件中提取导入和导出的列表:
Now, extract the listing of imports and exports from executable files as shown below −
if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT'):
print("\nImports: ")
print("=========")
for dir_entry in pe.DIRECTORY_ENTRY_IMPORT:
dll = dir_entry.dll
if not args.verbose:
print(dll.decode(), end=", ")
continue
name_list = []
for impts in dir_entry.imports:
if getattr(impts, "name", b"Unknown") is None:
name = b"Unknown"
else:
name = getattr(impts, "name", b"Unknown")
name_list.append([name.decode(), hex(impts.address)])
name_fmt = ["{} ({})".format(x[0], x[1]) for x in name_list]
print('- {}: {}'.format(dll.decode(), ", ".join(name_fmt)))
if not args.verbose:
print()
现在,使用如下所示的代码打印 exports 、 names 和 addresses :
Now, print exports, names and addresses using the code as shown below −
if hasattr(pe, 'DIRECTORY_ENTRY_EXPORT'):
print("\nExports: ")
print("=========")
for sym in pe.DIRECTORY_ENTRY_EXPORT.symbols:
print('- {}: {}'.format(sym.name.decode(), hex(sym.address)))
上面的脚本将提取 Windows 可执行文件中的基本元数据、来自头文件的信息。
The above script will extract the basic metadata, information from headers from windows executable files.
Office Document Metadata
计算机中的大部分工作都是在 MS Office 的三个应用程序中完成的——Word、PowerPoint 和 Excel。这些文件拥有庞大的元数据,可以揭示有关其作者和历史的有趣信息。
Most of the work in computer is done in three applications of MS Office – Word, PowerPoint and Excel. These files possess huge metadata, which can expose interesting information about their authorship and history.
请注意,2007 格式的 word(.docx)、excel(.xlsx)和 powerpoint(.pptx)的元数据存储在 XML 文件中。我们可以使用下面所示的 Python 脚本,在 Python 中处理这些 XML 文件:
Note that metadata from 2007 format of word (.docx), excel (.xlsx) and powerpoint (.pptx) is stored in a XML file. We can process these XML files in Python with the help of following Python script shown below −
首先,如下所示导入所需的库:
First, import the required libraries as shown below −
from __future__ import print_function
from argparse import ArgumentParser
from datetime import datetime as dt
from xml.etree import ElementTree as etree
import zipfile
parser = argparse.ArgumentParser('Office Document Metadata’)
parser.add_argument("Office_File", help="Path to office file to read")
args = parser.parse_args()
现在,检查文件是否为 ZIP 文件。如果不是,引发错误。现在,打开文件并使用以下代码提取要处理的关键元素:
Now, check if the file is a ZIP file. Else, raise an error. Now, open the file and extract the key elements for processing using the following code −
zipfile.is_zipfile(args.Office_File)
zfile = zipfile.ZipFile(args.Office_File)
core_xml = etree.fromstring(zfile.read('docProps/core.xml'))
app_xml = etree.fromstring(zfile.read('docProps/app.xml'))
现在,创建一个字典来启动元数据的提取:
Now, create a dictionary for initiating the extraction of the metadata −
core_mapping = {
'title': 'Title',
'subject': 'Subject',
'creator': 'Author(s)',
'keywords': 'Keywords',
'description': 'Description',
'lastModifiedBy': 'Last Modified By',
'modified': 'Modified Date',
'created': 'Created Date',
'category': 'Category',
'contentStatus': 'Status',
'revision': 'Revision'
}
使用 iterchildren() 方法访问 XML 文件中的每个标记:
Use iterchildren() method to access each of the tags within the XML file −
for element in core_xml.getchildren():
for key, title in core_mapping.items():
if key in element.tag:
if 'date' in title.lower():
text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
else:
text = element.text
print("{}: {}".format(title, text))
类似地,对包含有关文档内容的统计信息的 app.xml 文件执行此操作:
Similarly, do this for app.xml file which contains statistical information about the contents of the document −
app_mapping = {
'TotalTime': 'Edit Time (minutes)',
'Pages': 'Page Count',
'Words': 'Word Count',
'Characters': 'Character Count',
'Lines': 'Line Count',
'Paragraphs': 'Paragraph Count',
'Company': 'Company',
'HyperlinkBase': 'Hyperlink Base',
'Slides': 'Slide count',
'Notes': 'Note Count',
'HiddenSlides': 'Hidden Slide Count',
}
for element in app_xml.getchildren():
for key, title in app_mapping.items():
if key in element.tag:
if 'date' in title.lower():
text = dt.strptime(element.text, "%Y-%m-%dT%H:%M:%SZ")
else:
text = element.text
print("{}: {}".format(title, text))
现在,在运行上面的脚本后,我们可以获得特定文档的不同详细信息。请注意,我们只能对此脚本应用于 Office 2007 或更高版本的文档。
Now after running the above script, we can get the different details about the particular document. Note that we can apply this script on Office 2007 or later version documents only.