Machine Learning 简明教程

Machine Learning - Data Loading

假设你想启动一个 ML 项目，那么你首先需要什么最重要的事情？那就是我们需要加载的数据来启动任何 ML 项目。

Suppose if you want to start a ML project then what is the first and most important thing you would require? It is the data that we need to load for starting any of the ML project.

在机器学习中，数据加载指的是从外部资源导入或读取数据并将其转换为机器学习算法可以使用格式的过程。然后预处理数据以删除任何不一致、缺失值或异常值。数据预处理后，将其拆分为训练集和测试集，然后用于模型训练和评估。

In machine learning, data loading refers to the process of importing or reading data from external sources and converting it into a format that can be used by the machine learning algorithm. The data is then preprocessed to remove any inconsistencies, missing values, or outliers. Once the data is preprocessed, it is split into training and testing sets, which are then used for model training and evaluation.

数据可以来自各种来源，例如 CSV 文件、数据库、web API、云存储等。机器学习项目最常用的文件格式是 CSV（逗号分隔值）。

The data can come from various sources such as CSV files, databases, web APIs, cloud storage, etc. The most common file formats for machine learning projects is CSV (Comma Separated Values).

Consideration While Loading CSV data

CSV 是存储表格数据的一种纯文本格式，其中每一行表示一条记录，每一列表示一个字段或属性。由于其简单、轻便以及可以被诸如 Python、R 和 Java 等编程语言轻松读取和处理，因此它被广泛使用。

CSV is a plain text format that stores tabular data, where each row represents a record, and each column represents a field or attribute. It is widely used because it is simple, lightweight, and can be easily read and processed by programming languages such as Python, R, and Java.

在 Python 中，我们可以通过不同方式将 CSV 数据载入到 ML 项目中，但在载入 CSV 数据之前，必须考虑一些事项。

In Python, we can load CSV data into ML projects with different ways but before loading CSV data we must have to take care about some considerations.

在本章中，让我们了解 CSV 文件的主要组成部分、它们可能如何影响数据的载入和分析，以及在将 CSV 数据载入到 ML 项目之前我们应该考虑的一些事项。

In this chapter, let’s understand the main parts of a CSV file, how they might affect the loading and analysis of data, and some consideration we should take care before loading CSV data into ML projects.

File Header

这是 CSV 文件的第一行，它通常包含表中列的名称。在将 CSV 数据载入到 ML 项目时，文件头（也称为列头或变量名）可以在数据分析和模型训练中发挥重要作用。以下是有关文件头的一些需要记住的注意事项：

This is the first row of the CSV file, and it typically contains the names of the columns in the table. When loading CSV data into an ML project, the file header (also known as column headers or variable names) can play an important role in data analysis and model training. Here are some considerations to keep in mind regarding the file header −

Consistency − The header row should be consistent across the entire CSV file. This means that the number of columns and their names should be the same for each row. Inconsistencies can cause issues with parsing and analysis.
Meaningful names − Column names should be meaningful and descriptive. This can help with understanding the data and building more accurate models. Avoid using generic names like "column1", "column2", etc.
Case sensitivity − Depending on the tool or library being used to load the CSV file, the column names may be case sensitive. It’s important to ensure that the case of the header row matches the expected case sensitivity of the tool or library being used.
Special characters − Column names should not contain any special characters, such as spaces, commas, or quotation marks. These characters can cause issues with parsing and analysis. Instead, use underscores or camelCase to separate words.
Missing header − If the CSV file does not have a header row, it’s important to specify the column names manually or provide a separate file or documentation that includes the column names.
Encoding − The encoding of the header row can affect its interpretation when loading the CSV file. It’s important to ensure that the encoding of the header row is compatible with the tool or library being used to read the file.

Comments

这些是可选的行，它们以指定字符（例如“#”或“//”）开头，并且会被读取 CSV 文件的大多数程序忽略。它们可用于提供有关文件中数据的一些额外信息或背景。

These are optional lines that begin with a specified character, such as "#" or "//", and are ignored by most programs that read CSV files. They can be used to provide additional information or context about the data in the file.

CSV 文件中的注释通常不会用于表示会在机器学习项目中使用的数据。但是，如果 CSV 文件中存在注释，则必须考虑它们可能如何影响数据的载入和分析。以下是需要考虑的一些事项：

Comments in a CSV file are not typically used to represent data that would be used in a machine learning project. However, if comments are present in a CSV file, it’s important to consider how they might affect the loading and analysis of the data. Here are some considerations −

Comment markers − In a CSV file, comments can be indicated using a specific marker, such as "#" or "//". It’s important to know what marker is being used, so that the loading process can ignore comments properly.
Placement − Comments should be placed in a separate line from the actual data. If a comment is included in a line with actual data, it may cause issues with parsing and analysis.
Consistency − If comments are used in a CSV file, it’s important to ensure that the comment marker is used consistently throughout the entire file. Inconsistencies can cause issues with parsing and analysis.
Handling comments − Depending on the tool or library being used to load the CSV file, comments may be ignored by default or may require a specific parameter to be set. It’s important to understand how comments are handled by the tool or library being used.
Effect on analysis − If comments contain important information about the data, it may be necessary to process them separately from the data itself. This can add complexity to the loading and analysis process.

Delimiter

这是分隔每一行中的字段的字符。虽然名称表明使用逗号作为分隔符，但根据文件，还可以使用制表符、分号或管道等其他字符。

This is the character that separates the fields in each row. While the name suggests that a comma is used as the delimiter, other characters such as tabs, semicolons, or pipes can also be used depending on the file.

用于 CSV 文件的分隔符会显著影响机器学习模型的准确性和性能，因此在将数据载入到 ML 项目时考虑以下事项非常重要：

The delimiter used in a CSV file can significantly affect the accuracy and performance of a machine learning model, so it is important to consider the following while loading data into an ML project −

Delimiter choice − The delimiter used in a CSV file should be carefully chosen based on the data being used. For example, if the data contains commas within the values (e.g. "New York, NY"), then using a comma as a delimiter may cause issues. In this case, a different delimiter, such as a tab or semicolon, may be more appropriate.
Consistency − The delimiter used in the CSV file should be consistent throughout the entire file. Mixing different delimiters or using whitespace inconsistently can lead to errors and make it difficult to parse the data accurately.
Encoding − The delimiter can also be affected by the encoding of the CSV file. For example, if the CSV file uses a non-ASCII delimiter and is encoded in UTF-8, it may not be correctly read by some machine learning libraries or tools. It is important to ensure that the encoding and delimiter are compatible with the machine learning tools being used.
Other considerations − In some cases, the delimiter may need to be customized based on the machine learning tool being used. For example, some libraries may require a specific delimiter or may not support certain delimiters. It is important to check the documentation of the machine learning tool being used and customize the delimiter as needed.

Quotes

这些是可用于封闭包含分隔符或新行字符的字段的可选字符。例如，如果字段中包含逗号，将字段用引号引起可确保逗号将视为字段的一部分，而不是分隔符。在将 CSV 数据加载到 ML 项目时，有几个注意事项需要牢记关于引号的使用 −

These are optional characters that can be used to enclose fields that contain the delimiter character or newlines. For example, if a field contains a comma, enclosing the field in quotes ensures that the comma is treated as part of the field and not as a delimiter. When loading CSV data into an ML project, there are several considerations to keep in mind regarding the use of quotes −

Quote character − The quote character used in a CSV file should be consistent throughout the file. The most commonly used quote character is the double quote (") but some files may use single quotes or other characters. It’s important to make sure that the quote character used is consistent with the tool or library being used to read the CSV file.
Quoted values − In some cases, values in a CSV file may be enclosed in quotes to differentiate them from other values. For example, if a field contains a comma, it may be enclosed in quotes to prevent it from being interpreted as a new field. It’s important to make sure that quoted values are properly handled when loading the data into an ML project.
Escaping quotes − If a field contains the quote character used to enclose values, it must be escaped. This is typically done by doubling the quote character. For example, if the quote character is double quote (") and a field contains the value "John "the Hammer" Smith", it would be enclosed in quotes and the internal quotes would be escaped like this: "John ""the Hammer"" Smith".
Use of quotes − The use of quotes in CSV files can vary depending on the tool or library being used to generate the file. Some tools may use quotes around every field, while others may only use quotes around fields that contain special characters. It’s important to make sure that the quote usage is consistent with the tool or library being used to read the file.
Encoding − The use of quotes can also be affected by the encoding of the CSV file. If the file is encoded in a non-standard way, it may cause issues when loading the data into an ML project. It’s important to make sure that the encoding of the CSV file is compatible with the tool or library being used to read the file.

Various Methods of Loading a CSV Data File

在使用 ML 项目时，最关键的任务是将数据正确加载到其中。如前所述，ML 项目最常见的数据格式是 CSV，它具有多种风格和不同的难度才能进行解析。

While working with ML projects, the most crucial task is to load the data properly into it. As told earlier, the most common data format for ML projects is CSV and it comes in various flavors and varying difficulties to parse.

在本节中，我们将讨论使用 Python 将 CSV 数据文件加载到机器学习项目中的一些常见方法 −

In this section, we are going to discuss some common approaches in Python to load CSV data file into machine learning project −

Using the CSV Module

这是一个 Python 中的内置模块，它提供了用于读写 CSV 文件的功能。你可以使用它将 CSV 文件读入列表或字典对象。以下是其在 Python 中的实现示例 −

This is a built-in module in Python that provides functionality for reading and writing CSV files. You can use it to read a CSV file into a list or dictionary object. Below is its implementation example in Python −

import csv
with open('mydata.csv', 'r') as file:
   reader = csv.reader(file)
   for row in reader:
      print(row)

此代码读取名为 mydata.csv 的 CSV 文件并打印文件中的每一行。

This code reads a CSV file called mydata.csv and prints each row in the file.

Using the Pandas Library

这是一个流行的 Python 数据操作库，它提供了用于将 CSV 文件读入 pandas DataFrame 对象的 read_csv() 函数。这是一种加载数据和执行各种数据操作任务的非常便捷的方式。以下是其在 Python 中的实现示例 −

This is a popular data manipulation library in Python that provides a read_csv() function for reading CSV files into a pandas DataFrame object. This is a very convenient way to load data and perform various data manipulation tasks. Below is its implementation example in Python −

import pandas as pd

data = pd.read_csv('mydata.csv')

此代码读取名为 mydata.csv 的 CSV 文件并将其加载到名为 data 的 pandas DataFrame 对象中。

This code reads a CSV file called mydata.csv and loads it into a pandas DataFrame object called data.

Using the Numpy Library

这是一个 Python 中的数值库，它提供了用于将 CSV 文件加载到 numpy 数组的 genfromtxt() 函数。以下是其在 Python 中的实现示例 −

This is a numerical computing library in Python that provides a genfromtxt() function for loading CSV files into a numpy array. Below is its implementation example in Python −

import numpy as np

data = np.genfromtxt('mydata.csv', delimiter=',')

此代码读取名为 mydata.csv 的 CSV 文件并将其加载到名为“data”的 numpy 数组中。

This code reads a CSV file called mydata.csv and loads it into a numpy array called 'data'.

Using the Scipy Library

这是一个 Python 中的科学库，它提供了用于将文本文件（包括 CSV 文件）加载到 numpy 数组的 loadtxt() 函数。以下是其在 Python 中的实现示例 −

This is a scientific computing library in Python that provides a loadtxt() function for loading text files, including CSV files, into a numpy array. Below is its implementation example in Python −

import numpy as np

from scipy import loadtxt
data = loadtxt('mydata.csv', delimiter=',')

此代码读取名为 mydata.csv 的 CSV 文件并将其加载到名为“data”的 numpy 数组中。

This code reads a CSV file called mydata.csv and loads it into a numpy array called 'data'.

Using the Sklearn Library

这是一款流行的 Python 机器学习库，提供了 load_iris() 函数来加载鸢尾花数据集，这是一个用于分类任务的常用数据集。以下是在 Python 中的实现示例 -

This is a popular machine learning library in Python that provides a load_iris() function for loading the iris dataset, which is a commonly used dataset for classification tasks. Below is its implementation example in Python −

from sklearn.datasets import load_iris

data = load_iris().data

此代码加载包含在 sklearn 库中的鸢尾花数据集，然后将其加载到名为 data 的 numpy 数组中。

This code loads the iris dataset, which is included in the sklearn library, and loads it into a numpy array called data.