Ch02 預備知識 — Data - 深度學習基石與實務 - ndhu - school-life | Benjamin Yang = 太陽餅的世界 = 和我一起發現藏於日常中的小確幸吧！

Ch2: Preliminaries (預備知識)
Colab: ch2-1 (The output of all programs will be displayed on colab.)

講義地址: Heptabase 首頁
參考書籍: Dive into deep learning
上課日期: 2024/3/5, w3

# Data Manipulation

# Tensors: Arrays of Data

What are tensors?
- Tensors are multidimensional arrays (可以是任意維度) used for storing and manipulating data in deep learning.
- 是所有深度學習框架（如 PyTorch、TensorFlow、MXNet）中的基本資料結構。
Creating Tensors in PyTorch
import torch
x = torch.arange(12) #, dtype=torch.float32)
print(x.numel())
print(x.shape)
- torch.arange: Creates a 1D tensor [0..11].
- tensor.numel: Returns number of elements in the tensor.
- tensor.shape: Provides the dimensions or shape of the tensor.

特殊的 Tensor 初始化

	zero = torch.zeros((2, 3, 4)) # or [2,3,4]
	print(zero)
	one = torch.ones((2, 3, 4))
	print(one)
	rnd = torch.randn(3, 4)
	print(rnd)
	t = torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
	print(t) #若沒有加中括號 '.tensor (1,2,3)'，則默認為 3*1，shape=[3]

torch.zeros: 產生全 0 的 tensor.
torch.ones: 產生全 1 的 tensor.
torch.randn: 產生全來自 standard Gaussian distribution (即標準常態分布，平均為 0 且標準差為 1) 的 tensor. (n for normal, 亦有 rand() 函式 → 從 0~1 隨機抽取)
亦可從 python list 來產生

Reshaping Tensors
X = t.reshape(2, 6)
print(X)
Y = t.reshape(-1, 3)
print(Y)
Z = t.reshape(4, -1)
print(Z)
- Reshape functions 中的 -1 表示自動計算 the appropriate size for the specified dimension.
- 注意：維度要相容，否則會報錯 (如: 12 → 2*4 → ERROR)

# Indexing and Slicing

Indexing Tensors
- Tensors can be accessed using indices (索引), similar to Python lists.
- 從 0 開始，且 -n 表示倒數第 n 個元素。
X = torch.arange(12)
print(X[0], X[1])
print(X[-3])
Slicing Tensors
- 用來存取 sub-sections of a tensor.
- Use start:stop:step as the syntax for indexing a subset of elements, 其中 start + k∗step < stop−1 < start + (k+1)∗step ，且預設是從 0~n+1 , 步伐為 1 (因不含上界，∴共 n 個元素)
- 正向操作範例
  X1 = X[1:3]
  print(X1)
  X2 = X[:3]
  print(X2)
  X3 = X[2:8:2]
  print(X3)
  X4 = X[:]
  print(X4)
- 反向操作
  print(X[-1])
  print(X[:-1])
  # print (y [::-1]) python list 特有的倒過來數功能
- When slicing, if only one index or range is specified, it defaults to operating along the first axis (axis 0).
  Y = X.reshape(4,3)
  print(Y)
  print(Y[0])
- 多個元素可以同時被賦予相同的值。
  Y[:2, :] = 12
  print(Y)

# 運算

Elementwise Operations (逐位元運算)

	x = torch.tensor([1.0, 2, 4, 8])
	y = torch.tensor([2, 2, 2, 2])
	print(x + y, x - y, x * y, x / y, x ** y, sep='\n')
	torch.exp(x)

Concatenation of Tensors

torch.cat((X, Y), dim=<0 or 1>)
- dim = 0 (axis 0) 表示從下往上，往 x 軸合併起來
- dim = 1 則是從右往左，往 y 軸合併起來

	X = torch.arange(12, dtype=torch.float32).reshape((3,4))
	Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
	torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)

Logical Operations

Tensors 之間可用 X == Y 或是 < 和 > )

	X = torch.arange(12, dtype=torch.float32).reshape((3,4))
	Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
	X == Y

Aggregation Functions (總結 (聚合) 函數)

Include operations like max(), min(), sum(), mean(), which allow you to aggregate the values of a tensor along a specified dimension.

	print(X)
	print('-'*10)
	print(X.sum(), X.mean(), sep='\n')
	print('-'*10)
	print(X.sum(dim=0), X.sum(dim=1), sep='\n')
	print('-'*10)
	print(X.mean(dim=0), X.mean(dim=1), sep='\n')
	print('-'*10)
	print(X.min(dim=0), X.min(dim=1), sep='\n')
	print('-'*10)
	print(X.max(dim=0), X.max(dim=1), sep='\n')

# Broadcasting (廣播)

廣播在操作過程中會自動進行張量維度擴展。(因為維度相同才可運算)

Two-Step Process

Expand dimensions.
Perform elementwise operation.

	a = torch.arange(3).reshape((3, 1))
	b = torch.arange(2).reshape((1, 2))
	c = a + b
	print(c)

Rules of Broadcasting (PyTorch 會自動做)
1. Tensor Dimensions Alignment: Compare size of each dimension from last to first.
2. Compatibility of Dimensions: Dimensions are compatible if they are equal or one is 1.
3. Expansion of Fewer Dimensions: Add dimensions of size 1 at the beginning if necessary.
4. Size 1 Dimensions Stretching: Adjust dimensions of size 1 to match the other tensor.(只要有 1 就可以擴充)
5. No Stretching for Non-1 Dimensions: 如果尺寸不同且都不為 1，則會出錯。
Example of broadcasting steps
A (8, 1, 6, 1) and B (7, 1, 5)
1. Align shapes: (8, 1, 6, 1) and (1, 7, 1, 5).
2. Stretch dimensions of size 1.
3. Final shapes: both become (8, 7, 6, 5).
好處
- Memory Efficiency: 透過避免直接的複製資料來減少記憶體使用。
- Code Simplification: 消除了需要手動去匹配張量形狀的操作。

# Python 中的記憶體高效操作

Inefficient Memory Allocation
- Memory Allocation for Operations: Y = Y + X creates new memory for Y + X . (因為 python 中，assign 動作是貼標籤在記憶體上)
- 不用擔心舊的記憶體，因為 python 有 garbage collection 機制，但因為又要找新記憶體，又要回收舊記憶體，所以導致時間的浪費。
X = torch.rand(2,3)
Y = torch.rand(2,3)
print(id(X), id(Y))
Y = Y + X
print(id(Y))
- Inefficiency Reasons:
  - Unnecessary Memory Allocation: 頻繁為操作分配新記憶體。
  - Multiple References Issue: 潛在的內存洩漏或引用到已經淘汰的記憶體。

In-Place Operations in PyTorch

用 slice notation Y[:] = ... to update in-place.

	Z = torch.zeros_like(Y) # shape=y 的 zeros 陣列
	print('id(Z):', id(Z))
	Z[:] = X + Y # In-place update
	print('id(Z):', id(Z))

減少記憶體使用
- Use operations like X[:] = X + Y or X += Y for variables not reused later.
X = torch.rand(2,3)
Y = torch.rand(2,3)
before = id(X)
X += Y
id(X) == before

# 型態轉換 (between 傳統的 numpy 和新的 PyTorch)

Tensor-NumPy Conversion
- PyTorch tensors and NumPy arrays 共享底層內存。
- 印出來時， python list 有，，而 numpy array 沒有。
- Tensor to NumPy:
  A = X.numpy()
  print(A)
- NumPy to Tensor:
  B = torch.from_numpy(A)
  print(B)
- Types Confirmation:
  type(A), type(B)
Size-1 Tensor to Scalar Conversion
- .item(): Direct conversion to Python scalar.
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)

# Data Preprocessing (資料預處理)

# 用 Pandas 讀取資料集

Creating and Writing to a CSV File

	import os

	# The 'exist_ok=True' parameter allows the function to continue without raising an error if the directory already exists.
	os.makedirs(os.path.join('.', 'data'), exist_ok=True) #join('.', 'data')->'./data'

	# Define the path for the data file to be created. This uses 'os.path.join' for compatibility across different OS.
	data_file = os.path.join('.', 'data', 'house_tiny.csv')

	# Write mode will create the file if it does not exist or overwrite it if it does.
	with open(data_file, 'w') as f:
	# 'NA' is used to represent missing values.
	f.write('''NumRooms,RoofType,Price
	NA,NA,127500
	2,NA,106000
	4,Slate,178100
	NA,NA,140000''')

Loading Data with Pandas

Importing pandas: import pandas as pd

Reads the CSV file into a pandas DataFrame.

	import pandas as pd

	# 'pd.read_csv()' is a function in pandas used to read a CSV file and convert it into a DataFrame.
	# A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
	data = pd.read_csv(data_file)

	# This will display the contents of the CSV file as a table with indexed rows and named columns.
	print(data)

# 資料準備

輸入 - 目標分離：區分特徵和標籤。(for 監督式學習)

Handling Missing Values

Missing values represented as NaN .
神經網路只能處理 number (string 不可)，所以 Nan 要被編碼。

One-Hot Encoding with pd.get_dummies():
one-hot

Converts categorical variables to numerical format.
dummy_na=True includes a column for NaN values.

	# Splitting the DataFrame 'data' into inputs and targets for machine learning or data analysis purposes.
	# 'data.iloc[:, 0:2]' selects all rows (:) and the first two columns (0:2) from the DataFrame. These are the input features.
	# 'data.iloc[:, 2]' selects all rows (:) and the third column (2) which is assumed to be the target variable.
	inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]

	print(inputs)

	# Convert categorical data into dummy/indicator variables.
	# 'pd.get_dummies()' is a function that converts categorical variable(s) into dummy/indicator variables.
	# 'dummy_na=True' includes an additional column for missing values (NA) which are present in the dataset.
	inputs = pd.get_dummies(inputs['RoofType'], dummy_na=True, prefix='Label')

	# This DataFrame now contains binary columns for each category in the original data,
	# including additional columns for handling missing values (NA).
	print(inputs)

	import pandas as pd

	# Create a sample DataFrame
	data = {'Category': ['A', 'B', 'A', 'C', 'B'], 'Name':['John','Mary','Joe','Tom','Harry']}
	df = pd.DataFrame(data)
	print(df)

	# Use pd.get_dummies to convert the 'Category' column into dummy variables
	dummy_df = pd.get_dummies(df, prefix='Category')

	# # Concatenate the dummy variables with the original DataFrame
	# df = pd.concat([df, dummy_df], axis=1)

	print(dummy_df)

Converting NAN Data

Strategy: Replace missing values with mean or median.

	# Filling missing values in the DataFrame 'inputs' with the mean of each column.
	# 'inputs.fillna()' is a function that fills NA/NaN values using the specified method.
	# 'inputs.mean()' calculates the mean of each column in the DataFrame, ignoring NaN values.
	# This method is often used to handle missing data in machine learning and data analysis.
	inputs = inputs.fillna(inputs.mean())

	# This DataFrame now has missing values replaced by the mean of their respective columns.
	print(inputs)

Conversion to Tensor Format

Framework compatibility: For use in deep learning frameworks like PyTorch.
Conversion process: Convert DataFrame to NumPy array then to PyTorch tensor.

	import torch

	# Convert the pandas DataFrame 'inputs' to a numpy array and then to a PyTorch tensor.
	# 'inputs.to_numpy(dtype=float)' converts the DataFrame to a numpy array of type float.
	# 'torch.tensor()' converts the numpy array into a PyTorch tensor, which is used for computations in PyTorch.
	X = torch.tensor(inputs.to_numpy(dtype=float))
	y = torch.tensor(targets.to_numpy(dtype=float))

	# 'X' is the tensor containing input features, and 'y' is the tensor containing target values.
	print(X, y)

# Bonus 1 題目

Convert each sample (row) into numeric data by performing one-hot encoding on non-numeric columns (Name and Gender).
Produce a tensor to store the BMIs for the 20 persons.
Compute the average BMI for the 20 persons.
Find the student who has the highest BMI.

解答地址: Colab

	import torch
	x = torch.arange(12) #, dtype=torch.float32)
	print(x.numel())
	print(x.shape)

	X = t.reshape(2, 6)
	print(X)
	Y = t.reshape(-1, 3)
	print(Y)
	Z = t.reshape(4, -1)
	print(Z)

	X1 = X[1:3]
	print(X1)
	X2 = X[:3]
	print(X2)
	X3 = X[2:8:2]
	print(X3)
	X4 = X[:]
	print(X4)

	print(X[-1])
	print(X[:-1])
	# print (y [::-1]) python list 特有的倒過來數功能

	X = torch.rand(2,3)
	Y = torch.rand(2,3)
	print(id(X), id(Y))
	Y = Y + X
	print(id(Y))

	X = torch.rand(2,3)
	Y = torch.rand(2,3)
	before = id(X)
	X += Y
	id(X) == before

	X = torch.arange(12)
	print(X[0], X[1])
	print(X[-3])

	Y = X.reshape(4,3)
	print(Y)
	print(Y[0])

	Y[:2, :] = 12
	print(Y)

	A = X.numpy()
	print(A)