ETL

ETL：Extract, Transform, Load之簡稱，即代表將提取檔案中的資料，轉換成需要的形式，並且儲存檔案。
- Transform：清理不需要的資料、結合資料或做normalize。
可用於處理檔案格式的常見modules
- pandas：A full data analysis library can be used to visualize and analyze data.
- csv
- openpyxl：用於處理excel檔案的
- Google sheet Python API

CSV

csv file中的data通常用逗號( , )分隔，但也可以使用其他符號，如 | 、*。
在csv file中如要加入empty string，即標示””即可。

在python中開啟csv file

use open() function to open file
read data in the csv file with csv.reader() , it will create a reader object

使用python開啟csv檔時要在open() function中設定keyword argument來指定換行的格式，如以下data=open(’myfile.csv’, …, newline=”) 。

python open函数newline参数_python open可以换行吗-CSDN博客

example的14-1file.csv內容如下圖

Untitled

import csv

with open("14-1file.csv", newline="", encoding="cp950") as f:
    csv_data = csv.reader(f)
    print(csv_data)
    for row in csv_data:
        print(row)
        
# <_csv.reader object at 0x0000026E9CD97280> # a reader object
# ['id', 'name', 'year of birth']
# ['1', 'albert einstein', '1879']
# ['2', 'isaac newton', '1643']
# ['3', 'marie curie', '1867']
# ['4', 'galilée', '1564']

	    for row in csv_data:
        print(row[1].title())
# Name
# Albert Einstein
# Isaac Newton
# Marie Curie
# GaliléE

在python寫一個csv檔

use open() function with write mode or append mode
use csv.writer() function to create a writer object

writer.writerow(list) can write a row of data
writer.writerows(list of list) will write several rows of data

import csv

with open("newfile.csv", mode="w", newline="", encoding="cp950") as f:
    csv_writer = csv.writer(f, delimiter=",")
    csv_writer.writerow(['a', 'b', 'c'])
    
# in newfile.csv
# a,b,c

import csv

with open("newfile.csv", mode="a", newline="", encoding="cp950") as f:
    csv_writer = csv.writer(f, delimiter=",")
    csv_writer.writerows([['1', '2', '3'], ['d', 'e', 'f']])
    
# in newfile.csv
# a,b,c
# 1,2,3
# d,e,f

Excel

在python中會將excel file叫做workbook，而其中每個工作表則稱為worksheet。

在python中使用excel file

load the workbook
get the specific worksheet
iterate through each row in that worksheet
iterate through each cell in the row

example中的14-3Dodgers.xlsx內容如下

Untitled

from openpyxl import load_workbook

wb = load_workbook("14-3Dodgers.xlsx")
result = []

ws = wb.worksheets[0]
for row in ws.iter_rows():
    result.append([cell.value for cell in row])

print(result)

# [['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB'], 
# [1, 'C', 'Will Smith', 26, 130, 501, 414, 71, 107, 19, 2, 25, 76, 3, 0, 58, 101, 0.258, 0.365, 0.495, 0.86, 130, 205, 11, 18, 0, 11, 4], [2, '1B', 'Max Muncy', 30, 144, 592, 497, 95, 124, 26, 2, 36, 94, 2, 1, 83, 120, 0.249, 0.368, 0.527, 0.895, 138, 262, 7, 11, 0, 1, 5], 
# [3, '2B', 'Trea Turner', 28, 52, 226, 207, 41, 70, 17, 0, 10, 28, 11, 2, 15, 33, 0.338, 0.385, 0.565, 0.95, 152, 117, 5, 2, 0, 2, 2], [4, 'SS', 'Corey Seager', 27, 95, 409, 353, 54, 108, 22, 3, 16, 57, 1, 1, 48, 66, 0.306, 0.394, 0.521, 0.915, 145, 184, 8, 5, 0, 3, 2], 
# [5, '3B', 'Justin Turner', 36, 151, 612, 533, 87, 148, 22, 0, 27, 87, 3, 0, 61, 98, 0.278, 0.361, 0.471, 0.832, 123, 251, 12, 12, 0, 6, 0], [6, 'LF', 'AJ Pollock', 33, 117, 422, 384, 53, 114, 27, 1, 21, 69, 9, 1, 30, 80, 0.297, 0.355, 0.536, 0.892, 137, 206, 4, 6, 0, 2, 4], 
# [7, 'CF', 'Cody Bellinger', 25, 95, 350, 315, 39, 52, 9, 2, 10, 36, 3, 1, 31, 94, 0.165, 0.24, 0.302, 0.542, 45, 95, 2, 1, 0, 3, 2], [8, 'RF', 'Mookie Betts', 28, 122, 550, 466, 93, 123, 29, 3, 23, 58, 10, 5, 68, 86, 0.264, 0.367, 0.487, 0.854, 128, 227, 5, 11, 0, 5, 2]]

sum = 0
for r in result[1:]:  # 從2nd row到最後
    sum += int(r[11])

print(sum) # 168

在python中寫一個excel file

Create one excel workbook (file).
A workbook is always created with at least one worksheet.
Set the worksheet’s title.
Append each data row to the worksheet.
Save the workbook.

from openpyxl import load_workbook
import csv

data_rows = [fields for fields in csv.reader(
    open("14-1file.csv", newline=""))]
print(data_rows)

# [['id', 'name', 'year of birth'], 
# ['1', 'albert einstein', '1879'], 
# ['2', 'isaac newton', '1643'], 
# ['3', 'marie curie', '1867'], 
# ['4', 'galilée', '1564']]

wb = Workbook()
ws = wb.active
ws.title = "ws1"
ws.sheet_properties.tabColor = "6E388C"
for row in data_rows:
    ws.append(row)

wb.save("Myfile.xlsx")

In Myfile.xlsx

內容

worksheet的名字及顏色

Intro to Databases

2 types of databases
- relational (SQL)：資料庫中的內容，數據與數據之間是有相關的。市面上常見的有SQLite, PostgreSQL, MySQL, SQL Server
- non-relational (NoSQL)：資料與資料間是不相關的，常見的有MongoDB及Redis (Remote Dictionary Server)。