Effective Python

Python Chunks

當我們要把list分成好幾個chunk時的幾種做法 yield def chunks1(input_list, n): for i in range(0, len(input_list), n): yield input_list[i:i + n] input_list = [i for i in range(0, 15)] print(list(chunks1(input_list, 4))) ## [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14]] 一行for迴圈 input_list = [i for i in range(0, 15)] n = 3 output_list = [input_list[i:i+ n] for i in range(0, len(input_list), n)] print(output_list) ## [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14]] iterable 針對任何iterable from itertools import islice def chunks2(input_iter, n): input_list = iter(input_iter) return iter(lambda: tuple(islice(input_list, n)), ()) input_list = [i for i in range(0, 15)] n = 4 print(list(chunks2(input_list, n))) ## [(0, 1, 2, 3), (4, 5, 6, 7), (8, 9, 10, 11), (12, 13, 14)] Numpy import numpy as np input_list = [i for i in range(0, 15)] np.array_split(input_list, 5) ## [array([0, 1, 2]), ## array([3, 4, 5]), ## array([6, 7, 8]), ## array([ 9, 10, 11]), ## array([12, 13, 14])] 上述幾種簡單的方式皆可達成 ...

Before Data processing: ELT

Before ELT : ETL ETL stands for Extract, Transform, and Load. Historically, ETL has been the best and most reliable way to migrate data from one database to another. In addition to move data from one database to another, it also converts databases into a single format that can be utilized in the final point. Extract: Collecting data from different database. Sometimes using a staging table. Transform: It’s critical. Converting recently extracted data into the correct form so that it can be placed into another database. Sometimes there are other types of transformation involved in this step. Load: Load data into the target database or storage. ...

Python f-string

python 3.6後，字串多了個處理方法 PEP 498 – Literal String Interpolation 下面直接用例子來比較f-string和我們之前常用的 %-formatting、str.format()語法不同之處 >>> # %-formatting ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("%s, test numbers are %s and %s" % (text, number1, number2)) Hello, test numbers are 10 and 20 >>> # str.format() ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print("{}, test numbers are {} and {}".format(text, number1, number2)) Hello, test numbers are 10 and 20 >>> print("{0}, test numbers are {2} and {1}".format(text, number1, number2)) Hello, test numbers are 20 and 10 >>> # f-string ... >>> text = "Hello" >>> number1 = 10 >>> number2 = 20 >>> print(f"{text}, test numbers are {number1} and {number2}") Hello, test numbers are 10 and 20 F-string 看起來更python了，也解決了之前會遇到的問題；例如使用 %時的參數限制等等。在變數變多的情況下更易讀也易改。嘗試做更多操作 >>> f"{3 + 8}" '11' >>> text = "Literal String Interpolation" >>> f"{text.upper()}" 'LITERAL STRING INTERPOLATION' >>> f"{1/3:.2f}" '0.33' 也可以放入lambda表達式。 ...

Python lambda

lambda function（匿名函式）基本語法 lambda arg1, arg2,... : expression fun = lambda x: x + 1 print(fun(5)) >>> 6 lambda function可以看做是一個簡單的function，有好幾個輸入，但是只能有一個運算式。適合的使用時機有幾個時機適合使用lambda function 無法重複使用：“don’t repeat yourself”，因此若知道這個功能簡單且不會在類似的地方重複使用，那這是個好時機。不想去想變數名稱：在實作功能時，會希望變數名稱就能知道這個東西可能會是甚麼，而不是只有x,y,i,j等等看不出意義或是會搞混的名稱；要注意情況，大多還是乖乖想名字吧。

Python context manager

內文管理器 python with 語句，能讓我們更輕易的實行資源管理，例如數據、開啟文件，或是各種會lock的行為。要保證處理完相關事情，資源有被釋放。簡單行為中，我們會這樣去開啟文件 test_file = open('test.txt', 'w') try: test_file.write('line one') finally: test_file.close() 上述行為除了是非慣用以外，若try-finally裡面邏輯複雜，還面臨著維護的困難。這裡有著使用 with 的簡單用法 with open('test.txt', 'w') as test_file: test_file.write('line one') 上述程式碼中，當 with 內的語句執行結束後，會自動關閉該資源，且變數test_file也會結束。實現context manager 若想實現 context manager的功能，則要定義好__enter__ 與 __exit__ 兩個函式，分別管理with的進入行為和結束行為。

Python Iterable

要了解python 哪些對象是可以迭代的，可以先了解兩個相似的名詞 Iterable Iterator Iterable 可以被迭代、遍歷(loop, iteration)的物件對象可以被稱為iterable，從官方文件得知，要實現__iter__或是__getitem__的方法即可。包含了常見的list、tuple、set、dict、str、range， >>> dir(str()) ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', ......] 但若是使用collection去檢查是否是iterable 只有實現__getitem__的對象可以被迭代但不會是iterable Iterator https://docs.python.org/3.7/c-api/iter.html 從官方文件看出，含有__iter__和__next__的對象可稱為iterator， iterator是iterable的子集合，上述提到的幾種方式是iterable但都不是iterator，可以使用上面用到的isinstance或是dir來確認， >>> from collections.abc import Iterable, Iterator >>> for i in ([1,2,3], "123", (1,2,3)): ... print(f"{i} is iterable: {isinstance(i, Iterable)}") ... print(f"{i} is iterator: {isinstance(i, Iterator)}") ... [1, 2, 3] is iterable: True [1, 2, 3] is iterator: False 123 is iterable: True 123 is iterator: False (1, 2, 3) is iterable: True (1, 2, 3) is iterator: False 而文件則是iterator >>> file_path = os.path.abspath("test.py") >>> with open(file_path) as ifile: ... isinstance(ifile, Iterator) ... True 結語了解了iterable和iterator，以後開發時，若想創造出可以被迭代的對象或是迭代器，則要知道必須要包含哪些基礎功能那麼Generator呢?

python pdb

pdb — The Python Debugger 一段簡單的程式碼 print(f'file = {__file__}') 常見pdb幾種使用方式 1. 直接使用 python -m pdb file.py 執行上面指令會讓整個檔案進入pdb模式操作 > /home/src/test.py(1)<module>() -> file = __file__ (Pdb) 2. 設斷點把上面程式碼改成 file = __file__ import pdb; pdb.set_trace() print(f'file = {file}') 執行後則會在第二行進入pdb > /home/src/test.py(3)<module>() -> print(f'file = {file}') (Pdb) 結果如同上方，此時進入操作。在python3.7之後版本，可以使用 breakpoint() 指令詳見pep553 file = __file__ breakpoint() print(f'file = {file}') 3. python shell 中使用如果使用中遇到一些function的錯誤，可以這樣使用創造一個測試function： def test(): a = 1 b = 'b' return a + b 這個function會出現錯誤：數字和字串的操作再進入python >>> from test import test >>> import pdb >>> test() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/src/test.py", line 4, in test return a + b TypeError: unsupported operand type(s) for +: 'int' and 'str' >>> pdb.pm() > /home/src/test.py(4)test() -> return a + b (Pdb) 一樣也會進入pdb操作 ...

OPENCV 人臉辨識

人臉檢測 (Face Detection) 通常是人臉辨識流程的前置處理。這裡我們利用 Haar 特徵來進行實作。在訓練過程中，該演算法使用 AdaBoost，即利用多個「弱分類器」級聯 (Cascade) 來判別。每一步都會提取一個特徵值來判斷是否為人臉：如果判斷為「是」，則進入下一個層級的強分類器。如果判斷為「否」，則直接排除該區域。廣義來看，這就像是讓所有弱分類器進行投票，並根據各自的準確率加權集成。其組成的分類器架構稱為 Cascade，形式上類似於簡單的多層決策樹。實際應用與調整在實際使用中，Haar Cascade 的挑戰主要在於參數的調優，尤其是 scaleFactor 和 minNeighbors： scaleFactor：控制影像縮放的比例。數值調大時，檢測的層數會變少，速度快但容易漏掉較小的目標。 minNeighbors：決定一個目標區域被聲明為「人臉」前，周圍必須也被檢測到的人臉鄰居數量。由於不同圖片的解析度與場景差異，往往需要手動調整參數才能達到最佳效果，這在自動化處理上較為困難。未來我會嘗試使用深度學習等更強健的方式。參考來源：Face Recognition with Python 下圖是檢測人臉與眼睛的結果，圖片來源為 USA Volleyball National Team 合照： My Github

ML KNN

k-th nearest neighbor (k-NN) k-NN 是監督式學習 (Supervised learning) 的一種，名稱非常簡明扼要，就是尋找「K 個最相近的鄰居」。這個演算法在實作時，會找到附近 K 個最近的點，根據鄰居的類別來判斷自己要歸在哪一類。雖然它是監督式學習，但其實並不需要訓練模型參數，而是將所有訓練資料儲存起來進行即時對比。我們可以藉由調整 K 的數值來增加演算法的 Noise Margin。然而，此演算法存在著儲存空間需求大（空間複雜度高）的問題，且容易受到數據不平衡的影響。在實作上，核心在於計算點與點之間的距離。我使用了 Scipy 的函數來實作，為了方便觀察，先取 K=1，並將結果與 sklearn 的 KNN 進行比較。實作思路是利用 for 迴圈計算每個測試資料與所有訓練資料的距離，並取最近者的類別作為預測結果。準確率比較： sklearn knn : 0.9733 手刻 knn : 0.9467 My Github

PYTHON 機器學習基石 LS-PLA

Perceptron Learning Algorithm (PLA) 根據林軒田教授的機器學習基石課程，實作一下這個基礎的機器學習演算法。我們探討的是監督式學習 (Supervised learning) 大架構下的二元分類 (YES/NO) 問題。 Perceptron ⇔ linear (Binary) Classifiers 我們有一組訓練資料 D，包含數據 Xn 和對應的 Yn (在這裡就是 1, -1)；Hypothesis set H 代表全部可能的解 (無限多條線)，經過演算法 A，從 H 找到一個可能的 g 與我們的目標函數 f 相近。這個演算法的主要兩大步驟：找到錯誤的點，進行向量修正。詳細課程可以參考教授的講解！！其中 naive cycle 是常用的作法。這方法只適用於 linear separable PLA。除此以外，當資料中有雜訊也無法使用這個方式，目前在線性問題上較好的解是用 Pocket PLA。 Linear separable PLA 首先整理一下資料。把原始格式如 ['x0\ty0\tz0\nx1\ty1\tz1\nx2\ty2\tz2\n....'] 轉換為 array([[(x0, y0), z0], [(x1, y1), z1], [(x2, y2), z2].....]) 的格式。 NAIVE PLA 實作，畫線則是用 ax + by = 0。 ...