refinery-section4

--- title: refinery-section4 tags: DAS GA: UA-155999456-1 --- {%hackmd @docsharedstyle/default %} # 3.4 資料操作本小節以 [airline-data.csv file (1.5 MB)](https://cos.twcc.ai/cp4d/das4_0/refinery/data/airline-data.csv) 為例。假設我們想統計美國聯合航空(UA, United Airlines)，每天的平均延遲(Delay)時間，並依照延遲多寡做排序。依照情境，步驟如下： - 篩選出美國聯合航空的資料。 - 計算延遲時間 = 出發延遲 + 抵達延遲。 - 選擇欄位「年、月、日、延遲時間」。 - 依照「年、月、日」分組。 - 計算平均「延遲時間」時間。 - 將平均延遲時間，依長到短做排序。 **因翻譯的關係，相應步驟的英文版如下，可對照查看:** - filter `UniqueCarrier` = UA(United Airlines) - defined a new cloumn `TotalDelay` = sum of (`ArrDelay`, `DepDelay`) - select columns: `Year`, `Month`, `DayofMonth`, `TotalDelay` - Group by (`Year`, `Month`, `DayofMonth`) - Aggregate `TotalDelay` by mean - Sort data in descending by `mean_Delay` --- 以下我們透過 Data Refinery 依序操作： ## 篩選出美國聯合航空的資料 > filter `UniqueCarrier` = UA(United Airlines) **操作步驟：** 作業 > 過濾器 > 直欄:`UniqueCarrier` > 運算子: `是等於` > 值: `UA` > 套用。 **操作步驟(英文版)：** +Operation > Filter > Column:`UniqueCarrier` > Operator: `is equal to` > Value: `UA` > Apply ![refinery_data-1](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-1.png =350) -- ## 計算延遲時間 = 出發延遲 + 抵達延遲 > defined a new cloumn `TotalDelay` = sum of (`ArrDelay`, `DepDelay`) **操作步驟：** 作業 > 計算 > 直欄:`ArrDelay` > 下一步 > 選擇「新增(加法)」 > 直欄:`DepDelay` > 勾選建立新直欄存放結果:`TotalDelay` > 套用。 **操作步驟(英文版)：** +Operation > Calculate > Select:`ArrDelay` > Next > Addition > Select Column:`DepDelay` > Create new column for result:`TotalDelay` > Apply ![refinery_data-2](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-2.png =350) -- ## 選擇欄位「年、月、日、延遲時間」 > select columns: `Year`, `Month`, `DayofMonth`, `TotalDelay` **操作步驟：** 互動式程式碼範本 > 點選`select` > 輸入`Year, Month, DayofMonth, TotalDelay` > 套用。 **操作步驟(英文版)：** Code Operation > 輸入`select` > 輸入`Year, Month, DayofMonth, TotalDelay` > Apply p.s. 可以用點選的方式，程式語法會自動補完。 ![refinery_data-3](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-3.png =948) -- ## 依照「年、月、日」分組 > Group by (`Year`, `Month`, `DayofMonth`) **操作步驟：** 互動式程式碼範本 > 點選`group_by` > 輸入`Year, Month, DayofMonth` > 套用。 **操作步驟(英文版)：** Code Operation > 點選`group_by` > 輸入`Year, Month, DayofMonth` > Apply ![refinery_data-4](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-4.png =9480) -- ## 計算平均「延遲時間」時間 > Aggregate `TotalDelay` by mean **操作步驟：** 作業 > 聚集 > 欄位選擇:`TotalDelay` > 下一步 > 選擇「平均值」 > 聚集直欄的名稱:`mean_Delay` > 套用。 **操作步驟(英文版)：** +Operation > Aggregate > Select Column:`TotalDelay` > Next > choose Mean Aggregation > Name of the aggregated column:`mean_Delay` > Apply ![refinery_data-5](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-5.png NA) -- ## 將平均延遲時間，依長到短做排序 > Sort data in descending by `mean_Delay` 如果想知道美國聯合航空，哪一天的航班平均延遲狀況最嚴重，我們可以將結果依照欄位 `TotalDelay` 做降序排列。 ![refinery_data-6](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-6.png) -- ## 資料處理結果經過上述步驟後，資料處理結果如下，共有七個步驟。 ![refinery_data-7](https://cos.twcc.ai/cp4d/das4_0/refinery/image/refinery_data-7.png) 假如想設定處理好的資料名稱，可到右側的「資訊窗格(Detail)」頁面，點擊「編輯(Edit)」進入做修改。另外，在處理過程中，如果對資料的樣態感興趣，可利用「設定檔(Profiles)」或是「視覺校果(Visualizations)」的功能，對(部分*)資料進行探索。 > ***備註**： > 根據 IBM 官方手冊的描述，為了讓 Data Refinery 的操作過程保持順暢， > 資料(Data)、設定檔(Profiles)、視覺校果(Visualizations) 的呈現，僅限於部分資料，而非全體資料。 > > 而當 Data Refinery flow 的步驟確定，執行排程時，就會對全部資料進行操作。 --- 確認好資料處理的流程後，我們就可以將這樣的 Data Refinery flow 做一個排程。下一小節我們將教學排程的設定。 ## END