자율주행 센서의 안테나 성능 예측 AI 경진대회

Streamlit EDA

2022.08.06 16:09 2,290 조회

변수가 많아 하나하나씩 code를 작성하여 feature를 확인하기 어려워 streamlit을 활용해봤습니다.(간단한 histogram과 line plot 기능만 있으니 다른 plot이 필요한 분들은 custom하셔서 사용하세요)

streamlit을 활용하기 때문에 해당 library를 설치 이후에 사용하세요

(pip install streamlit 으로 간단하게 설치 가능합니다.)

app.py source code

import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


st.title("Data viewer")


train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")


x_feature_info_df = pd.read_csv("../data/meta/x_feature_info.csv")
y_feature_info_df = pd.read_csv("../data/meta/y_feature_info.csv")


x_feature_info_dict = x_feature_info_df.set_index("Feature").to_dict()["설명"]
y_feature_info_dict = y_feature_info_df.set_index("Feature").to_dict()["설명"]



X_feature_name = train_df.columns[1:-14]
y_feature_name = train_df.columns[-14:]



selected_x_feature_name = st.multiselect("Select feature name", X_feature_name)
selected_y_label_name = st.multiselect("Select label name", y_feature_name)



def feature_describe(name_list: list, info_dict: dict, option: str):
    for name in name_list:
        st.write(f"{name} (" + info_dict[name] + ")")
        if option != "null":
            feature_plot(name, option)



def feature_plot(feature_name: str, option: str):
    st.subheader("Train")
    fig, ax = plt.subplots(figsize=(30, 5))
    if option == "dist":
        sns.histplot(train_df[feature_name], label=feature_name, ax=ax, kde=True)
    else:
        ax.plot(train_df[feature_name], label=feature_name)


    ax.legend()
    st.pyplot(fig)


    if feature_name[0] != "Y":
        st.subheader("Test")
        fig, ax = plt.subplots(figsize=(30, 5))
        if option == "dist":
            sns.histplot(test_df[feature_name], label=feature_name, ax=ax, kde=True)
        else:
            ax.plot(test_df[feature_name], label=feature_name)
        ax.legend()
        st.pyplot(fig)



if len(selected_x_feature_name) > 0:
    st.subheader("X feature information")
    option = st.selectbox("Plot Type", ("null", "dist", "plot"))
    feature_describe(selected_x_feature_name, x_feature_info_dict, option)



if len(selected_y_label_name) > 0:
    st.subheader("Target feature information")
    option = st.selectbox("Plot Type", ("null", "dist", "plot"))
    feature_describe(selected_y_label_name, y_feature_info_dict, option)


경로 환경은 아래와 같습니다.

.
├── app
│   └── app.py
└── data
    ├── meta
    │   ├── x_feature_info.csv
    │   ├── y_feature_info.csv
    │   └── y_feature_spec_info.csv
    ├── open.zip
    ├── sample_submission.csv
    ├── test.csv
    └── train.csv


로그인이 필요합니다
0 / 1000
황윤태호랑이
2022.08.06 16:19

피쳐엔지니어링이 중요한건 확실해보이는데 분포만 가지고하니깐 약간 LB스코어가 운의 영향도 큰것같아요....
이런 피쳐에서도 도메인 지식이 있는분들은 좀 다를까요..?

1Gb
2022.08.06 16:27

도메인 지식을 알고 있는 분들이 있다면 분석하는데 있어 매우 좋을 것으로 예상되지만 특정 도메인 지식을 갖고 있는 분들을 찾기는 힘들 것 같습니다.
Lidar 센서를 생산을 하더라도 생산하는 방법이 다양하기 때문에 어려움이 있을 것 같습니다.
그리고 해당 데이터가 real world data가 아닐 가능성도 조금은 있구요
LB 스코어가 운의 영향이 크다는 것은 저도 매우 동감합니다. 하지만 아직 public LB이기 자신의 CV strategy 믿음으로 shake up을 기대하는 건 실력이라고 생각합니다.

planting
2022.08.09 18:52

streamlit 반갑네요. 이번에 태블로 사용해보려했다가 잘 못쓰겠어서, 저도 이걸로 대시보드 하나 만들어야겠네요