Understanding data format, structuring data

Certificate/data analytics-Google

Understanding data format, structuring data

Olivia-BlackCherry 2023. 7. 7. 16:21

Discovering process 중에 이러한 질문을 던져보자.

- 어떻게 이 많은 데이터를 더 작게 그룹짓고 쪼개어서, 더 깊이 이해할 수 있을까?
- 내가 세운 가설을 어떻게 증빙할 수 있을까?
- 현재 이와 같은 형식으로, 데이터가 내게 제대로 된 정답을 줄 수 있을까?

데이터에서 질문하고, 제대로 된 가정을 세우고, 가정을 테스트 해보는 것으로 데이터에서 유의미한 발견을 할 수 있다. 질문하고 가정을 세우는 과정은 많은 노력과 시간이 필요하지만 이것이 숨겨진 이야기를 발견하는데 결정적인 역할을 해줄 것이다.

Organize or alter data 데이터를 조직하고 바꾸기

데이터를 조작해보자. 그룹을 지어보기도 하고, 합쳐보기도, 나눠보기도 한다. 형식을 바꾸기도 해보자.

(예시)

- Regroup entries into months/years or age ranges

- group customer ages into age ranges

- combine or split data columns

- change data formats or time zones

1. 데이터 포맷 바꾸기 Data formatting

Manipulating datetime strings in python

데이트타임 string을 조작하는 코드이다.

from datetime import datetime

ex)

datetime.strptime(“25/11/2022”, “%d/%m/%Y”)	string	“25/11/2022”	DateTime	“2022-11-25 00:00:00”
datetime.strftime(dt_object, “%d/%m/%Y”)	DateTime	“2022-11-25 00:00:00”	string	“25/11/2022”
dt_object = datetime.strptime(“25/11/2022”, “%d/%m/%Y”) datetime.timestamp(dt_object)	string	“25/11/2022”	float (UTC timestamp in seconds)	1617836400.0
datetime.strptime(“25/11/2022”, “%d/%m/%Y”).strftime(“%Y-%m-%d”)	string	“25/11/2022”	string	“2022-11-25”
datetime.fromtimestamp(1617836400.0)	float (UTC timestamp in seconds)	1617836400.0	DateTime	“2022-11-25 00:00:00”
datetime.fromtimestamp(1617836400.0).strftime(“%d/%m/%Y”)	float (UTC timestamp in seconds)	1617836400.0	string	“25/11/2022”
from pytz import timezone ny_time = datetime.strptime(“25-11-2022 09:34:00-0700”, “%d-%m-%Y %H:%M:%S%z”) Tokyo_time = ny_time.astimezone(timezone(‘Asia/Tokyo’))	string	NewYork timezone “25-11-2022 09:34:00-0700”	DateTime	Tokyo timezone 2022, 11, 26, 1, 34, JST+9:00:00 STD>
datetime.strptime(“20:00”, “%H:%M”).strftime(“%I:%M %p”)	string	“20:00”	string	“08:00 PM”
datetime.strptime(“08:00 PM”, “%I:%M %p”).strftime(“%H:%M”)	string	“08:00 PM”

데이터 분석을 위해 데이터를 여러 가지 방법으로 조작한다. 다양한 방법이 있지만, 가장 처음 이야기하는 것은 데이터를 쪼개서 작은 여러 개의 그룹으로 나누어 보기이다.날짜라면 연, 월, 일로 시간이라면 시간, 분, 초로 나누어 보기가 그 예시이다.

또는 쪼개진 데이터를 가지고 새로운 방식으로 묶어낼 수 있다. 예컨데, 일주일 혹은 분기와 같은 형식으로 말이다. 다양한 방식으로 데이터를 살펴본다. 마치 내가 어떤 물건을 사기 전에 이 물건의 용도와 쓰임, 디자인이 무엇일지 다각도로 돌려보며 살펴보는 것과 같다.

# Create four new columns.
df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y')

df 의 'date' 컬럼의 속성값을 바꾼다. strftime은 형식을 바꾸는 것이다. 괄호 뒤에 써져있는 것이 요구하는 형식이다.

to_period()는 분기별 기간으로 변환한 후, strftime()로 특정 형식으로 포맷한다.

isocalendar()

연월일 중 내가 원하는 정보를 보여준다.

ser = pd.to_datetime(pd.Series(["2010-01-01", pd.NaT]))
>>> ser.dt.isocalendar()
   year  week  day
0  2009    53     5
1  <NA>  <NA>  <NA>
>>> ser.dt.isocalendar().week
0      53
1    <NA>
Name: week, dtype: UInt32

day_name()

요일을 알려준다.

s = pd.Series(pd.date_range(start='2018-01-01', freq='D', periods=3))
>>> s
0   2018-01-01
1   2018-01-02
2   2018-01-03
dtype: datetime64[ns]
>>> s.dt.day_name()
0       Monday
1      Tuesday
2    Wednesday
dtype: object

2. 데이터 구조화하기 Structuring

1) Sorting

The process of arraging data into meaningful order for analysis

어떠한 기준에 맞춰 정렬한다.

2) Extracting

The process of retrieving data from a dataset or source for further processing

비교, 시각화 등 특정 목적이 있는 컬럼을 추출해낸다.

3) Filtering

The process of selecting a smaller part of your dataset based on specified parameters and using it for viewing or analysis

특정 조건이 있는 행을 걸러낼 때 쓴다.

4) slicing

A method for breaking information down into smaller parts to facilitate efficient examination and analysis from different viewpoints.

큰 것을 작은 것으로 쪼갠다.

5) Grouping

Aggregating individual observations of a variable into groups

= bucketizing 버켓타이징
그루핑한다.

6) Merging

Method to combine two different data frames along a specified starting column

합치는 것을 말한다.

# Combine `lightning_by_month` and `lightning_by_year` dataframes into single dataframe
percentage_lightning = lightning_by_month.merge(lightning_by_year,on='year')
percentage_lightning.head()

데이터교육

저작자표시 비영리 변경금지

'Certificate > data analytics-Google' 카테고리의 다른 글

Outlier, 이상치 처리하기, global, contextual, collective outliers, (0)	2023.07.10
Missing Data 처리하기, isnull, isna, fillna, dropna, any, drop_duplicated (0)	2023.07.10
Understanding raw data, 비정형 데이터 이해하기 (0)	2023.07.06
PACE framework, EDA, EDA process (0)	2023.07.06
2_ 뉴욕 택시 데이터셋 관련 데이터 사전 (0)	2023.06.30

현재글Understanding data format, structuring data

올리비아 코딩스쿨