webscrapping, beautifulsoup, getText, get, strip, isdigit, fromkeys, extract

Certificate/data science-IBM

webscrapping, beautifulsoup, getText, get, strip, isdigit, fromkeys, extract

Olivia-BlackCherry 2023. 5. 29. 10:56

beautifulscoup

웹사이트는 엄청나게 복잡한 코드로 이루어져있는데, 뷰티풀 수프는 개발자가 웹사이트를 이해할 수 있도록 도와주는 파이썬 모듈이다. 뷰티풀 수프를 이용하면 복잡한 HTML 코드에서 원하는 HTML 요소를 정확하고 빠르게 가져올 수 있다.

즉, 여러 정보에서 필요한 정보만 쏙 뽑아오는 것이다.

인터넷 웹사이트 중 하나를 뷰티풀 수프를 이용해 가져온다고 가정하자.

1. html파일 읽어온다.

-cp949 codec이 해석하기가 어렵다는 에러가 뜨는 것을 방지하기 위해 encoding utf-8을 추가한다.

with open("website.html", encoding="UTF-8") as file:
	contents=file.read()

또는

API를 이용한다.

import requests
response=requests.get(url="주소넣기")
contents= response.text
print(contents)

2. beautifulsoup 객체 만들기

from bs4 import BeautifulSoup
soup= BeautifulSoup(contents, "html.parser")

beautifulsoup import고, 객체를 생성한다.

객체를 만들 때 인자로 2가지가 들어간다.

재료가 되는 contents이다. 타입은 markup 형식으로 html 또는 xml이어야 한다.

어떻게 가져오는지를 나타내는 parser이다. 파서는 뷰티풀수프가 이 컨텐츠를 이해하는데 도움을 준다. 어떻게 구조화된 언어를 이해할 수 있도록 markup 언어 중 어떤 것을 파씽할지 나타내주기 위해 적는다.

3. 원하는 특정 요소 선택하기

soup.title
soup.a
soup.title.string

print()하여 보여준다.

4. 특정 태그 전부를 찾기

a_all= soup.find_all(name="a")
print(a_all)

5. 특정 태그의 특정 아이디/클래스를 찾기

li_one=soup.find(name="li", id="lesson1")
class_one=soup.find(name="li", class_="lesson")

class가 아니라, class_ 라고 명명한다.

6. 선택자를 이용한 드릴다운

drill_down= soup.select_one("p em strong")
id_selector= soup.select(selector=#lesson1")
class_selector=soup.select(selector=".lesson")
# 같음=> class_selector=soup.select(".lesson")

7. 글자만 자져오기 getText()

anchors= soup.find_all(name="a")
for anchor in anchors:
	print(anchor.getText())

8.특정 속성값만 가져오기 get()

for anchor in anchors:
	print(anchor.get("href"))

9. strip()

문자열에서 양쪽에 있는 공백(띄어쓰기, 탭, 줄바꿈 등)을 제거하는 함수이다.

lstrip 왼쪽 공백

rstrip 오른쪽 공백을 제거한다.

10. isdigit

모두 숫자로 이루어져 있는지 여부 확인한다.

0-9까지로 이루어지면 true 아니면 false이다.

string1 = "12345"
print(string1.isdigit())  # 출력: True

string2 = "Hello"
print(string2.isdigit())  # 출력: False

string3 = "42 is the answer"
print(string3.isdigit())  # 출력: False

11. fromkeys()

딕셔너리를 생성하는 매서드이다.

지정된 키를 가지고 기본값을 설정하여 dictionary를 만든다.

keys = ['a', 'b', 'c']
default_value = 0

dictionary = dict.fromkeys(keys, default_value)
print(dictionary)

#{'a': 0, 'b': 0, 'c': 0}

12. del

객체를 삭제하는데 사용되는 파이썬 내장함수이다.

변수, 리스트, 리스트요소, 딕셔너리 키, 슬라이스, 객체 속성 등을 삭제한다.

my_list = [1, 2, 3, 4, 5]
del my_list[2]

my_dict = {'a': 1, 'b': 2, 'c': 3}
del my_dict['b']

class MyClass:
    def __init__(self):
        self.x = 10
        self.y = 20

obj = MyClass()
del obj.x

13. extract()

뷰티풀 수프 객체에서 특정 요소를 제거한다.

row.br.extract()라면

row 객체에서 br 태그를 나타내는 뷰티풀수프의 요소====> row.br

를 extract()메서드를 이용하여 br 태그를 제거한다.

from bs4 import BeautifulSoup

html_ex = '<p>This is a <br>paragraph with a line break.</p>'
soup_ex = BeautifulSoup(html_ex, 'html.parser')

p_tag = soup_ex.find('p')
print(p_tag)

if p_tag.br:
    p_tag.br.extract()
    print(p_tag)

updated_html = str(soup)
print(updated_html)

저작자표시 비영리 변경금지 (새창열림)

'Certificate > data science-IBM' 카테고리의 다른 글

folium, Map, add_child, circle, marker, Stamen Terrain, Stamen Toner, featuregroup, circlemarker, add_to (0)	2023.05.31
sns, seaborn, catplot, scatterplot, barplot, groupby, lineplot (0)	2023.05.30
loc, iloc, isnull, dropna, fillna, astype, dtype (0)	2023.05.28
map, lambda, filter, to_datetime(), date(), datetime.date(year, month, day), dt (0)	2023.05.28
data visualization, dashboard, plotly기초 (0)	2023.05.23

현재글webscrapping, beautifulsoup, getText, get, strip, isdigit, fromkeys, extract

올리비아 코딩스쿨