爬取poj.org网站试题文本记录

2021-04-22

字数统计: 4.2k字 | 阅读时长≈ 24分

开始之前

学习爬虫并记录一下爬取试题文本时使用的代码，遇到的问题等。

读取 txt 文件

import os
import re

def read_file_as_str(file_path):
    # 判断路径文件存在
    if not os.path.isfile(file_path):
        raise TypeError(file_path + " does not exist")

    all_the_text = open(file_path).read()
    # print type(all_the_text)
    return all_the_text

file=read_file_as_str('test.txt')
re.split('\d+#',file)[1:]

['Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Timeout\n',
 'Calculate a+b\n',
 'Problems involving the computation of exact values of very large magnitude and precision are common. For example, the computation of the national debt is a taxing experience for many computer systems.\n\nThis problem requires that you write a program to compute the exact value of Rn where R is a real number ( 0.0 < R < 99.999 ) and n is an integer such that 0 < n <= 25.\n',
 "Businesses like to have memorable telephone numbers. One way to make a telephone number memorable is to have it spell a memorable word or phrase. For example, you can call the University of Waterloo by dialing the memorable TUT-GLOP. Sometimes only part of the number is used to spell a word. When you get back to your hotel tonight you can order a pizza from Gino's by dialing 310-GINO. Another way to make a telephone number memorable is to group the digits in a memorable way. You could order your pizza from Pizza Hut by calling their ``three tens'' number 3-10-10-10.\n\nThe standard form of a telephone number is seven decimal digits with a hyphen between the third and fourth digits (e.g. 888-1200). The keypad of a phone supplies the mapping of letters to numbers, as follows:\n\nA, B, and C map to 2\nD, E, and F map to 3\nG, H, and I map to 4\nJ, K, and L map to 5\nM, N, and O map to 6\nP, R, and S map to 7\nT, U, and V map to 8\nW, X, and Y map to 9\n\nThere is no mapping for Q or Z. Hyphens are not dialed, and can be added and removed as necessary. The standard form of TUT-GLOP is 888-4567, the standard form of 310-GINO is 310-4466, and the standard form of 3-10-10-10 is 310-1010.\n\nTwo telephone numbers are equivalent if they have the same standard form. (They dial the same number.)\n\nYour company is compiling a directory of telephone numbers from local businesses. As part of the quality control process you want to check that no two (or more) businesses in the directory have the same telephone number.\n",
 "How far can you make a stack of cards overhang a table? If you have one card, you can create a maximum overhang of half a card length. (We're assuming that the cards must be perpendicular to the table.) With two cards you can make the top card overhang the bottom one by half a card length, and the bottom one overhang the table by a third of a card length, for a total maximum overhang of 1/2 + 1/3 = 5/6 card lengths. In general you can make n cards overhang by 1/2 + 1/3 + 1/4 + ... + 1/(n + 1) card lengths, where the top card overhangs the second by 1/2, the second overhangs tha third by 1/3, the third overhangs the fourth by 1/4, etc., and the bottom card overhangs the table by 1/(n + 1). This is illustrated in the figure below.\n",
 "Larry graduated this year and finally has a job. He's making a lot of money, but somehow never seems to have enough. Larry has decided that he needs to grab hold of his financial portfolio and solve his financing problems. The first step is to figure out what's been going on with his money. Larry has his bank account statements and wants to see how much money he has. Help Larry by writing a program to take his closing balance from each of the past twelve months and calculate his average account balance.\n",
 'Fred Mapper is considering purchasing some land in Louisiana to build his house on. In the process of investigating the land, he learned that the state of Louisiana is actually shrinking by 50 square miles each year, due to erosion caused by the Mississippi River. Since Fred is hoping to live in this house the rest of his life, he needs to know if his land is going to be lost to erosion.\n\nAfter doing more research, Fred has learned that the land that is being lost forms a semicircle. This semicircle is part of a circle centered at (0,0), with the line that bisects the circle being the X axis. Locations below the X axis are in the water. The semicircle has an area of 0 at the beginning of year 1. (Semicircle illustrated in the Figure.)\n',
 "Some people believe that there are three cycles in a person's life that start the day he or she is born. These three cycles are the physical, emotional, and intellectual cycles, and they have periods of lengths 23, 28, and 33 days, respectively. There is one peak in each period of a cycle. At the peak of a cycle, a person performs at his or her best in the corresponding field (physical, emotional or mental). For example, if it is the mental curve, thought processes will be sharper and concentration will be easier.\nSince the three cycles have different periods, the peaks of the three cycles generally occur at different times. We would like to determine when a triple peak occurs (the peaks of all three cycles occur in the same day) for any person. For each cycle, you will be given the number of days from the beginning of the current year at which one of its peaks (not necessarily the first) occurs. You will also be given a date expressed as the number of days from the beginning of the current year. You task is to determine the number of days from the given date to the next triple peak. The given date is not counted. For example, if the given date is 10 and the next triple peak occurs on day 12, the answer is 2, not 3. If a triple peak occurs on the given date, you should give the number of days to the next occurrence of a triple peak.\n"]

爬取 POJ 网站上的试题文本

import requests  #爬取网页的库
from bs4 import BeautifulSoup #用于解析网页的库
headers = {
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
    }  # 构造请求头
url = 'http://tjj.changsha.gov.cn/tjxx/tjsj/tjgb/202010/t20201016_9060722.html'
response = requests.request("GET", url, headers=headers) # 获取网页数据
response.encoding = response.apparent_encoding # 当获取的网页有乱码时加
soup = BeautifulSoup(response.text, 'html.parser')
bf = soup.find('div', class_='view TRS_UEDITOR trs_paper_default trs_web')
bf

import requests
from bs4 import BeautifulSoup
import bs4
import os
from time import sleep
url_list = []
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'} # 构造请求头

def url_all():
    for i in range(1000,4055):
        url = 'http://poj.org/problem?id='+str(i)
        url_list.append(url)

def save_path():
    s_path = './text/'
    if  not os.path.isdir(s_path):
        os.mkdir(s_path)
    else:
        pass
    return s_path

def save_text(urls,s_path): #找到所有试题的文本描述。
    num=1000
    for url in urls:
        response = requests.get(url, headers=headers)  # 获取网页数据
        response.encoding = response.apparent_encoding  # 当获取的网页有乱码时加
        soup = BeautifulSoup(response.text, 'html.parser')
        try:
            bf=soup.find('div', class_='ptx')
            if isinstance(bf, bs4.element.Tag):
                text = str(num)+'#'+str(bf.text)+'\n'
                print('第'+str(num)+'题已爬！')
                try:
                    file = open(s_path + 'poj_question_text.txt', 'a')
                    file.write(text)
                    file.close()
                except BaseException as a:
                    print(a)
        except BaseException as b:
            print(b)
        num+=1
        sleep(random.randint(0,3))
    print('---------------所有页面遍历完成----------------')

url_all()
save_text(url_list,save_path())

1000#Calculate a+b

 This problem requires that you write a program to compute the exact value of Rn where R is a real number ( 0.0 < R < 99.999 ) and n is an integer such that 0 < n <= 25. perience for many computer systems.

 Your company is compiling a directory of telephone numbers from local businesses. As part of the quality control process you want to check that no two (or more) businesses in the directory have the same telephone number. is 310-1010.LOP. Sometimes only part of the number is used to spell a word. When you get back to your hotel tonight you can order a pizza from Gino's by dialing 310-GINO. Another way to make a telephone number memorable is to group the digits in a memorable way. You could order your pizza from Pizza Hut by calling their ``three tens'' number 3-10-10-10.

1003#How far can you make a stack of cards overhang a table? If you have one card, you can create a maximum overhang of half a card length. (We're assuming that the cards must be perpendicular to the table.) With two cards you can make the top card overhang the bottom one by half a card length, and the bottom one overhang the table by a third of a card length, for a total maximum overhang of 1/2 + 1/3 = 5/6 card lengths. In general you can make n cards overhang by 1/2 + 1/3 + 1/4 + ... + 1/(n + 1) card lengths, where the top card overhangs the second by 1/2, the second overhangs tha third by 1/3, the third overhangs the fourth by 1/4, etc., and the bottom card overhangs the table by 1/(n + 1). This is illustrated in the figure below.

1004#Larry graduated this year and finally has a job. He's making a lot of money, but somehow never seems to have enough. Larry has decided that he needs to grab hold of his financial portfolio and solve his financing problems. The first step is to figure out what's been going on with his money. Larry has his bank account statements and wants to see how much money he has. Help Larry by writing a program to take his closing balance from each of the past twelve months and calculate his average account balance.

 After doing more research, Fred has learned that the land that is being lost forms a semicircle. This semicircle is part of a circle centered at (0,0), with the line that bisects the circle being the X axis. Locations below the X axis are in the water. The semicircle has an area of 0 at the beginning of year 1. (Semicircle illustrated in the Figure.)f his land is going to be lost to erosion.

Since the three cycles have different periods, the peaks of the three cycles generally occur at different times. We would like to determine when a triple peak occurs (the peaks of all three cycles occur in the same day) for any person. For each cycle, you will be given the number of days from the beginning of the current year at which one of its peaks (not necessarily the first) occurs. You will also be given a date expressed as the number of days from the beginning of the current year. You task is to determine the number of days from the given date to the next triple peak. The given date is not counted. For example, if the given date is 10 and the next triple peak occurs on day 12, the answer is 2, not 3. If a triple peak occurs on the given date, you should give the number of days to the next occurrence of a triple peak.

 You are responsible for cataloguing a sequence of DNA strings (sequences containing only the four letters A, C, G, and T). However, you want to catalog them, not in alphabetical order, but rather in order of ``sortedness'', from ``most sorted'' to ``least sorted''. All the strings are of the same length.easure is called the number of inversions in the sequence. The sequence ``AACEDGG'' has only one inversion (E and D)---it is nearly sorted---while the sequence ``ZWQM'' has 6 inversions (it is as unsorted as can be---exactly the reverse of sorted).

Help professor M. A. Ya and write a program for him to convert the dates from the Haab calendar to the Tzolkin calendar. hus, the first day was: b, 6 canac, 7 ahau, and again in the next period 8 imix, 9 ik, 10 akbal . . .r and the name of the day. They used 20 names: imix, ik, akbal, kan, chicchan, cimi, manik, lamat, muluk, ok, chuen, eb, ben, ix, mem, cib, caban, eznab, canac, ahau and 13 numbers; both in cycles.  cumhu. Instead of having names, the days of the months were denoted by numbers starting from 0 to 19. The last month of Haab was called uayet and had 5 days denoted by numbers 0, 1, 2, 3, 4. The Maya believed that this month was unlucky, the court of justice was not in session, the trade stopped, people did not even sweep the floor.

Images contain 2 to 1,000,000,000 (109) pixels. All images are encoded using run length encoding (RLE). This is a sequence of pairs, containing pixel value (0-255) and run length (1-109). Input images have at most 1,000 of these pairs. Successive pairs have different pixel values. All lines in an image contain the same number of pixels.

 To save money, the RPS would like to issue as few duplicate stamps as possible (given the constraint that they want to issue as many different types).  Further, the RPS won't sell more than four stamps at a time.e RPS has been known to issue several stamps of the same denomination in order to please customers (these count as different types, even though they are the same denomination).  The maximum number of different types of stamps issued at any time is twenty-five.

---------------所有页面遍历完成----------------

追加文本

# 打开一个文件
fo = open("test.txt", "a")
fo.write( "www.runoob.com!\nVery good site!\n")

# 关闭打开的文件
fo.close()

测试自己 IP 为多少

import requests

url = 'http://httpbin.org/get' # 该网址会返回访问者的IP
headers = {'User-Agent': 'Mozilla/5.0'}
# 使用代理IP
proxies = {'http':'http://122.4.48.145:9999','https':'https://122.4.48.145:9999'}
html = requests.get(url, headers=headers,proxies=proxies, timeout=5).text
print(html)

使用 fake_useragent 随机生成 User-Agent

1
2
3

from fake_useragent import UserAgent
ua = UserAgent()
ua.ie

1	'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; Media Center PC 6.0; InfoPath.3; MS-RTC LM 8; Zune 4.7'

构建自己的 IP 池

从快代理上面爬取 IP，迭代测试能否使用，建立一个自己的代理 IP 池，随时更新用来抓取网站数据

import requests
from lxml import etree
import time
import random
from fake_useragent import UserAgent

class GetProxyIP(object):
    def __init__(self):
        self.url = 'https://www.kuaidaili.com/free/inha/{}/' # 'https://www.xicidaili.com/nn/'
        self.proxies = {
            'http': 'http://163.204.247.219:9999',
            'https': 'https://163.204.247.219:9999'}

    # 随机生成User-Agent
    def get_random_ua(self):
        ua = UserAgent()        # 创建User-Agent对象
        useragent = ua.random
        return useragent

    # 从IP代理网站上获取随机的代理IP
    def get_ip_file(self, url):
        headers = {'User-Agent': self.get_random_ua()}
        # 访问IP代理网站国内高匿代理，找到所有的tr节点对象
        html = requests.get(url=url, headers=headers, timeout=5).content.decode('utf-8', 'ignore')
        parse_html = etree.HTML(html)
        # 基准xpath，匹配每个代理IP的节点对象列表
        tr_list = parse_html.xpath('//*[@id="list"]/table/tbody/tr')
        for tr in tr_list:
            ip = tr.xpath('./td[1]/text()')[0]
            port = tr.xpath('./td[2]/text()')[0]
            # 测试ip:port是否可用
            self.test_proxy_ip(ip, port)

    # 测试抓取的代理IP是否可用
    def test_proxy_ip(self, ip, port):
        proxies = {
            'http': 'http://{}:{}'.format(ip, port),
            'https': 'https://{}:{}'.format(ip, port), }
        test_url = 'http://www.baidu.com/'
        try:
            res = requests.get(url=test_url, proxies=proxies, timeout=8)
            if res.status_code == 200:
                print(ip, ":", port, 'Success')
                with open('proxies.txt', 'a') as f:
                    f.write(ip + ':' + port + '\n')
        except Exception as e:
            print(ip, port, 'Failed')

    # 主函数
    def main(self):
        for i in range(2000, 2050):
            url = self.url.format(i)
            self.get_ip_file(url)
            time.sleep(random.randint(5, 10))



spider = GetProxyIP()
spider.main()

# import requests
# from lxml import etree
# import time
# import random
# from fake_useragent import UserAgent
# url='https://www.kuaidaili.com/free/inha/{}/'
# url
# headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'}
# # 访问IP代理网站国内高匿代理，找到所有的tbody节点对象
# url = url.format(1)
# url
# html = requests.get(url=url, headers=headers, timeout=5).content.decode('utf-8', 'ignore')
# parse_html = etree.HTML(html)
# # 基准xpath，匹配每个代理IP的节点对象列表
# tr_list = parse_html.xpath('//*[@id="list"]/table/tbody/tr')
# print(type(tr_list))
# ip = tr_list[0].xpath('./td[1]/text()')[0]
# ip
# port = tr_list[0].xpath('./td[2]/text()')[0]
# port

测试代理 IP 是否可用

headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",}  # 构造请求头
proxies = {'http':'http://10.10.10.10:8765','https':'https://10.10.10.10:8765'}
url='http://www.baidu.com/'
resp = requests.get(url,headers=headers, proxies = proxies, timeout=5)
if res.status_code == 200:
        print("OK")
        return True
    else:
        print(res.status_code)
        print("错误")
        return False

从 IP 池中取 IP

从文件中随机获取代理 IP 写爬虫，防止同一个 IP 访问频繁被封

import random
import requests

class BaiduSpider(object):
    def __init__(self):
        self.url = 'http://www.baidu.com/'
        self.headers = {'User-Agent': 'Mozilla/5.0'}
        self.flag = 1

    def get_proxies(self):
        with open('proxies.txt', 'r') as f:
            result = f.readlines()  # 读取所有行并返回列表
        proxy_ip = random.choice(result)[:-1]       # 获取了所有代理IP
        L = proxy_ip.split(':')
        proxy_ip = {
            'http': 'http://{}:{}'.format(L[0], L[1]),
            'https': 'https://{}:{}'.format(L[0], L[1])
        }
        return proxy_ip

    def get_html(self):
        proxies = self.get_proxies()
        if self.flag <= 3:
            try:
                html = requests.get(url=self.url, proxies=proxies, headers=self.headers, timeout=5).text
                print(html)
            except Exception as e:
                print('Retry')
                self.flag += 1
                self.get_html()


spider = BaiduSpider()
spider.get_html()

获取收费代理 IP

写一个获取收费开放 API 代理的接口

# 获取开放代理的接口
import requests
from fake_useragent import UserAgent

ua = UserAgent()  # 创建User-Agent对象
useragent = ua.random
headers = {'User-Agent': useragent}


def ip_test(ip):
    url = 'http://www.baidu.com/'
    ip_port = ip.split(':')
    proxies = {
        'http': 'http://{}:{}'.format(ip_port[0], ip_port[1]),
        'https': 'https://{}:{}'.format(ip_port[0], ip_port[1]),
    }
    res = requests.get(url=url, headers=headers, proxies=proxies, timeout=5)
    if res.status_code == 200:
        return True
    else:
        return False


# 提取代理IP
def get_ip_list():
    # 快代理：https://www.kuaidaili.com/doc/product/dps/
    api_url = 'http://dev.kdlapi.com/api/getproxy/?orderid=946562662041898#=100&protocol=1&method=2&an_an=1&an_ha=1&sep=2'
    html = requests.get(api_url).content.decode('utf-8', 'ignore')
    ip_port_list = html.split('\n')

    for ip in ip_port_list:
        with open('proxy_ip.txt', 'a') as f:
            if ip_test(ip):
                f.write(ip + '\n')


if __name__ == '__main__':
    get_ip_list()

获取私密代理 IP

用户名和密码会在给你 API_URL 的时候给你。不是你的账号和账号密码。格式如下：

proxies = {‘协议’:’协议://用户名:密码@IP:端口号’}

proxies = {‘http’:’http://用户名:密码@IP:端口号’, ‘https’:’https://用户名:密码@IP:端口号’}

proxies = {‘http’: ‘http://309435365:szayclhp@106.75.71.140:16816‘, ‘https’:’https://309435365:szayclhp@106.75.71.140:16816‘}

# 获取开放代理的接口
import requests
from fake_useragent import UserAgent

ua = UserAgent()  # 创建User-Agent对象
useragent = ua.random
headers = {'User-Agent': useragent}


def ip_test(ip):
    url = 'https://blog.csdn.net/qq_34218078/article/details/90901602/'
    ip_port = ip.split(':')
    proxies = {
        'http': 'http://1786088386:b95djiha@{}:{}'.format(ip_port[0], ip_port[1]),
        'https': 'http://1786088386:b95djiha@{}:{}'.format(ip_port[0], ip_port[1]),
    }

    res = requests.get(url=url, headers=headers, proxies=proxies, timeout=5)
    if res.status_code == 200:
        print("OK")
        return True
    else:
        print(res.status_code)
        print("错误")
        return False


# 提取代理IP
def get_ip_list():
    # 快代理：https://www.kuaidaili.com/doc/product/dps/
    api_url = 'http://dps.kdlapi.com/api/getdps/?orderid=986603271748760#=1000&signature=z4a5b2rpt062iejd6h7wvox16si0f7ct&pt=1&sep=2'
    html = requests.get(api_url).content.decode('utf-8', 'ignore')
    ip_port_list = html.split('\n')

    for ip in ip_port_list:
        with open('proxy_ip.txt', 'a') as f:
            if ip_test(ip):
                f.write(ip + '\n')


if __name__ == '__main__':
    get_ip_list()

本文作者： YuT
本文链接： https://ytno1.github.io/archives/dcab6b34.html
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！