使用指南

本代码用于爬取 自由时报 新闻网站首页的新闻文章（包括新闻标题，新闻链接，和发布日期三个数据）
在 Python3 环境下运行本代码，同时保证运行环境中安装有 requests，pandas 库。
运行结果保存为 "自由时报.csv" 文件，路径位于脚本同路径下（如有需要可以修改代码中 filename 的值，设置文件名和存储路径）
使用此爬虫前，请确保您的网络可以正常访问自由时报网站，否则爬虫运行会报错失败。
本爬虫仅供学习交流使用，请勿用于商业用途。

源码

import requests
import json
import pandas as pd
import time

def fetchUrl(url):

    header = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
    }

    r = requests.get(url, headers = header)
    r.encoding = r.apparent_encoding
    return r.text

def parseHtml(html, page):

    jsObj = json.loads(html)
    tempList = jsObj['data']

    newsList = []

    for item in tempList:

        if page == 1:
            # 第一页这么解析
            title = item['title']
            link = item['url']
            date = item['time']
        else:
            # 从第二页起这么解析
            title = tempList[item]['title']
            link = tempList[item]['url']
            date = tempList[item]['time']

        title = title.replace("\r", "")

        if "/" not in date:
            today = time.strftime("%Y/%m/%d ", time.localtime())
            date = today + date

        print([date, title, link])
        newsList.append([date, title, link])

    return newsList

def saveData(data, filename):

    dataframe = pd.DataFrame(data)
    dataframe.to_csv(filename, mode='a', index=False, sep=',', header=False)

def spiderManager(TotalPage, filename):

    if TotalPage < 1:
        page = 1
        while True:
            url = "https://news.ltn.com.tw/ajax/breakingnews/all/%d" % page
            html = fetchUrl(url)
            data = parseHtml(html, page)
            saveData(data, filename)
            print("----" * 20)
            if len(data) < 20:
                break;
            page += 1

    else:
        for page in range(1, TotalPage + 1):
            url = "https://news.ltn.com.tw/ajax/breakingnews/all/%d" % page
            html = fetchUrl(url)
            data = parseHtml(html, page)
            saveData(data, filename)
            print("----"*20)


if __name__ == "__main__":

    # 保存的文件名
    filename = "自由时报.csv"

    # 要爬取的页数范围，爬取 1 - totalPage 页的内容，若 totalPage 为 0， 则爬取所有
    totalPage = 0

    spiderManager(totalPage, filename)
    print("结束")

Python 源码 | 爬取自由时报新闻网

机灵鹤 • 2020 年 12 月 15 日

使用指南

本代码用于爬取 自由时报 新闻网站首页的新闻文章（包括新闻标题，新闻链接，和发布日期三个数据）
在 Python3 环境下运行本代码，同时保证运行环境中安装有 requests，pandas 库。
运行结果保存为 "自由时报.csv" 文件，路径位于脚本同路径下（如有需要可以修改代码中 filename 的值，设置文件名和存储路径）
使用此爬虫前，请确保您的网络可以正常访问自由时报网站，否则爬虫运行会报错失败。
本爬虫仅供学习交流使用，请勿用于商业用途。

源码

import requests
import json
import pandas as pd
import time

def fetchUrl(url):

    header = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
    }

    r = requests.get(url, headers = header)
    r.encoding = r.apparent_encoding
    return r.text

def parseHtml(html, page):

    jsObj = json.loads(html)
    tempList = jsObj['data']

    newsList = []

    for item in tempList:

        if page == 1:
            # 第一页这么解析
            title = item['title']
            link = item['url']
            date = item['time']
        else:
            # 从第二页起这么解析
            title = tempList[item]['title']
            link = tempList[item]['url']
            date = tempList[item]['time']

        title = title.replace("\r", "")

        if "/" not in date:
            today = time.strftime("%Y/%m/%d ", time.localtime())
            date = today + date

        print([date, title, link])
        newsList.append([date, title, link])

    return newsList

def saveData(data, filename):

    dataframe = pd.DataFrame(data)
    dataframe.to_csv(filename, mode='a', index=False, sep=',', header=False)

def spiderManager(TotalPage, filename):

    if TotalPage < 1:
        page = 1
        while True:
            url = "https://news.ltn.com.tw/ajax/breakingnews/all/%d" % page
            html = fetchUrl(url)
            data = parseHtml(html, page)
            saveData(data, filename)
            print("----" * 20)
            if len(data) < 20:
                break;
            page += 1

    else:
        for page in range(1, TotalPage + 1):
            url = "https://news.ltn.com.tw/ajax/breakingnews/all/%d" % page
            html = fetchUrl(url)
            data = parseHtml(html, page)
            saveData(data, filename)
            print("----"*20)


if __name__ == "__main__":

    # 保存的文件名
    filename = "自由时报.csv"

    # 要爬取的页数范围，爬取 1 - totalPage 页的内容，若 totalPage 为 0， 则爬取所有
    totalPage = 0

    spiderManager(totalPage, filename)
    print("结束")

Python 源码 | 爬取自由时报新闻网

使用指南

源码

微信聊天记录导出教程

C++基础 | 十六进制宏的使用技巧

Python爬虫实战 | 爬取小红书去水印图片

Cocos Creator | 微信小游戏分包加载机制突破 4M 代码包体积限制

欢迎使用

Python爬虫实战 | 爬取解放日报新闻文章

C++基础 | 什么叫内存泄漏

CocosCreator | VSCode Problems loading reference......

【冰冰vlog.001】带大家看看每个冬天我必去的地方

Python 小知识 | 字符串拼接的几种方式

Python 源码 | 爬取自由时报新闻网

使用指南

源码