Nickel Titanium Alloy News and > Nickel Titanium Alloy News > Python crawler - news hot spot crawling China news network, today's headlines,

Python crawler - news hot spot crawling China news network, today's headlines,

Nickel Titanium Alloy News 2021-08-09 10:35 162

在这里插入图片描述

QQ 1274510382 Wechat JNZ_ Aming business alliance QQ group 538250800 technology trouble QQ group 599020441 solution QQ group 152889761 join us QQ group 649347320 share learning QQ group 674240731 chronicle technology aming network security, deep learning, embedded, machine reinforcement, biological intelligence and life science

Ding Ding: Product online -% 26gt; Official account of WeChat public, Ji'nan: Mdt InfoTech Ltd. Mall joining / entertainment making friends / business district / outsourcing part-time development - project release / security project: Situation awareness defense system / Intranet patrol system cloud service project: Dynamic capacity expansion virtual machine / domain name / elastic storage - Database - cloud disk / API aieverthing product consulting / after-sales service( Same as)

You'll never know what to do!% 26#xff01;% 26#xff01; Looking for like-minded partners in Entrepreneurship... Contact information!! #This paper is an automatic advertising system #If there is any infringement, deletion or modification, please contact us immediately

Show more

You can see the relevant data interface, There are news headlines and URL addresses of news details

How to extract URL address

 1. Convert to json, Key value pair value;
2. Use regular expressions to match URL addresses;

Turn the page according to the pager change in the interface data link, It corresponds to the page number

On the details page, you can see that the news content is in the div tag and in the P tag, According to the normal analysis of the website, you can get the news content

 Save mode

TXT text format PDF format

 Summary of overall crawling ideas
In the column list page, Click more news content, Get interface data URL
The data content returned in the interface data URL matches the URL of the news details page
Use general parsing website operation( re、css、
Save data
 import parsel
import requests
import re
####Get web page source code
def get_ html(html_ url):
"% 26#34;% 26#34;
Get web page source code response
:param html_ URL: Web page URL address
: Return: Web page source code
"% 26#34;% 26#34;
headers = {
" User-Agent": % 26#34; Mozilla/5.0 (Macintosh;  Intel Mac OS X 10_ 15_ 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36",
" Accept": % 26#34; text/html,application/xhtml+ xml,application/xml; q= 0.9,image/webp,image/apng,*/*; q= 0.8,application/signed-exchange; v= b3; q= 0.9",}
response =  requests.get(url= html_ url, headers= headers)
return response
####Get the URL address of each news article
def get_ page_ url(html_ data):
"% 26#34;% 26#34;
Get the URL address of each news article
:param html_ data: response.text
: Return: URL address of each news article
"% 26#34;% 26#34;
page_ url_ list =  re.findall('% 26#34; url":% 26#34;(.*?)% 26#34;% 26#39;,  html_ data)
return page_ url_ list
####The file saving name cannot contain special characters, News headlines need to be processed
def file_ name(name):
"% 26#34;% 26#34;
File naming cannot carry special characters
: param Name: news title
: Return: title without special characters
"% 26#34;% 26#34;
replace =  re.compile(r'[\\\\\\/\\:\\*\\?\\% 26#34;\\% 26lt;\\% 26gt;\\|]% 26#39;)
new_ name =  re.sub(replace, '_% 26#39;,  name)
return new_ name
####Save data
def download(content, title):
"% 26#34;% 26#34;
With OpenSave news content TXT
: param content: news content
: param Title: news title
:return:
"% 26#34;% 26#34;
path = % 26#39; News \\ \\% 26#39;% 26#43;  title + % 26#39;. txt'
with open(path, mode=% 26#39; a',  encoding=% 26#39; utf-8')  as f:
f.write(content)
print(' Saving', title)
###Main function
def main(url):
"% 26#34;% 26#34;
Main function
: param URL: URL address of news list page
:return:
"% 26#34;% 26#34;
html_ data =  get_ HTML (URL). Text # get interface data response.text
lis =  get_ page_ url(html_ Data) # get the list of news URL addresses
for li in lis:
page_ data =  get_ html(li).content.decode(' utf-8', % 26#39; ignore')  #  News details page response.text
selector =  parsel.Selector(page_ data)
title =  re.findall('% 26lt; title>(.*?)% 26lt;/ title>% 26#39;,  page_ Data, re. S) [0] # get news headlines
new_ title =  file_ name(title)
new_ data =  selector.css('# cont_ 1_ 1_ 2 div.left_ zw p::text'). getall()
content = % 26#39;% 26#39;. join(new_ data)
download(content, new_ title)
if __ name__ % 26#61;% 26#61; % 26#39;__ main__% 26#39;:
for page in range(1, 101):
url_ 1 = % 26#39; https://channel.chinanews.com/cns/cjs/gj.shtml?pager= ;{}% 26amp; pagenum= 9& t= 5_ 58'. format(page)
main(url_ 1)

In the browser developer mode network, you can quickly find a '? category= new_ \

As long as you find the requests URL of this file, you can crawl the web page through Python requests;

View the URL of the request, Found link is: https://www.toutiao.com/api/pc/feed/?category= ; news_ hot& utm_ source= toutiao& widen= 1& max_ behot_ time= 0& max_ behot_ time_ tmp= 0& tadrequire= true& as= A1B5AC16548E0FA& cp= 5C647E601F9AEE1&_ signature= F09fyaaaszbjisc9ouu9mxdpx3 在这里插入图片描述 where Max_ behot_ Time is obtained from the obtained JSON data:

I found Da Shen on the Internet to analyze as and CP algorithms, Two parameters are found in the JS file: home_ 4abea46.js contains, The specific algorithm is as follows:

 ! function(t) {
var e = {};
e.getHoney =  function() {
var t =  Math.floor((new Date).getTime() / 1e3)
, e =  t.toString(16).toUpperCase()
, i =  md5(t).toString().toUpperCase();
if (8 !% 26#61;  e.length)
return {
as: " 479BB4B7254C150",
cp: " 7E0AC8874BB0985"
};
for (var n =  i.slice(0, 5), a =  i.slice(-5), s = % 26#34;% 26#34;,  o =  0 5 >  o;  o+% 26#43;)
s +% 26#61;  n[o] +  e[o];
for (var r = % 26#34;% 26#34;,  c =  0 5 >  c;  c+% 26#43;)
r +% 26#61;  e[c +  3] +  a[c];
return {
as: " A1" % 26#43;  s +  e.slice(-3),
cp: e.slice(0, 3) +  r + % 26#34; E1"
}
}
,
t.ascp =  e
}(window, document),

The code for Python to obtain as and CP values is as follows:(

 def get_ as_ Cp(): # this function is mainly used to obtain as and CP parameters, The program refers to the encrypted JS file in today's headlines: home_ 4abea46.js
zz = {}
now =  round(time.time())
Print (now) # gets the current computer time
e =  Hex (int (now)). Upper() [2:] #hex() converts an integer object to a hexadecimal string representation
print(' e:',  e)
a =  Hashlib. Md5() #hashlib. Md5(). Hexdigest() creates a hash object and returns hexadecimal results
print(' a:',  a)
a.update(str(int(now)).encode(' utf-8'))
i =  a.hexdigest().upper()
print(' i:',  i)
if len(e)!% 26#61; 8:
zz = {% 26#39; as':% 26#39; 479BB4B7254C150',
' cp':% 26#39; 7E0AC8874BB0985'}
return zz
n =  i[:5]
a =  i[-5:]
r = % 26#39;% 26#39;
s = % 26#39;% 26#39;
for i in range(5):
s=  s+ n[i]+ e[i]
for j in range(5):
r =  r+ e[j+ 3]+ a[j]
zz ={
' as':% 26#39; A1'% 26#43; s+ e[-3:],
' cp': e[0:3]+ r+% 26#39; E1'
}
print(' zz:',  zz)
return zz
 In this way, the complete link constitutes, Another point is:
_ The JSON data can also be obtained by removing the signature parameter,
 import requests
import json
from openpyxl import Workbook
import time
import hashlib
import os
import datetime
start_ url = % 26#39; https://www.toutiao.com/api/pc/feed/?category= ; news_ hot& utm_ source= toutiao& widen= 1& max_ behot_ time=% 26#39;
url = % 26#39; https://www.toutiao.com' ;
headers={
' user-agent':% 26#39; Mozilla/5.0 (Macintosh;  Intel Mac OS X 10_ 12_ 3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
cookies = {% 26#39; tt_ webid':% 26#39; 6649949084894053895'} #  Cookies here can be found in the browser, To avoid being banned from the headlines
max_ behot_ time = % 26#39; 0'   #  Link parameters
title = []       #  Store news headlines
source_ url = []  #  Store links to news
s_ url = []       #  Store full links to news
source = []      #  Official account for storing and distributing news
media_ url = {}   #  Full link to store official account number
def get_ as_ Cp(): # this function is mainly used to obtain as and CP parameters, The program refers to the encrypted JS file in today's headlines: home_ 4abea46.js
zz = {}
now =  round(time.time())
Print (now) # gets the current computer time
e =  Hex (int (now)). Upper() [2:] #hex() converts an integer object to a hexadecimal string representation
print(' e:',  e)
a =  Hashlib. Md5() #hashlib. Md5(). Hexdigest() creates a hash object and returns hexadecimal results
print(' a:',  a)
a.update(str(int(now)).encode(' utf-8'))
i =  a.hexdigest().upper()
print(' i:',  i)
if len(e)!% 26#61; 8:
zz = {% 26#39; as':% 26#39; 479BB4B7254C150',
' cp':% 26#39; 7E0AC8874BB0985'}
return zz
n =  i[:5]
a =  i[-5:]
r = % 26#39;% 26#39;
s = % 26#39;% 26#39;
for i in range(5):
s=  s+ n[i]+ e[i]
for j in range(5):
r =  r+ e[j+ 3]+ a[j]
zz ={
%2

		   
Tag:Python,crawler,news,hot,spot,c