original reddit post: https://www.reddit.com/r/webscraping/comments/1kpwou5/login_form_questions/
So, they wanted to use an API to log in but received a non-200 response.
target website: https://www.costar.com/
Click the login
button at the top right. The website redirects to another URL:
https://secure.costargroup.com/login?signin=8e10875e6eeb2ea3856ae6da5659d78c
Click Log In
and we get a POST request.
Look at the response. If we input the correct username and password, I guess it will redirect us to an after-login page. But here, we can use these keywords to see if we did it correctly: Invalid username/password combination
.
If I delete all cookies and send a POST request to this API, that sentence will not show. I also noticed that if I delete idsrv.xsrf
in the params, the returned page will not show that sentence, which means my request is not correct. I’ve tried several times and finally confirmed that there are two necessary cookie values: SignInMessage.8e10875e6eeb2ea3856ae6da5659d78c
and idsrv.xsrf
.
its not a necessary step btw XD
signin
and idsrv.xsrf
There are two idsrv.xsrf
values: one in the params, one in the cookies. Here, we are talking about the one in the params.
Search for the value of signin
. It’s a redirect link, which means that if we send a GET request to this URL, the response.text
will be the login page.
Note: You can’t retrieve the location
key in the response header because the URL is redirected to the login
one.
We can retrieve the value of both signin
and idsrv.xsrf
in the login’s response text, which is the redirected response of the authorize API.
Use XPath to extract them.
def extract(resp):
tree = etree.HTML(resp)
signinform = tree.xpath('//form[@id="signinform"]/@action')[0]
print(signinform) # /login?signin=11.......d736d93e8c1b15ee
xsrf_token = tree.xpath('//input[@name="idsrv.xsrf"]/@value')[0]
print(xsrf_token)
return signinform, xsrf_token
cookies
Search for the value of SignInMessage.8e10875e6eeb2ea3856ae6da5659d78c
.
Search for the value of idsrv.xsrf
.
Use session
to keep the session. It automatically manages cookies.
session = requests.session()
code
from lxml import etree
import requests
def authorize():
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en',
'cache-control': 'no-cache',
'pragma': 'no-cache',
'priority': 'u=0, i',
'referer': 'https://www.costar.com/',
'sec-ch-ua': '"Chromium";v="136", "Google Chrome";v="136", "Not.A/Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'cross-site',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
}
params = {
'client_id': 'costar',
'nonce': '7e746f9b-9ca8-467a-25d1-71e3fb9ba889',
'response_type': 'code',
'response_mode': 'form_post',
'scope': 'openid profile email address phone offline_access product_user session',
'redirect_uri': 'https://product.costar.com/home/auth-callback',
'acr_values': '',
'locale': 'en-US',
}
response = session.get('https://secure.costargroup.com/connect/authorize', params=params, cookies={}, headers=headers)
# print(response.headers)
return response.text
def extract(resp):
tree = etree.HTML(resp)
signinform = tree.xpath('//form[@id="signinform"]/@action')[0]
print(signinform) # /login?signin=11.......d736d93e8c1b15ee
xsrf_token = tree.xpath('//input[@name="idsrv.xsrf"]/@value')[0]
print(xsrf_token)
return signinform, xsrf_token
if __name__ == '__main__':
session = requests.session()
resp = authorize()
signinform, xsrf_token = extract(resp)
signin = signinform.split('=')[-1]
params = {
'signin': signin,
}
referer = 'https://secure.costargroup.com' + signinform
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'accept-language': 'en',
'cache-control': 'no-cache',
'content-type': 'application/x-www-form-urlencoded',
'origin': 'https://secure.costargroup.com',
'pragma': 'no-cache',
'priority': 'u=0, i',
'referer': referer,
'sec-ch-ua': '"Chromium";v="136", "Google Chrome";v="136", "Not.A/Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36',
}
data = {
'idsrv.xsrf': xsrf_token,
'sessionId': '',
'username': 'your username',
'password': 'your password',
}
response = session.post("https://secure.costargroup.com/login", params=params, headers=headers, data=data)
print(response.text)