Skip to content

Commit 8745834

Browse files
Actualizaciones
Ya no depende del driver externo Actualizado para saltar cookies
1 parent 11c91fc commit 8745834

1 file changed

Lines changed: 183 additions & 68 deletions

File tree

src/Cap2/scrap.ipynb

Lines changed: 183 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
"\n",
3333
"2.- Instalar la biblioteca selenium (pip install selenium)\n",
3434
"\n",
35-
"3.- Tener un fichero controlador, un driver, de nombre *chromedriver.exe*. En nuestro caso supondremos que se encuentra en la carpeta c:/hlocal/tdm. El driver correspondiente a la versión de Chrome que tengamos se puede descargar de https://sites.google.com/a/chromium.org/chromedriver/downloads"
35+
"3.- Tener un fichero controlador, un driver, de nombre *chromedriver.exe*.Esto lo vamos a hacer mediante la librería chromedriver_autoinstaller"
3636
]
3737
},
3838
{
@@ -44,81 +44,83 @@
4444
"Comenzamos abriendo una sesión de Chrome de forma automática\n",
4545
"\n",
4646
"\n",
47-
"**Errores comunes**\n",
48-
"\n",
49-
"Es posible que al ejecutar este código obtengamos un error del tipo:\n",
50-
" \n",
51-
" SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 87. Current browser version is 90.0.4430.212 with binary path.....\n",
52-
" \n",
53-
" o similar. Esto indica que nuestra versión del driver no corresponde con la de nuestro navegador. La versión de Chrome, aunque nos la da el mensaje, la podemos consultar en el propio navegador, en *Configuración* (dentro del menú que se abre al hacer click sobre los 3 puntitos verticales, arriba a la derecha) + *Información de Chrome*.\n",
54-
" \n",
55-
" Con esta información iremos a https://sites.google.com/a/chromium.org/chromedriver/downloads y descargaremos y descomprimiremos el fichero chromedriver correspondiente y ponemos el path correspondiente en la variable chromedriver\n",
56-
" \n",
57-
" \n",
58-
" Otro error que podemos obtener es alguno del tipo *FileNotFound* que se deberá, casi con seguridad a que debemos cambiar el path almacenado en la variable chromedriver del siguiente código\n",
59-
"\n",
60-
"\n",
6147
"**Importante**\n",
62-
"Una vez que logremos ques e abra el navegador no debemos teclear nada en él, ni cerrarlo el control lo llevaremos desde el programa en Python"
48+
"Una vez que logremos que se abra el navegador no debemos teclear nada en él, ni cerrarlo el control lo llevaremos desde el programa en Python"
6349
]
6450
},
6551
{
6652
"cell_type": "code",
67-
"execution_count": 6,
53+
"execution_count": 8,
6854
"metadata": {},
6955
"outputs": [
7056
{
7157
"name": "stdout",
7258
"output_type": "stream",
7359
"text": [
74-
"Requirement already up-to-date: selenium in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (4.4.3)\n",
75-
"Requirement already satisfied, skipping upgrade: trio~=0.17 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from selenium) (0.21.0)\n",
76-
"Requirement already satisfied, skipping upgrade: certifi>=2021.10.8 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from selenium) (2022.9.14)\n",
77-
"Requirement already satisfied, skipping upgrade: urllib3[socks]~=1.26 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from selenium) (1.26.9)\n",
78-
"Requirement already satisfied, skipping upgrade: trio-websocket~=0.9 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from selenium) (0.9.2)\n",
79-
"Requirement already satisfied, skipping upgrade: idna in d:\\instalado\\anacondainstalado\\lib\\site-packages (from trio~=0.17->selenium) (2.10)\n",
80-
"Requirement already satisfied, skipping upgrade: sniffio in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from trio~=0.17->selenium) (1.3.0)\n",
81-
"Requirement already satisfied, skipping upgrade: cffi>=1.14; os_name == \"nt\" and implementation_name != \"pypy\" in d:\\instalado\\anacondainstalado\\lib\\site-packages (from trio~=0.17->selenium) (1.14.3)\n",
82-
"Requirement already satisfied, skipping upgrade: sortedcontainers in d:\\instalado\\anacondainstalado\\lib\\site-packages (from trio~=0.17->selenium) (2.2.2)\n",
83-
"Requirement already satisfied, skipping upgrade: outcome in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from trio~=0.17->selenium) (1.2.0)\n",
84-
"Requirement already satisfied, skipping upgrade: attrs>=19.2.0 in d:\\instalado\\anacondainstalado\\lib\\site-packages (from trio~=0.17->selenium) (20.3.0)\n",
85-
"Requirement already satisfied, skipping upgrade: async-generator>=1.9 in d:\\instalado\\anacondainstalado\\lib\\site-packages (from trio~=0.17->selenium) (1.10)\n",
86-
"Requirement already satisfied, skipping upgrade: PySocks!=1.5.7,<2.0,>=1.5.6; extra == \"socks\" in d:\\instalado\\anacondainstalado\\lib\\site-packages (from urllib3[socks]~=1.26->selenium) (1.7.1)\n",
87-
"Requirement already satisfied, skipping upgrade: wsproto>=0.14 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from trio-websocket~=0.9->selenium) (1.2.0)\n",
88-
"Requirement already satisfied, skipping upgrade: pycparser in d:\\instalado\\anacondainstalado\\lib\\site-packages (from cffi>=1.14; os_name == \"nt\" and implementation_name != \"pypy\"->trio~=0.17->selenium) (2.20)\n",
89-
"Requirement already satisfied, skipping upgrade: h11<1,>=0.9.0 in c:\\users\\rafa\\appdata\\roaming\\python\\python38\\site-packages (from wsproto>=0.14->trio-websocket~=0.9->selenium) (0.13.0)\n"
60+
"Instalando módulos\n",
61+
"selenium encontrado\n",
62+
"chromedriver_autoinstaller encontrado\n",
63+
"¡Terminado!\n"
9064
]
9165
}
9266
],
9367
"source": [
94-
"# ejecutar esta casilla solo si queremos instalar o actualizar selenium; \n",
95-
"# muchas veces tras hacerlo deberemos hacer \"Kernel->restart\" para que los cambios se apliquen\n",
68+
"modules = [\"selenium\",\"chromedriver_autoinstaller\"]\n",
69+
"\n",
70+
"\n",
9671
"import sys\n",
97-
"!{sys.executable} -m pip install --upgrade --user selenium"
72+
"import os.path\n",
73+
"from subprocess import check_call\n",
74+
"import importlib\n",
75+
"import os\n",
76+
"\n",
77+
"def instala(modules):\n",
78+
" print(\"Instalando módulos\")\n",
79+
" for m in modules:\n",
80+
" # para el import quitamos [...] y ==...\n",
81+
" p = m.find(\"[\")\n",
82+
" mi = m if p==-1 else m[:p]\n",
83+
" p = mi.find(\"==\")\n",
84+
" mi = mi if p==-1 else mi[:p]\n",
85+
" torch_loader = importlib.util.find_spec(mi)\n",
86+
" if torch_loader is not None:\n",
87+
" print(m,\" encontrado\")\n",
88+
" else:\n",
89+
" print(m,\" No encontrado, instalando...\",end=\"\") \n",
90+
" try: \n",
91+
" r = check_call([sys.executable, \"-m\", \"pip\", \"install\", \"--user\", m])\n",
92+
" print(\"¡hecho!\")\n",
93+
" except:\n",
94+
" print(\"¡Problema al instalar \",m,\"! ¿seguro que el módulo existe?\",sep=\"\")\n",
95+
"\n",
96+
" print(\"¡Terminado!\")\n",
97+
"\n",
98+
"instala(modules) "
9899
]
99100
},
100101
{
101102
"cell_type": "code",
102-
"execution_count": 1,
103+
"execution_count": 9,
103104
"metadata": {},
104-
"outputs": [
105-
{
106-
"name": "stderr",
107-
"output_type": "stream",
108-
"text": [
109-
"<ipython-input-1-cfdeb6085767>:7: DeprecationWarning: executable_path has been deprecated, please pass in a Service object\n",
110-
" driver = webdriver.Chrome(executable_path=chromedriver,options=chrome_options)\n"
111-
]
112-
}
113-
],
105+
"outputs": [],
114106
"source": [
115-
"chromedriver = \"./chromedriver.exe\" # cambiar esta variable con el path a nuestro chromedriver\n",
116-
"import os\n",
117-
"from selenium import webdriver # si da error, desde anaconda prompt hacer pip install --user selenium\n",
118-
"os.environ[\"webdriver.chrome.driver\"] = chromedriver\n",
107+
"import pandas as pd\n",
108+
"from bs4 import BeautifulSoup\n",
109+
"from selenium import webdriver\n",
110+
"import chromedriver_autoinstaller\n",
111+
"\n",
112+
"# setup chrome options\n",
119113
"chrome_options = webdriver.ChromeOptions()\n",
120-
"chrome_options.add_argument('--no-sandbox')\n",
121-
"driver = webdriver.Chrome(executable_path=chromedriver,options=chrome_options)"
114+
"#chrome_options.add_argument('--headless') # ensure GUI is off\n",
115+
"#chrome_options.add_argument('--no-sandbox')\n",
116+
"#chrome_options.add_argument('--disable-dev-shm-usage')\n",
117+
"\n",
118+
"# set path to chromedriver as per your configuration\n",
119+
"chromedriver_autoinstaller.install()\n",
120+
"\n",
121+
"\n",
122+
"# set up the webdriver\n",
123+
"driver = webdriver.Chrome(options=chrome_options)"
122124
]
123125
},
124126
{
@@ -131,14 +133,48 @@
131133
},
132134
{
133135
"cell_type": "code",
134-
"execution_count": 2,
136+
"execution_count": 17,
135137
"metadata": {},
136138
"outputs": [],
137139
"source": [
138140
"url = 'https://www1.sedecatastro.gob.es/CYCBienInmueble/OVCBusqueda.aspx'\n",
139141
"driver.get(url)"
140142
]
141143
},
144+
{
145+
"cell_type": "markdown",
146+
"metadata": {},
147+
"source": [
148+
"Es cada vez más normal que al cargar la página tengamos que dar \"Aceptar\" para quitar el banner inicial sobre cookies"
149+
]
150+
},
151+
{
152+
"cell_type": "code",
153+
"execution_count": 18,
154+
"metadata": {},
155+
"outputs": [
156+
{
157+
"name": "stdout",
158+
"output_type": "stream",
159+
"text": [
160+
"Cookies aceptadas.\n"
161+
]
162+
}
163+
],
164+
"source": [
165+
"from selenium.webdriver.common.by import By\n",
166+
"from selenium.webdriver.chrome.options import Options\n",
167+
"import time\n",
168+
"#aceptar las cookies\n",
169+
"\n",
170+
"try:\n",
171+
" cookies = driver.find_element(By.LINK_TEXT, \"Aceptar cookies\")\n",
172+
" cookies.click()\n",
173+
" print(\"Cookies aceptadas.\")\n",
174+
"except Exception as e:\n",
175+
" print(\"No se encontró el botón de aceptar cookies o ocurrió un error:\", e)\n"
176+
]
177+
},
142178
{
143179
"cell_type": "markdown",
144180
"metadata": {},
@@ -150,7 +186,7 @@
150186
},
151187
{
152188
"cell_type": "code",
153-
"execution_count": 3,
189+
"execution_count": 19,
154190
"metadata": {},
155191
"outputs": [],
156192
"source": [
@@ -168,7 +204,7 @@
168204
},
169205
{
170206
"cell_type": "code",
171-
"execution_count": 4,
207+
"execution_count": 20,
172208
"metadata": {},
173209
"outputs": [],
174210
"source": [
@@ -179,8 +215,8 @@
179215
"lat.send_keys(latitud)\n",
180216
"lon.send_keys(longitud)\n",
181217
"\n",
182-
"datos = driver.find_element(By.ID,\"ctl00_Contenido_btnDatos\")\n",
183-
"datos.click()"
218+
"datos = driver.find_element(By.NAME, \"ctl00$Contenido$btnDatos\")\n",
219+
"datos.click()\n"
184220
]
185221
},
186222
{
@@ -192,7 +228,7 @@
192228
},
193229
{
194230
"cell_type": "code",
195-
"execution_count": 6,
231+
"execution_count": 21,
196232
"metadata": {},
197233
"outputs": [
198234
{
@@ -229,9 +265,70 @@
229265
},
230266
{
231267
"cell_type": "code",
232-
"execution_count": null,
268+
"execution_count": 22,
233269
"metadata": {},
234-
"outputs": [],
270+
"outputs": [
271+
{
272+
"name": "stdout",
273+
"output_type": "stream",
274+
"text": [
275+
"Formulario master\n",
276+
"Castellano\n",
277+
"ICONO CORREO ELECTRÓNICO\n",
278+
"CONTÁCTENOS\n",
279+
"Icono página de inicio\n",
280+
"Consulta y certificación de Bien Inmueble\n",
281+
"Volver\n",
282+
"CARTOGRAFÍA\n",
283+
"CONSULTA DESCRIPTIVA Y GRÁFICA\n",
284+
"IMPRIMIR DATOS\n",
285+
"VISOR 3D\n",
286+
"DATOS DESCRIPTIVOS DEL INMUEBLE\n",
287+
"Referencia catastral\n",
288+
"7801701DF0070S0001QY \n",
289+
"Localización\n",
290+
"PZ NOVA 20\n",
291+
"08640 OLESA DE MONTSERRAT (BARCELONA)\n",
292+
"Clase\n",
293+
"Urbano\n",
294+
"Uso principal\n",
295+
"Religioso\n",
296+
"Superficie construida\n",
297+
"3.221 m2\n",
298+
"Año construcción\n",
299+
"1400\n",
300+
"PARCELA CATASTRAL\n",
301+
"\n",
302+
"Parcela construida sin división horizontal\n",
303+
"Localización\n",
304+
"PZ NOVA 20\n",
305+
"OLESA DE MONTSERRAT (BARCELONA)\n",
306+
"Superficie gráfica\n",
307+
"1.884 m2\n",
308+
"CONSTRUCCIÓN\n",
309+
"Uso principal Escalera Planta Puerta Superficie m2 Tipo Reforma Fecha Reforma\n",
310+
"RELIGIOSO T OD OS 2.117 I Reforma mínima 1.960\n",
311+
"RELIGIOSO T OD OS 249 I Reforma mínima 1.960\n",
312+
"ALMACEN T OD OS 765 I Reforma mínima 1.960\n",
313+
"RELIGIOSO 1 00 01 13 O Reforma total 2.002\n",
314+
"RELIGIOSO 1 01 01 26 O Reforma total 2.002\n",
315+
"RELIGIOSO 1 01 01 51\n",
316+
"¿Cómo se pueden obtener datos protegidos (titularidad y valor catastral) de los inmuebles y certificados telemáticos de los mismos?\n",
317+
"\n",
318+
"\n",
319+
"\n",
320+
"\n",
321+
"\n",
322+
"\n",
323+
"\n",
324+
"\n",
325+
"Normativa reguladora\n",
326+
"Política de privacidad\n",
327+
"Accesibilidad\n",
328+
"Mapa web\n"
329+
]
330+
}
331+
],
235332
"source": [
236333
"html = driver.find_element(By.XPATH,\"/html\")\n",
237334
"print(html.text)"
@@ -246,7 +343,7 @@
246343
},
247344
{
248345
"cell_type": "code",
249-
"execution_count": null,
346+
"execution_count": 23,
250347
"metadata": {},
251348
"outputs": [],
252349
"source": [
@@ -264,13 +361,31 @@
264361
},
265362
{
266363
"cell_type": "code",
267-
"execution_count": null,
364+
"execution_count": 24,
268365
"metadata": {},
269-
"outputs": [],
366+
"outputs": [
367+
{
368+
"name": "stdout",
369+
"output_type": "stream",
370+
"text": [
371+
"div\n",
372+
"div\n",
373+
"form\n",
374+
"div\n",
375+
"script\n",
376+
"a\n",
377+
"script\n",
378+
"link\n",
379+
"script\n",
380+
"script\n",
381+
"script\n"
382+
]
383+
}
384+
],
270385
"source": [
271386
"hijos = driver.find_elements(By.XPATH,\"/html/body/*\")\n",
272387
"for element in hijos:\n",
273-
" print(element.tag_name)"
388+
" print(element.tag_name)"
274389
]
275390
},
276391
{
@@ -404,7 +519,7 @@
404519
"cell_type": "markdown",
405520
"metadata": {},
406521
"source": [
407-
"Por Rafael Caballero. Del libro \"Big data con Python\""
522+
"Por Rafael Caballero. Del libro \"Big data con Python\". Gracias a José Ramón Guerra por las actualizaciones"
408523
]
409524
},
410525
{
@@ -417,7 +532,7 @@
417532
],
418533
"metadata": {
419534
"kernelspec": {
420-
"display_name": "Python 3",
535+
"display_name": "Python 3 (ipykernel)",
421536
"language": "python",
422537
"name": "python3"
423538
},
@@ -431,7 +546,7 @@
431546
"name": "python",
432547
"nbconvert_exporter": "python",
433548
"pygments_lexer": "ipython3",
434-
"version": "3.8.5"
549+
"version": "3.11.4"
435550
}
436551
},
437552
"nbformat": 4,

0 commit comments

Comments
 (0)