This allows you to get the HTML from any web page with a simple API call. You to export all types of scraped data in TXT, HTML CSV, or Excel formats. FMiner is another popular tool for web scraping, data extraction, crawling screen scraping, macro, and web support for Window and Mac OS. Execute Java Online. How to extract data from excel and automate a webpage by feeding those excel based data? All About Excel in Selenium: POI & JXL. File IO is a critical part for any software process. We frequently create a file, open it & update something or delete it in our Computers. Same is the case with Selenium Automation.
Scrapy is an open source framework for creating web crawlers (AKA spiders). A common roadblock when developing Scrapy spiders, and web scraping in general, is dealing with sites that use a heavy amount of JavaScript. Since many modern websites are built on JavaScript, they require scripts to be run in order for the page to render properly.
In many cases, pages also present modals and other dialogues that need to be interacted with to show the full page. So we developed Splash, an open source tool to help you get structured data from the web. In this post we’re going to show you how you can use Splash to handle JavaScript in your Scrapy projects.
Splash is Scrapinghub’s in-house solution for JavaScript rendering, implemented in Python using Twisted and QT. Splash is a lightweight web browser which is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more.
The easiest way to set up Splash is through Docker:
Splash will now be running on localhost:8050. If you’re using a Docker Machine on OS X or Windows, it will be running on the IP address of Docker’s virtual machine.
If you would like to install Splash without using Docker, please refer to the documentation.
Now that Splash is running, you can test it in your browser:
On the right enter a URL (e.g. http://amazon.com) and click ‘Render me!’. Splash will display a screenshot of the page as well as charts and a list of requests with their timings. At the bottom you should see a text box containing the rendered HTML.
You can use Request to send links to Splash:
If you’re using CrawlSpider, the easiest way is to override the process_links function in your spider to replace links with their Splash equivalents:
The preferred way to integrate Splash with Scrapy is using scrapy-splash. See here for why it’s recommended you use the middleware instead of using it manually. You can install scrapy-splash using pip:
To use ScrapyJS in your project, you first need to enable the middleware:
The middleware needs to take precedence over HttpProxyMiddleware, which by default is at position 750, so we set the middleware positions to numbers below 750.
You then need to set the SPLASH_URL setting in your project’s settings.py:
Don’t forget, if you’re using a Docker Machine on OS X or Windows, you will need to set this to the IP address of Docker’s virtual machine, e.g.:
Enable SplashDeduplicateArgsMiddleware to support cache_args feature: it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.
Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom DUPEFILTER_CLASS and a custom cache storage backend:
If you already use another cache storage backend, you will need to subclass it and replace all calls to scrapy.util.request.request_fingerprint with scrapy_splash.splash_request_fingerprint.
Now that the Splash middleware is enabled, you can use SplashRequest in place of scrapy.Request to render pages with Splash.
For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:
The ‘args’ dict contains arguments to send to Splash. You can find a full list of available arguments in the HTTP API documentation. By default the endpoint is set to ‘render.json’, but here we have overridden it and set it to ‘render.html’ to provide an HTML response.
Sometimes you may need to press a button or close a modal to view the page properly. Splash lets you run your own JavaScript code within the context of the web page you’re requesting. There are several ways you can accomplish this:
You can use the js_source parameter to send the JavaScript you want to execute. The JavaScript code is executed after the page finished loading but before the page is rendered. This allow to use the JavaScript code to modify the page being rendered. For example, you can do it with Scrapy-Splash:
Splash supports Lua scripts through its execute endpoint. This is the preferred way to execute JavaScript as you can preload libraries, choose when to execute the JavaScript, and retrieve the output.
Here’s an example script:
You need to send that script to the execute endpoint, in the lua_source argument.
This will return a JSON object containing the title:
Every script requires a main function to act as the entry point. You can return a Lua table which will be rendered as JSON, which is what we have done here. We use the splash:go function to tell Splash to visit the URL. The splash:evaljs function lets you execute JavaScript within the page context, however, if you don’t need the result you should use splash:runjs instead.
You can test your Splash scripts in your browser by visiting your Splash instance’s index page (e.g. http://localhost:8050/). It’s also possible to use Splash with IPython notebook as an interactive web-based development environment, see here for more details.
It’s often the case that you need to click a button before the page is displayed. We can do that using splash:mouse_click function:
Here we use splash:jsfunc to define a function that will return the element coordinates, then make sure the element is visible with splash:set_viewport_full, and click on the element. Splash then returns the rendered HTML.
You can find more info on running JavaScript with Splash in the docs, and for a more in-depth tutorial, check out the Splash Scripts Tutorial.
We hope this tutorial gave you a nice introduction to Splash, and please let us know if you have any questions or comments!
This post was written by Richard Dowinton, a former Software Developer at Scrapinghub.
Please heart “Recommend” to share this tutorial far and wide.
Find out what web scraping and web data can do for you.