Blog
home > blog > ImportXML, GoogleLookup alternatives for Excel

Tag: Excel, lab, scraping, ideativi

ImportXML, GoogleLookup alternatives for Excel

ImportXML and GoogleLookup are usefull function in Google Spreadsheet to scrape information from internet. We developed an alternative for these and other functions in Excel.
di Maurizio Ceravolo
This is our first english blog post. I hope it's clear enough for you. :-) For the italian people, we have a different post: Alternative a ImportXML e GoogleLookup per Excel.

ImportXML is a handy feature of Google Spreadsheet through which it is possible to parse xml files and web pages to extract informations.
GoogleLookup is an equally useful function always in Google Spreadsheet through which it is possible to query Google to obtain the response to simple questions, such as: the height of a mountain, the population of a country or the chemical formula of a compound.

GoogleLookup function
GoogleLookup

There are a couple of problems, however, which limit their use. ImportXML can only be used 50 times within a spreadsheet. While for GoogleLookup, the limitation is even more stringent. The function has been deprecated and is no longer possible to use it.

I followed an interesting discussion held on Google+, where I felt a great interest in these functions. So I thought to replicate them in Excel, trying to overcome the limitations described above.

I took advantage of the experience that we have inscraping informations and the few memories of VBA that I have left. :-) By combining with a few hours of work we created importxml.xls, an excel file where they are defined our alternative functions ready for use. And, in addition to those, we have created some other useful function.

Important! To be able to use them, you have to grant the use of macros in Excel. In order to release the new features, of course, we had to write a some code.

Currently have the following functions:
  • GetHTMLFromUrl(Url, Truncate)
  • GetHTMLAndApplyRegEx(Url, Regex, NumberOfOccurences, Separator)
  • GetHTMLAndApplyGroupRegEx(Url, Regex, NumberOfOccurences, Separator)
  • GoogleLookup(TextToQuery)
  • BingLookup(TextToQuery)
  • GoogleFinanceLookup(TextToQuery)
  • GetTitleFromUrl(Url)
  • GetMetaDescriptionFromUrl(Url)
  • GetBodyTextFromUrl(Url)
Let's look them in detail.


GetHTMLFromUrl(Url, Truncate)

This function allows you to download the contents of a url inside a cell.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it
  • Truncate (optional): Boolean. It truncates the downloaded text to 32,767 characters. This field is optional and has a default value true.
The function can be used without the optional parameter, it will truncate the text to 32,767. This is a limitation imposed by Excel. In fact, by checking the documentation, we can see that this value is the maximum number of allowed characters in a cell. So to overcome the limitation (and the parameter value to false Truncate) and be able to manipulate pages bigger than 32KB you have to use function exclusively within a VBA function.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GetHTMLAndApplyRegEx(Url, Regex, NumberOfOccurencesSeparator)

This function allows you to download the contents of a url and apply the regular expression to HTML downloaded.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it
  • Regex: String. Any regular expressione. Ex: http[^"]*
  • NumberOfOccurences (optional): integer. The maximum number of occurrences that the regex can return. This field is optional and has, as default value, -1. -1 permits not to have a maximum number of results.
  • Separator (optional): boolean. It specifies the separator to be applied to separate the results of the regex in case they are more than one. This field is optional and has, as default value, vbCrLf. vbCrLf is the equivalent of \r\n and corresponds to a carriage return.
This is our alternative to ImportXML, although it is formally very different. Our one used, as a mechanism for parsing, regular expressions, while the function of Google Spreadsheet is using XPath.
The difference has two motivations:
  • Excel has a native function for the application of XPath, FILTERXML, which is available from only Excel 2013 and has the limitation that only accepts valid and well-formed XML input. Which is not the HTML of most of the pages that you find on the web.
  • We find more versatile using of regular expressions then XPath.
In any case, we are thinking of an alternative to also use XPath in a future release of this project.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GetHTMLAndApplyGroupRegEx(Url, Regex, NumberOfOccurencesSeparator)

This function allows to download the contents of a url and apply a regular expression containing groups to html downloaded.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it
  • Regex: String. Any regular expressione. Ex: http[^"]*
  • NumberOfOccurences (optional): integer. The maximum number of occurrences that the regex can return. This field is optional and has, as default value, -1. -1 permits not to have a maximum number of results.
  • Separator (optional): boolean. It specifies the separator to be applied to separate the results of the regex in case they are more than one. This field is optional and has, as default value, vbCrLf. vbCrLf is the equivalent of \r\n and corresponds to a carriage return.
This function is similar to the previous GetHTMLAndApplyRegEx and shares goals and objectives. The only difference is that instead of extracting the whole pattern, extracts only the groups in it.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GoogleLookup(TextToQuery)

This function allows to query Google and grab the contents of the One Box, if present for the query.
Parameters:
  • TextToQuery: String. Any text for which Google has an answer in the One Box. Ex: Italy Population
Google One Box
Google OneBox

All the queries, that generate a box of this kind, are analyzed and the data downloaded and returned by this function. I do not have a complete list of what you can ask Google, but in the file that you can download at the bottom of this page, you can find several working examples. You can find other examples on the page where Google announced GoogleLookup, now abandoned.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


BingLookup(TextToQuery)

This function allows to query Bing grab the contents of the One Box, if present for the query.
Parameters:
  • TextToQuery: String. Any text for which Bing has an answer in the One Box. Ex: Italy Population

Bing OneBox
Bing OneBox


The function is similar to the previous GoogleLookup, with the difference that uses, as a supplier of information, Bing. All queries that generate a box. as one of the photo. are analyzed and the data downloaded and returned by this function. Again, the examples are in the excel file that you can download at the bottom of this page. For now it would seem that Bing returns less results then Google, but it is plausible that they will increase since the supplier of the information is Freebase.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GoogleFinanceLookup(TextToQuery)

This function allows to query Google Finance and withdraw the value of the searched company's stock.
Parameters:
  • TextToQuery: String. The name of any company or ticker of the share. Ex: Microsoft
The function relies on Google Finance, and returns the value of shares of a company. This function is partly redundant since you can get the same information through the GoogleLookup. The problem of GoogleLookup is that it doesn't return the value of each share, while if you use the stock ticker (ex BIT:F, for Fiat) in this function there is a wider range of stocks to be queried.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GetTitleFromUrl(Url)

This function allows to query any URL and extract the Title tag if present.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it

The function is a simple application of GetHTMLAndApplyRegEx with a regular expression that extracts the contents of the Title tag if present.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GetMetaDescriptionFromUrl(Url)

This function allows to query any URL and extract the contents of the META Description if present.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it

The function is a simple application of GetHTMLAndApplyRegEx with a regular expression that extracts the contents of the META Description if present.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


GetBodyTextFromUrl(Url)

This function allows you to query any URL and extract the text within the body.
Parameters:
  • Url: String. Any url you want to download the content. Ex:  http://www.ideativi.it

The function downloads the HTML of the URL parameter passed by and lextracts the text from the Body tag, eliminating any tag so far.

Exceptions. If there's a network issue or the URL can not be reached, the function returns # VALUE.


Limitations

After so much good news, we must speak of the limitations that are present in the use of these functions, which depend on the same Excel limitations and where we collect the information.

Almost all the function, we have created, start downloading a page from an URL. VBA, the language with which you can extend the capabilities of Excel has the feature of being single threaded. Which means you can only do one thing at a time, and the worst thing is that Excel shares the same thread between programming in VBA and  the user interface. This means that while it is running one of the custom functions, the interface is blocked.

We tried to minimize this problem to the maximum, but it is still quite appreciable. Which means that if you are running a few hundred functions simultaneously the interface might have blocked for a few minutes, and the processing speed also depends on the speed of the sites you are downloading. My advice is once you have downloaded the information you need, copy it to a different excel file. The reason is that Excel may decide at a random time while you work to rerun all the functions, and when it happens you find the sheet blocked. Much better to take the data and place them on a different sheet where to work them.

Another useful tip is to not apply more than 100 functions on the sheet at a time, for two reasons:
  • you find the sheet stuck for a while
  • if you are using GoogleLookup or BingLookup, know that these do not rely to any public API but recall the same page that you would use in your browser. This means that if there are too many calls, the server would block them to save resources. Google, when there are too many requests, shows a captcha to make sure that they came from a human operator and not from asoftware.
If you want to know how much you can push, I called up to 2000 functions at once, it worked, but I had to wait several minutes before I got control back of the excel sheet. However, you can get the status of the calculation, because when applied several heavy function, Excel provides some information on the status bar.



From the status bar we know that we are using 8 processors, the calculation is at 2% and we applied 1850 functions

We also have another limitation regarding GoogleLookup and BingLookup. Since the information that is returned only come from the American versions of Google and Bing, queries and their results are in English only.

Last but not least, these functions work only on Microsoft Excel for Windows. It doesn't work on Windows RT and on Mac OS.

Credits

A special thanks to Enrico Altavilla (http://www.lowlevel.it/). This project was born from his post. And reached this form thanks to his ideas, his advices and the time he devoted to beta-testing.
Thanks Enrico. :-)


Download

Inside the file that contains the functions described in this document there's a sheet with several examples of use.
Post correlati
2013
17
Settembre
Alternative a ImportXML e GoogleLookup per Excel
Maurizio Ceravolo
ImportXML e GoogleLookup sono due funzioni per acquisire informazioni da web di Google Spreadsheet. Abbiamo sviluppato funzioni equivalenti per Excel. E qualcos'altro.
2011
24
Dicembre
Auguri ideativi di buon Natale
ideativi
Tutto lo staff di ideativi augura a tutti i nostri amici, partner, clienti, visitatori gli auguri di un buon Natale e di un sereno periodo di feste.
2011
3
Novembre
Microsoft Codename "Social Analytics"
Maurizio Ceravolo
I laboratori di SQL Azure, rilasciano un servizio cloud sperimentale per analizzare gli andamenti social su social network, blog e forum. Con le API per gli sviluppatori.
2011
13
Gennaio
Eliminare la restrizione ip per gli ftp di Aruba
Maurizio Ceravolo
Da un mese circa sul principale hoster italiano sono state introdotte restrizioni sugli ip che accedono agli ftp dei domini ospitati. Ecco la guida per rimuoverle.
2011
18
Gennaio
Open Source contro Software a pagamento
Maurizio Ceravolo
Cosa è meglio fra Open Source e Software a pagamento? Sui forum si fronteggiano sempre due fazioni, come se fossero ultrà sulle curve degli stadi. Facciamo un'analisi.


Archivio Mensile Blog
Tag Cloud
Apple drone Facebook finanza geek Google Google+ Groupon humor innovazione iPad iPhone Microsoft Programmazione social network SQL Server Star Wars torrent video videogiochi
Ultimi Post
I migliori di oggi
I più popolari del mese
Top post