Stream: helpdesk (published)

Topic: Scrape Julia source code


view this post on Zulip Adrian Hill (Aug 11 2021 at 10:12):

I want to scrape the source code for the Top 50 repos on JuliaHub. My goal is to look at common bigrams and optimize a keyboard symbol layer for Julia. What is the best way to download source files and put substrings into a DataFrame? Does Pkg provide a mechanism that could be used?

view this post on Zulip Andrey Oskin (Aug 11 2021 at 11:02):

You can probably use usual git clone? As a regular shell command.

view this post on Zulip Andrey Oskin (Aug 11 2021 at 11:02):

Just create workspace in /tmp (or it's analogue in Windows, if you are using one) and do whatever you want.

view this post on Zulip Adrian Hill (Aug 11 2021 at 11:51):

Yeah, that would also do. I was just wondering whether this could be done in 100% Julia. But I guess Pkg also just calls Downloads.jl, which ends up using Curl.

view this post on Zulip Andrey Oskin (Aug 11 2021 at 12:41):

You can use UrlDownload.jl, which uses HTTP.jl
But there is nothing wrong with Curl and I think it works fine in windows too.

view this post on Zulip Sebastian Pfitzner (Aug 11 2021 at 13:41):

Just Pkg.add everything, no? Should be faster than cloning

view this post on Zulip Andrey Oskin (Aug 11 2021 at 13:47):

How one can get path to the source code of the result of add command?

view this post on Zulip Andrey Oskin (Aug 11 2021 at 13:47):

I suppose it's something simple, but it is hidden somewhere inside Pkg.

view this post on Zulip Danny (Aug 11 2021 at 13:49):

pathof(Foo) gives you the path of the Foo package

view this post on Zulip Danny (Aug 11 2021 at 13:52):

Additionally Base.find_package("Foo") doesn't require you to import the package

view this post on Zulip Eric Hanson (Aug 11 2021 at 22:17):

PackageAnalyzer can clone a bunch of packages to a directory with analyze!. It’s threaded and does a shallow clone so it can be quite fast

view this post on Zulip Adrian Hill (Aug 12 2021 at 07:13):

Thanks everyone! :smiley:


Last updated: Oct 02 2023 at 04:34 UTC