Jump to content
  • 0

Using parsoid to convert wikitext to html gives Template:<template name> in output html, can I avoid that?


srg9000

Question

I am trying to convert wikitext data dump of en wikipedia to html and I used "php parsoid/bin/parse.php --standalone --wt2html --inputfile <file>" to convert wikitext to html. I am separating tables but a lot of tables have Template:<template name>strings instead of the actual value. For example, all citations are converted to Template:cite. I used a container available on aws marketplace from bitnami for mediawiki and called the above mentioned command as a subprocess from inside a python script. What am I doing wrong or is that something I can do to replace that with actual values.

I have attached a sample pdf which is converted from the html (After some preprocessing and with html) the table is from https://en.wikipedia.org/wiki/Hungarian_language

Thank you

90_Hungarian language_79_1_chrome.pdf

Link to comment
Share on other sites

5 answers to this question

Recommended Posts

  • 0
23 hours ago, Skizzerz said:

You are likely missing the text of those templates (and all nested dependencies) from your input file/wiki. You may be missing some extensions as well, but that's something you'll just need to find out as you go along.

Thank you for your response. Would you happen to know if those templates part of the mediawiki repo or docker image? A lot of them are for citations, web links etc. For this case, I used parsoid with --standalone flag, would it help if i ran it without that flag (sorry, i am not experienced in php)?

Thank you

Edited by srg9000 (see edit history)
Link to comment
Share on other sites

  • 0
16 hours ago, Skizzerz said:

No on-wiki content is part of any repositories or docker images. You'll need to export them from the source wiki in question and import them into your wiki.

Thanks for clarifying. I am working with wikipedia data dump only, is there a separate dump for the template sources because Special:Export requires me to manually pass in every template? If I get sources of all templates, how can I add them? I am running it inside a container via ssh, so might not be able to use gui to use Special:Import to import individual pages independently
 

Thank You

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Answer this question...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.