I am trying to convert wikitext data dump of en wikipedia to html and I used "php parsoid/bin/parse.php --standalone --wt2html --inputfile <file>" to convert wikitext to html. I am separating tables but a lot of tables have Template:<template name>strings instead of the actual value. For example, all citations are converted to Template:cite. I used a container available on aws marketplace from bitnami for mediawiki and called the above mentioned command as a subprocess from inside a python script. What am I doing wrong or is that something I can do to replace that with actual values.
You can post now and register later.
If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.
We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.
Question
srg9000
I am trying to convert wikitext data dump of en wikipedia to html and I used "php parsoid/bin/parse.php --standalone --wt2html --inputfile <file>" to convert wikitext to html. I am separating tables but a lot of tables have Template:<template name>strings instead of the actual value. For example, all citations are converted to Template:cite. I used a container available on aws marketplace from bitnami for mediawiki and called the above mentioned command as a subprocess from inside a python script. What am I doing wrong or is that something I can do to replace that with actual values.
I have attached a sample pdf which is converted from the html (After some preprocessing and with html) the table is from https://en.wikipedia.org/wiki/Hungarian_language
Thank you
90_Hungarian language_79_1_chrome.pdf
Link to comment
Share on other sites
5 answers to this question
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.