From: Benjamin Mako Hill Date: Mon, 22 Jan 2018 01:15:51 +0000 (-0800) Subject: initial import of material for public archive into git X-Git-Url: https://code.communitydata.science/social-media-chapter.git/commitdiff_plain/dd420c77de075d148e5fc3d705401d7f3004660d?ds=sidebyside initial import of material for public archive into git We're creating a fresh archive because the history for our old chapter includes API keys, data files, and other material we can't share. --- dd420c77de075d148e5fc3d705401d7f3004660d diff --git a/COPYING b/COPYING new file mode 100644 index 0000000..3bd24bd --- /dev/null +++ b/COPYING @@ -0,0 +1,35 @@ +Code and paper for "A Computational Analysis of Social Media Scholarship" + +This work is copyright (c) 2016-2018 Jeremy Foote, Aaron Shaw, and +Benjamin Mako Hill. + +This archive contains both code and text for a document that was +published as a book chapter in the "SAGE Handbook of Social +Media." They are licensed differently: + +1. Code for analysis + +This code included here is all free software: you can redistribute it +and/or modify it under the terms of the GNU General Public License as +published by the Free Software Foundation, either version 3 of the +License, or (at your option) any later version. + +2. Text of the final paper + +The text of the paper is distributed under the Creative Commons +Attribution-NonCommercial-ShareAlike license: + +- Attribution — You must give appropriate credit, provide a link to + the license, and indicate if changes were made. You may do so in any + reasonable manner, but not in any way that suggests the licensor + endorses you or your use. + +- NonCommercial — You may not use the material for commercial purposes. + +- ShareAlike — If you remix, transform, or build upon the material, + you must distribute your contributions under the same license as the + original. + +No additional restrictions — You may not apply legal terms or +technological measures that legally restrict others from doing +anything the license permits. diff --git a/GPL-3 b/GPL-3 new file mode 100644 index 0000000..f288702 --- /dev/null +++ b/GPL-3 @@ -0,0 +1,674 @@ + GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff --git a/code/bibliometrics/00_citation_network_analysis.ipynb b/code/bibliometrics/00_citation_network_analysis.ipynb new file mode 100644 index 0000000..792cb0e --- /dev/null +++ b/code/bibliometrics/00_citation_network_analysis.ipynb @@ -0,0 +1,2493 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Import data and get things setup" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "random.seed(9001)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Populating the interactive namespace from numpy and matplotlib\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/lib/python3/dist-packages/IPython/core/magics/pylab.py:161: UserWarning: pylab import has clobbered these variables: ['sin', 'pi', 'median', 'random', 'percentile', 'save', 'deprecated', 'Rectangle', 'load', 'mean', 'plot', 'cos']\n", + "`%matplotlib` prevents importing * from pylab and numpy\n", + " \"\\n`%matplotlib` prevents importing * from pylab and numpy\"\n" + ] + } + ], + "source": [ + "# turn on the magic so we have inline figures\n", + "%pylab inline\n", + "import matplotlib\n", + "matplotlib.style.use('ggplot')\n", + "from IPython.display import display" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [], + "source": [ + "# import code to write r modules and create our variable we'll write to\n", + "import rpy2.robjects as robjects\n", + "from rpy2.robjects import pandas2ri\n", + "pandas2ri.activate()\n", + "\n", + "r = {}\n", + "def remember(name, x):\n", + " r[name] = x\n", + " display(x)" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "# load in modules we'll need for analysis\n", + "import subprocess\n", + "import csv\n", + "from igraph import *\n", + "import pandas as pd\n", + "import numpy as np\n", + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# grab the largest connected compontent with a little function\n", + "def get_largest_component(g):\n", + " g_components = g.components(mode=\"WEAK\")\n", + " max_size = max(g_components.sizes())\n", + " for g_tmp in g_components.subgraphs():\n", + " if g_tmp.vcount() == max_size:\n", + " return(g_tmp)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [], + "source": [ + "# look the full edgelist into igraph\n", + "def edge_list_iter(df):\n", + " for i, row in df.iterrows():\n", + " yield (row['from'], row['to'])" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [], + "source": [ + "# list top 5 journals for each of the clusters\n", + "def top_journals_for_clusters(clu):\n", + " articles_tmp = pd.merge(clu, articles[['eid', 'source_title']])\n", + " \n", + " output = pd.DataFrame()\n", + " for cid in articles_tmp['cluster'].unique():\n", + " journal_counts = articles_tmp['source_title'][articles_tmp['cluster'] == cid].value_counts().head(5)\n", + " tmp = pd.DataFrame({'cluster' : cid, 'count' : journal_counts }) \n", + " output = output.append(tmp)\n", + "\n", + " output = output.reset_index()\n", + " output = output.rename(columns = {'index' : \"journal\"})\n", + " return(output)" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [], + "source": [ + "def infomap_edgelist(g, edgelist_filename, directed=True):\n", + " nodes_tmp = pd.DataFrame([ {'node_infomap' : v.index, \n", + " 'eid' : v['name']} for v in g.vs ])\n", + "\n", + " # write out the edgelist to an external file so we can call infomap on it\n", + " with open(edgelist_filename + \".txt\", 'w') as f:\n", + " for e in g.es:\n", + " if e.source != e.target:\n", + " if 'weight' in e.attributes():\n", + " print(\"{}\\t{}\\t{}\".format(e.source, e.target, e['weight']), file=f)\n", + " else:\n", + " print(\"{}\\t{}\".format(e.source, e.target), file=f)\n", + "\n", + " \n", + " # run the external program to generate the infomap clustering\n", + " infomap_cmdline = [\"infomap/Infomap\", edgelist_filename + \".txt\", \"output_dir -z --map --clu --tree\"]\n", + " if directed:\n", + " infomap_cmdline.append(\"-d\")\n", + " subprocess.call(infomap_cmdline)\n", + "\n", + " # load up the clu data\n", + " clu = pd.read_csv(\"output_dir/\" + edgelist_filename + \".clu\",\n", + " header=None, comment=\"#\", delim_whitespace=True)\n", + " clu.columns = ['node_infomap', 'cluster', 'flow']\n", + " \n", + " return pd.merge(clu, nodes_tmp, on=\"node_infomap\")" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [], + "source": [ + "def write_graphml(g, clu, graphml_filename):\n", + " clu = clu[['node_infomap', 'cluster']].sort_values('node_infomap')\n", + " g.vs[\"cluster\"] = clu[\"cluster\"].tolist()\n", + " g.write_graphml(graphml_filename)" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [], + "source": [ + "# load article data\n", + "articles = pd.read_csv(\"../../processed_data/abstracts.tsv\", delimiter=\"\\t\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# network for just the central \"social media\" set" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [], + "source": [ + "# this contains the list of all INCOMING citations to for paper in the original set\n", + "raw_edgelist = pd.read_csv(\"../../processed_data/social_media_edgelist.txt\", delimiter=\"\\t\")" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [], + "source": [ + "g_sm_all = Graph.TupleList([i for i in edge_list_iter(raw_edgelist)], directed=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [], + "source": [ + "g_sm = get_largest_component(g_sm_all)\n", + "g_sm = g_sm.simplify()" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "g_sm_clu = infomap_edgelist(g_sm, \"sm_edgelist_infomap\", directed=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2 1817\n", + "1 1748\n", + "3 1088\n", + "4 653\n", + "6 355\n", + "10 114\n", + "5 104\n", + "9 90\n", + "8 59\n", + "7 44\n", + "12 27\n", + "11 19\n", + "13 10\n", + "14 5\n", + "15 3\n", + "16 2\n", + "18 1\n", + "17 1\n", + "Name: cluster, dtype: int64" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g_sm_clu['cluster'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
journalclustercount
40Lecture Notes in Computer Science (including s...94
41WSDM 2013 - Proceedings of the 6th ACM Interna...94
42Conference on Human Factors in Computing Syste...92
43WWW 2013 Companion - Proceedings of the 22nd I...92
44PLoS ONE92
\n", + "
" + ], + "text/plain": [ + " journal cluster count\n", + "40 Lecture Notes in Computer Science (including s... 9 4\n", + "41 WSDM 2013 - Proceedings of the 6th ACM Interna... 9 4\n", + "42 Conference on Human Factors in Computing Syste... 9 2\n", + "43 WWW 2013 Companion - Proceedings of the 22nd I... 9 2\n", + "44 PLoS ONE 9 2" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tmp = top_journals_for_clusters(g_sm_clu)\n", + "tmp[tmp.cluster == 9]" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": {}, + "outputs": [], + "source": [ + "write_graphml(g_sm, g_sm_clu, \"g_sm.graphml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# larger network that contains the incoming cites to citing articles" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [], + "source": [ + "# this contains the list of all INCOMING citations to everything in the original set\n", + "# plus every INCOMING citation to every paper that cites one of those papers\n", + "raw_edgelist_files = [\"../../processed_data/citation_edgelist.txt\",\n", + " \"../../processed_data/social_media_edgelist.txt\"]\n", + "combo_raw_edgelist = pd.concat([pd.read_csv(x, delimiter=\"\\t\") for x in raw_edgelist_files])" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [], + "source": [ + "g_full_all = Graph.TupleList([i for i in edge_list_iter(combo_raw_edgelist)], directed=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "g_full = get_largest_component(g_full_all)\n", + "g_full = g_full.simplify()" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [], + "source": [ + "g_full_clu = infomap_edgelist(g_full, \"citation_edglist_infomap\", directed=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1 9243\n", + "2 8225\n", + "3 6826\n", + "4 3227\n", + "6 2835\n", + "5 2704\n", + "7 1911\n", + "9 810\n", + "8 803\n", + "10 589\n", + "11 520\n", + "12 491\n", + "13 336\n", + "14 219\n", + "15 175\n", + "17 162\n", + "16 153\n", + "22 139\n", + "18 135\n", + "19 118\n", + "25 117\n", + "23 106\n", + "21 93\n", + "24 88\n", + "30 84\n", + "28 79\n", + "27 78\n", + "32 76\n", + "26 73\n", + "20 71\n", + " ... \n", + "54 26\n", + "56 25\n", + "52 23\n", + "49 23\n", + "55 22\n", + "58 19\n", + "62 18\n", + "61 18\n", + "63 18\n", + "60 17\n", + "66 15\n", + "59 15\n", + "57 15\n", + "65 14\n", + "68 13\n", + "53 7\n", + "64 6\n", + "73 6\n", + "71 4\n", + "70 4\n", + "74 3\n", + "67 3\n", + "72 3\n", + "69 3\n", + "75 2\n", + "78 1\n", + "79 1\n", + "77 1\n", + "80 1\n", + "76 1\n", + "Name: cluster, Length: 80, dtype: int64" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "g_full_clu['cluster'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
journalclustercount
0Public Relations Review1119
1Lecture Notes in Computer Science (including s...181
2Computers in Human Behavior171
3Proceedings of the Annual Hawaii International...149
4Government Information Quarterly140
5Journal of Medical Internet Research2149
6PLoS ONE243
7Studies in Health Technology and Informatics241
8Lecture Notes in Computer Science (including s...232
9Annals of Emergency Medicine217
10Lecture Notes in Computer Science (including s...3180
11ACM International Conference Proceeding Series351
12International Conference on Information and Kn...338
13CEUR Workshop Proceedings337
14PLoS ONE336
15Information Communication and Society470
16New Media and Society434
17First Monday424
18Lecture Notes in Computer Science (including s...423
19Computers in Human Behavior421
20Computers in Human Behavior542
21Cyberpsychology, Behavior, and Social Networking542
22Personality and Individual Differences511
23Journal of Medical Internet Research511
24Journal of Adolescent Health511
25Computers in Human Behavior638
26Lecture Notes in Computer Science (including s...624
27Computers and Education616
28Conference on Human Factors in Computing Syste...611
29Journal of Marketing Education611
............
286Medical Journal of Australia631
287Nicotine and Tobacco Research631
28835th International Conference on Information S...641
289First Monday641
290Cyberpsychology, Behavior, and Social Networking641
291HT'12 - Proceedings of 23rd ACM Conference on ...651
292IEEE/ACM Transactions on Networking651
293Journal of Healthcare Engineering651
294International Journal of Information Management662
295Journal of Theoretical and Applied Electronic ...661
296Journal of Experimental and Theoretical Artifi...661
297McKinsey Quarterly661
298Lecture Notes in Computer Science (including s...661
299Science (New York, N.Y.)671
300International Conference on Information and Kn...681
301Lecture Notes in Computer Science (including s...681
30216th Americas Conference on Information System...681
303Procedia Engineering681
304International Journal of Virtual and Personal ...681
305Scientometrics691
306Conference on Human Factors in Computing Syste...702
307NyS712
308Aslib Proceedings: New Information Perspectives711
309WWW 2013 Companion - Proceedings of the 22nd I...721
310Cyberpsychology, Behavior, and Social Networking721
311PACIS 2011 - 15th Pacific Asia Conference on I...731
312Proceedings of the International Conference on...731
313Online (Wilton, Connecticut)741
314Catalan Journal of Communication and Cultural ...751
315Proceedings - Pacific Asia Conference on Infor...751
\n", + "

316 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " journal cluster count\n", + "0 Public Relations Review 1 119\n", + "1 Lecture Notes in Computer Science (including s... 1 81\n", + "2 Computers in Human Behavior 1 71\n", + "3 Proceedings of the Annual Hawaii International... 1 49\n", + "4 Government Information Quarterly 1 40\n", + "5 Journal of Medical Internet Research 2 149\n", + "6 PLoS ONE 2 43\n", + "7 Studies in Health Technology and Informatics 2 41\n", + "8 Lecture Notes in Computer Science (including s... 2 32\n", + "9 Annals of Emergency Medicine 2 17\n", + "10 Lecture Notes in Computer Science (including s... 3 180\n", + "11 ACM International Conference Proceeding Series 3 51\n", + "12 International Conference on Information and Kn... 3 38\n", + "13 CEUR Workshop Proceedings 3 37\n", + "14 PLoS ONE 3 36\n", + "15 Information Communication and Society 4 70\n", + "16 New Media and Society 4 34\n", + "17 First Monday 4 24\n", + "18 Lecture Notes in Computer Science (including s... 4 23\n", + "19 Computers in Human Behavior 4 21\n", + "20 Computers in Human Behavior 5 42\n", + "21 Cyberpsychology, Behavior, and Social Networking 5 42\n", + "22 Personality and Individual Differences 5 11\n", + "23 Journal of Medical Internet Research 5 11\n", + "24 Journal of Adolescent Health 5 11\n", + "25 Computers in Human Behavior 6 38\n", + "26 Lecture Notes in Computer Science (including s... 6 24\n", + "27 Computers and Education 6 16\n", + "28 Conference on Human Factors in Computing Syste... 6 11\n", + "29 Journal of Marketing Education 6 11\n", + ".. ... ... ...\n", + "286 Medical Journal of Australia 63 1\n", + "287 Nicotine and Tobacco Research 63 1\n", + "288 35th International Conference on Information S... 64 1\n", + "289 First Monday 64 1\n", + "290 Cyberpsychology, Behavior, and Social Networking 64 1\n", + "291 HT'12 - Proceedings of 23rd ACM Conference on ... 65 1\n", + "292 IEEE/ACM Transactions on Networking 65 1\n", + "293 Journal of Healthcare Engineering 65 1\n", + "294 International Journal of Information Management 66 2\n", + "295 Journal of Theoretical and Applied Electronic ... 66 1\n", + "296 Journal of Experimental and Theoretical Artifi... 66 1\n", + "297 McKinsey Quarterly 66 1\n", + "298 Lecture Notes in Computer Science (including s... 66 1\n", + "299 Science (New York, N.Y.) 67 1\n", + "300 International Conference on Information and Kn... 68 1\n", + "301 Lecture Notes in Computer Science (including s... 68 1\n", + "302 16th Americas Conference on Information System... 68 1\n", + "303 Procedia Engineering 68 1\n", + "304 International Journal of Virtual and Personal ... 68 1\n", + "305 Scientometrics 69 1\n", + "306 Conference on Human Factors in Computing Syste... 70 2\n", + "307 NyS 71 2\n", + "308 Aslib Proceedings: New Information Perspectives 71 1\n", + "309 WWW 2013 Companion - Proceedings of the 22nd I... 72 1\n", + "310 Cyberpsychology, Behavior, and Social Networking 72 1\n", + "311 PACIS 2011 - 15th Pacific Asia Conference on I... 73 1\n", + "312 Proceedings of the International Conference on... 73 1\n", + "313 Online (Wilton, Connecticut) 74 1\n", + "314 Catalan Journal of Communication and Cultural ... 75 1\n", + "315 Proceedings - Pacific Asia Conference on Infor... 75 1\n", + "\n", + "[316 rows x 3 columns]" + ] + }, + "execution_count": 74, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top_journals_for_clusters(g_full_clu)" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [], + "source": [ + "write_graphml(g_full, g_full_clu, \"g_full.graphml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# create the meta-network of connections between clusters" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
to_clusterfrom_clustervalue
121396
231278
341233
451171
56185
67157
78186
89125
910129
1011112
111210
121313
1312412
1532117
1642126
1752187
1862104
1972175
208268
219216
221024
231123
241220
251324
2613184
2723150
2943174
3053345
316311
327399
............
20410160
20511160
20612160
20713161
2081170
2092170
2103170
2114173
2125174
2136170
2147170
2158172
2169170
21710170
21811170
21912170
22013170
2211183
2222180
2233180
2244182
2255182
2266180
2277180
2288180
2299180
23010180
23111180
23212180
23313180
\n", + "

221 rows × 3 columns

\n", + "
" + ], + "text/plain": [ + " to_cluster from_cluster value\n", + "1 2 1 396\n", + "2 3 1 278\n", + "3 4 1 233\n", + "4 5 1 171\n", + "5 6 1 85\n", + "6 7 1 57\n", + "7 8 1 86\n", + "8 9 1 25\n", + "9 10 1 29\n", + "10 11 1 12\n", + "11 12 1 0\n", + "12 13 1 3\n", + "13 1 2 412\n", + "15 3 2 117\n", + "16 4 2 126\n", + "17 5 2 187\n", + "18 6 2 104\n", + "19 7 2 175\n", + "20 8 2 68\n", + "21 9 2 16\n", + "22 10 2 4\n", + "23 11 2 3\n", + "24 12 2 0\n", + "25 13 2 4\n", + "26 1 3 184\n", + "27 2 3 150\n", + "29 4 3 174\n", + "30 5 3 345\n", + "31 6 3 11\n", + "32 7 3 99\n", + ".. ... ... ...\n", + "204 10 16 0\n", + "205 11 16 0\n", + "206 12 16 0\n", + "207 13 16 1\n", + "208 1 17 0\n", + "209 2 17 0\n", + "210 3 17 0\n", + "211 4 17 3\n", + "212 5 17 4\n", + "213 6 17 0\n", + "214 7 17 0\n", + "215 8 17 2\n", + "216 9 17 0\n", + "217 10 17 0\n", + "218 11 17 0\n", + "219 12 17 0\n", + "220 13 17 0\n", + "221 1 18 3\n", + "222 2 18 0\n", + "223 3 18 0\n", + "224 4 18 2\n", + "225 5 18 2\n", + "226 6 18 0\n", + "227 7 18 0\n", + "228 8 18 0\n", + "229 9 18 0\n", + "230 10 18 0\n", + "231 11 18 0\n", + "232 12 18 0\n", + "233 13 18 0\n", + "\n", + "[221 rows x 3 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "edgelist_tmp = pd.merge(raw_edgelist, g_sm_clu[[\"eid\", \"cluster\"]], how=\"inner\", left_on=\"to\", right_on=\"eid\")\n", + "edgelist_tmp = edgelist_tmp.rename(columns={'cluster' : 'to_cluster'})\n", + "edgelist_tmp.drop('eid', 1, inplace=True)\n", + " \n", + "edgelist_tmp = pd.merge(edgelist_tmp, g_sm_clu[[\"eid\", \"cluster\"]], how=\"inner\", left_on=\"from\", right_on=\"eid\")\n", + "edgelist_tmp = edgelist_tmp.rename(columns={\"cluster\" : 'from_cluster'})\n", + "edgelist_tmp.drop('eid', 1, inplace=True)\n", + "\n", + "edgelist_tmp = edgelist_tmp[[\"to_cluster\", \"from_cluster\"]]\n", + "edgelist_tmp = edgelist_tmp[edgelist_tmp[\"to_cluster\"] != edgelist_tmp[\"from_cluster\"]]\n", + "\n", + "cluster_edgelist = pd.crosstab(edgelist_tmp[\"to_cluster\"], edgelist_tmp[\"from_cluster\"])\n", + "cluster_edgelist[\"to_cluster\"] = cluster_edgelist.index\n", + "\n", + "cluster_edgelist = pd.melt(cluster_edgelist, id_vars=[\"to_cluster\"])\n", + "cluster_edgelist = cluster_edgelist[cluster_edgelist['to_cluster'] != cluster_edgelist['from_cluster']]\n", + "\n", + "remember(\"cluster_edgelist\", cluster_edgelist)" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "metadata": {}, + "outputs": [], + "source": [ + "top_clusters = g_sm_clu[\"cluster\"].value_counts().head(6).index\n", + "\n", + "# write the edgelist for the total number of clusters (currently 1-6)\n", + "cluster_edgelist_output = cluster_edgelist[(cluster_edgelist[\"to_cluster\"].isin(top_clusters)) &\n", + " (cluster_edgelist[\"from_cluster\"].isin(top_clusters))]\n", + "\n", + "cluster_edgelist_output = cluster_edgelist_output[cluster_edgelist_output[\"value\"] > 0]\n", + "\n", + "g_cluster = Graph.TupleList([tuple(x) for x in cluster_edgelist_output[[\"from_cluster\", \"to_cluster\"]].values], directed=True)\n", + "g_cluster.es[\"weight\"] = cluster_edgelist_output[\"value\"].tolist()\n", + "\n", + "# assign the number of total articles as an attribute for each node\n", + "g_cluster.vs[\"papers\"] = g_sm_clu[\"cluster\"].value_counts()[[x[\"name\"] for x in g_cluster.vs]].tolist()\n", + "\n", + "g_cluster.write_graphml(\"clusters.graphml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# create network stats for tables (overall and within clusters)" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [], + "source": [ + "def create_network_stats(g):\n", + " network_stats = pd.DataFrame({'eid' : g.vs['name'],\n", + " 'eig_cent' : g.eigenvector_centrality(),\n", + " 'indegree' : g.indegree(),\n", + " 'betweenness' : g.betweenness()})\n", + "\n", + " network_stats = pd.merge(network_stats,\n", + " articles[['eid', 'title', 'source_title']],\n", + " how=\"inner\")\n", + " return network_stats" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "metadata": {}, + "outputs": [], + "source": [ + "network_stats = create_network_stats(g_full)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
betweennesseideig_centindegreetitlesource_title
22756393.5604982-s2.0-711490889871.000000e+001876Users of the world, unite! The challenges and ...Business Horizons
1790.0000002-s2.0-434491350336.899762e-15645Why we twitter: Understanding microblogging us...Joint Ninth WebKDD and First SNA-KDD 2007 Work...
5120669.6253972-s2.0-799537117117.271520e-02468Social media? Get serious! Understanding the f...Business Horizons
18550.0000002-s2.0-673492681242.974873e-01450Social media: The new hybrid element of the pr...Business Horizons
\n", + "
" + ], + "text/plain": [ + " betweenness eid eig_cent indegree \\\n", + "2275 6393.560498 2-s2.0-71149088987 1.000000e+00 1876 \n", + "179 0.000000 2-s2.0-43449135033 6.899762e-15 645 \n", + "5120 669.625397 2-s2.0-79953711711 7.271520e-02 468 \n", + "1855 0.000000 2-s2.0-67349268124 2.974873e-01 450 \n", + "\n", + " title \\\n", + "2275 Users of the world, unite! The challenges and ... \n", + "179 Why we twitter: Understanding microblogging us... \n", + "5120 Social media? Get serious! Understanding the f... \n", + "1855 Social media: The new hybrid element of the pr... \n", + "\n", + " source_title \n", + "2275 Business Horizons \n", + "179 Joint Ninth WebKDD and First SNA-KDD 2007 Work... \n", + "5120 Business Horizons \n", + "1855 Business Horizons " + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "network_stats.sort_values(\"indegree\", ascending=False).head(4)" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
betweennesseideig_centindegreetitlesource_title
22756393.5604982-s2.0-711490889871.0000001876Users of the world, unite! The challenges and ...Business Horizons
22590.0000002-s2.0-703498168880.60527970The fairyland of Second Life: Virtual social w...Business Horizons
36120.0000002-s2.0-779495225960.563979335Networked narratives: Understanding word-of-mo...Journal of Marketing
70880.0000002-s2.0-795515820370.43295136Online Personal Branding: Processes, Challenge...Journal of Interactive Marketing
\n", + "
" + ], + "text/plain": [ + " betweenness eid eig_cent indegree \\\n", + "2275 6393.560498 2-s2.0-71149088987 1.000000 1876 \n", + "2259 0.000000 2-s2.0-70349816888 0.605279 70 \n", + "3612 0.000000 2-s2.0-77949522596 0.563979 335 \n", + "7088 0.000000 2-s2.0-79551582037 0.432951 36 \n", + "\n", + " title \\\n", + "2275 Users of the world, unite! The challenges and ... \n", + "2259 The fairyland of Second Life: Virtual social w... \n", + "3612 Networked narratives: Understanding word-of-mo... \n", + "7088 Online Personal Branding: Processes, Challenge... \n", + "\n", + " source_title \n", + "2275 Business Horizons \n", + "2259 Business Horizons \n", + "3612 Journal of Marketing \n", + "7088 Journal of Interactive Marketing " + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "network_stats.sort_values(\"eig_cent\", ascending=False).head(4)" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
betweennesseideig_centindegreetitlesource_title
22756393.5604982-s2.0-711490889871.000000e+001876Users of the world, unite! The challenges and ...Business Horizons
4016220.2500002-s2.0-703504918893.749870e-16103Crisis in a networked world: Features of compu...Social Science Computer Review
27815131.8246392-s2.0-848880473001.310283e-0131Social media metrics - A framework and guideli...Journal of Interactive Marketing
38214319.7475612-s2.0-849101362353.045168e-188What are health-related users tweeting? A qual...Journal of Medical Internet Research
\n", + "
" + ], + "text/plain": [ + " betweenness eid eig_cent indegree \\\n", + "2275 6393.560498 2-s2.0-71149088987 1.000000e+00 1876 \n", + "401 6220.250000 2-s2.0-70350491889 3.749870e-16 103 \n", + "2781 5131.824639 2-s2.0-84888047300 1.310283e-01 31 \n", + "3821 4319.747561 2-s2.0-84910136235 3.045168e-18 8 \n", + "\n", + " title \\\n", + "2275 Users of the world, unite! The challenges and ... \n", + "401 Crisis in a networked world: Features of compu... \n", + "2781 Social media metrics - A framework and guideli... \n", + "3821 What are health-related users tweeting? A qual... \n", + "\n", + " source_title \n", + "2275 Business Horizons \n", + "401 Social Science Computer Review \n", + "2781 Journal of Interactive Marketing \n", + "3821 Journal of Medical Internet Research " + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "network_stats.sort_values(\"betweenness\", ascending=False).head(4)" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAFKFJREFUeJzt3W9sW2fdxvHLifenJSVN7eKQLZNo1kqk2tYad0CgW/6YTlQIdRVEjBdoC6PJsjGyMbHhF9MkFinSiBwJGoEgRKNIaENKKEggJFO6ogRo4ixd1bAt6ZjUqFlM7NK667LO8XleVPPT0KS1XZ/45Ob7eVWf2T5X7sTXnJ9PznFZlmUJAGCskmIHAADYi6IHAMNR9ABgOIoeAAxH0QOA4Sh6ADAcRQ8AhqPoAcBwFD0AGI6iBwDDuYu589HRUUWjUbW2tur06dN5PYfX69Xc3FyBkxWW0zM6PZ9ExkJwej7J+Rmdlq+qqiqr+xW16AOBgAKBQDEjAIDxGN0AgOEoegAwHEUPAIaj6AHAcBQ9ABiOogcAw1H0AGC4oh5HXwiz99cVbd+lP/td0fYNANniHT0AGM6Wop+fn9fTTz+taDRqx9MDAHKQ1eimt7dXY2NjKi8vV3d3d2b7+Pi4+vv7lU6n1dTUpD179kiSDh48qM9+9rP2JAYA5CSrd/T19fUKhUKLtqXTafX19SkUCikcDmtoaEjT09N67bXXdOutt2r9+vW2BAYA5Card/S1tbWKxWKLtk1NTamyslI+n0+SVFdXp5GREc3Pz+v999/X9PS0brzxRm3fvl0lJXwUAADFkvdRN4lEQh6PJ3Pb4/FocnJS3/zmNyVJhw8f1rp165Yt+UgkokgkIknq6uqS1+vNK8dsXo8qjGwzu93uvL++leD0fBIZC8Hp+STnZ3R6vuXkXfSWZV2xzeVyZf5dX19/1ccHg0EFg8HMbSed4zlb2WZ22jms/5vT80lkLASn55Ocn9Fp+bI9H33eMxWPx6N4PJ65HY/HVVFRkdNzjI6O6qc//Wm+EQAAWci76GtqajQzM6NYLKZUKqXh4eGcLyISCATU2tqabwQAQBayGt309PRoYmJCyWRSbW1tam5uVmNjo1paWtTZ2al0Oq2GhgZVV1fntPPLLyUIALBHVkXf0dGx5Ha/3y+/35/3zrmUIADYj+MeAcBwRS16PowFAPsV9eyVjG4AwH6MbgDAcIxuAMBwjG4AwHCMbgDAcBQ9ABiOGT0AGI4ZPQAYjtENABiOogcAw1H0AGA4PowFAMPxYSwAGI7RDQAYjqIHAMNR9ABgOIoeAAzHUTcAYDiOugEAwzG6AQDDUfQAYDiKHgAMR9EDgOEoegAwHEUPAIbjOHoAMBzH0QOA4RjdAIDhKHoAMBxFDwCGo+gBwHAUPQAYjqIHAMNR9ABgOIoeAAxH0QOA4Qr+l7HT09P6wx/+oGQyqTvuuEO7du0q9C4AADnIquh7e3s1Njam8vJydXd3Z7aPj4+rv79f6XRaTU1N2rNnj2699Vbt27dP6XSa89gAgANkNbqpr69XKBRatC2dTquvr0+hUEjhcFhDQ0Oanp6WdOlkZc8++6zuuOOOwicGAOQkq6Kvra1VWVnZom1TU1OqrKyUz+eT2+1WXV2dRkZGJF06Wdnzzz+vv/71r4VPDADISd4z+kQiIY/Hk7nt8Xg0OTmpEydO6B//+IdSqZS2b9++7OMjkYgikYgkqaurS16vN68cs3k9qjCyzex2u/P++laC0/NJZCwEp+eTnJ/R6fmWk3fRW5Z1xTaXy6WtW7dq69at13x8MBhUMBjM3J6bm8s3StFkm9nr9Tr663N6PomMheD0fJLzMzotX1VVVVb3y/vwSo/Ho3g8nrkdj8dVUVGR03Nw4REAsF/eRV9TU6OZmRnFYjGlUikNDw/nfBGRQCCg1tbWfCMAALKQ1eimp6dHExMTSiaTamtrU3NzsxobG9XS0qLOzk6l02k1NDSouro6p52Pjo4qGo1S9gBgo6yKvqOjY8ntfr9ffr8/751zKUEAsB+nQAAAwxW16PkwFgDsV/Bz3eSC0Q0A2I/RDQAYjtENABiO0Q0AGI7RDQAYjqIHAMMxowcAwzGjBwDDMboBAMNR9ABgOIoeAAzHh7EAYDg+jAUAwzG6AQDDUfQAYDiKHgAMR9EDgOE46gYADMdRNwBgOEY3AGA4ih4ADEfRA4DhKHoAMBxFDwCGo+gBwHAcRw8AhuM4egAwHKMbADAcRQ8AhqPoAcBwFD0AGI6iBwDDUfQAYDiKHgAMR9EDgOEoegAwnC1/GXv06FGNjY3p3Llzuu+++3TXXXfZsRsAQBayLvre3l6NjY2pvLxc3d3dme3j4+Pq7+9XOp1WU1OT9uzZo7vvvlt33323zp8/rwMHDlD0AFBEWY9u6uvrFQqFFm1Lp9Pq6+tTKBRSOBzW0NCQpqenM/99YGBA9913X+HSAgBylnXR19bWqqysbNG2qakpVVZWyufzye12q66uTiMjI7IsS7/61a+0bds2bdq0qeChAQDZu64ZfSKRkMfjydz2eDyanJzUH//4Rx0/flwXLlzQO++8o127dl3x2EgkokgkIknq6uqS1+vNK8NsftELItvMbrc7769vJTg9n0TGQnB6Psn5GZ2ebznXVfSWZV2xzeVyaffu3dq9e/dVHxsMBhUMBjO35+bmridKUWSb2ev1Ovrrc3o+iYyF4PR8kvMzOi1fVVVVVve7rsMrPR6P4vF45nY8HldFRUXWj+fCIwBgv+sq+pqaGs3MzCgWiymVSml4eDinC4kEAgG1trZeTwQAwDVkPbrp6enRxMSEksmk2tra1NzcrMbGRrW0tKizs1PpdFoNDQ2qrq7Oeuejo6OKRqOUPQDYKOui7+joWHK73++X3+/Pa+dcShAA7McpEADAcEUtej6MBQD72XKum2wxugEA+zG6AQDDMboBAMMxugEAwzG6AQDDUfQAYDhm9ABgOGb0AGA4RjcAYDiKHgAMR9EDgOH4MBYADMeHsQBgOEY3AGA4ih4ADEfRA4DhKHoAMBxH3QCA4TjqBgAMx+gGAAxH0QOA4Sh6ADAcRQ8AhqPoAcBwRT3qZrVb+NaXs7rfbIH3W/qz3xX4GQGYjOPoAcBwHEcPAIZjRg8AhqPoAcBwFD0AGI6iBwDDUfQAYDiKHgAMR9EDgOEoegAwHEUPAIYr+F/Gzs7OamBgQBcuXNB3v/vdQj89ACBHWb2j7+3t1cMPP3xFcY+Pj+s73/mOvv3tb+u3v/2tJMnn8+mRRx4pfFIAQF6yKvr6+nqFQqFF29LptPr6+hQKhRQOhzU0NKTp6WlbQgIA8pdV0dfW1qqsrGzRtqmpKVVWVsrn88ntdquurk4jIyO2hAQA5C/vGX0ikZDH48nc9ng8mpycVDKZ1K9//Wu9/fbbGhwc1P3337/k4yORiCKRiCSpq6tLXq83rxyFPtf7apDvWi3H7XYX/DkLjYzXz+n5JOdndHq+5eRd9JZlXbHN5XJp3bp12rdv3zUfHwwGFQwGM7fn5ubyjfI/p9Br5fV6Hb/+ZLx+Ts8nOT+j0/JVVVVldb+8D6/0eDyKx+OZ2/F4XBUVFTk9BxceAQD75V30NTU1mpmZUSwWUyqV0vDwcM4XEQkEAmptbc03AgAgC1mNbnp6ejQxMaFkMqm2tjY1NzersbFRLS0t6uzsVDqdVkNDg6qrq3Pa+ejoqKLRKGUPADbKqug7OjqW3O73++X3+/PeOZcSBAD7cQoEADBcUYueD2MBwH4FP9dNLhjdAID9GN0AgOEY3QCA4RjdAIDhGN0AgOEoegAwHDN6ADAcM3oAMByjGwAwHEUPAIaj6AHAcHwYCwCG48NYADAcoxsAMBxFDwCGo+gBwHAUPQAYrqgfxnJx8PwsfOvLBX2+2RzuW/qz3xV03wDsx1E3AGA4RjcAYDiKHgAMR9EDgOEoegAwHEUPAIaj6AHAcJy9EgAMx3H0AGA4RjcAYDiKHgAMR9EDgOEoegAwHEUPAIaj6AHAcBQ9ABiOogcAw1H0AGC4gv9l7Pz8vH7+85/L7XZr69at2rlzZ6F3AQDIQVZF39vbq7GxMZWXl6u7uzuzfXx8XP39/Uqn02pqatKePXt09OhRfeYzn1EgEFA4HKboAaDIshrd1NfXKxQKLdqWTqfV19enUCikcDisoaEhTU9PKx6Py+v1XnryEiZDAFBsWTVxbW2tysrKFm2bmppSZWWlfD6f3G636urqNDIyIo/Ho3g8LkmyLKvwiQEAOcl7Rp9IJOTxeDK3PR6PJicn9cUvflG/+MUvNDY2pk996lPLPj4SiSgSiUiSurq6Mr8F5Go2r0chXwvf+nJR9uv+/dG8f0ZWitvtdnRGp+eTipdx9v667O5nw759g8M2POtieRf9Uu/WXS6Xbr75ZrW3t1/z8cFgUMFgMHN7bm4u3yj4H5BKpRz/M+L1eh2d0en5pNWRsdCu5+utqqrK6n55D9EvH9FIUjweV0VFRU7PwYVHAMB+eRd9TU2NZmZmFIvFlEqlNDw8nPNFRAKBgFpbW/ONAADIQlajm56eHk1MTCiZTKqtrU3Nzc1qbGxUS0uLOjs7lU6n1dDQoOrqarvzAgBylFXRd3R0LLnd7/fL7/fnvfPR0VFFo1He1QOAjbhmLAAYrqh/0cSHsQBgP97RA4DhOEcBABjOZXGeAgAw2qp/R//MM88UO8I1OT2j0/NJZCwEp+eTnJ/R6fmWs+qLHgBwdRQ9ABiu9Lnnnnuu2CGu16ZNm4od4ZqcntHp+SQyFoLT80nOz+j0fEvhw1gAMByjGwAwXFH/YOp6LXXN2pU2Nzen/fv36z//+Y9cLpeCwaB2796tl19+WX/+85/10Y9+VJL0wAMPZM4LNDg4qEOHDqmkpEQPPfSQtm3bZnvORx99VDfffLNKSkpUWlqqrq4unT9/XuFwWP/+97+1ceNGPfHEEyorK5NlWerv79err76qm266Se3t7bb+unr69GmFw+HM7VgspubmZr377rtFXcOlrpWcz5odPnxYAwMDkqS9e/eqvr7e1owHDhxQNBqV2+2Wz+dTe3u7PvKRjygWi+mJJ57InMN88+bN2rdvnyTprbfe0v79+3Xx4kVt375dDz30kFwuly358nlt2PlaXypjOBzW6dOnJUkXLlzQ2rVr9cILLxRlDQvCWqUWFhasxx57zHrnnXesDz74wHrqqaesU6dOrXiORCJhnTx50rIsy7pw4YL1+OOPW6dOnbJeeukl6+DBg1fc/9SpU9ZTTz1lXbx40ZqdnbUee+wxa2Fhwfac7e3t1tmzZxdtO3DggDU4OGhZlmUNDg5aBw4csCzLsqLRqNXZ2Wml02nrjTfesL7//e/bnu9DCwsL1sMPP2zFYrGir+GJEyeskydPWk8++WRmW65rlkwmrUcffdRKJpOL/m1nxvHxcSuVSmXyfphxdnZ20f0u98wzz1hvvPGGlU6nrc7OTmtsbMy2fLl+X+1+rS+V8XIvvvii9Zvf/MayrOKsYSGs2tHNctesXWkVFRWZd25r1qzRLbfcokQisez9R0ZGVFdXpxtuuEEf+9jHVFlZqampqZWKe0WWe++9V5J07733ZtZvdHRU99xzj1wul7Zs2aJ3331XZ86cWZFMx48fV2VlpTZu3HjV3CuxhktdKznXNRsfH9edd96psrIylZWV6c4779T4+LitGe+66y6VlpZKkrZs2XLVn0dJOnPmjN577z1t2bJFLpdL99xzT8FeS0vlW85y31e7X+tXy2hZlv72t7/pc5/73FWfw841LIRVO7pZ7pq1xRSLxfSvf/1Lt99+u15//XX96U9/0pEjR7Rp0yZ94xvfUFlZmRKJhDZv3px5zIYNG675QiyUzs5OSdIXvvAFBYNBnT17NnNVsIqKCp07d07SpbW9/LqdHo9HiUQi5yuI5WNoaGjRi8ppa5jrmv33z+lKZpWkQ4cOqa7u/6+HGovF9L3vfU9r1qzR1772NX3yk59c8rVkd8Zcv6/Feq3/85//VHl5uT7+8Y9ntjllDXOxaoveWuaatcUyPz+v7u5uPfjgg1q7dq127dqlr3zlK5Kkl156Sb/85S/V3t6+ZO6V8IMf/EAbNmzQ2bNn9fzzz1/1WpPFWttUKqVoNKqvf/3rkuS4NbyaXNZspX5OBwYGVFpaqp07d0q69D+m3t5erVu3Tm+99ZZeeOEFdXd3r/h65vp9LeZr/b/feDhlDXO1akc3hbhmbaGkUil1d3dr586d+vSnPy1JWr9+vUpKSlRSUqKmpiadPHlyydyJREIbNmywPeOH+ygvL9eOHTs0NTWl8vLyzEjmzJkzmQ/HPB7PogsWr9Tavvrqq/rEJz6h9evXS3LeGkrKec02bNhwRdaVWMvDhw8rGo3q8ccfz5TiDTfcoHXr1km6dCy4z+fTzMzMkq8lO9cz1+9rsV7rCwsLOnr06KLfiJyyhrlatUVfiGvWFoJlWfrJT36iW265RV/60pcy2y+faR89ejRzmcVAIKDh4WF98MEHisVimpmZ0e23325rxvn5eb333nuZf7/22mu67bbbFAgE9Morr0iSXnnlFe3YsSOT8ciRI7IsS2+++abWrl1blLGNk9bwQ7mu2bZt23Ts2DGdP39e58+f17Fjx2w/ymp8fFwHDx7U008/rZtuuimz/dy5c0qn05Kk2dlZzczMyOfzqaKiQmvWrNGbb74py7J05MgRW19LuX5fi/VaP378uKqqqhaNZJyyhrla1X8wNTY2phdffDFzzdq9e/eueIbXX39dzz77rG677bbMO6cHHnhAQ0NDevvtt+VyubRx40bt27cvU5YDAwP6y1/+opKSEj344IPavn27rRlnZ2f1wx/+UNKldymf//zntXfvXiWTSYXDYc3Nzcnr9erJJ5/MHCrY19enY8eO6cYbb1R7e7tqampszfj+++/rkUce0Y9//GOtXbtWkvSjH/2oqGt4+bWSy8vL1dzcrB07duS8ZocOHdLg4KCkS4dXNjQ02JpxcHBQqVQq8wHjh4cA/v3vf9fLL7+s0tJSlZSU6Ktf/WqmjE6ePKne3l5dvHhR27ZtU0tLS0HGI0vlO3HiRM7fVztf60tlbGxs1P79+7V582bt2rUrc99irGEhrOqiBwBc26od3QAAskPRA4DhKHoAMBxFDwCGo+gBwHAUPQAYjqIHAMNR9ABguP8DaoV4MSni/p8AAAAASUVORK5CYII=\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "network_stats['indegree'].hist(log = True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# things to store" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "23131" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "remember('total_articles', articles.shape[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "35620" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "4807" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "3864" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# total number of citations in the sm dataset\n", + "remember('sm_citations', raw_edgelist.shape[0])\n", + "\n", + "remember('sm_citing', len(raw_edgelist[\"from\"].unique()))\n", + "\n", + "# the number of articles in the original dataset that have any INCOMING citations\n", + "remember('sm_cited', len(raw_edgelist[\"to\"].unique()))" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "212773" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "42935" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "9710" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# total number of citations in the sm dataset\n", + "remember('all_citations', combo_raw_edgelist.shape[0])\n", + "\n", + "remember('all_citing', len(combo_raw_edgelist[\"from\"].unique()))\n", + "\n", + "# the number of articles in the original dataset that have any INCOMING citations\n", + "remember('all_cited', len(combo_raw_edgelist[\"to\"].unique()))" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
eidcluster
02-s2.0-711490889871
12-s2.0-703498168881
22-s2.0-799537117111
32-s2.0-795516307511
42-s2.0-800514691031
52-s2.0-848667188511
62-s2.0-848776855511
72-s2.0-848644425471
82-s2.0-848614208641
92-s2.0-848874834871
102-s2.0-809551448471
112-s2.0-848850383091
122-s2.0-848860995691
132-s2.0-848633797831
142-s2.0-848990936631
152-s2.0-848791098591
162-s2.0-830551683091
172-s2.0-848763043221
182-s2.0-848661681471
192-s2.0-848778174281
202-s2.0-848734812561
212-s2.0-848617948971
222-s2.0-848995082981
232-s2.0-848980824651
242-s2.0-848790217741
252-s2.0-800549880411
262-s2.0-849443941181
272-s2.0-848705723011
282-s2.0-849071673201
292-s2.0-849146757211
.........
61102-s2.0-8485608683912
61112-s2.0-8485951012212
61122-s2.0-8490512120912
61132-s2.0-8488375861312
61142-s2.0-8487795310012
61152-s2.0-8490437676612
61162-s2.0-8490583718212
61172-s2.0-8490046121812
61182-s2.0-8375522878513
61192-s2.0-8488679597513
61202-s2.0-8487613278513
61212-s2.0-8490312133413
61222-s2.0-8486372040013
61232-s2.0-8487318093813
61242-s2.0-8491411283813
61252-s2.0-8487879574813
61262-s2.0-8488801166613
61272-s2.0-8494210121813
61282-s2.0-8005275211314
61292-s2.0-8487407470714
61302-s2.0-8494258223514
61312-s2.0-7084913036014
61322-s2.0-8486415263014
61332-s2.0-8486870916115
61342-s2.0-8489635001515
61352-s2.0-8494410493315
61362-s2.0-8487553950616
61372-s2.0-8490226295416
61382-s2.0-8490995448117
61392-s2.0-8492146967818
\n", + "

6140 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " eid cluster\n", + "0 2-s2.0-71149088987 1\n", + "1 2-s2.0-70349816888 1\n", + "2 2-s2.0-79953711711 1\n", + "3 2-s2.0-79551630751 1\n", + "4 2-s2.0-80051469103 1\n", + "5 2-s2.0-84866718851 1\n", + "6 2-s2.0-84877685551 1\n", + "7 2-s2.0-84864442547 1\n", + "8 2-s2.0-84861420864 1\n", + "9 2-s2.0-84887483487 1\n", + "10 2-s2.0-80955144847 1\n", + "11 2-s2.0-84885038309 1\n", + "12 2-s2.0-84886099569 1\n", + "13 2-s2.0-84863379783 1\n", + "14 2-s2.0-84899093663 1\n", + "15 2-s2.0-84879109859 1\n", + "16 2-s2.0-83055168309 1\n", + "17 2-s2.0-84876304322 1\n", + "18 2-s2.0-84866168147 1\n", + "19 2-s2.0-84877817428 1\n", + "20 2-s2.0-84873481256 1\n", + "21 2-s2.0-84861794897 1\n", + "22 2-s2.0-84899508298 1\n", + "23 2-s2.0-84898082465 1\n", + "24 2-s2.0-84879021774 1\n", + "25 2-s2.0-80054988041 1\n", + "26 2-s2.0-84944394118 1\n", + "27 2-s2.0-84870572301 1\n", + "28 2-s2.0-84907167320 1\n", + "29 2-s2.0-84914675721 1\n", + "... ... ...\n", + "6110 2-s2.0-84856086839 12\n", + "6111 2-s2.0-84859510122 12\n", + "6112 2-s2.0-84905121209 12\n", + "6113 2-s2.0-84883758613 12\n", + "6114 2-s2.0-84877953100 12\n", + "6115 2-s2.0-84904376766 12\n", + "6116 2-s2.0-84905837182 12\n", + "6117 2-s2.0-84900461218 12\n", + "6118 2-s2.0-83755228785 13\n", + "6119 2-s2.0-84886795975 13\n", + "6120 2-s2.0-84876132785 13\n", + "6121 2-s2.0-84903121334 13\n", + "6122 2-s2.0-84863720400 13\n", + "6123 2-s2.0-84873180938 13\n", + "6124 2-s2.0-84914112838 13\n", + "6125 2-s2.0-84878795748 13\n", + "6126 2-s2.0-84888011666 13\n", + "6127 2-s2.0-84942101218 13\n", + "6128 2-s2.0-80052752113 14\n", + "6129 2-s2.0-84874074707 14\n", + "6130 2-s2.0-84942582235 14\n", + "6131 2-s2.0-70849130360 14\n", + "6132 2-s2.0-84864152630 14\n", + "6133 2-s2.0-84868709161 15\n", + "6134 2-s2.0-84896350015 15\n", + "6135 2-s2.0-84944104933 15\n", + "6136 2-s2.0-84875539506 16\n", + "6137 2-s2.0-84902262954 16\n", + "6138 2-s2.0-84909954481 17\n", + "6139 2-s2.0-84921469678 18\n", + "\n", + "[6140 rows x 2 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "remember('g_sm_clusters', g_sm_clu[[\"eid\", \"cluster\"]])" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['all_citations',\n", + " 'all_cited',\n", + " 'all_citing',\n", + " 'cluster_edgelist',\n", + " 'g_sm_clusters',\n", + " 'sm_citations',\n", + " 'sm_cited',\n", + " 'sm_citing',\n", + " 'total_articles']" + ] + }, + "execution_count": 88, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sorted(r.keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [], + "source": [ + "#save the r function to rdata file\n", + "def save_to_r(r_dict, filename=\"output.RData\"):\n", + " for var_name, x in r.items():\n", + " var_name = var_name.replace('_', '.')\n", + " if type(x) == np.int64:\n", + " x = np.asscalar(x)\n", + " \n", + " if type(x) == pd.DataFrame:\n", + " rx = pandas2ri.py2ri(x)\n", + " else:\n", + " rx = x\n", + " \n", + " robjects.r.assign(var_name, x)\n", + "\n", + " # create a new variable called in R\n", + " robjects.r(\"r <- sapply(ls(), function (x) {eval(parse(text=x))})\")\n", + " robjects.r('save(\"r\", file=\"{}\")'.format(filename))\n", + " robjects.r(\"rm(list=ls())\")\n", + " \n", + "save_to_r(r, \"../../paper/data/network_data.RData\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/code/bibliometrics/00_citation_network_analysis.py b/code/bibliometrics/00_citation_network_analysis.py new file mode 100644 index 0000000..037fdfd --- /dev/null +++ b/code/bibliometrics/00_citation_network_analysis.py @@ -0,0 +1,232 @@ +# coding: utf-8 +# # Import data and get things setup + +import random +random.seed(9001) + +# import code to write r modules and create our variable we'll write to +import rpy2.robjects as robjects +from rpy2.robjects import pandas2ri +pandas2ri.activate() + +r = {} +def remember(name, x): + r[name] = x + +# load in modules we'll need for analysis +import subprocess +import csv +from igraph import * +import pandas as pd +import numpy as np +import re + +# grab the largest connected compontent with a little function +def get_largest_component(g): + g_components = g.components(mode="WEAK") + max_size = max(g_components.sizes()) + for g_tmp in g_components.subgraphs(): + if g_tmp.vcount() == max_size: + return(g_tmp) + +# look the full edgelist into igraph +def edge_list_iter(df): + for i, row in df.iterrows(): + yield (row['from'], row['to']) + +# list top 5 journals for each of the clusters +def top_journals_for_clusters(clu): + articles_tmp = pd.merge(clu, articles[['eid', 'source_title']]) + + output = pd.DataFrame() + for cid in articles_tmp['cluster'].unique(): + journal_counts = articles_tmp['source_title'][articles_tmp['cluster'] == cid].value_counts().head(5) + tmp = pd.DataFrame({'cluster' : cid, 'count' : journal_counts }) + output = output.append(tmp) + + output = output.reset_index() + output = output.rename(columns = {'index' : "journal"}) + return(output) + +def infomap_edgelist(g, edgelist_filename, directed=True): + nodes_tmp = pd.DataFrame([ {'node_infomap' : v.index, + 'eid' : v['name']} for v in g.vs ]) + + # write out the edgelist to an external file so we can call infomap on it + with open("code/bibliometrics/" + edgelist_filename + ".txt", 'w') as f: + for e in g.es: + if e.source != e.target: + if 'weight' in e.attributes(): + print("{}\t{}\t{}".format(e.source, e.target, e['weight']), file=f) + else: + print("{}\t{}".format(e.source, e.target), file=f) + + + # run the external program to generate the infomap clustering + infomap_cmdline = ["code/bibliometrics/infomap/Infomap", "code/bibliometrics/" + edgelist_filename + ".txt", "code/bibliometrics/output_dir -z --map --clu --tree"] + if directed: + infomap_cmdline.append("-d") + subprocess.call(infomap_cmdline) + + # load up the clu data + clu = pd.read_csv("code/bibliometrics/output_dir/" + edgelist_filename + ".clu", + header=None, comment="#", delim_whitespace=True) + clu.columns = ['node_infomap', 'cluster', 'flow'] + + return pd.merge(clu, nodes_tmp, on="node_infomap") + + +def write_graphml(g, clu, graphml_filename): + clu = clu[['node_infomap', 'cluster']].sort_values('node_infomap') + g.vs["cluster"] = clu["cluster"].tolist() + g.write_graphml("code/bibliometrics/" + graphml_filename) + + +# load article data +articles = pd.read_csv("processed_data/abstracts.tsv", delimiter="\t") + +# # network for just the central "social media" set + +# this contains the list of all INCOMING citations to for paper in the original set +raw_edgelist = pd.read_csv("processed_data/social_media_edgelist.txt", delimiter="\t") + +g_sm_all = Graph.TupleList([i for i in edge_list_iter(raw_edgelist)], directed=True) + + +g_sm = get_largest_component(g_sm_all) +g_sm = g_sm.simplify() + +g_sm_clu = infomap_edgelist(g_sm, "sm_edgelist_infomap", directed=True) + +g_sm_clu['cluster'].value_counts() + +write_graphml(g_sm, g_sm_clu, "g_sm.graphml") + + +# # larger network that contains the incoming cites to citing articles + +# this contains the list of all INCOMING citations to everything in the original set +# plus every INCOMING citation to every paper that cites one of those papers +raw_edgelist_files = ["processed_data/citation_edgelist.txt", + "processed_data/social_media_edgelist.txt"] +combo_raw_edgelist = pd.concat([pd.read_csv(x, delimiter="\t") for x in raw_edgelist_files]) + + +g_full_all = Graph.TupleList([i for i in edge_list_iter(combo_raw_edgelist)], directed=True) + +g_full = get_largest_component(g_full_all) +g_full = g_full.simplify() + + +g_full_clu = infomap_edgelist(g_full, "citation_edglist_infomap", directed=True) + + +g_full_clu['cluster'].value_counts() + +top_journals_for_clusters(g_full_clu) + +write_graphml(g_full, g_full_clu, "g_full.graphml") + + +# # create the meta-network of connections between clusters + +edgelist_tmp = pd.merge(raw_edgelist, g_sm_clu[["eid", "cluster"]], how="inner", left_on="to", right_on="eid") +edgelist_tmp = edgelist_tmp.rename(columns={'cluster' : 'to_cluster'}) +edgelist_tmp.drop('eid', 1, inplace=True) + +edgelist_tmp = pd.merge(edgelist_tmp, g_sm_clu[["eid", "cluster"]], how="inner", left_on="from", right_on="eid") +edgelist_tmp = edgelist_tmp.rename(columns={"cluster" : 'from_cluster'}) +edgelist_tmp.drop('eid', 1, inplace=True) + +edgelist_tmp = edgelist_tmp[["to_cluster", "from_cluster"]] +edgelist_tmp = edgelist_tmp[edgelist_tmp["to_cluster"] != edgelist_tmp["from_cluster"]] + +cluster_edgelist = pd.crosstab(edgelist_tmp["to_cluster"], edgelist_tmp["from_cluster"]) +cluster_edgelist["to_cluster"] = cluster_edgelist.index + +cluster_edgelist = pd.melt(cluster_edgelist, id_vars=["to_cluster"]) +cluster_edgelist = cluster_edgelist[cluster_edgelist['to_cluster'] != cluster_edgelist['from_cluster']] + +remember("cluster_edgelist", cluster_edgelist) + +top_clusters = g_sm_clu["cluster"].value_counts().head(6).index + +# write the edgelist for the total number of clusters (currently 1-6) +cluster_edgelist_output = cluster_edgelist[(cluster_edgelist["to_cluster"].isin(top_clusters)) & + (cluster_edgelist["from_cluster"].isin(top_clusters))] + +cluster_edgelist_output = cluster_edgelist_output[cluster_edgelist_output["value"] > 0] + +g_cluster = Graph.TupleList([tuple(x) for x in cluster_edgelist_output[["from_cluster", "to_cluster"]].values], directed=True) +g_cluster.es["weight"] = cluster_edgelist_output["value"].tolist() + +# assign the number of total articles as an attribute for each node +g_cluster.vs["papers"] = g_sm_clu["cluster"].value_counts()[[x["name"] for x in g_cluster.vs]].tolist() + +g_cluster.write_graphml("code/bibliometrics/clusters.graphml") + +# # create network stats for tables (overall and within clusters) + +def create_network_stats(g): + network_stats = pd.DataFrame({'eid' : g.vs['name'], + 'eig_cent' : g.eigenvector_centrality(), + 'indegree' : g.indegree(), + 'betweenness' : g.betweenness()}) + + network_stats = pd.merge(network_stats, + articles[['eid', 'title', 'source_title']], + how="inner") + return network_stats + +network_stats = create_network_stats(g_full) + +network_stats.sort_values("indegree", ascending=False).head(4) + +network_stats.sort_values("eig_cent", ascending=False).head(4) + +network_stats.sort_values("betweenness", ascending=False).head(4) + +# # things to store +remember('total_articles', articles.shape[0]) + +# total number of citations in the sm dataset +remember('sm_citations', raw_edgelist.shape[0]) + +remember('sm_citing', len(raw_edgelist["from"].unique())) + +# the number of articles in the original dataset that have any INCOMING citations +remember('sm_cited', len(raw_edgelist["to"].unique())) + +# total number of citations in the sm dataset +remember('all_citations', combo_raw_edgelist.shape[0]) + +remember('all_citing', len(combo_raw_edgelist["from"].unique())) + +# the number of articles in the original dataset that have any INCOMING citations +remember('all_cited', len(combo_raw_edgelist["to"].unique())) + +remember('g_sm_clusters', g_sm_clu[["eid", "cluster"]]) + +sorted(r.keys()) + +#save the r function to rdata file +def save_to_r(r_dict, filename="output.RData"): + for var_name, x in r.items(): + var_name = var_name.replace('_', '.') + if type(x) == np.int64: + x = np.asscalar(x) + + if type(x) == pd.DataFrame: + rx = pandas2ri.py2ri(x) + else: + rx = x + + robjects.r.assign(var_name, x) + + # create a new variable called in R + robjects.r("r <- sapply(ls(), function (x) {eval(parse(text=x))})") + robjects.r('save("r", file="{}")'.format(filename)) + robjects.r("rm(list=ls())") + +save_to_r(r, "paper/data/network_data.RData") + diff --git a/code/bibliometrics/clusters.gephi b/code/bibliometrics/clusters.gephi new file mode 100644 index 0000000..207a267 Binary files /dev/null and b/code/bibliometrics/clusters.gephi differ diff --git a/code/bibliometrics/g_sm.gephi b/code/bibliometrics/g_sm.gephi new file mode 100644 index 0000000..2d965a9 Binary files /dev/null and b/code/bibliometrics/g_sm.gephi differ diff --git a/code/data_collection/00_get_search_results.py b/code/data_collection/00_get_search_results.py new file mode 100644 index 0000000..be4b5cd --- /dev/null +++ b/code/data_collection/00_get_search_results.py @@ -0,0 +1,24 @@ +import argparse +from request_functions import * + +''' +This script takes in a search query and an output file. It queries the scopus API to find all papers that match the search query, and saves them to the output file. + +Unlike some of the other scripts in this directory, it does not try to determine the state - if you restart the script, it will start over and blow away whatever you had saved before. +''' + +years = range(2004, 2017) + +def main(): + + parser = argparse.ArgumentParser(description='Output JSON of all articles matching search query') + parser.add_argument('-q', help='Search query', required=True) + parser.add_argument('-o', help='Where to append JSON results') + args = parser.parse_args() + + with open(args.o, 'w') as out_file: + for year in years: + get_search_results(args.q, out_file, year=year) + +if __name__ == '__main__': + main() diff --git a/code/data_collection/01_get_abstracts.py b/code/data_collection/01_get_abstracts.py new file mode 100644 index 0000000..0548fe1 --- /dev/null +++ b/code/data_collection/01_get_abstracts.py @@ -0,0 +1,56 @@ +from request_functions import * +import argparse +import json +import subprocess + + +def main(): + + parser = argparse.ArgumentParser(description='Output JSON of abstracts and bibliography of all articles passed in.') + parser.add_argument('-i', help='JSON file which includes eids') + parser.add_argument('--eid', '-e', help='Single eid') + parser.add_argument('-o', help='Where to append JSON results') + args = parser.parse_args() + + if args.eid: + eids = [args.eid] + elif args.i: + with open(args.i, 'r') as f: + eids = [json.loads(line)['eid'] for line in f] + else: + print('Need to either pass in an eid or a json file with eids') + + # If the script gets interrupted, we need to start where we left off + try: + errors = [] + with open(args.o, 'r') as f: + completed_eids = [] + for line in f: + try: + result = json.loads(line) + completed_eids.append(result['abstracts-retrieval-response']['coredata']['eid']) + except ValueError: + errors.append(line) + except IOError as e: + completed_eids = [] + + + print('{} completed eids'.format(len(completed_eids))) + with open(args.o, 'a') as out_file: + for eid in eids: + if eid not in completed_eids: + result = get_abstract(eid) + if result: + out_file.write(result) + out_file.write('\n') + else: + errors.append(eid) + + if len(errors) > 0: + with open('raw_data/missing_eids.json', 'a') as l: + # Add the bad lines from the output file + (l.write(e) for e in errors) + + +if __name__ == '__main__': + main() diff --git a/code/data_collection/02_get_cited_by.py b/code/data_collection/02_get_cited_by.py new file mode 100644 index 0000000..3f38c66 --- /dev/null +++ b/code/data_collection/02_get_cited_by.py @@ -0,0 +1,43 @@ +from request_functions import * +import argparse +import json +import subprocess +from os import remove + +def main(): + + parser = argparse.ArgumentParser(description='Output JSON of all articles which cite the articles passed in') + parser.add_argument('-i', help='JSON file which includes eids and citedby-count') + parser.add_argument('-o', help='Where to append JSON results') + args = parser.parse_args() + + with open(args.i, 'r') as f: + # Make a dictionary of eid:citation count for each line in the file + eids = {} + for line in f: + l = json.loads(line) + eids[l['eid']] = l['citedby-count'] + + # If the script gets interrupted, we need to start where we left off + try: + # Open the output file, and grab all of the eids which are already completed + with open(args.o, 'r') as f: + completed_eids = [json.loads(l)['parent_eid'] for l in f] + # Remove those which came from the last id (since we may have missed some) + if len(completed_eids) > 0: + last_eid = completed_eids.pop() + # Remove all of the lines which came from the last eid + subprocess.call(['sed', '-i.bak', '/parent_eid": "{}/d'.format(last_eid), args.o]) + # Hopefully everything has worked out, because here we blow away the backup + remove('{}.bak'.format(args.o)) + except IOError: + # If the file doesn't exist, then there aren't any completed eids + completed_eids = [] + + with open(args.o, 'a') as out_file: + for eid, citation_count in eids.items(): + if citation_count != '0' and eid not in completed_eids: + get_cited_by(eid, out_file) + +if __name__ == '__main__': + main() diff --git a/code/data_collection/request_functions.py b/code/data_collection/request_functions.py new file mode 100644 index 0000000..0d98760 --- /dev/null +++ b/code/data_collection/request_functions.py @@ -0,0 +1,166 @@ +import requests +from datetime import datetime +from scopus_api import key as API_KEY +import json +import os +import logging +import re + +logging.basicConfig(level=logging.DEBUG) + +RETRY_COUNT = 5 +TIMEOUT_SECS = 10 + +# Initialize a global session object +s = requests.Session() +s.headers.update({'X-ELS-APIKey' : API_KEY, + 'X-ELS-ResourceVersion' : 'XOCS', + 'Accept' : 'application/json'}) + +def get_token(location_id = None): + '''Given a location_id, gets an authentication token''' + print('Getting a token') + api_resource = 'http://api.elsevier.com/authenticate' + # Parameters + payload = {'platform':'SCOPUS', + 'choice': location_id} + r = s.get(api_resource, params = payload) + r.raise_for_status() + s.headers['X-ELS-AuthToken'] = r.json()['authenticate-response']['authtoken'] + +def get_search_results(query, output_file, results_per_call = 200, + tot_results=None, year=None, sort='+title', citation_call=False): + '''Handles getting search results. Takes a query and an output + file. Writes as many of the search results as possible to the + output file as JSON dictionaries, one per line.''' + result_set = [] + results_added = 0 + def curr_call(start=0, count=results_per_call): + '''Shorthand for the current call: DRY''' + return make_search_call(query, start=start, + count=count, year=year, sort=sort) + if tot_results == None: + # Call the API initially to figure out how many results there are, and write the results + initial_results = curr_call(count=results_per_call) + tot_results = int(initial_results['search-results']['opensearch:totalResults']) + result_set.append((initial_results, sort)) + results_added += results_per_call + logging.debug("Total results: {}".format(tot_results)) + + if tot_results == 0: + return None + if tot_results > 5000: + # If this is just one year, we can't get any more granular, and + # we need to return what we can. + if tot_results > 10000: + print("{} results for {}. We can only retrieve 10,000".format(tot_results, year)) + first_half = last_half = 5000 + else: + # Get half, and correct for odd # of results + first_half = tot_results//2 + tot_results % 2 + last_half = tot_results//2 + # Break the search into the first half and the bottom half of results. + get_search_results(query, output_file, + year = year, + tot_results=first_half) + # Get the other half + get_search_results(query, output_file, + year = year, + tot_results = last_half, sort='-title') +# If there are 5000 or fewer to retrieve, then get them + else: + logging.debug('Retrieving {} results'.format(tot_results)) + # As long as there are more citations to retrieve, then do it, and write + # them to the file + while results_added < tot_results: + # If we are near the end, then only get as many results as are left. + to_retrieve = min(results_per_call, (tot_results - results_added)) + curr_results = curr_call(start=results_added, count=to_retrieve) + result_set.append((curr_results, sort)) + results_added += results_per_call + # This is hacky, but I'm doing it + # If this is a citation call, then construct metadata to be written with the result + if citation_call: + metadata = {'parent_eid': re.match(r'refeid\((.*)\)', query).group(1)} + else: + metadata = {} + write_results(result_set, output_file, metadata) + +def write_results(result_set, output_file, metadata={}): + for x in result_set: + search_json = x[0] + to_reverse = x[1].startswith('-') + try: + results = [x for x in search_json['search-results']['entry']] + except KeyError: + raise + if to_reverse: + results = results[::-1] + for x in results: + for k, v in metadata.items(): + x[k] = v + json.dump(x, output_file) + output_file.write('\n') + + +def make_search_call(query, start=0, count=200, + sort='+title', year=None, + retry_limit = RETRY_COUNT, + timeout_secs = TIMEOUT_SECS): + api_resource = "https://api.elsevier.com/content/search/scopus" + # Parameters + payload = {'query':query, + 'count':count, + 'start':start, + 'sort': sort, + 'date': year} + for _ in range(retry_limit): + try: + r = s.get(api_resource, + params = payload, + timeout = timeout_secs) + logging.debug(r.url) + if r.status_code == 401: + get_token() + continue + if r.status_code == 400: + raise requests.exceptions.HTTPError('Bad request; possibly you aren\'t connected to an institution with Scopus acces?') + break + except requests.exceptions.Timeout: + pass + else: + raise requests.exceptions.Timeout('Timeout Error') + + r.raise_for_status() + return r.json() + + +def get_cited_by(eid, output_file): + return get_search_results('refeid({})'.format(eid), output_file, results_per_call=200, + citation_call = True) + + +def get_abstract(eid, retry_limit = RETRY_COUNT, + timeout_secs = TIMEOUT_SECS): + api_resource = "http://api.elsevier.com/content/abstract/eid/{}".format(eid) + # Parameters + payload = {} + for _ in range(retry_limit): + try: + r = s.get(api_resource, + params = payload, + timeout = timeout_secs) + if r.status_code == 401: + get_token() + continue + if r.status_code == 400: + raise requests.exceptions.HTTPError('Bad request; possibly you aren\'t connected to an institution with Scopus acces?') + break + except requests.exceptions.Timeout: + pass + else: + raise requests.exceptions.Timeout('Timeout Error') + if r.status_code == 404: + return None + r.raise_for_status() + return r.content.decode('utf-8') diff --git a/code/data_collection/scopus_api.py b/code/data_collection/scopus_api.py new file mode 100644 index 0000000..2b29810 --- /dev/null +++ b/code/data_collection/scopus_api.py @@ -0,0 +1 @@ +key = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' diff --git a/code/data_processing/00_abstracts_to_tsv.py b/code/data_processing/00_abstracts_to_tsv.py new file mode 100644 index 0000000..6a936b7 --- /dev/null +++ b/code/data_processing/00_abstracts_to_tsv.py @@ -0,0 +1,177 @@ +from collections import Counter +from datetime import datetime +import json +import argparse +import csv +import random + +random.seed(2017) + +def main(): + + parser = argparse.ArgumentParser(description='Change a big ugly abstract file to a nice CSV') + parser.add_argument('-i', help='Abstract file') + parser.add_argument('-o', help='TSV output file') + args = parser.parse_args() + + with open(args.i, 'r') as i: + with open(args.o, 'w') as o: + # Have to get the field names + first_line = clean_abstract(json.loads(next(i))) + fieldnames = first_line.keys() + output = csv.DictWriter(o, fieldnames, delimiter='\t') + output.writeheader() + output.writerow(first_line) + for line in i: + output.writerow(clean_abstract(json.loads(line))) + + +def clean_abstract(json_response): + result = json_response['abstracts-retrieval-response'] + head = result['item']['bibrecord']['head'] + try: + attributes = { + 'modal_country': get_country(head), + 'abstract' : get_abstract(result), + 'title' : get_title(result), + 'source_title': get_source_title(head), + 'language': result['language']['@xml:lang'], + 'first_ASJC_subject_area': get_subject(result, '$'), + 'first_ASJC_classification': get_subject(result, '@code'), + 'first_CPX_class': get_CPX_class(head, 'classification-description'), + 'date': to_date(result['coredata']['prism:coverDate']), + 'aggregation_type' : if_exists('prism:aggregationType',result['coredata'],else_val='NA'), + 'eid' : result['coredata']['eid'], + 'cited_by_count': result['coredata']['citedby-count'], + 'num_citations': get_citation_count(result) + } + except KeyError: + raise + except TypeError: + # print(result) + raise + return attributes + +def get_citation_count(result): + try: + return result['item']['bibrecord']['tail']['bibliography']['@refcount'] + except TypeError: + return None + +def get_title(result): + try: + return result['coredata']['dc:title'] + except KeyError: + raise + + +def get_source_title(head): + try: + return head['source']['sourcetitle'] + except KeyError: + raise + +def get_abstract(result): + try: + abstract = result['coredata']['dc:description'] + abstract = abstract.replace('\n',' ') + return abstract + except KeyError: + return None + +def get_auth_names(head): + try: + auth_info = [x['author'] for x in make_list(head['author-group'])] + except KeyError: + print(head) + auth_names = [] + for auth_group in auth_info: + for auth in make_list(auth_group): + auth_names.append('{} {}'.format( + auth['preferred-name']['ce:given-name'], + auth['preferred-name']['ce:surname'])) + return auth_names + +def get_country(head): + all_countries = get_aff_info(head, 'country') + if all_countries: + # Find the mode. If there's more than one, choose randomly + modes = Counter + s = set(all_countries) + max_count = max([all_countries.count(x) for x in s]) + modes = [x for x in s if all_countries.count(x) == max_count] + return random.choice(modes) + +def get_aff_info(head, affiliation_key): + aff_info = [] + try: + authors = make_list(head['author-group']) + except KeyError: + return None + for x in authors: + try: + num_auth = len(make_list(x['author'])) + except KeyError: + # Apparently there are things called "collaborations", which don't have affiliation info. + # I'm just skipping them + continue + except TypeError: + # And apparently "None" appears in the author list for no reason. :) + continue + try: + curr_inst = x['affiliation'][affiliation_key] + # Add one instance for each author from this institution + aff_info += [curr_inst] * num_auth + except KeyError: + # If there isn't affiliation info for these authors, return empty str + aff_info += [''] * num_auth + return aff_info + +def get_keywords(head): + cite_info = head['citation-info'] + try: + keywords = [x for x in + make_list(cite_info['author-keywords']['author-keyword'])] + # When there's only one keyword, it's a string. Otherwise, we will + # have a list of dictionaries + if len(keywords) == 1: + return keywords + else: + return [x['$'] for x in keywords] + except KeyError: + return None + +def get_subject(result, key): + try: + return [x[key] for x in make_list(result['subject-areas']['subject-area'])][0] + except KeyError: + print(result) + raise + +def get_CPX_class(head, class_key): + try: + for x in head['enhancement']['classificationgroup']['classifications']: + if x['@type'] == 'CPXCLASS': + try: + return [y[class_key] for y in make_list(x['classification'])][0] + except (KeyError, TypeError): + return None + except KeyError: + print(head['enhancement']['classificationgroup']) + raise + +def to_date(date_string): + return datetime.strptime(date_string, '%Y-%m-%d') + + +def if_exists(key, dictionary, else_val = None): + try: + return dictionary[key] + except KeyError: + return else_val + +def make_list(list_or_dict): + return list_or_dict if isinstance(list_or_dict, list) else [list_or_dict] + +if __name__ == '__main__': + main() diff --git a/code/data_processing/01_cited_by_to_edgelist.py b/code/data_processing/01_cited_by_to_edgelist.py new file mode 100644 index 0000000..51e5b5c --- /dev/null +++ b/code/data_processing/01_cited_by_to_edgelist.py @@ -0,0 +1,25 @@ +from datetime import datetime +import json +import argparse +import csv + + +def main(): + + parser = argparse.ArgumentParser(description='Make a citation network from the cited_by json') + parser.add_argument('-i', help='Cited_by file') + parser.add_argument('-o', help='TSV output file') + args = parser.parse_args() + + with open(args.i, 'r') as i: + with open(args.o, 'w') as o: + output = csv.writer(o, delimiter = '\t') + output.writerow(['to','from', 'date']) + for line in i: + line = json.loads(line) + output.writerow([line['parent_eid'], line['eid'], line['prism:coverDate']]) + + +if __name__ == '__main__': + main() + diff --git a/code/data_processing/02_filter_edgelist.py b/code/data_processing/02_filter_edgelist.py new file mode 100644 index 0000000..b1a64b2 --- /dev/null +++ b/code/data_processing/02_filter_edgelist.py @@ -0,0 +1,29 @@ +import argparse +import csv + + +def main(): + + parser = argparse.ArgumentParser(description='Take the edgelist, and reduce it to just the papers which are in our search') + parser.add_argument('-i', help='Full edgelist file') + parser.add_argument('-o', help='Edgelist output file') + args = parser.parse_args() + + with open(args.i, 'r') as in_file: + i = csv.reader(in_file, delimiter= '\t') + next(i) # Discard header + # Get the list of nodes to keep + nodes = set([x[0] for x in i]) + in_file.seek(0) # Start over at the beginning + with open(args.o, 'w') as o: + output = csv.writer(o, delimiter = '\t') + output.writerow(['to','from', 'date']) + for line in i: + # If the both items are in nodes, then keep the line + if line[1] in nodes: + output.writerow(line) + + +if __name__ == '__main__': + main() + diff --git a/code/data_processing/03_make_paper_aff_table.py b/code/data_processing/03_make_paper_aff_table.py new file mode 100644 index 0000000..dd26a55 --- /dev/null +++ b/code/data_processing/03_make_paper_aff_table.py @@ -0,0 +1,62 @@ +import json +import argparse +import csv + +def main(): + + parser = argparse.ArgumentParser(description='Generate paper to affiliation mapping file from abstracts file') + parser.add_argument('-i', help='Abstract file') + parser.add_argument('-o', help='TSV output file') + args = parser.parse_args() + + with open(args.i, 'r') as i: + with open(args.o, 'w') as o: + output = csv.writer(o, delimiter='\t') + output.writerow(['paper_eid','affiliation_id', + 'organization','country']) + for line in i: + entries = get_entries(line) + for entry in entries: + output.writerow(entry) + + +def get_entries(l): + json_response = json.loads(l) + full = json_response['abstracts-retrieval-response'] + head = full['item']['bibrecord']['head'] + eid = full['coredata']['eid'] + countries = get_aff_info(head, 'country') + affiliation_ids = get_aff_info(head, '@afid') + org_names = get_aff_info(head, 'organization') + if countries: + result = [[eid, affiliation_ids[i], org_names[i], countries[i]] + for i in range(len(countries))] + return result + return [] + +def get_aff_info(head, affiliation_key): + aff_info = [] + try: + affiliations = make_list(head['author-group']) + except KeyError: + return None + for x in affiliations: + if x is None: + continue + try: + curr_inst = x['affiliation'][affiliation_key] + # May return a string or a list. If it's a list, then + # return the final value of that list (This is the base organization) + if isinstance(curr_inst, list): + curr_inst = [x['$'] for x in curr_inst][-1] + aff_info.append(curr_inst) + except KeyError: + # If there isn't affiliation info for these authors, return empty str + aff_info.append('') + return aff_info + +def make_list(list_or_dict): + return list_or_dict if isinstance(list_or_dict, list) else [list_or_dict] + +if __name__ == '__main__': + main() diff --git a/code/data_processing/04_make_paper_subject_table.py b/code/data_processing/04_make_paper_subject_table.py new file mode 100644 index 0000000..25b9c99 --- /dev/null +++ b/code/data_processing/04_make_paper_subject_table.py @@ -0,0 +1,50 @@ +import json +import argparse +import csv + +def main(): + + parser = argparse.ArgumentParser(description='Generate paper to subject mapping file from abstracts file') + parser.add_argument('-i', help='Abstract file') + parser.add_argument('-o', help='TSV output file') + args = parser.parse_args() + + with open(args.i, 'r') as i: + with open(args.o, 'w') as o: + output = csv.writer(o, delimiter='\t') + output.writerow(['paper_eid','subject', + 'subject_code']) + for line in i: + entries = get_entries(line) + for entry in entries: + output.writerow(entry) + + +def get_entries(l): + json_response = json.loads(l) + full = json_response['abstracts-retrieval-response'] + eid = full['coredata']['eid'] + subjects = get_subjects(full) + # Prepend the eid, and return the subjects + return [[eid,s[0],s[1]] for s in subjects] + return [] + + +def get_subjects(abstract_response): + try: + subject_info = make_list(abstract_response['subject-areas']['subject-area']) + except KeyError: + print(result) + raise + result = [] + for s in subject_info: + # Get the subject name and code, and append them + result.append([s['$'],s['@code']]) + return result + + +def make_list(list_or_dict): + return list_or_dict if isinstance(list_or_dict, list) else [list_or_dict] + +if __name__ == '__main__': + main() diff --git a/code/data_processing/05_save_descriptives.R b/code/data_processing/05_save_descriptives.R new file mode 100644 index 0000000..0202cf1 --- /dev/null +++ b/code/data_processing/05_save_descriptives.R @@ -0,0 +1,17 @@ +df = read.csv('processed_data/abstracts.tsv',sep='\t', strip.white=TRUE) +df['date'] = as.Date(df$date) +df$modal_country[df['modal_country'] == ''] <- NA +df['year'] = format(df['date'],'%Y') + +abstracts <- df[df['abstract'] != '',c('eid','abstract')] +# Creates a vector of word counts, based on counting all of the groups of alphanumeric characters +word_count <- apply(abstracts, 1, function(x) sapply(gregexpr("[[:alnum:]]+", x['abstract']), function(x) sum(x > 0))) + +s = read.csv('processed_data/paper_subject_table.tsv', sep='\t') +full <- merge(df,s, by.x = 'eid', by.y = 'paper_eid') + +# zero these out before we save them so we don't save all of the abstracts. +full['abstract'] <- NULL +df['abstract'] <- NULL + +save(df, abstracts, s, full, word_count, file="paper/data/orig_data_sets.RData") diff --git a/code/data_processing/make_network.py b/code/data_processing/make_network.py new file mode 100644 index 0000000..1dcbd54 --- /dev/null +++ b/code/data_processing/make_network.py @@ -0,0 +1,26 @@ +'''Takes a CSV of retrieved articles, and creates an igraph +network from them (not even close to done)''' + +class CitationNetwork(igraph.Graph): + def __init__(self, network_type): + super().__init__(directed=True) + self.temp_edges = [] + self.temp_vertices = [] + self.network_type = network_type + + def add_vertices(self, to_node, from_nodes): + self.temp_vertices += [[from_node, to_node] for from_node in from_nodes] + + def make_network(self): + # Get the unique set of nodes, and add them. + nodes = set([v for v in self.temp_vertices if v['eid'] not in self.vs['name']]) + nodes = sorted(nodes) + self.add_vertices(nodes) + self.add_edges(self.temp_edges) + self.es['weight'] = 1 + + def collapse_weights(self): + self.simplify(combine_edges={"weight": "sum"}) + + def add_citations(eid, citations): + self.retrieved_eids.append(eid) diff --git a/code/prediction/00_ngram_extraction.py b/code/prediction/00_ngram_extraction.py new file mode 100644 index 0000000..4be1db4 --- /dev/null +++ b/code/prediction/00_ngram_extraction.py @@ -0,0 +1,89 @@ +from time import time + +from sklearn.feature_extraction.text import CountVectorizer +import csv +import argparse + +n_features = 100000 # Gets the top n_features terms +n_samples = None # Enter an integer here for testing, so it doesn't take so long + +def main(): + + parser = argparse.ArgumentParser(description='Take in abstracts, output CSV of n-gram counts') + parser.add_argument('-i', help='Location of the abstracts file', + default='processed_data/abstracts.tsv') + parser.add_argument('-o', help='Location of the output file', + default='processed_data/ngram_table.csv') + parser.add_argument('-n', type=int, help='Gets from 1 to n ngrams', + default=3) + + args = parser.parse_args() + + print("Loading dataset...") + t0 = time() + doc_ids, data_samples = get_ids_and_abstracts(args.i, n_samples) + print("done in %0.3fs." % (time() - t0)) + + # Write the header + write_header(args.o) + + bags_o_words = get_counts(data_samples, n_features, args.n) + write_output(doc_ids, bags_o_words, args.o) + +def get_counts(abstracts, n_features, ngram_max): + tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, + max_features=n_features, + stop_words='english', + ngram_range = (1,ngram_max)) + t0 = time() + tf = tf_vectorizer.fit_transform(abstracts) + print("done in %0.3fs." % (time() - t0)) + + terms = tf_vectorizer.get_feature_names() + freqs = tf.toarray() + bags_o_words = to_bags_o_words(terms, freqs) + return bags_o_words + + +def write_header(out_file): + with open(out_file, 'w') as o_f: + out = csv.writer(o_f) + out.writerow(['document_id','term','frequency']) + +def to_bags_o_words(terms, freqs): + '''Takes in the vectorizer stuff, and returns a list of dictionaries, one for each document. + The format of the dictionaries is term:count within that document. + ''' + result = [] + for d in freqs: + curr_result = {terms[i]:val for i,val in enumerate(d) if val > 0 } + result.append(curr_result) + return result + +def write_output(ids, bags_o_words, out_file): + with open(out_file, 'a') as o_f: + out = csv.writer(o_f) + for i, doc in enumerate(bags_o_words): + for k,v in doc.items(): + # For each term and count, output a row, together with the document id + out.writerow([ids[i],k,v]) + +def get_ids_and_abstracts(fn, length_limit): + with open(fn, 'r') as f: + in_csv = csv.DictReader(f, delimiter='\t') + abstracts = [] + ids = [] + i = 1 + for r in in_csv: + try: + abstracts.append(r['abstract']) + ids.append(r['eid']) + except KeyError: + print(r) + if length_limit and i > length_limit: + break + i += 1 + return ids, abstracts + +if __name__ == '__main__': + main() diff --git a/code/prediction/01-build_control_variables.R b/code/prediction/01-build_control_variables.R new file mode 100644 index 0000000..0127cdf --- /dev/null +++ b/code/prediction/01-build_control_variables.R @@ -0,0 +1,89 @@ +source("code/prediction/utils.R") + +# use this to store things for use in the paper +pred.descrip <- NULL + +abstracts <- read.delim("processed_data/abstracts.tsv", header=TRUE, + stringsAsFactors=FALSE, sep="\t") + +abstracts <- subset(abstracts, select = -abstract) + +abstracts <- abstracts[abstracts$aggregation_type != "Trade Journal" & + is.na(abstracts$aggregation_type) == FALSE, ] + +names(abstracts)[names(abstracts) == 'num_citations'] <- 'works_cited' +abstracts$works_cited[is.na(abstracts$works_cited) == TRUE] <- 0 + +# affiliations +affiliations <- read.delim("processed_data/paper_aff_table.tsv", + header=TRUE, stringsAsFactors=FALSE, + sep="\t") + +# eliminate missing values +affiliations <- affiliations[!is.na(affiliations$affiliation_id) & + affiliations$organization != "", ] + + +remap.affiliations <- function(aff.id, + aff.df = affiliations){ + org.modal <- names(tail(sort(table(affiliations$organization[ + affiliations$affiliation_id == aff.id])),1)) + return(org.modal) +} + +affiliations$organization <- sapply(affiliations$affiliation_id, remap.affiliations) + +affiliations <- subset(affiliations, select = c(paper_eid, + organization)) +names(affiliations) <- c("eid", "affiliation") + +# need to remove repeat affiliations +affiliations <- affiliations[duplicated(affiliations$eid) == FALSE,] + + +###################################### +d <- abstracts[, c("eid", "language", "modal_country", + "source_title", "works_cited")] + +# dichotomous dependent variable +d$cited <- abstracts$cited_by_count > 0 + + +# store this here for use in the paper before we run any restrictions: +pred.descrip$cited <- d$cited +pred.descrip$cites <- abstracts$cited_by_count + + +# We want these to be categorical variables +d$modal_country <- factor(d$modal_country) +d$language <- factor(d$language) +d$subject <- factor(abstracts$first_ASJC_subject_area) +d$source_title <- factor(d$source_title) +d$month <- factor(strftime(abstracts$date, format= "%m")) +# except for pub year - keep that continuous +d$year <- as.numeric(strftime(abstracts$date, format="%Y")) + +# bring in org affiliations +d <- merge(d, affiliations, by="eid") # note that this drops papers + # w/out org info + +d$affiliation <- factor(d$affiliation) + +##### Restrictions: + +### do this explicitly so that changes are easy: +d <- restrict(d, d$affiliation, 1) +d <- restrict(d, d$subject, 1) +d <- restrict(d, d$source_title, 1) +d <- restrict(d, d$language, 1) +d <- restrict(d, d$modal_country, 1) + +# n.authors +# per author prior citations + +pred.descrip$covars <- d +save(pred.descrip, file = "paper/data/prediction_descriptives.RData") + + +rm(d, abstracts, affiliations) + diff --git a/code/prediction/02-build_textual_features.R b/code/prediction/02-build_textual_features.R new file mode 100644 index 0000000..7347af5 --- /dev/null +++ b/code/prediction/02-build_textual_features.R @@ -0,0 +1,56 @@ +library(data.table) + + +# import ngram data +# note that the file is not pushed to repository, but is available on +# hyak at: /com/users/jdfoote/css_chapter/ngram_table.csv + +# Top 100,000 ngrams (?) +ngrams <- read.delim("processed_data/ngram_table.csv", sep=",", + header=TRUE, stringsAsFactors=FALSE)[,-3] +names(ngrams)[1] <- "eid" + +subjects <- read.delim("processed_data/abstracts.tsv", header=TRUE, + stringsAsFactors=FALSE, sep="\t")[,c("eid", + "first_ASJC_subject_area")] +names(subjects)[2] <- "subject" + +# takes a couple of minutes: +ngrams <- merge(ngrams, subjects, by="eid", all.x=TRUE) + +# only use ngrams that occur accross all (many?) subject areas +subject.by.ngram <- tapply(ngrams$subject, ngrams$term, function(x) + length(unique(x))) + +# summary(subject.by.ngram) +# +# library(txtplot) +# txtdensity(log(subject.by.ngram)) + +# Note: +# The median number of subject areas per term is five. We'll cut it +# off at terms that occur across at least 30 subject areas. + +top.ngrams <- ngrams[ngrams$term %in% + names(subject.by.ngram[subject.by.ngram > + 30]),c("eid", "term")] + +rm(ngrams, subject.by.ngram, subjects) + +# convert to a wide format matrix of dichotomous variables +library(reshape2) +library(data.table) + +top.ngrams <- data.table(top.ngrams) +setkey(top.ngrams, eid) + +top.ngrams[,vv:= TRUE] + +# took more than 20 minutes on hyak +top.ngram.matrix <- dcast(top.ngrams, eid ~ term, length, + value.var = "vv") + +rm(top.ngrams) + +save(top.ngram.matrix, file="processed_data/top.ngram.matrix.RData") +#load("processed_data/top.ngram.matrix.RData") diff --git a/code/prediction/03-prediction_analysis.R b/code/prediction/03-prediction_analysis.R new file mode 100644 index 0000000..f040263 --- /dev/null +++ b/code/prediction/03-prediction_analysis.R @@ -0,0 +1,221 @@ +library(data.table) +library(Matrix) +library(glmnet) +library(xtable) +library(methods) + +predict.list <- NULL + +if(!exists("top.ngram.matrix")){ + load("processed_data/top.ngram.matrix.RData") +} + +if(!exists("pred.descrip")){ + load("paper/data/prediction_descriptives.RData") + covars <- pred.descrip$covars +} + +top.ngram.matrix <- data.table(top.ngram.matrix) +setkey(top.ngram.matrix, eid) +covars <- data.table(pred.descrip$covars) +setkey(covars,eid) + +# restrict to the overlap of the two datasets +covars <- covars[covars$eid %in% top.ngram.matrix$eid,] + +top.ngram.matrix <- top.ngram.matrix[top.ngram.matrix$eid %in% + covars$eid,] + +# rename the cited column in case it doesn't appear +names(covars)[names(covars) == 'cited'] <- 'cited.x' + +# then merge also to facilitate some manipulations below +d <- merge(covars, top.ngram.matrix, by="eid", all=FALSE) + +# Note that this duplicates some column names so X gets appended in a +# few cases. + +# construct model matrices +x.controls <- sparse.model.matrix(cited.x ~ language.x + + modal_country + month.x, + data=d)[,-1] + +x.aff <- sparse.model.matrix(cited.x ~ affiliation, data=d)[,-1] +x.subj <- sparse.model.matrix(cited.x ~ subject.x, data=d)[,-1] +x.venue <- sparse.model.matrix(cited.x ~ source_title, data=d)[,-1] + +x.ngrams <- as.matrix(subset(top.ngram.matrix, select=-eid)) +x.ngrams <- as(x.ngrams, "sparseMatrix") + +X <- cBind(x.controls, covars$year.x, covars$works.cited) +X.aff <- cBind(X, x.aff) +X.subj <- cBind(X.aff, x.subj) +X.venue <- cBind(X.subj, x.venue) +X.terms <- cBind(X.venue, x.ngrams) + +Y <- covars$cited + +### Hold-back sample for testing model performance later on: +set.seed(20160719) +holdback.index <- sample(nrow(X), round(nrow(X)*.1)) + +X.hold <- X[holdback.index,] +X.hold.aff <- X.aff[holdback.index,] +X.hold.subj <- X.subj[holdback.index,] +X.hold.venue <- X.venue[holdback.index,] +X.hold.terms <- X.terms[holdback.index,] +Y.hold <- Y[holdback.index] + +X.test <- X[-holdback.index,] +X.test.aff <- X.aff[-holdback.index,] +X.test.subj <- X.subj[-holdback.index,] +X.test.venue <- X.venue[-holdback.index,] +X.test.terms <- X.terms[-holdback.index,] +Y.test <- Y[-holdback.index] + +############### Models and prediction + +set.seed(20160719) + +m.con <- cv.glmnet(X.test, Y.test, alpha=1, family="binomial", + type.measure="class") +con.pred = predict(m.con, type="class", s="lambda.min", + newx=X.hold) + +m.aff <- cv.glmnet(X.test.aff, Y.test, alpha=1, family="binomial", + type.measure="class") +aff.pred = predict(m.aff, type="class", s="lambda.min", + newx=X.hold.aff) + +m.subj <- cv.glmnet(X.test.subj, Y.test, alpha=1, family="binomial", + type.measure="class") +subj.pred = predict(m.subj, type="class", s="lambda.min", + newx=X.hold.subj) + +m.venue <- cv.glmnet(X.test.venue, Y.test, alpha=1, family="binomial", + type.measure="class") +venue.pred = predict(m.venue, type="class", s="lambda.min", + newx=X.hold.venue) + +m.terms <- cv.glmnet(X.test.terms, Y.test, alpha=1, family="binomial", + type.measure="class") +terms.pred = predict(m.terms, type="class", s="lambda.min", + newx=X.hold.terms) + +########## +# Compare test set predictions against held-back sample: + +pred.df <- data.frame(cbind(con.pred, aff.pred, subj.pred, + venue.pred, terms.pred)) +names(pred.df) <- c("Controls", "+ Affiliation", "+ Subject", "+ Venue", + "+ Terms") + +m.list <- list(m.con, m.aff, m.subj, m.venue, m.terms) + +# collect: +# df +# percent.deviance +# nonzero coefficients +# prediction error + +gen.m.summ.info <- function(model){ + df <- round(tail(model$glmnet.fit$df, 1),0) + percent.dev <- round(tail(model$glmnet.fit$dev.ratio, 1),2)*100 + cv.error <- round(tail(model$cvm,1),2)*100 +# null.dev <- round(tail(model$glmnet.fit$nulldev),0) + out <- c(df, percent.dev, cv.error) + return(out) +} + +gen.class.err <- function(pred, test){ + props <- prop.table(table(pred, test)) + err.sum <- round(sum(props[1,2], props[2,1]),2)*100 + return(err.sum) +} + + +results.tab <- cbind(names(pred.df),data.frame(matrix(unlist(lapply(m.list, + gen.m.summ.info)), + byrow=T, nrow=5))) + +results.tab$class.err <- sapply(pred.df, function(x) gen.class.err(x, + Y.hold)) + +results.tab <- data.frame(lapply(results.tab, as.character)) + + + +names(results.tab) <- c("Model", "N features", "Deviance (%)", + "CV error (%)", "Hold-back error (%)") + + +print(xtable(results.tab, + caption= + "Summary of fitted models predicting any citations. The ``Model'' column describes which features were included. The N features column shows the number of features included in the prediction. ``Deviance'' summarizes the goodness of fit as a percentage of the total deviance accounted for by the model. ``CV error'' (cross-validation error) reports the prediction error rates of each model in the cross-validation procedure conducted as part of the parameter estimation process. ``Hold-back error'' shows the prediction error on a random 10 percent subset of the original dataset not included in any of the model estimation procedures.", + label='tab:predict_models', align='llrrrr'), + include.rownames=FALSE) + +# Store the results: +predict.list$results.tab <- results.tab + + + + +############# Generate most salient coefficients +nz.coefs <- data.frame( coef = + colnames(X.test.terms)[which( + coef(m.terms, s="lambda.min") + != 0)], + type = "term", + beta = + coef(m.terms, + s="lambda.min")[which(coef(m.terms, + s="lambda.min") + != 0)]) + +nz.coefs$coef <- as.character(nz.coefs$coef) +nz.coefs$type <- as.character(nz.coefs$type) +nz.coefs <- nz.coefs[order(-abs(nz.coefs$beta)),] + +# comparison: + +#nz.coefs$type <- "terms" +nz.coefs$type[grepl("(Intercept)", nz.coefs$coef)] <- NA +nz.coefs$type[grepl("source_title", nz.coefs$coef)] <- "venue" +nz.coefs$type[grepl("subject.x", nz.coefs$coef)] <- "subject" +nz.coefs$type[grepl("affiliation", nz.coefs$coef)] <- "affiliation" +nz.coefs$type[grepl("month.x", nz.coefs$coef)] <- "month" +nz.coefs$type[grepl("modal_country", nz.coefs$coef)] <- "country" +nz.coefs$type[grepl("language", nz.coefs$coef)] <- "language" +nz.coefs$type[grepl("^20[0-9]{2}$", nz.coefs$coef)] <- "year" + + +# cleanup +nz.coefs$coef <- gsub("source_title", "", nz.coefs$coef) +nz.coefs$coef <- gsub("subject.x", "", nz.coefs$coef) +nz.coefs$coef <- gsub("affiliation","", nz.coefs$coef) +nz.coefs$beta <- round(nz.coefs$beta, 3) +names(nz.coefs) <- c("Feature", "Type", "Coefficient") + +predict.list$nz.coefs <- nz.coefs + +# table for all +round(prop.table(table(nz.coefs$Type))*100, 2) + +# for top subsets +round(prop.table(table(nz.coefs$Type[1:700]))*100, 2) +round(prop.table(table(nz.coefs$Type[1:200]))*100, 2) +round(prop.table(table(nz.coefs$Type[1:100]))*100, 2) + +print(xtable( + as.matrix(head(nz.coefs, 10)), + label='tab:nzcoefs', + caption='Feature, variable type, and beta value for top 100 non-zero coefficients estimated by the best fitting model with all features included.', + align='lllr' +), include.rownames=FALSE) + + +# output +save(predict.list, file="paper/data/prediction.RData") + + diff --git a/code/prediction/utils.R b/code/prediction/utils.R new file mode 100644 index 0000000..b0207e3 --- /dev/null +++ b/code/prediction/utils.R @@ -0,0 +1,13 @@ + +# Use this to check for underpopulated cells +gen.counts <- function(df, c.var){ + tapply(df[,"eid"], c.var, function(x) length(unique(x))) +} + +# use this to remove underpopulated cells +restrict <- function(df, c.var, c.min){ + var.counts <- gen.counts(df, c.var) + out.df <- df[c.var %in% names(var.counts[var.counts > + c.min]),] + return(out.df) +} diff --git a/code/topic_modeling/00_topics_extraction.py b/code/topic_modeling/00_topics_extraction.py new file mode 100644 index 0000000..5e3450f --- /dev/null +++ b/code/topic_modeling/00_topics_extraction.py @@ -0,0 +1,126 @@ + +from time import time + +from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer +from sklearn.decomposition import NMF, LatentDirichletAllocation +import sys +import csv +import pandas as pd +import argparse + +""" +This code was inspired/copied from http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html. + +It takes in an abstract file, and creates two outputs: The abstracts together with their topic distribution and a set of topics and the top words associated with each. +""" + +n_samples = None # Enter an integer here for testing. +n_features = 20000 +n_topics = 12 + +def main(): + + parser = argparse.ArgumentParser(description='Program to use LDA to create topics and topic distributions from a set of abstracts.') + parser.add_argument('-i', help='Abstracts file', + default='processed_data/abstracts.tsv') + parser.add_argument('-o', help='Where to output results', + default='processed_data/abstracts_LDA.csv') + parser.add_argument('-t', help='Where to output topics and top words associated with them', + default='processed_data/top_words.csv') + args = parser.parse_args() + + print("Loading dataset...") + t0 = time() + dataset, doc_data = get_abstracts(args.i) + data_samples = dataset[:n_samples] + doc_data = doc_data[:n_samples] + print("done in %0.3fs." % (time() - t0)) + + # Use tf (raw term count) features for LDA. + print("Extracting tf features for LDA...") + tf_vectorizer = CountVectorizer(max_df=0.95, # Terms that show up in > max_df of documents are ignored + min_df=2, # Terms that show up in < min_df of documents are ignored + max_features=n_features, # Only use the top max_features + stop_words='english', + ngram_range=(1,2)) + t0 = time() + tf = tf_vectorizer.fit_transform(data_samples) + print("done in %0.3fs." % (time() - t0)) + + + print("Fitting LDA models with tf features, " + "n_samples=%d and n_features=%d..." + % (len(data_samples), n_features)) + lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, + learning_method='online', + learning_offset=50., + random_state=2017, + n_jobs=2) + t0 = time() + model = lda.fit(tf) + transformed_model = lda.fit_transform(tf) + print("done in %0.3fs." % (time() - t0)) + + + # Change the values into a probability distribution for each abstract + topic_dist = [[topic/sum(abstract_topics) for topic in abstract_topics] + for abstract_topics in transformed_model] + + # Make the topic distribution into a dataframe + td = pd.DataFrame(topic_dist) + # Get the feature names (i.e., the words/terms) + tf_feature_names = tf_vectorizer.get_feature_names() + + + # Get the top words by topic + topic_words = get_top_words(lda, tf_feature_names, 20) + # Sort by how often topic is used + topic_words = topic_words.reindex_axis(sorted(topic_words.columns, key = lambda x: td[x].sum(), reverse=True),axis=1) + + # Rearrange the columns by how often each topic is used + td = td.reindex_axis(sorted(td.columns, key = lambda x: td[x].sum(), reverse=True),axis=1) + + topic_words.to_csv(args.t, index=False) + + df = pd.DataFrame(doc_data) + df = df.join(td) + + df.to_csv(args.o, index=False) + +def get_abstracts(fn): + with open(fn, 'r') as f: + in_csv = csv.DictReader(f, delimiter='\t') + abstracts = [] + doc_data = [] + for r in in_csv: + try: + curr_abstract = r['abstract'] + # If this isn't really an abstract, then don't add it + if len(curr_abstract) > 5: + # Add the abstracts to the corpus, and save the data + abstracts.append(r['abstract']) + doc_data.append(r) + except KeyError: + print(r) + return abstracts, doc_data + +def get_top_words(model, feature_names, n_top_words): + '''Takes the model, the words used, and the number of words requested. + Returns a dataframe of the top n_top_words for each topic''' + r = pd.DataFrame() + # For each topic + for i, topic in enumerate(model.components_): + # Get the top feature names, and put them in that column + r[i] = [add_quotes(feature_names[i]) + for i in topic.argsort()[:-n_top_words - 1:-1]] + return r + +def add_quotes(s): + '''Adds quotes around multiple term phrases''' + if " " in s: + s = '"{}"'.format(s) + return s + + +if __name__ == '__main__': + main() diff --git a/code/topic_modeling/01_make_paper_files.py b/code/topic_modeling/01_make_paper_files.py new file mode 100644 index 0000000..e928e4d --- /dev/null +++ b/code/topic_modeling/01_make_paper_files.py @@ -0,0 +1,103 @@ +'''Creates the figures and tables for LaTeX''' + +import pandas as pd +import numpy as np +import datetime +import argparse +import os + +topic_names = [ + 'Media Use', + 'Social Network Analysis', + 'Consumer Analsyis', + 'Education', + 'Quantitative Analysis', + 'Information Spread', + 'Health', + 'Sentiment Analysis', + 'News', + 'HCI', + 'Influence', + 'Methodology' +] + +def main(): + + parser = argparse.ArgumentParser(description='Takes the LDA info and top words and creates an RData file with summary statistics') + parser.add_argument('-a', help='Abstracts LDA file', + default='processed_data/abstracts_LDA.csv') + parser.add_argument('-w', help='Top words file', + default='processed_data/top_words.csv') + parser.add_argument('-t', help='Topic tables directory', + default='paper/tables/') + parser.add_argument('-o', help = 'RData output file location', + default = 'paper/data/topic_model_data.RData') + + args = parser.parse_args() + + # Make the top_words tables + tw = pd.read_csv(args.w) + # Add names + tw.columns = topic_names + # Save as 2 different tables, because they are too long + if not os.path.exists(args.t): + os.makedirs(args.t) + tw.to_latex(args.t + 'topic_words1.tex',index=False, columns=tw.columns[:6]) + tw.to_latex(args.t + 'topic_words2.tex',index=False, columns=tw.columns[6:]) + + # Load the abstracts and topics data + df = pd.read_csv(args.a) + n_topics = len(tw.columns) + # Change to datetime + df.date = pd.to_datetime(df.date) + + # Remove papers from 2016 since we don't have the entire year, so graphs are misleading + df = df[df.date <= pd.to_datetime('2015-12-31')] + df = df.set_index('date') + # Rename the last columns as the topic names + df.columns = list(df.columns[:-n_topics]) + topic_names + # Group by year, and get only the LDA columns + topics_by_year = df.groupby(lambda x: x.year)[df.columns[-n_topics:]] + # Get summary statistics for each topic + # Total amount published in each topic by year + topic_sums = topics_by_year.sum() + # Mean amount published in each topic + topic_means = topics_by_year.mean() + # Now, we weight the contributions by how much a paper has been cited. + # Remember, each document has a distribution of topics that it belongs to, so a given document might look like: + # T1: .5 + # T2: .3 + # T3: 0 + # T4: .2 + # To account for how influential a paper is, we take all of the topic columns for a document + # and multiplies their weights by the logged citations the paper has received. + citation_weighted_topics = df[df.columns[-n_topics:]] + citation_weighted_topics = citation_weighted_topics.apply(lambda x: x * np.log1p(df.cited_by_count), axis=0) + weighted_sums = citation_weighted_topics.groupby(lambda x: x.year).sum() + + ## write data to R + # import code to write r modules and create our variable we'll write to + import rpy2.robjects as robjects + from rpy2.robjects import pandas2ri + pandas2ri.activate() + + + r = {'weighted_sums' : weighted_sums, + 'topic_sums' : topic_sums, + 'topic_means' : topic_means } + + for var_name, x in r.items(): + robjects.r.assign(var_name.replace("_", "."), x) + + if not os.path.exists(os.path.dirname(args.o)): + os.makedirs(os.path.dirname(args.o)) + + robjects.r('save({},file = "{}")'.format( + ",".join([k.replace("_", ".") for k in r.keys()]), + args.o + )) + robjects.r("rm(list=ls())") + + +if __name__ == '__main__': + main() diff --git a/paper/Makefile b/paper/Makefile new file mode 100644 index 0000000..958f0ab --- /dev/null +++ b/paper/Makefile @@ -0,0 +1,34 @@ +#!/usr/bin/make + +all: $(patsubst %.Rnw,%.pdf,$(wildcard *.Rnw)) +pdf: all + +%.tex: %.Rnw + #python3 ../code/LDA/make_latex_files.py + Rscript -e "library(knitr); knit('$<')" + +%.pdf: %.tex vc + grep -v -e '^.usepackage.longtable.$$' $<| sponge $< + latexmk -f -pdf $< + +clean: + latexmk -C *.tex + rm -f *.tmp + rm -f *.tex + rm -rf figure/ + rm -f *.bbl + rm -rf cache/ + rm -rf *.run.xml + rm -f vc + +viewpdf: all + evince *.pdf + +spell: + aspell -c -t --tex-check-comments -b text.tex + +vc: + resources/vc-git + +.PHONY: clean all +.PRECIOUS: %.tex diff --git a/paper/figures/cluster_connections-grey-numbered.pdf b/paper/figures/cluster_connections-grey-numbered.pdf new file mode 100644 index 0000000..b4ba39d Binary files /dev/null and b/paper/figures/cluster_connections-grey-numbered.pdf differ diff --git a/paper/figures/cluster_connections-grey.pdf b/paper/figures/cluster_connections-grey.pdf new file mode 100644 index 0000000..1bf3f78 Binary files /dev/null and b/paper/figures/cluster_connections-grey.pdf differ diff --git a/paper/figures/cluster_connections-grey.svg b/paper/figures/cluster_connections-grey.svg new file mode 100644 index 0000000..8401406 --- /dev/null +++ b/paper/figures/cluster_connections-grey.svg @@ -0,0 +1,275 @@ + + + +image/svg+xml1 +2 +3 +4 +5 +6 + \ No newline at end of file diff --git a/paper/figures/cluster_connections.pdf b/paper/figures/cluster_connections.pdf new file mode 100644 index 0000000..9784a54 Binary files /dev/null and b/paper/figures/cluster_connections.pdf differ diff --git a/paper/figures/g_sm_hairball-grey.pdf b/paper/figures/g_sm_hairball-grey.pdf new file mode 100644 index 0000000..e7bb8b1 Binary files /dev/null and b/paper/figures/g_sm_hairball-grey.pdf differ diff --git a/paper/figures/g_sm_hairball.pdf b/paper/figures/g_sm_hairball.pdf new file mode 100644 index 0000000..e458ead Binary files /dev/null and b/paper/figures/g_sm_hairball.pdf differ diff --git a/paper/foote_shaw_hill-computational_analysis_of_social_media.Rnw b/paper/foote_shaw_hill-computational_analysis_of_social_media.Rnw new file mode 100644 index 0000000..bbf9a4e --- /dev/null +++ b/paper/foote_shaw_hill-computational_analysis_of_social_media.Rnw @@ -0,0 +1,575 @@ +\documentclass[12pt]{memoir} + +% % article-1 and article-2 styles were originally based on kieran healy's +% templates +\usepackage{mako-mem} +\chapterstyle{article-2} + +% with article-3 \chapterstyle, change to: \pagestyle{memo} +\pagestyle{mako-mem} + + +\usepackage[utf8]{inputenc} + +\usepackage[T1]{fontenc} +\usepackage{textcomp} +\usepackage[garamond]{mathdesign} + +\usepackage[letterpaper,left=1.65in,right=1.65in,top=1.3in,bottom=1.2in]{geometry} + +<>= +knit_hooks$set(document = function(x) { + sub('\\usepackage[]{color}', +'\\usepackage[usenames,dvipsnames]{color}', x, fixed = TRUE) +}) +#$ +@ + +% packages I use in essentially every document +\usepackage{graphicx} +\usepackage{enumerate} + +% packages i use in many documents but leave off by default +% \usepackage{amsmath, amsthm, amssymb} +% \usepackage{dcolumn} +% \usepackage{endfloat} + +% import and customize urls +\usepackage[usenames,dvipsnames]{xcolor} +\usepackage[breaklinks]{hyperref} + +\hypersetup{colorlinks=true, linkcolor=Black, citecolor=Black, filecolor=Blue, + urlcolor=Blue, unicode=true} + +% list of footnote symbols for \thanks{} +\makeatletter +\renewcommand*{\@fnsymbol}[1]{\ensuremath{\ifcase#1\or *\or \dagger\or \ddagger\or + \mathsection\or \mathparagraph\or \|\or **\or \dagger\dagger + \or \ddagger\ddagger \else\@ctrerr\fi}} +\makeatother +\newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]} + +% add bibliographic stuff +\usepackage[american]{babel} +\usepackage{csquotes} +\usepackage[natbib=true, style=apa, backend=biber]{biblatex} +\addbibresource{refs.bib} +\DeclareLanguageMapping{american}{american-apa} + +\defbibheading{secbib}[\bibname]{% + \section*{#1}% + \markboth{#1}{#1}% + \baselineskip 14.2pt% + \prebibhook} + +\def\citepos#1{\citeauthor{#1}'s (\citeyear{#1})} +\def\citespos#1{\citeauthor{#1}' (\citeyear{#1})} + +% memoir function to take out of the space out of the whitespace lists +\firmlists + + +% LATEX NOTE: these lines will import vc stuff after running `make vc` which +% will add version control information to the bottom of each page. This can be +% useful for keeping track of which version of a document somebody has: +\input{vc} +\pagestyle{mako-mem-git} + +% LATEX NOTE: this alternative line will just input a timestamp at the +% build process, useful for sharelatex +%\pagestyle{mako-mem-sharelatex} + +\setlength{\parskip}{4.5pt} +% LATEX NOTE: Ideal linespacing is usually said to be between 120-140% the +% typeface size. So, for 12pt (default in this document, we're looking for +% somewhere between a 14.4-17.4pt \baselineskip. Single; 1.5 lines; and Double +% in MSWord are equivalent to ~117%, 175%, and 233%. + +% packages specific to /this/ to document +\usepackage{adjustbox} +\usepackage{longtable} +% \usepackage{lmodern} + +\definecolor{snagrey1}{HTML}{D9D9D9} +\definecolor{snagrey2}{HTML}{BDBDBD} +\definecolor{snagrey3}{HTML}{969696} +\definecolor{snagrey4}{HTML}{737373} +\definecolor{snagrey5}{HTML}{525252} +\definecolor{snagrey6}{HTML}{252525} + +% Prefix table and figure numbers with our chapter number +\renewcommand{\thetable}{6.\arabic{table}} +\renewcommand{\thefigure}{6.\arabic{figure}} + +\begin{document} + +<>= +library(dplyr) +library(ggplot2) +library(xtable) +library(data.table) +library(reshape2) +library(plyr) + +bold <- function(x) {paste('{\\textbf{',x,'}}', sep ='')} +gray <- function(x) {paste('{\\textcolor{gray}{',x,'}}', sep ='')} +wrapify <- function (x) {paste("{", x, "}", sep="")} + +f <- function (x) {formatC(x, format="d", big.mark=',')} + +load("data/orig_data_sets.RData") +load("data/topic_model_data.RData") +load("data/prediction_descriptives.RData") +load("data/prediction.RData") + +load("data/network_data.RData") +attach(r) +@ + +\baselineskip 16pt + +\title{A Computational Analysis of Social Media Scholarship} + +\author{Jeremy Foote (\href{mailto:jdfoote@u.northwestern.edu}{jdfoote@u.northwestern.edu}) + + \smallskip + + Aaron Shaw (\href{mailto:aaronshaw@northwestern.edu}{aaronshaw@northwestern.edu}) + + \smallskip + + Benjamin Mako Hill (\href{mailto:makohill@uw.edu}{makohill@uw.edu}) +} +\date{} + +\published{\textcolor{BrickRed}{\textsc{This is the final authors' version of a published book chapter. When citing this work, please cite the published version:} Foote, Jeremy D., Aaron Shaw, and Benjamin Mako Hill. 2017. “A Computational Analysis of Social Media Scholarship.” In \textit{The SAGE Handbook of Social Media}, edited by Jean Burgess, Alice Marwick, and Thomas Poell, 111–34. London, UK: SAGE.}} + +\maketitle + +\vspace{-2.5em} + +\begin{abstract} +Data from social media platforms and online communities have fueled the growth of computational social science. In this chapter, we use computational analysis to characterize the state of research on social media and demonstrate the utility of such methods. First, we discuss how to obtain datasets from the APIs published by many social media platforms. Then, we perform some of the most widely used computational analyses on a dataset of social media scholarship we extract from the Scopus bibliographic database's API. We apply three methods: network analysis, topic modeling using latent Dirichlet allocation, and statistical prediction using machine learning. For each technique, we explain the method and demonstrate how it can be used to draw insights from our dataset. Our analyses reveal overlapping scholarly communities studying social media. We find that early social media research applied social network analysis and quantitative methods, but the most cited and influential work has come from marketing and medical research. We also find that publication venue and, to a lesser degree, textual features of papers explain the largest variation in incoming citations. We conclude with some consideration of the limitations of computational research and future directions. + +\end{abstract} + +\section{Introduction} + +The combination of large-scale trace data generated through social media with a series of advances in computing and statistics have enabled the growth of `computational social science' \citep{lazer_computational_2009}. This turn presents an unprecedented opportunity for researchers who can now test social theories using massive datasets of fine-grained, unobtrusively collected behavioral data. In this chapter, we aim to introduce non-technical readers to the promise of these computational social science techniques by applying three of the most common approaches to a bibliographic dataset of social media scholarship. We use our analyses as a context for discussing the benefits of each approach as well as some of the common pitfalls and dangers of computational approaches. + +The chapter walks through the entire process of computational analysis, beginning with data collection. We explain how we gather a large-scale dataset about social media research from the \emph{Scopus} website's application programming interface. In particular, our dataset contains data about every article in the Scopus database that includes the term `social media' in its title, abstract, or keywords. Using this dataset, we perform multiple computational analyses. First, we use network analysis \citep{wasserman_social_1994} on article citation metadata to understand the structure of references between the articles. Second, we use topic models \citep{blei_probabilistic_2012}, an unsupervised natural language processing technique, to describe the distribution of topics within the sample of articles included in our analysis. Third, we perform statistical prediction \citep{james_introduction_2013} in order to understand what characteristics of articles best predict subsequent citations. For each analysis, we describe the method we use in detail and discuss some of its benefits and limitations. + +Our results reveal several patterns in social media scholarship. Bibliometric network data reveals disparities in the degree that disciplines cite each other and illustrate that marketing and medical research each enjoy surprisingly large influence. Through descriptive analysis and topic modeling, we find evidence of the early influence of social network research. When we use papers' characteristics to predict which work gets cited, we find that publication venues and linguistic features provide the most explanatory power. + +In carrying out our work in this chapter, we seek to exemplify several current best practices in computational research. We use data collected in a manner consistent with the expectations of privacy and access held by the subjects of our analysis as well as the publishers of the data source. We also make our analysis fully reproducible from start to finish. In an online supplement, we provide the full source code for all aspects of this project -- from the beginning of data collection to the creation of the figures and the chapter text itself -- as a resource for future researchers. + +\section{Collecting and describing data from the web} + +A major part of computational research consists of obtaining data, preparing it for analysis, and generating initial descriptions that can help guide subsequent inquiry. Social media datasets vary in how they make it into researchers' hands. There are several sources of social media data which are provided in a form that is pre-processed and ready for analysis. For example, The Stanford Large Network Dataset Collection \citep{leskovec_snap_2014} contains pre-formatted and processed data from a variety of social media platforms. Typically, prepared datasets come formatted as `flat files' such as comma-separated value (CSV) tables, which many types of statistical software and programming tools can import directly. + +More typically, researchers retrieve data directly from social media platforms or other web-based sources. These `primary' sources provide more extensive, dynamic, and up-to-date datasets, but also require much more work to prepare the data for analysis. Typically, researchers retrieve these data from social media sites through application programming interfaces (APIs). Web sites and platforms use APIs to provide external programmers with limited access to their servers and databases. Unfortunately, APIs are rarely designed with research in mind and are often inconvenient and limited for social scientists as a result. For example, Twitter's search API returns a small, non-random sample of tweets by default (what a user might want to read), rather than all of the tweets that match a given query (what a researcher building a sample would want). In addition, APIs typically limit how much data they will provide for each query and how many queries can be submitted within a given time period. + +APIs provide raw data in formats like XML or JSON, which are poorly suited to most data analysis tasks. As a result, researchers must take the intermediate step of converting data into more appropriate formats and structures. Typically, researchers must also construct measures from the raw data, such as user-level statistics (e.g., number of retweets) or metadata (e.g., post length). A number of tools, such as NodeXL \citep{hansen_analyzing_2010}, exist to make the process of obtaining and preparing digital trace data easier. However, off-the-shelf tools tend to come with their own limitations and, in our experience, gathering data amenable to computational analysis usually involves some programming work. + +Compared to some traditional forms of data collection, obtaining and preparing social media data has high initial costs. It frequently involves writing and debugging custom software, reading documentation about APIs, learning new software libraries, and testing datasets for completeness and accuracy. However, computational methods scale very well and gathering additional data often simply means expanding the date range in a programming script. Contrast this with interviews, surveys, or experiments, where recruitment is often labor-intensive, expensive, and slow. Such scalability, paired with the massive participation on many social media platforms, can support the collection of very large samples. + +\subsection{Our application: The Scopus Bibliographic Database} + +We used a series of Scopus Bibliographic Database APIs to retrieve data about all of the publications in their database that contained the phrase `social media' in their abstract, title, or keywords. We used the Python programming language to write custom software to download data. First, we wrote a program to query the Scopus Search API to retrieve a list of the articles that matched our criteria. We stored the resulting list of \Sexpr{f(total.articles)} articles in a file. We used this list of articles as input to a second program, which used the Scopus Citations Overview API to retrieve information about all of the articles that cited these \Sexpr{f(total.articles)} articles. Finally, we wrote a third program that used the Scopus Abstract Retrieval API to download abstracts and additional metadata about the original \Sexpr{f(total.articles)} articles. Due to rate limits and the process of trial and error involved in writing, testing, and debugging these custom programs, it took a few weeks to obtain the complete dataset. + +Like many social media APIs, the Scopus APIs returns data in JSON format. Although not suitable for analysis without processing, we stored this JSON data in the form it was given to us. Retaining the `raw' data directly from APIs allows researchers to construct new measures they might not have believed were relevant in the early stages of their research and to fix any bugs that they find in their data processing and reduction code without having to re-download raw data. Once we obtained the raw data, we wrote additional Python scripts to turn the downloaded JSON files into CSV tables which could be imported into Python and R, the programming languages we used to complete our analyses. + +\subsection{Results} + +The Scopus dataset contains a wide variety of data, and many different descriptive statistics could speak to various research questions. Here we present a sample of the sorts of summary data that computational researchers might explore. We begin by looking at where social media research is produced. Table \ref{tab:country} shows the number of papers produced by authors located in each of the six most frequently seen countries in our dataset.\footnote{Technically, each paper is assigned to the modal (i.e.,~most frequent) country among its authors. For example, if a paper has three authors with two authors located in Canada and one in Japan, the modal country for the paper would be Canada. Any ties (i.e., if more than one country is tied for most frequent location among a paper's authors) were broken by randomly selecting among the tied countries.} We can immediately see that the English-language world produces much of the research on social media (which is perhaps unsurprising given that our search term was in English), but that a large amount of research comes from authors in China and Germany. + +<>= +country.table <- sort(table(df[['modal_country']]), decreasing = T, index.return=T ) +country.table <- as.data.frame(head(country.table)) +colnames(country.table) <- c('Country', 'Number of Papers') +print(xtable(country.table, + caption="Top author countries by number of social media papers.", + align='lp{.5\\linewidth}r', + label='tab:country'), + include.rownames = FALSE) +@ + +\begin{figure} + \centering +<>= + +top.subjects <- head(names(rev(sort(table(s$subject)))), 10) +reduced <- full[full$subject %in% top.subjects, c("year", "subject")] +reduced$subject <- droplevels(reduced$subject) + +subj.by.year <- melt(table(reduced)) +subj.by.year$year <- as.Date(paste(subj.by.year$year, "-06-01", sep="")) +subj.by.year <- subj.by.year[subj.by.year$year <= as.Date("2016-01-01"),] +colnames(subj.by.year) <- c("year", "subject", "papers") +subj.by.year$subject <- as.factor(subj.by.year$subject) + +ggplot(subj.by.year) + + aes(x=year, y=papers, fill = subject) + + geom_density(stat="identity", position="stack") + + xlab('Year') + ylab('Number of papers') + + guides(fill = guide_legend(reverse = TRUE)) + + # scale_fill_grey(name="Discipline") + theme_bw() +#$ +@ + + +\caption{Social media papers published in the top ten disciplines (as categorized by Scopus), over time.} +\label{fig:pubstime} +\end{figure} + +Next we look at the disciplines that publish social media research. Figure \ref{fig:pubstime} shows the number of papers containing the term `social media' over time. The plot illustrates that the quantity of published research on social media has increased rapidly over time. The growth appears to slow down more recently, but this may be due to the speed at which the Scopus database imports data about new articles. + +Figure \ref{fig:pubstime} shows the top ten disciplines, as categorized by Scopus. We see that the field started off dominated by computer science publications, with additional disciplines increasing their activity in recent years. This story is also reflected in the top venues, listed in Table \ref{tab:source}, where we see that computer science venues have published social media research most frequently. + +\begin{table} + \begin{adjustbox}{center} +<>= +# Change long titles +df$source_title <- revalue(df$source_title, c('Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)' = 'Lecture Notes in Computer Science', +'Proceedings of the Annual Hawaii International Conference on System Sciences' = 'Proceedings of the Hawaii International Conference on System Sciences', +'Joint Ninth WebKDD and First SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis' = 'Proceedings of WebKDD / SNA-KDD 2007')) + +source.table <- sort(table(df[['source_title']]), decreasing = T, index.return=T ) +source.table <- as.data.frame(head(source.table)) +colnames(source.table) <- c('Publication Venue', 'Papers') +print(xtable(source.table, + align='lp{.8\\linewidth}r'), + include.rownames = FALSE, + floating=FALSE) +@ +\end{adjustbox} +\caption{Venues with the most social media papers.} +\label{tab:source} +\end{table} + +\begin{table} + \begin{adjustbox}{center} +<>= +results <- head(df[ order(-df[['cited_by_count']]), ][,c('title','source_title','cited_by_count')]) +colnames(results) <- c('Title', 'Publication Venue', 'Cited by') + +print(xtable(results, + align='lp{.48\\linewidth}p{.28\\linewidth}r'), + include.rownames = FALSE, + floating=FALSE) +@ + \end{adjustbox} + \caption{Most cited social media papers.} + \label{tab:citedby} +\end{table} + +We then consider the impact of this set of papers as measured by the citations they have received. Like many phenomena in social systems, citation counts follow a highly skewed distribution with a few papers receiving many citations and most papers receiving very few. Table \ref{tab:citedby} provides a list of the most cited papers. These sorts of distributions suggest the presence of `preferential attachment' \citep{barabasi_emergence_1999} or `Matthew effects' \citep{merton_matthew_1968}, where success leads to greater success. + +\subsection{Discussion} + +The summary statistics and exploratory visualizations presented above provide an overview of the scope and trajectory of social media research. We find that social media research is growing -- both overall and within many disciplines. We find evidence that computer scientists laid the groundwork for the study of social media, but that social scientists, learning scientists, and medical researchers have increasingly been referring to social media in their published work. We also find several business and marketing papers among the most cited pieces of social media research even though neither these disciplines nor their journals appear among the most prevalent in the dataset. + +These results are interesting and believable because they come from a comprehensive database of academic work. In most social science contexts, researchers have to sample from a population and that sampling is often biased. For example, the people willing to come to a lab to participate in a study or take a phone survey may have different attributes from those unwilling to participate. This makes generalizing to the entire population problematic. When using trace data, on the other hand, we often have data from all members of a community including those who would not have chosen to participate. One of the primary benefits of collecting data from a comprehensive resource like Scopus is that it limits the impact of researcher biases or assumptions. For example, we do not have backgrounds in education or medical research; had we tried to summarize the state of social media research by identifying articles and journals manually, we might have overlooked these disciplines. + +That said, this apparent benefit can also become a liability when we seek to generalize our results beyond the community that we have data for. The large \textit{N} of big data studies using social media traces may make results appear more valid, precise, or certain, but a biased sample does not become less biased just because it is larger \citep{hargittai_is_2015}. For example, a sample of 100 million Twitter users might be a worse predictor of election results than a truly random sample of only 1,000 likely voters because Twitter users likely have different attributes and opinions than the voting public. Another risk comes from the danger that data providers collect or filter data in ways that aren't apparent. Researchers should think carefully about the relationship of their data to the population they wish to study and find ways to estimate bias empirically. + +Overall, we view the ease of obtaining and analyzing digital traces as one of the most exciting developments in social science. Although the hurdles involved represent a real challenge to many scholars of social media today, learning the technical skills required to obtain online trace data is no more challenging than the statistics training that is part of many PhD programs and opens opportunities for important, large-scale studies. Below, we present examples of a few computational analyses that can be done with this sort of data. + +\section{Network analysis} + +Social network analysis encompasses the most established set of computational methods in the social sciences \citep{wasserman_social_1994}. At its core, network analysis revolves around a `graph' representation of data that tries to capture relationships (called edges) between discrete objects (called nodes). Graphs can represent any type of object and relationship, such as roads connecting a group of cities or shared ingredients across a set of recipes. Graph representations of data, and the network analytic methods built to reason using these data, are widely used across the social sciences as well as other fields including physics, genomics, computer science, and philosophy. `Social network analysis' constitutes a specialized branch of network analysis in which nodes represent people (or other social entities) and edges represent social relationships like friendship, interaction, or communication. + +The power of network analysis stems from its capacity to reduce a very large and complex dataset to a relatively simple set of relations that possess enormous explanatory power. For example, \citet{hausmann_atlas_2014} use network data on the presence or absence of trading relationships between countries to build a series of extremely accurate predictions about countries' relative wealth and economic performance over time. By reasoning over a set of relationships in a network, Hausmann and his colleagues show that details of the nature or amount of goods exchanged are not necessary to arrive at accurate economic conclusions. + +Network analysis has flourished in studies of citation patterns within scholarly literature, called `bibliometrics' or `scientometrics.' Bibliometric scholars have developed and applied network analytic tools for more than a half-century \citep{kessler_bibliographic_1963, hood_literature_2001}. As a result, bibliometric analysis provides an obvious jumping-off point for our tour of computational methods. Because network methods reflect a whole family of statistics, algorithms, and applications, we focus on approaches that are both well-suited to bibliometric analysis and representative of network analyses used in computational social science more broadly. + +\subsection{Our application: Citation networks} + +Our network analysis begins by representing citation information we collected from the Scopus APIs as a graph. In our representation, each node represents a paper and each edge represents a citation. Scopus provides data on incoming citations for each article. Our full dataset includes \Sexpr{f(sm.citations)} incoming citations to the \Sexpr{f(total.articles)} articles in Scopus with `social media' in their titles, abstracts, or keywords. \Sexpr{f(total.articles - sm.cited)} of these articles (\Sexpr{round(((total.articles - sm.cited)/total.articles)*100)}\%) have not been cited even once by another article in Scopus and \Sexpr{f(total.articles - sm.citing)} (\Sexpr{round(((total.articles - sm.citing)/total.articles)*100)}\%) do not cite any other article in our sample. The recent development of social media and the rapid growth of the field depicted in Figure \ref{fig:pubstime} might help explain the sparseness (i.e.~lack of connections) of the graph. As a result, and as is often the case in network analysis, a majority of our dataset plays no role in our analysis described in the rest of this section. + +Once we create our citation graph, there are many potential ways to analyze it. One important application, common to bibliometrics, is the computational identification of communities or clusters within networks. In network studies, the term `community' is used to refer to groups of nodes that are densely connected to each other but relatively less connected to other groups. In bibliometric analyses, communities can describe fields or sub-fields of articles which cite each other, but are much less likely to cite or be cited by papers in other groups. Although there are many statistical approaches to community detection in network science, we use a technique from Rosvall and Bergstrom (\citeyear{rosvall_maps_2008}) that has been identified as appropriate for the study of bibliometric networks \citep{subelj_clustering_2016}. By looking at the most frequently occurring journals and publication venues in each community, we are able to identify and name sub-fields of social media research as distinct communities. + +A citation graph is only one possible network representation of the relationships between articles. For example, the use of common topics or terminology might constitute another type of edge. Alternatively, journals or individual authors (rather than articles) might constitute an alternative source of nodes. In bibliometric analyses, for example, it is common for edges to represent `co-citations' between articles or authors. Using this approach, papers are said to be tied together by a co-citation if they have both been cited in a third document \citep{small_co-citation_1973}. Due to limited space, we only present the simplest case of direct citations. + +\begin{figure} + \includegraphics[width=\textwidth]{figures/g_sm_hairball.pdf} + \caption{Network visualization of the citation network in our dataset. The layout is `force directed' meaning that nodes (papers) with more edges (citations) appear closer to the center of the figure.} + \label{fig:hairball} + \end{figure} + +\begin{table} +\begin{adjustbox}{center} + \begin{tabular}{cm{0.3\textwidth}m{0.5\textwidth}} + \hline + Community & Description & Journals \\ + \hline + \colorbox{CarnationPink}{\color{black}Community 1} & biomedicine; bioinformatics & Journal of Medical Internet Research; PLoS ONE; Studies in Health Technology and Informatics \\ + \colorbox{Green}{\color{black}Community 2} & information technology; management & Computers in Human Behavior; Business Horizons; Journal of Interactive Marketing \\ + \colorbox{Black}{\color{white}Community 3} & communication & Information Communication and Society; New Media and Society; Journal of Communication \\ + \colorbox{Cyan}{\color{black}Community 4} & computer science; network science & Lecture Notes in Computer Science; PLoS ONE; WWW; KDD \\ + \colorbox{Orange}{\color{black}Community 5} & psychology; psychometrics & Computers in Human Behavior; Cyberpsychology, Behavior, and Social Networking; Computers and Education\\ + \colorbox{Red}{\color{black}Community 6} & multimedia & IEEE Transactions on Multimedia; Lecture Notes in Computer Science; ACM Multimedia \\ + \hline + \end{tabular} + \end{adjustbox} + \caption{Description of each of the citation network clusters identified by the community detection algorithm, together with a list of the three most common journals in each community.} + \label{tab:clusters} +\end{table} + +\subsection{Results} + +As is common in social networks, the large majority of articles with any citations connect to each other in one large sub-network called a `component'. Figure \ref{fig:hairball} shows a visualization of this large component. The optimal way to represent network data in two-dimensional space is a topic of research and debate. Figure \ref{fig:hairball} uses a force-directed drawing technique \citep{fruchterman_graph_1991}, the most widely used algorithm in network visualization, using the free/open source software package Gephi \citep{bastian_gephi:_2009}. The basic idea behind the algorithm is that nodes naturally push away from each other, but are pulled together by edges between them. Shades in each graph in this section reflect the communities of documents identified by Rosvall and colleagues' `map' algorithm \citep{rosvall_maps_2008, rosvall_map_2010}. Although the algorithm identified several dozen communities, most are extremely small, so we have shown only the largest 6 communities in Figure \ref{fig:hairball}. Each of these communities are summarized in Table \ref{tab:clusters} where the right-most column lists the three most common journals for the articles included in each community. + +At this point, we can look in more depth at the attributes of the different communities. For example, in a bibliometric analysis published in the journal \emph{Scientometrics}, \citet{kovacs_exploring_2015} reported summary statistics for articles in each of the major communities identified (e.g., the average number of citations) as well as qualitative descriptions of the nodes in each community. We can see from looking at Table \ref{tab:clusters} that the communities point to the existence of coherent thematic groups. For example, \colorbox{CarnationPink}{\color{black}Community 1} includes biomedical research while \colorbox{black}{\color{white}Community 3} contains papers published in communication journals. Earlier, we relied on an existing category scheme applied to journals to create Figure \ref{fig:pubstime}; all articles published in particular journals were treated as being within one field. Network analysis, however, can identify groups and categories of articles in terms of who is citing whom and, as a result, can reveal groups that cross journal boundaries. PLoS ONE, for example, is a `megajournal' that publishes articles from all scientific fields \citep{binfield_plos_2012}. As a result, PLoS ONE is one of the most frequently included journals in both \colorbox{CarnationPink}{\color{black}Community 1} and \colorbox{Red}{\color{black}Community 6}. In a journal-based categorization system, articles may be misclassified or not classified at all. + +\begin{figure} + \centering + \includegraphics[width=0.4\textwidth]{figures/cluster_connections.pdf} + \caption{Graphical representation of citations between communities using the same grayscale mapping described in Table \ref{tab:clusters}. The size of the nodes reflects the total number of papers in each community. The thickness of each edge reflects the number of outgoing citations. Edges are directional, and share the color of their source (i.e., citing) community.} + \label{fig:cluscon} +\end{figure} + +Network analysis can also reveal details about the connections between fields. Figure \ref{fig:cluscon} shows a second network we have created in which our communities are represented as nodes and citations from articles in one community to articles in the other communities are represented as edges. The thickness of each edge represents the number of citations and the graph shows the directional strength of the relative connections between communities. For example, the graph suggests that the communication studies community (\colorbox{Black}{\color{white}Community 3}) cites many papers in information technology and management (\colorbox{Green}{\color{black}Community 2}) but that this relationship is not reciprocated. + +\subsection{Discussion} + +Like many computational methods, the power of network techniques comes from the representation of complex relationships in simplified forms. Although elegant and powerful, the network analysis approach is inherently reductive in nature and limited in many ways. What we gain in our ability to analyze millions or billions of individuals comes at the cost of speaking about particular individuals and sub-groups. A second limitation stems from the huge number of relationships that can be represented in graphs. A citation network and a co-citation network, for example, represent different types of connections and these differences might lead an algorithm to identify different communities. As a result, choices about the way that edges and nodes are defined can lead to very different conclusions about the structure of a network or the influence of particular nodes. Network analyses often treat all connections and all nodes as similar in ways that mask important variation. + +Network analysis is built on the assumption that knowing about the relationships between individuals in a system is often as important, and sometimes more important, than knowing about the individuals themselves. It inherently recognizes interdependence and the importance of social structures. This perspective comes with a cost, however. The relational structure and interdependence of social networks make it impossible to use traditional statistical methods and SNA practitioners have had to move to more complex modeling strategies and simulations to test hypotheses. + +\section{Text analysis} + +Social media produces an incredible amount of text, and social media researchers often analyze the content of this text. For example, researchers use ethnographic approaches \citep{kozinets_field_2002} or content analysis \citep{chew_pandemics_2010} to study the texts of interactions online. Because the amount of text available for analysis is far beyond the ability of any set of researchers to analyze by hand, scholars increasingly turn to computational approaches. Some of these analyses are fairly simple, such as tracking the occurrence of terms related to a topic or psychological construct \citep{tausczik_psychological_2010}. Others are more complicated, using tools from natural language processing (NLP). NLP includes a range of approaches in which algorithms are applied to texts, such as machine translation, optical character recognition, and part-of-speech tagging. Perhaps the most common use in the social sciences is sentiment analysis, in which the affect of a piece of text is intuited based on the words that are used \citep{asur_predicting_2010}. Many of these techniques have applications for social media research. + +One natural language processing technique -- topic modeling -- is used increasingly frequently in computational social science research. Topic modeling seeks to identify topics automatically within a set of documents. In this sense, topic modeling is analogous to content analysis or other manual forms of document coding and labeling. However, topic models are a completely automated, unsupervised computational method -- i.e., topic modeling algorithms do not require any sort of human intervention, such as hand-coded training data or dictionaries of terms. Topic modeling scales well to even very large datasets, and is most usefully applied to large corpora of text where labor-intensive methods like manual coding are simply not an option. + +When using the technique, a researcher begins by feeding topic modeling software the texts that she would like to find topics for and by specifying the number of topics to be returned. There are multiple algorithms for identifying topics, but we focus on the most common: \emph{latent Dirichlet allocation} or LDA \citep{blei_latent_2003}. The nuts and bolts of how LDA works are complex and beyond the scope of this chapter, but the basic goal is fairly simple: LDA identifies sets of words that are likely to be used together and calls these sets `topics.' For example, a computer science paper is likely to use words like `algorithm', `memory', and `network.' While a communication article might also use `network,' it would be much less likely to use `algorithm' and more likely to use words like `media' and `influence.' The other key feature of LDA is that it does not treat documents as belonging to only one topic, but as consisting of a mixture of multiple topics with different degrees of emphasis. For example, an LDA analysis might characterize this chapter as a mixture of computer science and communication (among other topics). + +LDA identifies topics inductively from the observed distributions of words in documents. The LDA algorithm looks at all of the words that co-occur within a corpus of documents and assumes that words used in the same document are more likely to be from the same topic. The algorithm then looks across all of the documents and finds the set of topics and topic distributions that would be, in a statistical sense, most likely to produce the observed documents. LDA's output is the set of topics: ranked lists of words likely to be used in documents about each topic, as well as the distribution of topics in each document. \citet{dimaggio_exploiting_2013} argue that while many aspects of topic modeling are simplistic, many of the assumptions have parallels in sociological and communication theory. Perhaps more importantly, the topics created by LDA frequently correspond to human intuition about how documents should be grouped or classified. + +The results of topic models can be used many ways. Our dataset includes \Sexpr{nrow(abstracts[grep('LDA',abstracts[['abstract']]),])} publications with the term `LDA' in their abstracts. Some of these papers use topic models to conduct large-scale content analysis, such as looking at the topics used around health on Twitter \citep{prier_identifying_2011,ghosh_what_2013}. Researchers commonly use topic modeling for prediction and machine learning tasks, such as identifying how topics vary by demographic characteristics and personality types \citep{schwartz_personality_2013}. In our dataset, papers use LDA to predict transitions between topics \citep{wang_tm-lda:_2012}, to recommend friends based on similar topic use \citep{pennacchiotti_investigating_2011}, and to identify interesting tweets on Twitter \citep{yang_identifying_2014}. + + +\subsection{Our application: Identifying topics in social media research} + +We apply LDA to the texts of abstracts in our dataset in order to identify prevalent topics in social media research. We show how topics are extracted and labeled and then use data on topic distributions to show how the focus of social media research has changed over time. We begin by collecting each of the abstracts for the papers in our sample. Scopus does not include abstract text for \Sexpr{f(total.articles - nrow(abstracts))} of the \Sexpr{f(total.articles)} articles in our sample. We examined a random sample of the entries with missing abstracts by hand, and found that abstracts for many simply never existed (e.g., articles published in trade journals or books). Other articles had published abstracts, but the text of these abstracts, for reasons that are not clear, were not available through Scopus.\footnote{This provides one example of how the details of missing data can be invisible or opaque. It is easy to see how missing data like this could impact research results. For example, if certain disciplines or topics are systematically less likely to include abstracts in Scopus, we will have a skewed representation of the field.} +We proceed with the \Sexpr{f(nrow(abstracts))} articles in our sample for which abstract data was available. +The average abstract in this dataset is \Sexpr{f(round(mean(word_count),2))} words long, with a max of \Sexpr{f(max(word_count))} words and a minimum of \Sexpr{f(min(word_count))} (``\Sexpr{abstracts[which.min(word_count),'abstract']}''). + +We then remove `stop words' (common words like `the,'`of,' etc.) and tokenize the documents by breaking them into unigrams and bigrams (one-word and two-word terms). We analyze the data using the Python \emph{LatentDirichletAllocation} module from the \emph{scikit-learn} library \citep{pedregosa_scikit-learn:_2011}. Choosing the appropriate number of topics to be returned (typically referred to as \textit{k}) is a matter of some debate and research \citep[e.g.,][]{arun_finding_2010}. After experimenting with different values of \textit{k}, plotting the distribution of topics each time in a way similar to graph shown in Figure \ref{fig:ldaplots}, we ultimately set \textit{k} as twelve. At higher values of \textit{k}, additional topics only rarely appeared in the abstracts. + +\subsection{Results} + +Table \ref{topic_table} shows the top words for each of the topics discovered by the LDA model, sorted by how common each topic is in our dataset. +At this point, researchers typically evaluate the lists of words for coherence and give names to each of the topics. For example, we look at the words associated with Topic 1 and give it the name `Media Use.' Of course, many other names for this topic could be chosen. We might call it `Facebook research' because it is the only topic which includes the term `facebook.' Researchers often validate these names by looking at some of the texts which score highest for each topic and subjectively evaluating the appropriateness of the chosen name as a label for those texts. For example, we examined the abstracts of the five papers with the highest value for the `Media Use' topic and confirmed that we were comfortable claiming that they were examples of research about media use. In this way, topic modeling requires a mixture of both quantitative and qualitative interpretation. The computer provides results, but making sense of those results requires familiarty with the data. + +\begin{table} + \tiny + \begin{adjustbox}{center} + \input{tables/topic_words1.tex} + \end{adjustbox} + + \vspace{1em} + + \begin{adjustbox}{center} + \input{tables/topic_words2.tex} + \end{adjustbox} + \caption{Top 20 terms for each topic. Topics are presented in the order of their frequency in the corpus of abstracts.} +\label{topic_table} +\end{table} + + +\begin{figure} +\centering +<>= +data.sets <- c("topic.sums", "topic.means", "weighted.sums") + +tmp <- lapply(data.sets, function (x) { + d <- eval(as.name(x)) + d$year <- rownames(d) + d <- melt(d, id.vars="year") + colnames(d) <- c("year", "Category", "value") + d[["variable"]] <- x + d +}) + +grid.tmp <- do.call("rbind", tmp) + +grid.tmp$Category <- gsub("\\." , " ", grid.tmp$Category) +grid.tmp$Category <- factor(grid.tmp$Category, levels=unique(grid.tmp$Category)) + +grid.tmp$year <- as.Date(paste(grid.tmp$year, "-07-21", sep="")) + +grid.tmp$variable <- revalue(grid.tmp$variable, c("weighted.sums"="Weighted Sums", "topic.sums"="Topic Sums", "topic.means"="Topic Means")) +grid.tmp$variable <- factor(grid.tmp$variable, levels=c("Topic Sums", "Topic Means", "Weighted Sums")) + +# drop 2016 +grid.tmp <- grid.tmp[grid.tmp$year <= as.Date("2015-01-01"),] + +ggplot(grid.tmp) + aes(x=year, y=value, group=Category, + color=Category + ) + geom_line(aes(linetype=Category)) + + facet_grid(variable ~ ., scale="free_y") + + labs(x="Year", y="") + # scale_colour_grey() + + theme_bw() +#$ +@ +\caption{Statistics from our LDA analysis, over time. The top panel shows topic sums which capture the amount that each topic is used in abstracts, by year. The middle panel shows topic means which are the average amount that each topic is used in a given abstract. The bottom panel shows the amount that each topic is used in abstracts, by year, weighted by citation count.} +\label{fig:ldaplots} +\end{figure} + +The top panel of Figure \ref{fig:ldaplots} shows how the distribution of topics identified by LDA in our analysis has changed over time. The LDA algorithm gives each abstract a probability distribution over each of the topics, such that it sums to 1 (e.g., a given abstract may be 80\% `Social Network Analysis,' 20\% `Education,' and 0\% everything else). To construct Figure \ref{fig:ldaplots}, we sum these percentages for all of the documents published in each year and plot the resulting prevalence of each topic over time.\footnote{More complex approaches such as dynamic LDA \citep{blei_dynamic_2006} are often better suited to identify the temporal evolution of topics.} + +The figures provide insight into the history and trajectory of social media research. Looking at the top figure, it appears that the `Social Network Analysis' topic was the early leader in publishing on social media, but was overtaken by the `Media Use' topic around 2012. This pattern is even more apparent when we look at the mean amount that each topic was used each year (the middle panel of Figure \ref{fig:ldaplots}). In the bottom panel, we take a third look at this data by weighting the topics used in each paper by the log of the number of citations that the paper received. This final statistic characterizes how influential each topic has been. The overall story is similar, although we see that the `Health' topic and the `Media Use' topic are more influential than the non-weighted figures suggest. + +\subsection{Discussion} + +Some of the strengths of topic modeling become apparent when we compare these LDA-based analyses with the distribution of papers by discipline that we created earlier (Figure \ref{fig:pubstime}). In this earlier attempt, we relied on the categories that Scopus provided and found that early interest in social media was driven by computer science and information systems researchers. Through topic modeling, we learn that these researchers engaged in social network analysis (rather than interface design, for example). While some of our topics match up well with the disciplines identified by Scopus, a few are more broad (e.g., `Media Use') and most are more narrow (e.g., `Sentiment Analysis'). This analysis might provide a richer sense of the topics of interest to social media researchers. Finally, these topics emerged inductively without any need for explicit coding, such as classifying journals into disciplines. This final feature is a major benefit in social media research where text is rarely categorized for researchers ahead of time. + +Topic modeling provides an intuitive, approachable way of doing large-scale text analysis. Its outputs can be understandable and theory-generating. The inductive creation of topics has advantages over traditional content analysis or `supervised' computational methods that require researchers to define labels or categories of interest ahead of time. While the results of topic models clearly lack the nuance and depth of understanding that human coders bring to texts, the method allows researchers to analyze datasets at a scale and granularity that would take a huge amount of resources to code manually. + +There are, of course, limitations to topic modeling. Many of LDA's limitations have analogues in manual coding. One we have already mentioned is that researchers must choose the number of topics without any clear rules about how to do so. A similar problem exists in content analysis, but the merging and splitting of topics can be done more intuitively and intentionally. +An additional limitation is that topic modeling tends to work best with many long documents. This can represent a stumbling block for researchers with datasets of short social media posts or comments; in these cases posts can be aggregated by user or by page to produce meaningful topics. The scope of documents can also affect the results of topic models. If, in addition to using abstracts about `social media,' we had also included abstracts containing the term `gene splicing,' our twelve topics would be divided between the two fields and each topic would be less granular. To recover topics similar to those we report here, we would have to increase the number of topics created. + +As with network analysis, a goal of LDA is to distill large, messy, and noisy data down to much simpler representations in order to find patterns. Such simplification will always entail ignoring some part of what is going on. Luckily, human coders and LDA have complementary advantages and disadvantages in this regard. Computational methods do not understand which text is more or less important. Humans are good at seeing the meaning and importance of topics, but may suffer from cognitive biases and miss out on topics that are less salient \citep{dimaggio_exploiting_2013}. Topic models work best when they are interpreted by researchers with a rich understanding of the texts and contexts under investigation. + +\section{Predicting citation} + +A final computational technique is statistical prediction. Statistical prediction can come in handy in situations where researchers have a great deal of data, including measures of an important, well-defined outcome they care about, but little in the way of prior literature or theory to guide analysis. Prediction has become a mainstream computational approach that encompasses a number of specific statistical techniques including classification, cross validation, and machine learning (also known as statistical learning) methods \citep{tibshirani_regression_1996}. Arguably made most famous by Nate \citet{silver_signal_2015}, who uses the technique to predict elections and sporting event outcomes, prediction increasingly colors public discourse about current events \citep{domingos_master_2015}. + +There are many approaches to prediction. We focus on regression-based prediction because it offers a reasonably straightforward workflow. Begin by breaking a dataset into two random subsets: a large subset used as `training' data and a small subset as `holdout' or `test' data. Next, use the training data to construct a regression model of a given outcome (dependent variable) that incorporates a set of features (independent variables) that might explain variations in the outcome. Apply statistical model selection techniques to determine the best weights (coefficients) to apply to the variables. Evaluate the performance of the model by seeing how accurately it can predict the outcome on the test data. After selecting an appropriate model, assess and interpret the items that most strongly predict the outcome. One can even compare the performance of different or nested sets of features by repeating these steps with multiple groups of independent variables. + +Interpreting the results of statistical prediction can be less clear-cut. The term `prediction' suggests a deep knowledge of a complex social process and the factors that determine a particular outcome. However, statistical prediction often proves more suitable for exploratory analysis where causal mechanisms and processes are poorly understood. We demonstrate this in the following example that predicts whether or not papers in our dataset get cited during the period of data collection. In particular, we try to find out whether textual features of the abstracts can help explain citation outcomes. Our approach follows that used by \citet{mitra_language_2014}, who sought to understand what textual features of Kickstarter projects predicted whether or not projects were funded. + +\subsection{Our application: Predicting paper citation} + +<>= +attach(pred.descrip) +@ + +We use multiple attributes of the papers in our dataset, including text of their abstracts, to predict citations. About \Sexpr{f(round(table(cited)[["TRUE"]] / total.articles * 100, 0))}\% of the papers (\Sexpr{f(length(cited[cited]))} out of \Sexpr{f(total.articles)}) received one or more citations ($\mu = \Sexpr{f(round(mean(cites),2))}$; $\sigma = \Sexpr{f(round(sd(cites), 2))}$). Can textual features of the abstracts explain which papers receive citations? What about other attributes, such as the publication venue or subject area? A prediction analysis can help evaluate these competing alternatives. + +To begin, we generate a large set of features for each paper from the Scopus data. Our measures include the year, month, and language of publication as well as the number of citations each paper contains to prior work. We also include the modal country of origin of the authors as well as the affiliation of the first author. Finally, we include the publication venue and publication subject as provided by Scopus. Then, we build the textual features by taking all of the abstracts and moving them through the following sequence of steps similar to those we took when performing LDA: we lowercase all the words; remove all stop words; and create uni-, bi-, and tri-grams. + +We also apply some inclusion criteria to both papers and features. To avoid subject-specific jargon, we draw features only from those terms that appear across at least 30 different subject areas. To avoid spurious results, we also exclude papers that fall into unique categories. For example, we require that there be more than one paper published in any language, journal, or subject area that we include as a feature. These sorts of unique cases can cause problems in the context of prediction tasks because they may predict certain outcomes perfectly. As a result, it is often better to focus on datasets and measures that are less `sparse' (i.e.,~characterized by rare, one-off observations). Once we drop the \Sexpr{ f( length(cited) - length(covars['cited']))} papers that do not meet these criteria, we are left with \Sexpr{f(length(covars['cited']))} papers. + +We predict the dichotomous outcome variable \emph{cited}, which indicates whether a paper received any citations during the period covered by our dataset (2004-2016). We use a method of \emph{penalized logistic regression} called the least absolute shrinkage and selection operator (also known as the \emph{Lasso}) to do the prediction work. Although, the technical details of Lasso models lie beyond the scope of this chapter, it, and other penalized regression models, work well on data where many of the variables have nearly identical values (sometimes called collinear variables because they would sit right around the same line if you plotted them) and/or many zero values (this is also called `sparse' data) \citep{friedman_regularization_2010, james_introduction_2013}. In both of these situations, some measures are redundant; the Lasso uses clever math to pick which of those measures should go into your final model and which ones should be, in effect, left out.\footnote{To put things a little more technically, a fitted Lasso model \emph{selects} the optimal set of variables that should have coefficient values greater than zero and \emph{shrinks} the rest of the coefficients to zero without sacrificing goodness of fit \citep{tibshirani_regression_1996}.} The results of a Lasso model are thus more computationally tractable and easier to interpret. + +We use a common statistical technique called cross-validation to validate our models. Cross-validation helps solve another statistical problem that can undermine the results of predictive analysis. Imagine fitting an ordinary least squares regression model on a dataset to generate a set of parameter estimates reflecting the relationships between a set of independent variables and some outcome. The model results provide a set of weights (the coefficients) that represent the strength of the relationships between each predictor and the outcome. Because of the way regression works (and because this is a hypothetical example and we can assume data that does not violate the assumptions of our model), the model weights are the best, linear, unbiased estimators of those relationships. In other words, the regression model fits the data as well as possible. However, nothing about fitting this one model ensures that the same regression weights will provide the best fit for some new data from the same population that the model has not seen. A model may be overfit if it excellently predicts the dataset it was fitted on but poorly predicts new data. Overfitting in this way is a common concern in statistical prediction. Cross-validation addresses this overfitting problem. First, the training data is split into equal-sized groups (typically 10). Different model specifications are tested by iteratively training them on all but one of the groups, and testing how well they predict the final group. The specification that has the lowest average error is then used on the full training data to estimate coefficients.\footnote{For our Lasso models, cross-validation was used to select $\lambda$, a parameter that tells the model how quickly to shrink variable coefficients. We include this information for those of you who want to try this on your own or figure out the details of our statistical code.} This approach ensures that the resulting models not only fit the data that we have, but that they are likely to predict the outcomes for new, unobserved results. For each model, we report the mean error rate from the cross-validation run which produced the best fit. + +Our analysis proceeds in multiple stages corresponding to the different types of measures we use to predict citation outcomes. We start by estimating a model that includes only the features that correspond to paper and author-level attributes (year, month, and language of publication, modal author country). We then add information about the first author's affiliation. Next, we include predictors that have more to do with research topic and field-level variations (publication venue and subject area). Finally, we include the textual features (terms) from the abstracts. + +\subsection{Results} + +Table \ref{tab:predict_models} summarizes the results of our prediction models. We include goodness-of-fit statistics and prediction error rates for each model as we add more features. A `better' model will fit the data more closely (i.e.,~it will explain a larger percentage of the deviance) and produce a lower error rate. We also include a supplementary error rate calculated against the `held-back' data, created from a random subset of 10\% of the original dataset that was not used in any of our models. An intuitive way to think about the error rate is to imagine it as the percentage of unobserved papers for which the model will correctly predict whether or not it receives any citations. The two error rate statistics are just this same percentage calculated on different sets of unobserved papers. Unlike a normal regression analysis, we do not report or interpret the full battery of coefficients, standard errors, t-statistics, or p-values. In part, we do not report this information because the results of these models are unwieldy -- each model has over 2,000 predictors and most of those predictors have coefficients of zero! Additionally, unlike traditional regression results, coefficient interpretation and null hypothesis testing with predictive models remain challenging (for reasons that lie beyond the scope of this chapter). Instead, we focus on interpreting the relative performance of each set of features. After we have done this, we refer to the largest coefficients to help add nuance to our interpretation.\\ + +\begin{table} +\begin{adjustbox}{center} +<>= +detach(pred.descrip) +attach(predict.list) + +print(xtable(results.tab, + align='llrrrr'), + include.rownames=FALSE, + floating=FALSE) +@ +\end{adjustbox} + +\caption{Summary of fitted models predicting citation. The `Model' column describes which features were included. The N features column shows the number of features included in the prediction. `Deviance' summarizes the goodness of fit as a percentage of the total deviance accounted for by the model. `CV error' (cross-validation error) reports the prediction error rates of each model in the cross-validation procedure conducted as part of the parameter estimation process. `Holdout error' shows the prediction error on a random 10\% subset of the original dataset not included in any of the model estimation procedures.} +\label{tab:predict_models} +\end{table} + +The results reveal that all of the features improve the goodness of fit, but not necessarily the predictive performance of the models. As a baseline, our controls-only model has a \Sexpr{f(as.numeric(as.character(results.tab[1,5])))}\% classification error on the holdout sample. This level of precision barely improves with the addition of both the author affiliation and subject area features. We observe substantial improvements in the prediction performance when the models include the publication venue features and the abstract text terms. When it comes to research about social media, it appears that venue and textual content are the most informative features for predicting whether or not articles get cited. + +To understand these results more deeply, we explore the non-zero coefficient estimates for the best-fitting iteration of the full model. Recall that the Lasso estimation procedure returns coefficients for a subset of the parameters that produce the best fit and shrinks the other coefficients to zero. While it does not make sense to interpret the coefficients in the same way as traditional regression, the non-zero coefficients indicate what features the model identified as the most important predictors of the outcome. First, we note that among the \Sexpr{f(nrow(nz.coefs))} features with non-zero coefficients, only \Sexpr{f(sum(round(prop.table(table(nz.coefs['Type']))*100, 2)[c("country", "language", "month", "year")]))}\% are control measures (country, language, month, and year of publication). Similarly, \Sexpr{f(round(prop.table(table(nz.coefs$Type))*100,2)["subject"])}\% are subject features. In contrast, \Sexpr{f(round(prop.table(table(nz.coefs$Type))*100, 2)["affiliation"])}\% are affiliation features, \Sexpr{f(round(prop.table(table(nz.coefs$Type))*100, 2)["venue"])}\% are venue features, and a whopping \Sexpr{f(round(prop.table(table(nz.coefs$Type))*100, 2)["term"])}\% are textual terms. Once again, we find that publication venue and textual terms do the most to explain which works receive citations. + +Closer scrutiny of the features with the largest coefficients adds further nuance to this interpretation. Table \ref{tab:nzcoefs} shows the ten features with the largest coefficients in terms of absolute value. The Lasso model identified these coefficients as the most informative predictors of whether or not papers in our dataset get cited. Here we see that the majority of these most predictive features are publication venues. The pattern holds across the 100 features with the largest coefficients, of which \Sexpr{f(round(prop.table(table(nz.coefs$Type[1:100]))*100, 2)["venue"])} are publication venues and only \Sexpr{f(round(prop.table(table(nz.coefs$Type[1:100]))*100, 2)["term"])} are textual terms from the abstracts. In other words, variations in publication venue predict which work gets cited more than any other type of feature. + +\begin{table} + \begin{adjustbox}{center} +<>= +print(xtable(as.matrix(head(nz.coefs, 10)), + align='lllr' + ), + include.rownames=FALSE, + floating=FALSE) +@ +\end{adjustbox} +\caption{Feature, variable type, and beta value for top 10 non-zero coefficients estimated by the best fitting model with all features included. Note that the outcome is coded such that positive coefficients indicate features that positively predict the observed outcome of interest (getting cited) while negative coefficients indicate features that negatively predict the outcome.} +\label{tab:nzcoefs} +\end{table} + +\subsection{Discussion} + +The results of our prediction models suggest that two types of features -- publication venue and textual terms -- do the most to explain whether or not papers on social media get cited. Both types of features substantially improve model fit and reduce predictive error in ten-fold cross-validation as well as on a holdout 10\% sub-sample of the original dataset. However, the venue features appear to have a much stronger relationship to our outcome (citation), with the vast majority of the most influential features in the model coming from the venue data (\Sexpr{f(round(prop.table(table(nz.coefs$Type[1:100]))*100, 2)["venue"])} of the 100 largest coefficients). + +As we said at the outset of this section, statistical prediction offers an exploratory, data-driven, and inductive approach. Based on these findings, we conclude that the venue where research on social media gets published better predicts whether that work gets cited than the other features in our dataset. Textual terms used in abstracts help to explain citation outcomes across the dataset, but the relationship between textual terms and citation only becomes salient in aggregate. On their own, hardly any of the textual terms approach the predictive power of the venue features. Features such as author affiliation and paper-level features like language or authors' country provide less explanatory power overall. + +The approach has several important limitations. Most important, statistical prediction only generates `predictions' in a fairly narrow, statistical sense. Language of prediction often sounds like the language of causality and inferring process, but these methods do not guarantee that anything being studied is causal or explanatory in terms of mechanisms. We do not claim that a paper's publication venue or the phrases in its abstract \emph{cause} people to cite it. Rather, we think these attributes of a paper likely index specific qualities of an article that are linked to citation outcomes. Just because something is predictive does not mean it is deterministic or causal. +We also note that the sort of machine learning approach we demonstrate here does not support the types of inferences commonly made with frequentist null hypothesis tests (the sort that lead to p-values and stars next to `significant' variables in a regression model). Instead, the interpretation of learning models rests on looking closely at model summary statistics, objective performance metrics (such as error rates), and qualitative exploration of model results. + +\section{Conclusion} + +In this chapter, we have described computational social scientific analysis of social media by walking through a series of example analyses. We began with the process of collecting a dataset of bibliographic information on social media scholarship using a web API similar to those provided by most social media platforms. We then subjected this dataset to three of the mostly widely used computational techniques: network analysis, topic modeling, and statistical prediction. Most empirical studies would employ a single, theoretically-motivated analytic approach, but we compromised depth in order to illustrate the diversity of computational research methodologies available. As we have shown, each approach has distinct strengths and limitations. + +We believe our examples paint a realistic picture of what is involved in typical computational social media research. However, these analyses remain limited in scope and idiosyncratic in nature. For example, there are popular computational methods we did not cover in this chapter. Obvious omissions include other forms of machine learning, such as decision trees and collaborative filtering \citep{resnick_grouplens:_1994}, as well as simulation-based techniques such as agent-based modeling \citep{macy_factors_2002, wilensky_introduction_2015}. + +Despite our diffuse approach, we report interesting substantive findings about the history and state of social media research. We discovered a number of diverse communities studying social media. We used different tools to identify these communities, including the categories provided by Scopus, the results of a community detection algorithm applied to the citation network, and the topics identified by topic modeling. Each analysis provided a slightly different picture of social media research. We learned that the study of social media related to media use and medical research is on the rise. We also learned that social network research was influential at the early stages of social media research, but that it is not as influential in the citation network. All of these findings are complicated by our final finding that subject area is not as good a predictor of whether a paper will receive a citation as the publication venue and the terms used in the abstract. + +In the process of describing our analyses, we tried to point to many of the limitations of computational research methods. +Although computational methods and the promise of `big data' elicit excitement, this hype can obscure the fact that large datasets and fast computers do nothing to obviate the fundamentals of high quality social science: researchers must understand their empirical settings, design studies with care, operationalize concepts in ways that are valid and honest, take steps to ensure that their findings generalize, and ask tough questions about the substantive impacts of observed relationships. These tenets extend to computational research as well. + +Other challenges go beyond methodological limitations. Researchers working with passively collected data generated by social media can face complex issues around the ethics of privacy and consent as well as the technical and legal restrictions on automated data collection. Computational analyses of social media often involve datasets gathered without the sort of active consent considered standard in other arenas of social scientific inquiry. In some cases, data is not public and researchers access it through private agreements (or employment arrangements) with companies that own platforms or proprietary databases. In others, researchers obtain social media data from public or semi-public sources, but the individuals creating the data may not consider their words or actions public and may not even be aware that their participation generates durable digital traces \citep{boyd_critical_2012}. A number of studies have been criticized for releasing information that researchers considered public, but which users did not \citep{zimmer_okcupid_2016}. In other cases, researchers pursuing legitimate social inquiry have become the target of companies or state prosecutors who selectively seek to enforce terms of service agreements or invoke broad laws such as the federal Computer Fraud and Abuse Act (CFAA).\footnote{See \citepos{sandvig_why_2016} blogpost, ``Why I am Suing the Government,'' for a thoughtful argument against the incredibly vague and broad scope of the CFAA as well as a cautionary tale for those who write software to conduct bulk downloads of public website data for research purposes.} + +We advise computational researchers to take a cautious and adaptive approach to these issues. Existing mechanisms such as Institutional Review Boards and federal laws have been slow to adjust to the realities of online research. In many cases, the authority and resources to anticipate, monitor, or police irresponsible behaviors threaten to impose unduly cumbersome restrictions. In other cases, review boards' policies greenlight research that seems deeply problematic. We believe researchers must think carefully about the specific implications of releasing specific datasets. In particular, we encourage abundant caution and public consultation before disseminating anything resembling personal information about individual social media system users. Irresponsible scholarship harms both subjects and reviewers and undermines the public trust scholars need to pursue their work. + +At the same time, we remain excited and optimistic about the future of computational studies of social media. As we have shown, the potential benefits of computational methods are numerous. Trace data can capture behaviors that are often difficult to observe in labs and that went unrecorded in offline interactions. Large datasets allow researchers to measure real effects obscured by large amounts of variation, and to make excellent predictions using relatively simple models. These new tools and new datasets provide a real opportunity to advance our understanding of the world. Such opportunities should not be undermined by overly-broad laws or alarmist concerns. + +Finally, much of computational social science, including this chapter, is data-focused rather than theory-focused. We would encourage others to do as we say, and not as we do. The great promise of computational social science is the opportunity to test and advance social science theory. We hope that readers of this chapter will think about whether there are theories they are interested in which might benefit from a computational approach. We urge readers with a stronger background in theory to consider learning the tools to conduct these types of analyses and to collaborate with technically minded colleagues. + +\subsection{Reproducible research} + +Computational research methods also have the important benefit of being extraordinarily reproducible and replicable \citep{stodden_toward_2013}. Unlike many other forms of social research, a computational researcher can theoretically use web APIs to collect a dataset identical to one used in a previous study. +Even when API limits or other factors prohibit creating an identical dataset, researchers can work to make data available alongside the code they use for their analysis, allowing others to re-run the study and assess the results directly. +Making code and data available also means that others can analyze and critique it. This can create uncomfortable situations, but we feel that such situations serve the long-term interests of society and scholarly integrity. Although not every computational researcher shares their code \citep{stodden_toward_2013} there are movements to encourage or require this \citep{leveque_reproducible_2012, stodden_toward_2013, bollen_social_2015}. + +We have tried to follow emerging best practices with regards to reproducibility in this chapter. We have released an online copy of all of the code that we used to create this chapter. By making our code available, we hope to make our unstated assumptions and decisions visible. By looking at our code, you might find errors or omissions which can be corrected in subsequent work. By releasing our code and data, we also hope that others can learn from and build on our work. +For example, a reader with access to a server and some knowledge of the Python and R programming languages should be able to build a more up-to-date version of our dataset years from now. Another reader might create a similar bibliographic analysis of another field. By using our code, this reader should able to produce results, tables, and figures like those in this chapter. Data repositories, such as the Harvard Dataverse, make storing and sharing data simple and inexpensive. +When thinking of the opportunities for openness, transparency, and collaboration, we are inspired by the potential of computational research methods for social media. We hope that our overview, data, and code can facilitate more of this type of work. + +\section{Online supplements} +\label{sec-supp} + +All of the code used to generate our dataset, to complete our analyses, and even to produce the text of this chapter, is available for download on the following public website: \url{https://communitydata.cc/social-media-chapter/} + +Because the Scopus dataset is constantly being updated and changed, reproducing the precise numbers and graphs in this chapter requires access to a copy of the dataset we collected from Scopus in 2016. Unfortunately, like many social media websites, the terms of use for the Scopus APIs prohibit the re-publication of data collected from their database. However, they did allow us to create a private, access-controlled, replication dataset in the Harvard Dataverse archive at the following URL: \url{http://dx.doi.org/10.7910/DVN/W31PH5}. Upon request, we will grant access to this dataset to any researchers interested in reproducing our analyses. + +% bibliography here +\printbibliography[title = {References}, heading=secbib] + +\end{document} + +% LocalWords: Foote BrickRed lazer Scopus wasserman blei james neef +% LocalWords: geo howison flossmole leskovec programmatically JSON +% LocalWords: NodeXL hansen droplevels sep ggplot aes xlab ylab bw +% LocalWords: df cbind xtable lp linewidth Weibo eigenvector welles +% LocalWords: hausmann bibliometrics scientometrics kessler sm ccm +% LocalWords: rosvall subelj subfields biomedicine tm dplyr plyr nz +% LocalWords: bioinformatics Informatics KDD psychometrics Gephi sd +% LocalWords: Cyberpsychology fruchterman bastian gephi kovacs plos +% LocalWords: megajournal binfield kozinets tausczik asur LDA ghosh +% LocalWords: dimaggio relationality heteroglossia copresence LIWC +% LocalWords: schwartz regressors olds wang lda pennacchiotti pdf +% LocalWords: facebook textwidth tmp lapply eval rbind gsub mitra +% LocalWords: Kickstarter tri collinear stodden leveque bollen SNA +% LocalWords: secbib textbf textcolor wrapify formatC WebKDD merton +% LocalWords: citedby barabasi matthew hargittai Bergstrom snagrey +% LocalWords: misclassified grayscale tokenize unigrams bigrams +% LocalWords: LatentDirichletAllocation scikit linetype colour pred +% LocalWords: dimensionality tibshirani domingos descrip covars +% LocalWords: friedman llrrrr coefs nzcoefs lllr macy LDA's +% LocalWords: frequentist resnick grouplens wilensky boyd zimmer +% LocalWords: okcupid CFAA Sandvig's blogpost replicable Dataverse +% LocalWords: reproducibility greenlight diff --git a/paper/mako-mem.sty b/paper/mako-mem.sty new file mode 100644 index 0000000..95c0b7f --- /dev/null +++ b/paper/mako-mem.sty @@ -0,0 +1,254 @@ +% Some article styles and page layout tweaks for the LaTeX Memoir class. +% +% Copyright 2009 Benjamin Mako Hill +% Copyright 2008-2009 Kieran Healy + +% Distributed as free software under the GNU GPL v3 + +% This file is heavily based on one by Kieran Healy +% available here: http://github.com/kjhealy/latex-custom-kjh/ + +\usepackage{lastpage} +\usepackage{datetime} + +% blank footnote +% Use \symbolfootnote[0]{Footnote text} for a blank footnote. +% Useful for initial acknowledgment note. +\long\def\symbolfootnote[#1]#2{\begingroup% +\def\thefootnote{\fnsymbol{footnote}}\footnote[#1]{#2}\endgroup} + +% put a period after the section numbers +\setsecnumformat{\csname the#1\endcsname.\enspace} + +% >> article-1 << +\makechapterstyle{article-1}{ + \renewcommand{\rmdefault}{ugm} + \renewcommand{\sfdefault}{phv} + + \setsecheadstyle{\large\scshape} + \setsubsecheadstyle{\normalsize\itshape} + \renewcommand{\printchaptername}{} + \renewcommand{\chapternamenum}{} + \renewcommand{\chapnumfont}{\chaptitlefont} + \renewcommand{\printchapternum}{\chapnumfont \thechapter\space} + \renewcommand{\afterchapternum}{} + \renewcommand{\printchaptername}{\secheadstyle} + \renewcommand{\cftchapterfont}{\normalfont} + \renewcommand{\cftchapterpagefont}{\normalfont\scshape} + \renewcommand{\cftchapterpresnum}{\scshape} + \captiontitlefont{\small} + + % turn off chapter numbering + \counterwithout{section}{chapter} + \counterwithout{figure}{chapter} + \counterwithout{table}{chapter} + + % reduce skip after section heading + \setaftersecskip{1.2ex} + + \pretitle{\newline\centering \LARGE\scshape \MakeLowercase } + \posttitle{\par\vskip 1em} + \predate{\footnotesize \centering} + \postdate{\par\vskip 1em} + + % 'abstract' title, bigger skip from title + \renewcommand{\abstractname}{} + \abstractrunin + +% set name of bibliography to 'references' +\renewcommand{\bibname}{References} +} + +% >> article-2 << +\makechapterstyle{article-2}{ + \renewcommand{\rmdefault}{ugm} + \renewcommand{\sfdefault}{phv} + + \setsecheadstyle{\large\scshape} + \setsubsecheadstyle{\normalsize\itshape} + \setaftersubsubsecskip{-1em} + \setsubsubsecheadstyle{\bfseries} + \renewcommand{\printchaptername}{} + \renewcommand{\chapternamenum}{} + \renewcommand{\chapnumfont}{\chaptitlefont} + \renewcommand{\printchapternum}{\chapnumfont \thechapter\space} + \renewcommand{\afterchapternum}{} + \renewcommand{\printchaptername}{\secheadstyle} + \renewcommand{\cftchapterfont}{\normalfont} + \renewcommand{\cftchapterpagefont}{\normalfont\scshape} + \renewcommand{\cftchapterpresnum}{\scshape} + \captiontitlefont{\small} + + % turn off chapter numbering + \counterwithout{section}{chapter} + \counterwithout{figure}{chapter} + \counterwithout{table}{chapter} + + % supress chapter numbers + \maxsecnumdepth{chapter} + \setsecnumdepth{chapter} + + % for numbered sections and subsections: + % (a) comment out the above stanza; (b) uncomment the one below + % \maxsecnumdepth{subsection} + % \setsecnumdepth{subsection} + + % reduce skip after section heading + \setaftersecskip{1.7ex} + + % Title flush left + \pretitle{\flushleft\LARGE \itshape} + \posttitle{\par\vskip 0.5em} + \preauthor{\flushleft \large \lineskip 1em} + \postauthor{\par\lineskip 1em} + \predate{\flushleft\footnotesize\vspace{0.65em}} + \postdate{\par\vskip 1em} + + % 'abstract' title, bigger skip from title + \renewcommand{\abstractname}{Abstract:} + \renewcommand{\abstractnamefont}{\normalfont\small\bfseries} + \renewcommand{\abstracttextfont}{\normalfont\small} + \setlength{\absparindent}{0em} + \setlength{\abstitleskip}{-1.5em} + \abstractrunin + + % set name of bibliography to 'references' + \renewcommand{\bibname}{References} +} + + +% >> article-3 << +\makechapterstyle{article-3}{ + \renewcommand{\rmdefault}{ugm} + \renewcommand{\sfdefault}{phv} + + \setsecheadstyle{\large\sffamily\bfseries\MakeUppercase} + \setsubsecheadstyle{\normalsize\itshape} + \setaftersubsubsecskip{-1em} + \setsubsubsecheadstyle{\small\bfseries} + \renewcommand{\printchaptername}{} + \renewcommand{\chapternamenum}{} + \renewcommand{\chapnumfont}{\chaptitlefont} + \renewcommand{\printchapternum}{\chapnumfont \thechapter\space} + \renewcommand{\afterchapternum}{} + \renewcommand{\printchaptername}{\secheadstyle} + \renewcommand{\cftchapterfont}{\normalfont} + \renewcommand{\cftchapterpagefont}{\normalfont\scshape} + \renewcommand{\cftchapterpresnum}{\scshape} + \captiontitlefont{\small} + + % turn off chapter numbering + \counterwithout{section}{chapter} + \counterwithout{figure}{chapter} + \counterwithout{table}{chapter} + + % supress chapter numbers + \maxsecnumdepth{chapter} + \setsecnumdepth{chapter} + + % reduce skip after section heading + \setaftersecskip{1pt} + \setbeforesecskip{-1em} + + % 'abstract' title, bigger skip from title + % \renewcommand{\maketitle}{\{\preauthor \theauthor\} \hfill \thetitle} + \renewcommand{\maketitle}{ + {\Large\sffamily\bfseries\MakeUppercase\thetitle} \hfill + {\Large\sffamily\MakeUppercase\theauthor} + \vskip 0.7em} + \renewcommand{\abstractname}{\normalfont\scriptsize\noindent} + \renewcommand{\abstracttextfont}{\normalfont\scriptsize} + \abstractrunin + + % set name of bibliography to 'references' + \renewcommand{\bibname}{References} + + \parindent 0pt + +} + +%%% Custom styles for headers and footers +%%% Basic +\makepagestyle{mako-mem} +%\makeevenfoot{mako-mem}{\thepage}{}{} +%\makeoddfoot{mako-mem}{}{}{\thepage} +%\makeheadrule{mako-mem}{\textwidth}{\normalrulethickness} +\newcommand{\@makomarks}{% + \let\@mkboth\markboth + \def\chaptermark##1{% + \markboth{% + \ifnum \c@secnumdepth >\m@ne + \if@mainmatter + \thechapter. \ % + \fi + \fi + ##1}{}} + \def\sectionmark##1{% + \markright{##1}} +} +\makepsmarks{mako-mem}{\@makomarks} +\makepsmarks{mako-mem}{} +\makeevenhead{mako-mem}{}{}{\scshape\thepage} +\makeoddhead{mako-mem}{}{}{\scshape\thepage} + +%%% version control info in footers; requires vc package +% Make the style for vc-git revision control headers and footers +\makepagestyle{mako-mem-git} +\newcommand{\@gitmarks}{% + \let\@mkboth\markboth + \def\chaptermark##1{% + \markboth{% + \ifnum \c@secnumdepth >\m@ne + \if@mainmatter + \thechapter. \ % + \fi + \fi + ##1}{}} + \def\sectionmark##1{% + \markright{##1}} +} +\makepsmarks{mako-mem-git}{\@gitmarks} +\makeevenhead{mako-mem-git}{}{}{\scshape\thepage} +\makeoddhead{mako-mem-git}{}{}{\scshape\thepage} +\makeevenfoot{mako-mem-git}{}{\texttt{\footnotesize{\textcolor{BrickRed}{git revision \VCRevision\ on \VCDateTEX}}}}{} +\makeoddfoot{mako-mem-git}{}{\texttt{\footnotesize \textcolor{BrickRed}{git revision \VCRevision\ on \VCDateTEX}}}{} + +%%% print a datestamp from ShareLaTeX +\makepagestyle{mako-mem-sharelatex} +\newcommand{\@slmarks}{% + \let\@mkboth\markboth + \def\chaptermark##1{% + \markboth{% + \ifnum \c@secnumdepth >\m@ne + \if@mainmatter + \thechapter. \ % + \fi + \fi + ##1}{}} + \def\sectionmark##1{% + \markright{##1}} +} +\makepsmarks{mako-mem-sharelatex}{\@slmarks} +\makeevenhead{mako-mem-sharelatex}{}{}{\scshape\thepage} +\makeoddhead{mako-mem-sharelatex}{}{}{\scshape\thepage} +\makeevenfoot{mako-mem-sharelatex}{}{\texttt{\footnotesize{\textcolor{BrickRed}{Buildstamp/Version:~\pdfdate}}}}{} +\makeoddfoot{mako-mem-sharelatex}{}{\texttt{\footnotesize{\textcolor{BrickRed}{Buildstamp/Version:~\pdfdate}}}}{} + +%% Create a command to make a note at the top of the first page describing the +%% publication status of the paper. +\newcommand{\published}[1]{% + \gdef\puB{#1}} + \newcommand{\puB}{} + \renewcommand{\maketitlehooka}{% + \par\noindent\footnotesize \puB} + +\makepagestyle{memo} +\makeevenhead{memo}{}{}{} +\makeoddhead{memo}{}{}{} + +\makeevenfoot{memo}{}{\scshape \thepage/\pageref{LastPage}}{} +\makeoddfoot{memo}{}{\scshape \thepage/\pageref{LastPage}}{} + + +\endinput + diff --git a/paper/refs.bib b/paper/refs.bib new file mode 100644 index 0000000..bded4ca --- /dev/null +++ b/paper/refs.bib @@ -0,0 +1,1011 @@ + +@article{howison_flossmole_2006, + title = {{FLOSSmole}}, + volume = {1}, + issn = {1554-1045, 1554-1053}, + url = {http://www.igi-global.com/article/international-journal-information-technology-web/2610}, + doi = {10.4018/jitwe.2006070102}, + pages = {17--26}, + number = {3}, + journaltitle = {International Journal of Information Technology and Web Engineering}, + author = {Howison, James and Conklin, Megan and Crowston, Kevin}, + urldate = {2013-06-15}, + date = {2006} +} + +@article{bohannon_google_2011, + title = {Google Books, Wikipedia, and the Future of Culturomics}, + volume = {331}, + issn = {0036-8075, 1095-9203}, + url = {http://www.sciencemag.org/content/331/6014/135}, + doi = {10.1126/science.331.6014.135}, + abstract = {As a follow-up to the quantitative analysis of data obtained from Google Books published online in Science on 16 December 2010 and in this week's issue on page 176, one of the study's authors has been using Wikipedia to analyze the fame of scientists whose names appear in books over the centuries. But his effort has been hampered by the online encyclopedia's shortcomings, from the reliability of its information to the organization of its content. Several efforts are under way to improve Wikipedia as a teaching and research tool, including one by the Association for Psychological Science that seeks to create a more complete and accurate representation of its field.}, + pages = {135--135}, + number = {6014}, + journaltitle = {Science}, + author = {Bohannon, John}, + urldate = {2014-02-14}, + date = {2011-01}, + langid = {english}, + pmid = {21233356} +} + +@article{welles_visualizing_2015, + title = {Visualizing Computational Social Science The Multiple Lives of a Complex Image}, + volume = {37}, + url = {http://scx.sagepub.com/content/37/1/34.short}, + pages = {34--58}, + number = {1}, + journaltitle = {Science Communication}, + author = {Welles, Brooke Foucault and Meirelles, Isabel}, + urldate = {2015-08-05}, + date = {2015}, + file = {[PDF] from sagepub.com:/home/jeremy/Zotero/storage/AMRMRGNB/Welles and Meirelles - 2015 - Visualizing Computational Social Science The Multi.pdf:application/pdf} +} + +@article{van_noorden_interdisciplinary_2015, + title = {Interdisciplinary research by the numbers}, + volume = {525}, + issn = {0028-0836, 1476-4687}, + url = {http://www.nature.com/doifinder/10.1038/525306a}, + doi = {10.1038/525306a}, + pages = {306--307}, + number = {7569}, + journaltitle = {Nature}, + author = {Van Noorden, Richard}, + urldate = {2015-09-21}, + date = {2015-09-16} +} + +@article{mcfarland_sociology_2015, + title = {Sociology in the Era of Big Data: The Ascent of Forensic Social Science}, + issn = {0003-1232, 1936-4784}, + url = {http://link.springer.com/article/10.1007/s12108-015-9291-8}, + doi = {10.1007/s12108-015-9291-8}, + shorttitle = {Sociology in the Era of Big Data}, + pages = {1--24}, + journaltitle = {The American Sociologist}, + shortjournal = {Am Soc}, + author = {{McFarland}, Daniel A. and Lewis, Kevin and Goldberg, Amir}, + urldate = {2015-09-25}, + date = {2015-09-17}, + langid = {english}, + keywords = {Forensic social science, Social Sciences, general, Sociology of science, Sociology, general, Computational social science, Big data}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/F66XW8K7/McFarland et al. - 2015 - Sociology in the Era of Big Data The Ascent of Fo.pdf:application/pdf} +} + +@article{hargittai_is_2015, + title = {Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites}, + volume = {659}, + issn = {0002-7162, 1552-3349}, + url = {http://ann.sagepub.com/content/659/1/63}, + doi = {10.1177/0002716215570866}, + shorttitle = {Is Bigger Always Better?}, + abstract = {This article discusses methodological challenges of using big data that rely on specific sites and services as their sampling frames, focusing on social network sites in particular. It draws on survey data to show that people do not select into the use of such sites randomly. Instead, use is biased in certain ways yielding samples that limit the generalizability of findings. Results show that age, gender, race/ethnicity, socioeconomic status, online experiences, and Internet skills all influence the social network sites people use and thus where traces of their behavior show up. This has implications for the types of conclusions one can draw from data derived from users of specific sites. The article ends by noting how big data studies can address the shortcomings that result from biased sampling frames.}, + pages = {63--76}, + number = {1}, + journaltitle = {The {ANNALS} of the American Academy of Political and Social Science}, + shortjournal = {The {ANNALS} of the American Academy of Political and Social Science}, + author = {Hargittai, Eszter}, + urldate = {2015-10-19}, + date = {2015-05-01}, + langid = {english}, + keywords = {digital inequality, social network sites, sampling, Internet skills, sampling frame, biased sample, Big data} +} + +@article{lazer_computational_2009, + title = {Computational Social Science}, + volume = {323}, + url = {http://www.sciencemag.org}, + doi = {10.1126/science.1167742}, + shorttitle = {{SOCIAL} {SCIENCE}}, + pages = {721--723}, + number = {5915}, + journaltitle = {Science}, + author = {Lazer, David and Pentland, Alex and Adamic, Lada and Aral, Sinan and Barabasi, Albert-Laszlo and Brewer, Devon and Christakis, Nicholas and Contractor, Noshir and Fowler, James and Gutmann, Myron and Jebara, Tony and King, Gary and Macy, Michael and Roy, Deb and Van Alstyne, Marshall}, + urldate = {2009-03-06}, + date = {2009-02-06}, + file = {HighWire Snapshot:/home/jeremy/Zotero/storage/C939DFAS/721.html:text/html;PubMed Central Full Text PDF:/home/jeremy/Zotero/storage/RPX8A4ID/Lazer et al. - 2009 - Life in the network the coming age of computation.pdf:application/pdf} +} + +@article{mann_bibliometric_2006, + title = {Bibliometric impact measures leveraging topic analysis}, + abstract = {Measurements of the impact and history of research literature provide a useful complement to scientific digital library collections. Bibliometric indicators have been extensively studied, mostly in the context of journals. However, journal-based metrics poorly capture topical distinctions in fast-moving fields, and are increasingly problematic with the rise of open-access publishing. Recent developments in latent topic models have produced promising results for automatic sub-field discovery. The fine-grained, faceted topics produced by such models provide a clearer view of the topical divisions of a body of research literature and the interactions between those divisions. We demonstrate the usefulness of topic models in measuring impact by applying a new phrase-based topic discovery model to a collection of 300,000 computer science publications, collected by the Rexa automatic citation indexing system}, + pages = {65--74}, + author = {Mann, G.S and Mimno, D and McCallum, A and {2006 IEEE/ACM 6th Joint Conference on Digital Libraries}}, + date = {2006}, + note = {00083}, + file = {Mann et al. - 2006 - Bibliometric impact measures leveraging topic anal.pdf:/home/jeremy/Zotero/storage/RHR8REID/Mann et al. - 2006 - Bibliometric impact measures leveraging topic anal.pdf:application/pdf} +} + +@article{reid_mapping_2007, + title = {Mapping the contemporary terrorism research domain}, + volume = {65}, + issn = {1071-5819}, + abstract = {A systematic view of terrorism research to reveal the intellectual structure of the field and empirically discern the distinct set of core researchers, institutional affiliations, publications, and conceptual areas can help us gain a deeper understanding of approaches to terrorism. This paper responds to this need by using an integrated knowledge-mapping framework that we developed to identify the core researchers and knowledge creation approaches in terrorism. The framework uses three types of analysis: (a) basic analysis of scientific output using citation, bibliometric, and social network analyses, (b) content map analysis of large corpora of literature, and (c) co-citation analysis to analyse linkages among pairs of researchers. We applied domain visualization techniques such as content map analysis, block-modeling, and co-citation analysis to the literature and author citation data from the years 1965 to 2003. The data were gathered from ten databases such as the {ISI} Web of Science. The results reveal: (1) the names of the top 42 core terrorism researchers (e.g., Brian Jenkins, Bruce Hoffman, and Paul Wilkinson) as well as their institutional affiliations; (2) their influential publications; (3) clusters of terrorism researchers who work in similar areas; and (4) that the research focus has shifted from terrorism as a low-intensity conflict to a strategic threat to world powers with increased focus on Osama Bin Laden.}, + pages = {42--56}, + number = {1}, + journaltitle = {{YIJHC} International Journal of Human - Computer Studies}, + author = {Reid, Edna F and Chen, Hsinchun}, + date = {2007}, + note = {00091}, + file = {Reid and Chen - 2007 - Mapping the contemporary terrorism research domain.pdf:/home/jeremy/Zotero/storage/DAN5ATFN/Reid and Chen - 2007 - Mapping the contemporary terrorism research domain.pdf:application/pdf} +} + +@article{blei_probabilistic_2012, + title = {Probabilistic Topic Models}, + volume = {55}, + issn = {0001-0782}, + url = {http://doi.acm.org/10.1145/2133806.2133826}, + doi = {10.1145/2133806.2133826}, + abstract = {Surveying a suite of algorithms that offer a solution to managing large document archives.}, + pages = {77--84}, + number = {4}, + journaltitle = {Commun. {ACM}}, + author = {Blei, David M.}, + urldate = {2016-03-07}, + date = {2012-04}, + file = {Blei - 2012 - Probabilistic Topic Models.pdf:/home/jeremy/Zotero/storage/5HZENWNZ/Blei - 2012 - Probabilistic Topic Models.pdf:application/pdf} +} + +@article{schwartz_personality_2013, + title = {Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach}, + volume = {8}, + issn = {1932-6203}, + url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0073791}, + doi = {10.1371/journal.pone.0073791}, + shorttitle = {Personality, Gender, and Age in the Language of Social Media}, + abstract = {We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or 'boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality.}, + pages = {e73791}, + number = {9}, + journaltitle = {{PLOS} {ONE}}, + shortjournal = {{PLOS} {ONE}}, + author = {Schwartz, H. Andrew and Eichstaedt, Johannes C. and Kern, Margaret L. and Dziurzynski, Lukasz and Ramones, Stephanie M. and Agrawal, Megha and Shah, Achal and Kosinski, Michal and Stillwell, David and Seligman, Martin E. P. and Ungar, Lyle H.}, + urldate = {2016-03-07}, + date = {2013-09-25}, + keywords = {Social Media, Facebook, Personality, Psychology, language, Psycholinguistics, Forecasting, Vocabulary}, + file = {Schwartz et al. - 2013 - Personality, Gender, and Age in the Language of So.pdf:/home/jeremy/Zotero/storage/CKR7EZ5S/Schwartz et al. - 2013 - Personality, Gender, and Age in the Language of So.pdf:application/pdf} +} + +@article{kovacs_exploring_2015, + title = {Exploring the scope of open innovation: a bibliometric review of a decade of research}, + volume = {104}, + issn = {0138-9130, 1588-2861}, + url = {http://link.springer.com/article/10.1007/s11192-015-1628-0}, + doi = {10.1007/s11192-015-1628-0}, + shorttitle = {Exploring the scope of open innovation}, + abstract = {The concept of open innovation has attracted considerable attention since Henry Chesbrough first coined it to capture the increasing reliance of firms on external sources of innovation. Although open innovation has flourished as a topic within innovation management research, it has also triggered debates about the coherence of the research endeavors pursued under this umbrella, including its theoretical foundations. In this paper, we aim to contribute to these debates through a bibliometric review of the first decade of open innovation research. We combine two techniques—bibliographic coupling and co-citation analysis—to (1) visualize the network of publications that explicitly use the label ‘open innovation’ and (2) to arrive at distinct clusters of thematically related publications. Our findings illustrate that open innovation research builds principally on four related streams of prior research, whilst the bibliographic network of open innovation research reveals that seven thematic clusters have been pursued persistently. While such persistence is undoubtedly useful to arrive at in-depth and robust insights, the observed patterns also signal the absence of new, emerging, themes. As such, ‘open innovation’ might benefit from applying its own ideas: sourcing concepts and models from a broader range of theoretical perspectives as well as pursuing a broader range of topics might introduce dynamics resulting in more impact and proliferation.}, + pages = {951--983}, + number = {3}, + journaltitle = {Scientometrics}, + shortjournal = {Scientometrics}, + author = {Kovács, Adrián and Looy, Bart Van and Cassiman, Bruno}, + urldate = {2016-04-20}, + date = {2015-06-20}, + langid = {english}, + keywords = {open innovation, Library Science, Information Storage and Retrieval, 91-02, Co-citation analysis, Bibliographic coupling, O32, Q55, Interdisciplinary Studies, openness}, + file = {Kovács et al. - 2015 - Exploring the scope of open innovation a bibliome.pdf:/home/jeremy/Zotero/storage/MFDEMAFC/Kovács et al. - 2015 - Exploring the scope of open innovation a bibliome.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/AITBH9EK/s11192-015-1628-0.html:text/html} +} + +@inproceedings{blei_dynamic_2006, + title = {Dynamic topic models}, + url = {http://dl.acm.org/citation.cfm?id=1143859}, + pages = {113--120}, + booktitle = {Proceedings of the 23rd international conference on Machine learning}, + publisher = {{ACM}}, + author = {Blei, David M. and Lafferty, John D.}, + urldate = {2016-04-21}, + date = {2006}, + file = {[PDF] from cmu.edu:/home/jeremy/Zotero/storage/UBSD9KNT/Blei and Lafferty - 2006 - Dynamic topic models.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/MR3H4FSU/citation.html:text/html} +} + +@inproceedings{hall_studying_2008, + location = {Stroudsburg, {PA}, {USA}}, + title = {Studying the History of Ideas Using Topic Models}, + url = {http://dl.acm.org/citation.cfm?id=1613715.1613763}, + series = {{EMNLP} '08}, + abstract = {How can the development of ideas in a scientific field be studied over time? We apply unsupervised topic modeling to the {ACL} Anthology to analyze historical trends in the field of Computational Linguistics from 1978 to 2006. We induce topic clusters using Latent Dirichlet Allocation, and examine the strength of each topic over time. Our methods find trends in the field including the rise of probabilistic methods starting in 1988, a steady increase in applications, and a sharp decline of research in semantics and understanding between 1978 and 2001, possibly rising again after 2001. We also introduce a model of the diversity of ideas, topic entropy, using it to show that {COLING} is a more diverse conference than {ACL}, but that both conferences as well as {EMNLP} are becoming broader over time. Finally, we apply Jensen-Shannon divergence of topic distributions to show that all three conferences are converging in the topics they cover.}, + pages = {363--371}, + booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing}, + publisher = {Association for Computational Linguistics}, + author = {Hall, David and Jurafsky, Daniel and Manning, Christopher D.}, + urldate = {2016-04-21}, + date = {2008}, + file = {ACM Full Text PDF:/home/jeremy/Zotero/storage/UZV4H35G/Hall et al. - 2008 - Studying the History of Ideas Using Topic Models.pdf:application/pdf} +} + +@inproceedings{mitra_language_2014, + location = {New York, {NY}, {USA}}, + title = {The Language That Gets People to Give: Phrases That Predict Success on Kickstarter}, + isbn = {978-1-4503-2540-0}, + url = {http://doi.acm.org/10.1145/2531602.2531656}, + doi = {10.1145/2531602.2531656}, + series = {{CSCW} '14}, + shorttitle = {The Language That Gets People to Give}, + abstract = {Crowdfunding sites like Kickstarter--where entrepreneurs and artists look to the internet for funding--have quickly risen to prominence. However, we know very little about the factors driving the 'crowd' to take projects to their funding goal. In this paper we explore the factors which lead to successfully funding a crowdfunding project. We study a corpus of 45K crowdfunded projects, analyzing 9M phrases and 59 other variables commonly present on crowdfunding sites. The language used in the project has surprising predictive power accounting for 58.56\% of the variance around successful funding. A closer look at the phrases shows they exhibit general persuasion principles. For example, also receive two reflects the principle of Reciprocity and is one of the top predictors of successful funding. We conclude this paper by announcing the release of the predictive phrases along with the control variables as a public dataset, hoping that our work can enable new features on crowdfunding sites--tools to help both backers and project creators make the best use of their time and money.}, + pages = {49--61}, + booktitle = {Proceedings of the 17th {ACM} Conference on Computer Supported Cooperative Work \& Social Computing}, + publisher = {{ACM}}, + author = {Mitra, Tanushree and Gilbert, Eric}, + urldate = {2016-04-29}, + date = {2014}, + keywords = {crowdfunding, natural language processing (nlp), {CMC}} +} + +@book{wasserman_social_1994, + title = {Social Network Analysis: Methods And Applications}, + publisher = {Cambridge University Press}, + author = {Wasserman, Stanley and Faust, Katherine}, + date = {1994} +} + +@article{tausczik_psychological_2010, + title = {The Psychological Meaning of Words: {LIWC} and Computerized Text Analysis Methods}, + volume = {29}, + issn = {0261-927X, 1552-6526}, + url = {http://jls.sagepub.com/content/29/1/24}, + doi = {10.1177/0261927X09351676}, + shorttitle = {The Psychological Meaning of Words}, + abstract = {We are in the midst of a technological revolution whereby, for the first time, researchers can link daily word use to a broad array of real-world behaviors. This article reviews several computerized text analysis methods and describes how Linguistic Inquiry and Word Count ({LIWC}) was created and validated. {LIWC} is a transparent text analysis program that counts words in psychologically meaningful categories. Empirical results using {LIWC} demonstrate its ability to detect meaning in a wide variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.}, + pages = {24--54}, + number = {1}, + journaltitle = {Journal of Language and Social Psychology}, + shortjournal = {Journal of Language and Social Psychology}, + author = {Tausczik, Yla R. and Pennebaker, James W.}, + urldate = {2016-07-12}, + date = {2010-03-01}, + langid = {english}, + keywords = {attention, {LIWC}, deception, dominance, relationships, pronouns, computerized text analysis}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/G6TIZD38/Tausczik and Pennebaker - 2010 - The Psychological Meaning of Words LIWC and Compu.pdf:application/pdf} +} + +@book{smith_general_2014, + title = {General social surveys, 1972-2014}, + shorttitle = {General social surveys, 1972-2014}, + publisher = {National Opinion Research Center ({NORC})}, + author = {Smith, Tom William and Marsden, Peter and Hout, Michael and Kim, Jibum}, + date = {2014} +} + +@book{leskovec_snap_2014, + title = {{SNAP} Datasets: Stanford Large Network Dataset Collection}, + url = {http://snap.stanford.edu/data}, + author = {Leskovec, Jure and Krevl, Andrej}, + date = {2014-06} +} + +@article{kozinets_field_2002, + title = {The Field Behind the Screen: Using Netnography for Marketing Research in Online Communities}, + volume = {39}, + issn = {0022-2437}, + url = {http://journals.ama.org/doi/abs/10.1509/jmkr.39.1.61.18935}, + doi = {10.1509/jmkr.39.1.61.18935}, + shorttitle = {The Field Behind the Screen}, + abstract = {The author develops “netnography” as an online marketing research technique for providing consumer insight. Netnography is ethnography adapted to the study of online communities. As a method, netnography is faster, simpler, and less expensive than traditional ethnography and more naturalistic and unobtrusive than focus groups or interviews. It provides information on the symbolism, meanings, and consumption patterns of online consumer groups. The author provides guidelines that acknowledge the online environment, respect the inherent flexibility and openness of ethnography, and provide rigor and ethics in the conduct of marketing research. As an illustrative example, the author provides a netnography of an online coffee newsgroup and discusses its marketing implications.}, + pages = {61--72}, + number = {1}, + journaltitle = {Journal of Marketing Research}, + shortjournal = {Journal of Marketing Research}, + author = {Kozinets, Robert V.}, + urldate = {2016-07-18}, + date = {2002-02-01} +} + +@article{chew_pandemics_2010, + title = {Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak}, + volume = {5}, + issn = {1932-6203}, + url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014118}, + doi = {10.1371/journal.pone.0014118}, + shorttitle = {Pandemics in the Age of Twitter}, + abstract = {Background +Surveys are popular methods to measure public perceptions in emergencies but can be costly and time consuming. We suggest and evaluate a complementary “infoveillance” approach using Twitter during the 2009 H1N1 pandemic. Our study aimed to: 1) monitor the use of the terms “H1N1” versus “swine flu” over time; 2) conduct a content analysis of “tweets”; and 3) validate Twitter as a real-time content, sentiment, and public attention trend-tracking tool. + + Methodology/Principal Findings + Between May 1 and December 31, 2009, we archived over 2 million Twitter posts containing keywords “swine flu,” “swineflu,” and/or “H1N1.” using Infovigil, an infoveillance system. Tweets using “H1N1” increased from 8.8\% to 40.5\% ( R 2  = .788; p \<.001), indicating a gradual adoption of World Health Organization-recommended terminology. 5,395 tweets were randomly selected from 9 days, 4 weeks apart and coded using a tri-axial coding scheme. To track tweet content and to test the feasibility of automated coding, we created database queries for keywords and correlated these results with manual coding. Content analysis indicated resource-related posts were most commonly shared (52.6\%). 4.5\% of cases were identified as misinformation. News websites were the most popular sources (23.2\%), while government and health agencies were linked only 1.5\% of the time. 7/10 automated queries correlated with manual coding. Several Twitter activity peaks coincided with major news stories. Our results correlated well with H1N1 incidence data. + + Conclusions + This study illustrates the potential of using social media to conduct “infodemiology” studies for public health. 2009 H1N1-related tweets were primarily used to disseminate information from credible sources, but were also a source of opinions and experiences. Tweets can be used for real-time content analysis and knowledge translation research, allowing health authorities to respond to public concerns.}, + pages = {e14118}, + number = {11}, + journaltitle = {{PLOS} {ONE}}, + shortjournal = {{PLOS} {ONE}}, + author = {Chew, Cynthia and Eysenbach, Gunther}, + urldate = {2016-07-18}, + date = {2010-11-29}, + keywords = {Chi square tests, Public and occupational health, Data Mining, H1N1, Swine influenza, twitter, Swine, Internet}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/KV2JGXGC/Chew and Eysenbach - 2010 - Pandemics in the Age of Twitter Content Analysis .pdf:application/pdf} +} + +@inproceedings{agichtein_finding_2008, + location = {New York, {NY}, {USA}}, + title = {Finding High-quality Content in Social Media}, + isbn = {978-1-59593-927-2}, + url = {http://doi.acm.org/10.1145/1341531.1341557}, + doi = {10.1145/1341531.1341557}, + series = {{WSDM} '08}, + abstract = {The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans}, + pages = {183--194}, + booktitle = {Proceedings of the 2008 International Conference on Web Search and Data Mining}, + publisher = {{ACM}}, + author = {Agichtein, Eugene and Castillo, Carlos and Donato, Debora and Gionis, Aristides and Mishne, Gilad}, + urldate = {2016-07-19}, + date = {2008}, + keywords = {media, user interactions, community question answering}, + file = {ACM Full Text PDF:/home/jeremy/Zotero/storage/CNFWMINP/Agichtein et al. - 2008 - Finding High-quality Content in Social Media.pdf:application/pdf;ACM Full Text PDF:/home/jeremy/Zotero/storage/9BDZK58M/Agichtein et al. - 2008 - Finding High-quality Content in Social Media.pdf:application/pdf} +} + +@inproceedings{resnick_grouplens:_1994, + location = {New York, {NY}, {USA}}, + title = {{GroupLens}: An Open Architecture for Collaborative Filtering of Netnews}, + isbn = {978-0-89791-689-9}, + url = {http://doi.acm.org/10.1145/192844.192905}, + doi = {10.1145/192844.192905}, + series = {{CSCW} '94}, + shorttitle = {{GroupLens}}, + abstract = {Collaborative filters help people make choices based on the opinions of other people. {GroupLens} is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. News reader clients display predicted scores and make it easy for users to rate articles after they read them. Rating servers, called Better Bit Bureaus, gather and disseminate the ratings. The rating servers predict scores based on the heuristic that people who agreed in the past will probably agree again. Users can protect their privacy by entering ratings under pseudonyms, without reducing the effectiveness of the score prediction. The entire architecture is open: alternative software for news clients and Better Bit Bureaus can be developed independently and can interoperate with the components we have developed.}, + pages = {175--186}, + booktitle = {Proceedings of the 1994 {ACM} Conference on Computer Supported Cooperative Work}, + publisher = {{ACM}}, + author = {Resnick, Paul and Iacovou, Neophytos and Suchak, Mitesh and Bergstrom, Peter and Riedl, John}, + urldate = {2016-07-19}, + date = {1994}, + keywords = {collaborative filtering, selective dissemination of information, user model, social filtering, electronic bulletin boards, netnews, information filtering, Usenet}, + file = {ACM Full Text PDF:/home/jeremy/Zotero/storage/JPUR4MA4/Resnick et al. - 1994 - GroupLens An Open Architecture for Collaborative .pdf:application/pdf} +} + +@inproceedings{wang_tm-lda:_2012, + title = {{TM}-{LDA}: efficient online modeling of latent topic transitions in social media}, + isbn = {978-1-4503-1462-6}, + url = {http://dl.acm.org/citation.cfm?doid=2339530.2339552}, + doi = {10.1145/2339530.2339552}, + shorttitle = {{TM}-{LDA}}, + pages = {123}, + publisher = {{ACM} Press}, + author = {Wang, Yu and Agichtein, Eugene and Benzi, Michele}, + urldate = {2016-07-19}, + date = {2012}, + langid = {english} +} + +@inproceedings{prier_identifying_2011, + location = {Berlin, Heidelberg}, + title = {Identifying Health-related Topics on Twitter: An Exploration of Tobacco-related Tweets As a Test Topic}, + isbn = {978-3-642-19655-3}, + url = {http://dl.acm.org/citation.cfm?id=1964698.1964702}, + series = {{SBP}'11}, + shorttitle = {Identifying Health-related Topics on Twitter}, + abstract = {Public health-related topics are difficult to identify in large conversational datasets like Twitter. This study examines how to model and discover public health topics and themes in tweets. Tobacco use is chosen as a test case to demonstrate the effectiveness of topic modeling via {LDA} across a large, representational dataset from the United States, as well as across a smaller subset that was seeded by tobacco-related queries. Topic modeling across the large dataset uncovers several public health-related topics, although tobacco is not detected by this method. However, topic modeling across the tobacco subset provides valuable insight about tobacco use in the United States. The methods used in this paper provide a possible toolset for public health researchers and practitioners to better understand public health problems through large datasets of conversational data.}, + pages = {18--25}, + booktitle = {Proceedings of the 4th International Conference on Social Computing, Behavioral-cultural Modeling and Prediction}, + publisher = {Springer-Verlag}, + author = {Prier, Kyle W. and Smith, Matthew S. and Giraud-Carrier, Christophe and Hanson, Carl L.}, + urldate = {2016-07-19}, + date = {2011}, + keywords = {Social Media, tobacco use, {LDA}, Data Mining, topic modeling, Social networks, public health} +} + +@inproceedings{pennacchiotti_investigating_2011, + location = {New York, {NY}, {USA}}, + title = {Investigating Topic Models for Social Media User Recommendation}, + isbn = {978-1-4503-0637-9}, + url = {http://doi.acm.org/10.1145/1963192.1963244}, + doi = {10.1145/1963192.1963244}, + series = {{WWW} '11}, + abstract = {This paper presents a user recommendation system that recommends to a user new friends having similar interests. We automatically discover users' interests using Latent Dirichlet Allocation ({LDA}), a linguistic topic model that represents users as mixtures of topics. Our system is able to recommend friends for 4 million users with high recall, outperforming existing strategies based on graph analysis.}, + pages = {101--102}, + booktitle = {Proceedings of the 20th International Conference Companion on World Wide Web}, + publisher = {{ACM}}, + author = {Pennacchiotti, Marco and Gurumurthy, Siva}, + urldate = {2016-07-19}, + date = {2011}, + keywords = {Social Media, {LDA}, user recommendation, Topic models}, + file = {ACM Full Text PDF:/home/jeremy/Zotero/storage/R389CKQJ/Pennacchiotti and Gurumurthy - 2011 - Investigating Topic Models for Social Media User R.pdf:application/pdf} +} + +@article{yang_identifying_2014, + title = {Identifying Interesting Twitter Contents Using Topical Analysis}, + volume = {41}, + issn = {0957-4174}, + url = {http://dx.doi.org/10.1016/j.eswa.2013.12.051}, + doi = {10.1016/j.eswa.2013.12.051}, + abstract = {Social media platforms such as Twitter are becoming increasingly mainstream which provides valuable user-generated information by publishing and sharing contents. Identifying interesting and useful contents from large text-streams is a crucial issue in social media because many users struggle with information overload. Retweeting as a forwarding function plays an important role in information propagation where the retweet counts simply reflect a tweet's popularity. However, the main reason for retweets may be limited to personal interests and satisfactions. In this paper, we use a topic identification as a proxy to understand a large number of tweets and to score the interestingness of an individual tweet based on its latent topics. Our assumption is that fascinating topics generate contents that may be of potential interest to a wide audience. We propose a novel topic model called Trend Sensitive-Latent Dirichlet Allocation ({TS}-{LDA}) that can efficiently extract latent topics from contents by modeling temporal trends on Twitter over time. The experimental results on real world data from Twitter demonstrate that our proposed method outperforms several other baseline methods.}, + pages = {4330--4336}, + number = {9}, + journaltitle = {Expert Syst. Appl.}, + author = {Yang, Min-Chul and Rim, Hae-Chang}, + urldate = {2016-07-19}, + date = {2014-07}, + keywords = {Social Media, Interesting content, {LDA}, Topic model, twitter} +} + +@article{fruchterman_graph_1991, + title = {Graph drawing by force-directed placement}, + volume = {21}, + rights = {Copyright © 1991 John Wiley \& Sons, Ltd}, + issn = {1097-024X}, + url = {http://onlinelibrary.wiley.com/doi/10.1002/spe.4380211102/abstract}, + doi = {10.1002/spe.4380211102}, + abstract = {We present a modification of the spring-embedder model of Eades [Congressus Numerantium, 42, 149–160, (1984)] for drawing undirected graphs with straight edges. Our heuristic strives for uniform edge lengths, and we develop it in analogy to forces in natural systems, for a simple, elegant, conceptually-intuitive, and efficient algorithm.}, + pages = {1129--1164}, + number = {11}, + journaltitle = {Software: Practice and Experience}, + shortjournal = {Softw: Pract. Exper.}, + author = {Fruchterman, Thomas M. J. and Reingold, Edward M.}, + urldate = {2016-07-20}, + date = {1991-11-01}, + langid = {english}, + keywords = {Multi-level techniques, Force-directed placement, Graph drawing, Simulated annealing}, + file = {Snapshot:/home/jeremy/Zotero/storage/SR6JA3QW/abstract.html:text/html} +} + +@article{bastian_gephi:_2009, + title = {Gephi: an open source software for exploring and manipulating networks.}, + volume = {8}, + url = {http://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/154/1009/}, + shorttitle = {Gephi}, + pages = {361--362}, + journaltitle = {{ICWSM}}, + author = {Bastian, Mathieu and Heymann, Sebastien and Jacomy, Mathieu and {others}}, + urldate = {2016-07-20}, + date = {2009}, + file = {Bastian et al. - 2009 - Gephi an open source software for exploring and m.pdf:/home/jeremy/Zotero/storage/Q82CV3RM/Bastian et al. - 2009 - Gephi an open source software for exploring and m.pdf:application/pdf} +} + +@unpublished{binfield_plos_2012, + location = {National Institute for Informatics}, + title = {{PLoS} {ONE} and the rise of the Open Access {MegaJournal}}, + url = {http://www.nii.ac.jp/sparc/en/event/2011/pdf/20120229_doc3_binfield.pdf}, + note = {The 5th {SPARC} Japan Seminar 2011}, + author = {Binfield, Peter}, + urldate = {2016-07-20}, + date = {2012-02-29}, + file = {[PDF] from nii.ac.jp:/home/jeremy/Zotero/storage/DU86MXEM/Binfield - 2003 - PLoS ONE and the rise of the Open Access MegaJourn.pdf:application/pdf} +} + +@article{subelj_clustering_2016, + title = {Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods}, + volume = {11}, + issn = {1932-6203}, + url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0154404}, + doi = {10.1371/journal.pone.0154404}, + shorttitle = {Clustering Scientific Publications Based on Citation Relations}, + abstract = {Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community.}, + pages = {e0154404}, + number = {4}, + journaltitle = {{PLOS} {ONE}}, + shortjournal = {{PLOS} {ONE}}, + author = {Šubelj, Lovro and Eck, Nees Jan van and Waltman, Ludo}, + urldate = {2016-07-20}, + date = {2016-04-28}, + keywords = {Library Science, Bibliometrics, Graphs, Algorithms, Statistical methods, Optimization, Computer and information sciences, Scientometrics}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/UQJHZF6X/Šubelj et al. - 2016 - Clustering Scientific Publications Based on Citati.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/7T77BK72/article.html:text/html} +} + +@article{small_co-citation_1973, + title = {Co-citation in the scientific literature: A new measure of the relationship between two documents}, + volume = {24}, + rights = {Copyright © 1973 Wiley Periodicals, Inc., A Wiley Company}, + issn = {1097-4571}, + url = {http://onlinelibrary.wiley.com/doi/10.1002/asi.4630240406/abstract}, + doi = {10.1002/asi.4630240406}, + shorttitle = {Co-citation in the scientific literature}, + abstract = {A new form of document coupling called co-citation is defined as the frequency with which two documents are cited together. The co-citation frequency of two scientific papers can be determined by comparing lists of citing documents in the Science Citation Index and counting identical entries. Networks of co-cited papers can be generated for specific scientific specialties, and an example is drawn from the literature of particle physics. Co-citation patterns are found to differ significantly from bibliographic coupling patterns, but to agree generally with patterns of direct citation. Clusters of co-cited papers provide a new way to study the specialty structure of science. They may provide a new approach to indexing and to the creation of {SDI} profiles.}, + pages = {265--269}, + number = {4}, + journaltitle = {Journal of the American Society for Information Science}, + shortjournal = {J. Am. Soc. Inf. Sci.}, + author = {Small, Henry}, + urldate = {2016-07-20}, + date = {1973-07-01}, + langid = {english}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/9HF57A4X/Small - 1973 - Co-citation in the scientific literature A new me.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/NF4S7SJ4/abstract.html:text/html} +} + +@article{rosvall_map_2010, + title = {The map equation}, + volume = {178}, + issn = {1951-6355, 1951-6401}, + url = {http://link.springer.com/article/10.1140/epjst/e2010-01179-1}, + doi = {10.1140/epjst/e2010-01179-1}, + abstract = {Many real-world networks are so large that we must simplify their structure before we can extract useful information about the systems they represent. As the tools for doing these simplifications proliferate within the network literature, researchers would benefit from some guidelines about which of the so-called community detection algorithms are most appropriate for the structures they are studying and the questions they are asking. Here we show that different methods highlight different aspects of a network's structure and that the the sort of information that we seek to extract about the system must guide us in our decision. For example, many community detection algorithms, including the popular modularity maximization approach, infer module assignments from an underlying model of the network formation process. However, we are not always as interested in how a system's network structure was formed, as we are in how a network's extant structure influences the system's behavior. To see how structure influences current behavior, we will recognize that links in a network induce movement across the network and result in system-wide interdependence. In doing so, we explicitly acknowledge that most networks carry flow. To highlight and simplify the network structure with respect to this flow, we use the map equation. We present an intuitive derivation of this flow-based and information-theoretic method and provide an interactive on-line application that anyone can use to explore the mechanics of the map equation. The differences between the map equation and the modularity maximization approach are not merely conceptual. Because the map equation attends to patterns of flow on the network and the modularity maximization approach does not, the two methods can yield dramatically different results for some network structures. To illustrate this and build our understanding of each method, we partition several sample networks. We also describe an algorithm and provide source code to efficiently decompose large weighted and directed networks based on the map equation.}, + pages = {13--23}, + number = {1}, + journaltitle = {The European Physical Journal Special Topics}, + shortjournal = {Eur. Phys. J. Spec. Top.}, + author = {Rosvall, M. and Axelsson, D. and Bergstrom, C. T.}, + urldate = {2016-07-20}, + date = {2010-04-17}, + langid = {english}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/SP7AM2FW/Rosvall et al. - 2010 - The map equation.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/36S24FS9/e2010-01179-1.html:text/html} +} + +@article{rosvall_maps_2008, + title = {Maps of random walks on complex networks reveal community structure}, + volume = {105}, + issn = {0027-8424, 1091-6490}, + url = {http://www.pnas.org/content/105/4/1118}, + doi = {10.1073/pnas.0706851105}, + abstract = {To comprehend the multipartite organization of large-scale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships. We illustrate the method by making a map of scientific communication as captured in the citation patterns of {\textgreater}6,000 journals. We discover a multicentric organization with fields that vary dramatically in size and degree of integration into the network of science. Along the backbone of the network—including physics, chemistry, molecular biology, and medicine—information flows bidirectionally, but the map reveals a directional pattern of citation from the applied fields to the basic sciences.}, + pages = {1118--1123}, + number = {4}, + journaltitle = {Proceedings of the National Academy of Sciences}, + shortjournal = {{PNAS}}, + author = {Rosvall, Martin and Bergstrom, Carl T.}, + urldate = {2016-07-20}, + date = {2008-01-29}, + langid = {english}, + pmid = {18216267}, + keywords = {compression, clustering, information theory, map of science, bibiometrics}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/3HQG7TS3/Rosvall and Bergstrom - 2008 - Maps of random walks on complex networks reveal co.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/TG6S96XS/1118.html:text/html} +} + +@article{ghosh_what_2013, + title = {What are we `tweeting' about obesity? Mapping tweets with topic modeling and Geographic Information System}, + volume = {40}, + issn = {1523-0406}, + url = {http://dx.doi.org/10.1080/15230406.2013.776210}, + doi = {10.1080/15230406.2013.776210}, + shorttitle = {What are we `tweeting' about obesity?}, + abstract = {Public health related tweets are difficult to identify in large conversational datasets like Twitter.com. Even more challenging is the visualization and analyses of the spatial patterns encoded in tweets. This study has the following objectives: how can topic modeling be used to identify relevant public health topics such as obesity on Twitter.com? What are the common obesity related themes? What is the spatial pattern of the themes? What are the research challenges of using large conversational datasets from social networking sites? Obesity is chosen as a test theme to demonstrate the effectiveness of topic modeling using Latent Dirichlet Allocation ({LDA}) and spatial analysis using Geographic Information System ({GIS}). The dataset is constructed from tweets (originating from the United States) extracted from Twitter.com on obesity-related queries. Examples of such queries are ‘food deserts’, ‘fast food’, and ‘childhood obesity’. The tweets are also georeferenced and time stamped. Three cohesive and meaningful themes such as ‘childhood obesity and schools’, ‘obesity prevention’, and ‘obesity and food habits’ are extracted from the {LDA} model. The {GIS} analysis of the extracted themes show distinct spatial pattern between rural and urban areas, northern and southern states, and between coasts and inland states. Further, relating the themes with ancillary datasets such as {US} census and locations of fast food restaurants based upon the location of the tweets in a {GIS} environment opened new avenues for spatial analyses and mapping. Therefore the techniques used in this study provide a possible toolset for computational social scientists in general, and health researchers in specific, to better understand health problems from large conversational datasets.}, + pages = {90--102}, + number = {2}, + journaltitle = {Cartography and Geographic Information Science}, + author = {Ghosh, Debarchana (Debs) and Guha, Rajarshi}, + urldate = {2016-07-19}, + date = {2013-03-01}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/S3WJGXET/Ghosh and Guha - 2013 - What are we ‘tweeting’ about obesity Mapping twee.pdf:application/pdf} +} + +@article{hidalgo_building_2009, + title = {The building blocks of economic complexity}, + volume = {106}, + issn = {0027-8424, 1091-6490}, + url = {http://www.pnas.org/content/106/26/10570}, + doi = {10.1073/pnas.0900943106}, + abstract = {For Adam Smith, wealth was related to the division of labor. As people and firms specialize in different activities, economic efficiency increases, suggesting that development is associated with an increase in the number of individual activities and with the complexity that emerges from the interactions between them. Here we develop a view of economic growth and development that gives a central role to the complexity of a country's economy by interpreting trade data as a bipartite network in which countries are connected to the products they export, and show that it is possible to quantify the complexity of a country's economy by characterizing the structure of this network. Furthermore, we show that the measures of complexity we derive are correlated with a country's level of income, and that deviations from this relationship are predictive of future growth. This suggests that countries tend to converge to the level of income dictated by the complexity of their productive structures, indicating that development efforts should focus on generating the conditions that would allow complexity to emerge to generate sustained growth and prosperity.}, + pages = {10570--10575}, + number = {26}, + journaltitle = {Proceedings of the National Academy of Sciences}, + shortjournal = {{PNAS}}, + author = {Hidalgo, César A. and Hausmann, Ricardo}, + urldate = {2016-07-20}, + date = {2009-06-30}, + langid = {english}, + pmid = {19549871}, + keywords = {networks, economic development}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/BSD98SD2/Hidalgo and Hausmann - 2009 - The building blocks of economic complexity.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/EXMG4VVB/10570.html:text/html} +} + +@book{hausmann_atlas_2014, + title = {The Atlas of Economic Complexity: Mapping Paths to Prosperity}, + isbn = {978-0-262-31773-3}, + shorttitle = {The Atlas of Economic Complexity}, + abstract = {Why do some countries grow and others do not? The authors of The Atlas of Economic Complexity offer readers an explanation based on "Economic Complexity," a measure of a society's productive knowledge. Prosperous societies are those that have the knowledge to make a larger variety of more complex products. The Atlas of Economic Complexity attempts to measure the amount of productive knowledge countries hold and how they can move to accumulate more of it by making more complex products.Through the graphical representation of the "Product Space," the authors are able to identify each country's "adjacent possible," or potential new products, making it easier to find paths to economic diversification and growth. In addition, they argue that a country's economic complexity and its position in the product space are better predictors of economic growth than many other well-known development indicators, including measures of competitiveness, governance, finance, and schooling.Using innovative visualizations, the book locates each country in the product space, provides complexity and growth potential rankings for 128 countries, and offers individual country pages with detailed information about a country's current capabilities and its diversification options. The maps and visualizations included in the Atlas can be used to find more viable paths to greater productive knowledge and prosperity.}, + pagetotal = {369}, + publisher = {{MIT} Press}, + author = {Hausmann, Ricardo and Hidalgo, César A. and Bustos, Sebastián and Coscia, Michele and Simoes, Alexander and Yildirim, Muhammed A.}, + date = {2014-01-17}, + langid = {english}, + keywords = {Business \& Economics / International / Economics, Business \& Economics / Economics / Macroeconomics} +} + +@article{hood_literature_2001, + title = {The Literature of Bibliometrics, Scientometrics, and Informetrics}, + volume = {52}, + issn = {01389130}, + url = {http://link.springer.com/10.1023/A:1017919924342}, + doi = {10.1023/A:1017919924342}, + pages = {291--314}, + number = {2}, + journaltitle = {Scientometrics}, + author = {Hood, William W. and Wilson, Concepción S.}, + urldate = {2016-07-20}, + date = {2001} +} + +@article{kessler_bibliographic_1963, + title = {Bibliographic coupling between scientific papers}, + volume = {14}, + rights = {Copyright © 1963 Wiley Periodicals, Inc., A Wiley Company}, + issn = {1936-6108}, + url = {http://onlinelibrary.wiley.com/doi/10.1002/asi.5090140103/abstract}, + doi = {10.1002/asi.5090140103}, + abstract = {This report describes the results of automatic processing of a large number of scientific papers according to a rigorously defined criterion of coupling. The population of papers under study was ordered into groups that satisfy the stated criterion of interrelation. An examination of the papers that constitute the groups shows a high degree of logical correlation.}, + pages = {10--25}, + number = {1}, + journaltitle = {American Documentation}, + shortjournal = {Amer. Doc.}, + author = {Kessler, M. M.}, + urldate = {2016-04-20}, + date = {1963-01-01}, + langid = {english}, + file = {Kessler - 1963 - Bibliographic coupling between scientific papers.pdf:/home/jeremy/Zotero/storage/SSZX4B3K/Kessler - 1963 - Bibliographic coupling between scientific papers.pdf:application/pdf} +} + +@article{macy_factors_2002, + title = {From Factors to Actors: Computational Sociology and Agent-Based Modeling}, + volume = {28}, + issn = {0360-0572}, + url = {http://www.jstor.org/stable/3069238}, + shorttitle = {From Factors to Actors}, + abstract = {Sociologists often model social processes as interactions among variables. We review an alternative approach that models social life as interactions among adaptive agents who influence one another in response to the influence they receive. These agent-based models ({ABMs}) show how simple and predictable local interactions can generate familiar but enigmatic global patterns, such as the diffusion of information, emergence of norms, coordination of conventions, or participation in collective action. Emergent social patterns can also appear unexpectedly and then just as dramatically transform or disappear, as happens in revolutions, market crashes, fads, and feeding frenzies. {ABMs} provide theoretical leverage where the global patterns of interest are more than the aggregation of individual attributes, but at the same time, the emergent pattern cannot be understood without a bottom up dynamical model of the microfoundations at the relational level. We begin with a brief historical sketch of the shift from "factors" to "actors" in computational sociology that shows how agent-based modeling differs fundamentally from earlier sociological uses of computer simulation. We then review recent contributions focused on the emergence of social structure and social order out of local interaction. Although sociology has lagged behind other social sciences in appreciating this new methodology, a distinctive sociological contribution is evident in the papers we review. First, theoretical interest focuses on dynamic social networks that shape and are shaped by agent interaction. Second, {ABMs} are used to perform virtual experiments that test macrosociological theories by manipulating structural factors like network topology, social stratification, or spatial mobility. We conclude our review with a series of recommendations for realizing the rich sociological potential of this approach.}, + pages = {143--166}, + journaltitle = {Annual Review of Sociology}, + shortjournal = {Annual Review of Sociology}, + author = {Macy, Michael W. and Willer, Robert}, + urldate = {2016-07-20}, + date = {2002} +} + +@book{neef_digital_2014, + location = {Indianapolis, {IN}}, + edition = {1 edition}, + title = {Digital Exhaust: What Everyone Should Know About Big Data, Digitization and Digitally Driven Innovation}, + isbn = {978-0-13-383796-4}, + shorttitle = {Digital Exhaust}, + abstract = {Will "Big Data" supercharge the economy, tyrannize us, or both? Data Exhaust is the definitive primer for everyone who wants to understand all the implications of Big Data, digitally driven innovation, and the accelerating Internet Economy. Renowned digital expert Dale Neef clearly explains: What Big Data really is, and what's new and different about it How Big Data works, and what you need to know about Big Data technologies Where the data is coming from: how Big Data integrates sources ranging from social media to machine sensors, smartphones to financial transactions How companies use Big Data analytics to gain a more nuanced, accurate picture of their customers, their own performance, and the newest trends How governments and individual citizens can also benefit from Big Data How to overcome obstacles to success with Big Data – including poor data that can magnify human error A realistic assessment of Big Data threats to employment and personal privacy, now and in the future Neef places the Big Data phenomenon where it belongs: in the context of the broader global shift to the Internet economy, with all that implies. By doing so, he helps businesses plan Big Data strategy more effectively – and helps citizens and policymakers identify sensible policies for preventing its misuse.   By conservative estimate, the global Big Data market will soar past \$50 billion by 2018. But those direct expenses represent just the "tip of the iceberg" when it comes to Big Data's impact. Big Data is now of acute strategic interest for every organization that aims to succeed – and it is equally important to everyone else. Whoever you are, Data Exhaust tells you exactly what you need to know about Big Data – and what to do about it, too.}, + pagetotal = {320}, + publisher = {Pearson {FT} Press}, + author = {Neef, Dale}, + date = {2014-12-01} +} + +@article{friedman_regularization_2010, + title = {Regularization Paths for Generalized Linear Models via Coordinate Descent}, + volume = {33}, + issn = {1548-7660}, + url = {http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/}, + abstract = {We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.}, + pages = {1--22}, + number = {1}, + journaltitle = {Journal of statistical software}, + shortjournal = {J Stat Softw}, + author = {Friedman, Jerome and Hastie, Trevor and Tibshirani, Rob}, + urldate = {2016-07-20}, + date = {2010}, + pmid = {20808728}, + pmcid = {PMC2929880} +} + +@book{james_introduction_2013, + location = {New York}, + title = {An introduction to statistical learning: with applications in R}, + isbn = {978-1-4614-7137-0}, + shorttitle = {An introduction to statistical learning}, + abstract = {"An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra. Provides tools for Statistical Learning that are essential for practitioners in science, industry and other fields. Analyses and methods are presented in R. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, and clustering. Extensive use of color graphics assist the reader"--Publisher description.}, + publisher = {Springer}, + author = {James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert}, + date = {2013} +} + +@article{tibshirani_regression_1996, + title = {Regression Shrinkage and Selection via the Lasso}, + volume = {58}, + issn = {0035-9246}, + url = {http://www.jstor.org/stable/2346178}, + abstract = {We propose a new method for estimation in linear models. The `lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.}, + pages = {267--288}, + number = {1}, + journaltitle = {Journal of the Royal Statistical Society. Series B (Methodological)}, + shortjournal = {Journal of the Royal Statistical Society. Series B (Methodological)}, + author = {Tibshirani, Robert}, + urldate = {2016-07-20}, + date = {1996} +} + +@report{bollen_social_2015, + title = {Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science}, + url = {http://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf}, + institution = {National Science Foundation}, + author = {Bollen, Kenneth and Cacioppo, John T. and Kaplan, Robert M. and Krosnick, Jon A. and Olds, James L. and Dean, Heather}, + date = {2015-05} +} + +@article{stodden_toward_2013, + title = {Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals}, + volume = {8}, + issn = {1932-6203}, + url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0067111}, + doi = {10.1371/journal.pone.0067111}, + shorttitle = {Toward Reproducible Computational Research}, + abstract = {Journal policy on research data and code availability is an important part of the ongoing shift toward publishing reproducible computational science. This article extends the literature by studying journal data sharing policies by year (for both 2011 and 2012) for a referent set of 170 journals. We make a further contribution by evaluating code sharing policies, supplemental materials policies, and open access status for these 170 journals for each of 2011 and 2012. We build a predictive model of open data and code policy adoption as a function of impact factor and publisher and find higher impact journals more likely to have open data and code policies and scientific societies more likely to have open data and code policies than commercial publishers. We also find open data policies tend to lead open code policies, and we find no relationship between open data and code policies and either supplemental material policies or open access journal status. Of the journals in this study, 38\% had a data policy, 22\% had a code policy, and 66\% had a supplemental materials policy as of June 2012. This reflects a striking one year increase of 16\% in the number of data policies, a 30\% increase in code policies, and a 7\% increase in the number of supplemental materials policies. We introduce a new dataset to the community that categorizes data and code sharing, supplemental materials, and open access policies in 2011 and 2012 for these 170 journals.}, + pages = {e67111}, + number = {6}, + journaltitle = {{PLOS} {ONE}}, + shortjournal = {{PLOS} {ONE}}, + author = {Stodden, Victoria and Guo, Peixuan and Ma, Zhaokun}, + urldate = {2016-07-22}, + date = {2013-06-21}, + keywords = {Reproducibility, science policy, computational biology, Open access, Scientific publishing, open data, Computer and information sciences, Data management}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/PIC8KFJE/Stodden et al. - 2013 - Toward Reproducible Computational Research An Emp.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/NTS2JK5S/article.html:text/html} +} + +@article{leveque_reproducible_2012, + title = {Reproducible research for scientific computing: Tools and strategies for changing the culture}, + volume = {14}, + issn = {1521-9615}, + shorttitle = {Reproducible research for scientific computing}, + pages = {13--17}, + number = {4}, + journaltitle = {Computing in Science and Engineering}, + author = {{LeVeque}, Randall J. and Mitchell, Ian M. and Stodden, Victoria}, + date = {2012}, + file = {LeVeque et al. - 2012 - Reproducible research for scientific computing To.pdf:/home/jeremy/Zotero/storage/2FHZTG9Q/LeVeque et al. - 2012 - Reproducible research for scientific computing To.pdf:application/pdf} +} + +@book{wilensky_introduction_2015, + location = {Cambridge, Massachusetts}, + title = {An introduction to agent-based modeling: modeling natural, social, and engineered complex systems with {NetLogo}}, + shorttitle = {An introduction to agent-based modeling}, + publisher = {{MIT} Press}, + author = {Wilensky, Uri and Rand, William}, + urldate = {2016-07-19}, + date = {2015} +} + +@article{welles_minorities_2014, + title = {On minorities and outliers: The case for making Big Data small}, + volume = {1}, + issn = {2053-9517}, + url = {http://bds.sagepub.com/content/1/1/2053951714540613}, + doi = {10.1177/2053951714540613}, + shorttitle = {On minorities and outliers}, + abstract = {In this essay, I make the case for choosing to examine small subsets of Big Data datasets—making big data small. Big Data allows us to produce summaries of human behavior at a scale never before possible. But in the push to produce these summaries, we risk losing sight of a secondary but equally important advantage of Big Data—the plentiful representation of minorities. Women, minorities and statistical outliers have historically been omitted from the scientific record, with problematic consequences. Big Data affords the opportunity to remedy those omissions. However, to do so, Big Data researchers must choose to examine very small subsets of otherwise large datasets. I encourage researchers to embrace an ethical, empirical and epistemological stance on Big Data that includes minorities and outliers as reference categories, rather than the exceptions to statistical norms.}, + pages = {2053951714540613}, + number = {1}, + journaltitle = {Big Data \& Society}, + author = {Welles, Brooke Foucault}, + urldate = {2016-07-23}, + date = {2014-04-01}, + langid = {english}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/SS8P2JN4/Welles - 2014 - On minorities and outliers The case for making Bi.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/M2HTAVP2/2053951714540613.html:text/html} +} + +@book{hansen_analyzing_2010, + location = {Burlington, Massachusetts}, + title = {Analyzing social media networks with {NodeXL}: Insights from a connected world}, + shorttitle = {Analyzing social media networks with {NodeXL}}, + publisher = {Morgan Kaufmann}, + author = {Hansen, Derek and Shneiderman, Ben and Smith, Marc A.}, + urldate = {2016-07-18}, + date = {2010} +} + +@inproceedings{asur_predicting_2010, + title = {Predicting the Future with Social Media}, + volume = {1}, + doi = {10.1109/WI-IAT.2010.63}, + abstract = {In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from Twitter.com to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be utilized to improve the forecasting power of social media.}, + eventtitle = {2010 {IEEE}/{WIC}/{ACM} International Conference on Web Intelligence and Intelligent Agent Technology ({WI}-{IAT})}, + pages = {492--499}, + booktitle = {2010 {IEEE}/{WIC}/{ACM} International Conference on Web Intelligence and Intelligent Agent Technology ({WI}-{IAT})}, + author = {Asur, S. and Huberman, B. A.}, + date = {2010-08}, + keywords = {Web sites, Social Media, attention, prediction, social media content, content sharing, social networking (online), market-based predictors, Twitter.com, social networking}, + file = {IEEE Xplore Abstract Record:/home/jeremy/Zotero/storage/AT38MBGW/articleDetails.html:text/html;IEEE Xplore Abstract Record:/home/jeremy/Zotero/storage/NAPSZ9F4/login.html:text/html;IEEE Xplore Full Text PDF:/home/jeremy/Zotero/storage/5XINGQC4/Asur and Huberman - 2010 - Predicting the Future with Social Media.pdf:application/pdf} +} + +@article{blei_latent_2003, + title = {Latent dirichlet allocation}, + volume = {3}, + url = {http://dl.acm.org/citation.cfm?id=944937}, + pages = {993--1022}, + journaltitle = {The Journal of Machine Learning Research}, + author = {Blei, David M. and Ng, Andrew Y. and Jordan, Michael I.}, + urldate = {2015-12-03}, + date = {2003}, + file = {Blei et al. - 2003 - Latent dirichlet allocation.pdf:/home/jeremy/Zotero/storage/2K3E7TJH/Blei et al. - 2003 - Latent dirichlet allocation.pdf:application/pdf} +} + +@article{dimaggio_exploiting_2013, + title = {Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding}, + volume = {41}, + issn = {0304422X}, + url = {http://linkinghub.elsevier.com/retrieve/pii/S0304422X13000661}, + doi = {10.1016/j.poetic.2013.08.004}, + shorttitle = {Exploiting affinities between topic modeling and the sociological perspective on culture}, + pages = {570--606}, + number = {6}, + journaltitle = {Poetics}, + author = {{DiMaggio}, Paul and Nag, Manish and Blei, David}, + urldate = {2016-01-02}, + date = {2013-12}, + langid = {english}, + file = {exploiting-affinities.pdf:/home/jeremy/Zotero/storage/7D8NAGNB/exploiting-affinities.pdf:application/pdf} +} + +@inproceedings{cheng_can_2014, + location = {New York, {NY}, {USA}}, + title = {Can cascades be predicted?}, + isbn = {978-1-4503-2744-2}, + url = {http://doi.acm.org/10.1145/2566486.2567997}, + doi = {10.1145/2566486.2567997}, + series = {{WWW} '14}, + abstract = {On many social networking web sites such as Facebook and Twitter, resharing or reposting functionality allows users to share others' content with their own friends or followers. As content is reshared from user to user, large cascades of reshares can form. While a growing body of research has focused on analyzing and characterizing such cascades, a recent, parallel line of work has argued that the future trajectory of a cascade may be inherently unpredictable. In this work, we develop a framework for addressing cascade prediction problems. On a large sample of photo reshare cascades on Facebook, we find strong performance in predicting whether a cascade will continue to grow in the future. We find that the relative growth of a cascade becomes more predictable as we observe more of its reshares, that temporal and structural features are key predictors of cascade size, and that initially, breadth, rather than depth in a cascade is a better indicator of larger cascades. This prediction performance is robust in the sense that multiple distinct classes of features all achieve similar performance. We also discover that temporal features are predictive of a cascade's eventual shape. Observing independent cascades of the same content, we find that while these cascades differ greatly in size, we are still able to predict which ends up the largest.}, + pages = {925--936}, + booktitle = {Proceedings of the 23rd International Conference on World Wide Web}, + publisher = {{ACM}}, + author = {Cheng, Justin and Adamic, Lada and Dow, P. Alex and Kleinberg, Jon Michael and Leskovec, Jure}, + urldate = {2015-04-06}, + date = {2014}, + keywords = {cascade prediction, contagion, information diffusion}, + file = {Cheng et al. - 2014 - Can Cascades Be Predicted.pdf:/home/jeremy/Zotero/storage/KPPCCRXU/Cheng et al. - 2014 - Can Cascades Be Predicted.pdf:application/pdf} +} + +@article{pedregosa_scikit-learn:_2011, + title = {Scikit-learn: Machine learning in python}, + volume = {12}, + url = {http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html}, + shorttitle = {Scikit-learn}, + abstract = {Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and {API} consistency. It has minimal dependencies and is distributed under the simplified {BSD} license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.}, + pages = {2825--2830}, + journaltitle = {Journal of Machine Learning Research}, + author = {Pedregosa, Fabian and Varoquaux, Gaël and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, Édouard}, + urldate = {2016-06-07}, + date = {2011-10}, + note = {bibtex: pedregosa\_scikit-learn:\_2011}, + file = {Scikit-learn\: Machine Learning in Python:/home/jeremy/Zotero/storage/6XS2PM2P/Pedregosa et al. - 2011 - Scikit-learn Machine Learning in Python.pdf:application/pdf} +} + +@article{zimmer_okcupid_2016, + title = {{OkCupid} Study Reveals the Perils of Big-Data Science}, + url = {https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/}, + abstract = {The data of 70,000 {OKCupid} users is now searchable in a database. Ethicist Michael T Zimmer explains why it doesn't matter that it was "already public."}, + journaltitle = {{WIRED}}, + author = {Zimmer, Michael}, + urldate = {2016-08-31}, + date = {2016-05-14}, + file = {Snapshot:/home/jeremy/Zotero/storage/KV5P4IA9/okcupid-study-reveals-perils-big-data-science.html:text/html} +} + +@article{merton_matthew_1968, + title = {The Matthew effect in science}, + volume = {159}, + url = {http://www.unc.edu/~fbaum/teaching/PLSC541_Fall06/Merton_Science_1968.pdf}, + pages = {56--63}, + number = {3810}, + journaltitle = {Science}, + author = {Merton, Robert K.}, + urldate = {2014-09-27}, + date = {1968}, + file = {[PDF] from unc.edu:/home/jeremy/Zotero/storage/B3H2PG6R/Merton - 1968 - The Matthew effect in science.pdf:application/pdf} +} + +@article{barabasi_emergence_1999, + title = {Emergence of Scaling in Random Networks}, + volume = {286}, + issn = {0036-8075, 1095-9203}, + url = {http://science.sciencemag.org/content/286/5439/509}, + doi = {10.1126/science.286.5439.509}, + abstract = {Systems as diverse as genetic networks or the World Wide Web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature was found to be a consequence of two generic mechanisms: (i) networks expand continuously by the addition of new vertices, and (ii) new vertices attach preferentially to sites that are already well connected. A model based on these two ingredients reproduces the observed stationary scale-free distributions, which indicates that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.}, + pages = {509--512}, + number = {5439}, + journaltitle = {Science}, + author = {Barabási, Albert-László and Albert, Réka}, + urldate = {2016-10-06}, + date = {1999-10-15}, + langid = {english}, + pmid = {10521342}, + file = {Barabási and Albert - 1999 - Emergence of Scaling in Random Networks.pdf:/home/jeremy/Zotero/storage/D4DAX5XA/Barabási and Albert - 1999 - Emergence of Scaling in Random Networks.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/JETSMGUZ/509.html:text/html} +} + +@article{rosvall_mapping_2010, + title = {Mapping Change in Large Networks}, + volume = {5}, + issn = {1932-6203}, + url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008694}, + doi = {10.1371/journal.pone.0008694}, + abstract = {Change is a fundamental ingredient of interaction patterns in biology, technology, the economy, and science itself: Interactions within and between organisms change; transportation patterns by air, land, and sea all change; the global financial flow changes; and the frontiers of scientific research change. Networks and clustering methods have become important tools to comprehend instances of these large-scale structures, but without methods to distinguish between real trends and noisy data, these approaches are not useful for studying how networks change. Only if we can assign significance to the partitioning of single networks can we distinguish meaningful structural changes from random fluctuations. Here we show that bootstrap resampling accompanied by significance clustering provides a solution to this problem. To connect changing structures with the changing function of networks, we highlight and summarize the significant structural changes with alluvial diagrams and realize de Solla Price's vision of mapping change in science: studying the citation pattern between about 7000 scientific journals over the past decade, we find that neuroscience has transformed from an interdisciplinary specialty to a mature and stand-alone discipline.}, + pages = {e8694}, + number = {1}, + journaltitle = {{PLOS} {ONE}}, + shortjournal = {{PLOS} {ONE}}, + author = {Rosvall, Martin and Bergstrom, Carl T.}, + urldate = {2016-07-08}, + date = {2010-01-27}, + keywords = {Medicine and health sciences, Behavioral neuroscience, neuroscience, Algorithms, Structure of markets, Molecular neuroscience, Simulated annealing, Cellular neuroscience}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/79Q8AFD4/Rosvall and Bergstrom - 2010 - Mapping Change in Large Networks.pdf:application/pdf;Snapshot:/home/jeremy/Zotero/storage/7Z6NMBHX/article.html:text/html} +} + +@inproceedings{tufekci_big_2014, + title = {Big Questions for social media big data: Representativeness, validity and other methodological pitfalls}, + isbn = {978-1-57735-657-8}, + shorttitle = {Big Questions for social media big data}, + abstract = {Large-scale databases of human activity in social media have captured scientific and policy attention, producing a flood of research and discussion. This paper considers methodological and conceptual challenges for this emergent field, with special attention to the validity and representativeness of social media big data analyses. Persistent issues include the over-emphasis of a single platform, Twitter, sampling biases arising from selection by hashtags, and vague and unrepresentative sampling frames. The sociocultural complexity of user behavior aimed at algorithmic invisibility (such as subtweeting, mock-retweeting, use of "screen captures" for text, etc.) further complicate interpretation of big data social media. Other challenges include accounting for field effects, i.e. broadly consequential events that do not diffuse only through the network under study but affect the whole society. The application of network methods from other fields to the study of human social activity may not always be appropriate. The paper concludes with a call to action on practical steps to improve our analytic capacity in this promising, rapidly-growing field.. Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.}, + eventtitle = {Proceedings of the 8th International Conference on Weblogs and Social Media, {ICWSM} 2014}, + pages = {505--514}, + author = {Tufekci, Z.}, + date = {2014} +} + +@article{lazer_parable_2014, + title = {The Parable of Google Flu: Traps in Big Data Analysis}, + volume = {343}, + rights = {Copyright © 2014, American Association for the Advancement of Science}, + issn = {0036-8075, 1095-9203}, + url = {http://science.sciencemag.org/content/343/6176/1203}, + doi = {10.1126/science.1248506}, + shorttitle = {The Parable of Google Flu}, + abstract = {In February 2013, Google Flu Trends ({GFT}) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that {GFT} was predicting more than double the proportion of doctor visits for influenza-like illness ({ILI}) than the Centers for Disease Control and Prevention ({CDC}), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that {GFT} was built to predict {CDC} reports. Given that {GFT} is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error? +Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data. +Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.}, + pages = {1203--1205}, + number = {6176}, + journaltitle = {Science}, + author = {Lazer, David and Kennedy, Ryan and King, Gary and Vespignani, Alessandro}, + urldate = {2016-10-06}, + date = {2014-03-14}, + langid = {english}, + pmid = {24626916}, + file = {Full Text PDF:/home/jeremy/Zotero/storage/UFHNQF8W/Lazer et al. - 2014 - The Parable of Google Flu Traps in Big Data Analy.pdf:application/pdf} +} + +@article{boyd_critical_2012, + title = {Critical questions for big data}, + volume = {15}, + issn = {1369-118X}, + url = {http://dx.doi.org/10.1080/1369118X.2012.678878}, + doi = {10.1080/1369118X.2012.678878}, + abstract = {The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Given the rise of Big Data as a socio-technical phenomenon, we argue that it is necessary to critically interrogate its assumptions and biases. In this article, we offer six provocations to spark conversations about the issues of Big Data: a cultural, technological, and scholarly phenomenon that rests on the interplay of technology, analysis, and mythology that provokes extensive utopian and dystopian rhetoric.}, + pages = {662--679}, + number = {5}, + journaltitle = {Information, Communication \& Society}, + author = {boyd, danah and Crawford, Kate}, + urldate = {2016-08-09}, + date = {2012}, + file = {boyd and Crawford - 2012 - Critical Questions for Big Data.pdf:/home/jeremy/Zotero/storage/XEM23ZJG/boyd and Crawford - 2012 - Critical Questions for Big Data.pdf:application/pdf} +} + +@book{silver_signal_2015, + location = {New York, New York}, + title = {The Signal and the Noise: Why So Many Predictions Fail--but Some Don't}, + isbn = {978-0-14-312508-2}, + shorttitle = {The Signal and the Noise}, + abstract = {One of Wall Street Journal's Best Ten Works of Nonfiction in 2012   New York Times Bestseller “Not so different in spirit from the way public intellectuals like John Kenneth Galbraith once shaped discussions of economic policy and public figures like Walter Cronkite helped sway opinion on the Vietnam War…could turn out to be one of the more momentous books of the decade.” —New York Times Book Review   "Nate Silver's The Signal and the Noise is The Soul of a New Machine for the 21st century." —Rachel Maddow, author of Drift "A serious treatise about the craft of prediction—without academic mathematics—cheerily aimed at lay readers. Silver's coverage is polymathic, ranging from poker and earthquakes to climate change and terrorism." —New York Review of Books Nate Silver built an innovative system for predicting baseball performance, predicted the 2008 election within a hair’s breadth, and became a national sensation as a blogger—all by the time he was thirty. He solidified his standing as the nation's foremost political forecaster with his near perfect prediction of the 2012 election. Silver is the founder and editor in chief of {FiveThirtyEight}.com.  Drawing on his own groundbreaking work, Silver examines the world of prediction, investigating how we can distinguish a true signal from a universe of noisy data. Most predictions fail, often at great cost to society, because most of us have a poor understanding of probability and uncertainty. Both experts and laypeople mistake more confident predictions for more accurate ones. But overconfidence is often the reason for failure. If our appreciation of uncertainty improves, our predictions can get better too. This is the “prediction paradox”: The more humility we have about our ability to make predictions, the more successful we can be in planning for the future.In keeping with his own aim to seek truth from data, Silver visits the most successful forecasters in a range of areas, from hurricanes to baseball, from the poker table to the stock market, from Capitol Hill to the {NBA}. He explains and evaluates how these forecasters think and what bonds they share. What lies behind their success? Are they good—or just lucky? What patterns have they unraveled? And are their forecasts really right? He explores unanticipated commonalities and exposes unexpected juxtapositions. And sometimes, it is not so much how good a prediction is in an absolute sense that matters but how good it is relative to the competition. In other cases, prediction is still a very rudimentary—and dangerous—science.Silver observes that the most accurate forecasters tend to have a superior command of probability, and they tend to be both humble and hardworking. They distinguish the predictable from the unpredictable, and they notice a thousand little details that lead them closer to the truth. Because of their appreciation of probability, they can distinguish the signal from the noise.With everything from the health of the global economy to our ability to fight terrorism dependent on the quality of our predictions, Nate Silver’s insights are an essential read.}, + pagetotal = {560}, + publisher = {Penguin Books}, + author = {Silver, Nate}, + date = {2015} +} + +@online{sandvig_why_2016, + title = {Why I Am Suing the Government}, + url = {https://socialmediacollective.org/2016/07/01/why-i-am-suing-the-government/}, + titleaddon = {Social Media Collective Research Blog}, + type = {Web Log}, + author = {Sandvig, Christian}, + urldate = {2016-10-23}, + date = {2016-07-01}, + file = {Snapshot:/home/jeremy/Zotero/storage/9USUHHJB/why-i-am-suing-the-government.html:text/html} +} + +@book{domingos_master_2015, + location = {New York, New York}, + title = {The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World}, + shorttitle = {The Master Algorithm}, + abstract = {Algorithms increasingly run our lives. They find books, movies, jobs, and dates for us, manage our investments, and discover new drugs. More and more, these algorithms work by learning from the trails of data we leave in our newly digital world. Like curious children, they observe us, imitate, and experiment. And in the world’s top research labs and universities, the race is on to invent the ultimate learning algorithm: one capable of discovering any knowledge from data, and doing anything we want, before we even ask.Machine learning is the automation of discovery—the scientific method on steroids—that enables intelligent robots and computers to program themselves. No field of science today is more important yet more shrouded in mystery. Pedro Domingos, one of the field’s leading lights, lifts the veil for the first time to give us a peek inside the learning machines that power Google, Amazon, and your smartphone. He charts a course through machine learning’s five major schools of thought, showing how they turn ideas from neuroscience, evolution, psychology, physics, and statistics into algorithms ready to serve you. Step by step, he assembles a blueprint for the future universal learner—the Master Algorithm—and discusses what it means for you, and for the future of business, science, and society.If data-ism is today’s rising philosophy, this book will be its bible. The quest for universal learning is one of the most significant, fascinating, and revolutionary intellectual developments of all time. A groundbreaking book, The Master Algorithm is the essential guide for anyone and everyone wanting to understand not just how the revolution will happen, but how to be at its forefront.}, + pagetotal = {354}, + publisher = {Basic Books}, + author = {Domingos, Pedro}, + date = {2015} +} + +@inproceedings{arun_finding_2010, + title = {On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations}, + isbn = {978-3-642-13656-6}, + url = {https://link.springer.com/chapter/10.1007/978-3-642-13657-3_43}, + doi = {10.1007/978-3-642-13657-3_43}, + series = {Lecture Notes in Computer Science}, + shorttitle = {On Finding the Natural Number of Topics with Latent Dirichlet Allocation}, + abstract = {It is important to identify the “correct” number of topics in mechanisms like Latent Dirichlet Allocation({LDA}) as they determine the quality of features that are presented as features for classifiers like {SVM}. In this work we propose a measure to identify the correct number of topics and offer empirical evidence in its favor in terms of classification accuracy and the number of topics that are naturally present in the corpus. We show the merit of the measure by applying it on real-world as well as synthetic data sets(both text and images). In proposing this measure, we view {LDA} as a matrix factorization mechanism, wherein a given corpus C is split into two matrix factors M1 and M2 as given by Cd*w = M1d*t x Qt*w. Where d is the number of documents present in the corpus and w is the size of the vocabulary. The quality of the split depends on “t”, the right number of topics chosen. The measure is computed in terms of symmetric {KL}-Divergence of salient distributions that are derived from these matrix factors. We observe that the divergence values are higher for non-optimal number of topics – this is shown by a ’dip’ at the right value for ’t’.}, + eventtitle = {Pacific-Asia Conference on Knowledge Discovery and Data Mining}, + pages = {391--402}, + booktitle = {Advances in Knowledge Discovery and Data Mining}, + publisher = {Springer, Berlin, Heidelberg}, + author = {Arun, R. and Suresh, V. and Madhavan, C. E. Veni and Murthy, M. N. Narasimha}, + urldate = {2017-07-06}, + date = {2010-06-21}, + langid = {english}, + file = {Arun et al. - 2010 - On Finding the Natural Number of Topics with Laten.pdf:/home/jeremy/Zotero/storage/EMMCNH7F/Arun et al. - 2010 - On Finding the Natural Number of Topics with Laten.pdf:application/pdf} +} \ No newline at end of file diff --git a/paper/resources/vc-git b/paper/resources/vc-git new file mode 100755 index 0000000..557a573 --- /dev/null +++ b/paper/resources/vc-git @@ -0,0 +1,24 @@ +#!/bin/sh +# This is file 'vc' from the vc bundle for TeX. +# The original file can be found at CTAN:support/vc. +# This file is Public Domain. + +# Parse command line options. +full=0 +mod=0 +while [ -n "$(echo $1 | grep '-')" ]; do + case $1 in + -f ) full=1 ;; + -m ) mod=1 ;; + * ) echo 'usage: vc [-f] [-m]' + exit 1 + esac + shift +done +# English locale. +LC_ALL=C +git --no-pager log -1 HEAD --pretty=format:"Hash: %H%nAbr. Hash: %h%nParent Hashes: %P%nAbr. Parent Hashes: %p%nAuthor Name: %an%nAuthor Email: %ae%nAuthor Date: %ai%nCommitter Name: %cn%nCommitter Email: %ce%nCommitter Date: %ci%n" |gawk -v script=log -v full=$full -f ~/bin/vc-git.awk > vc +if [ "$mod" = 1 ] +then + git status |gawk -v script=status -f ~/bin/vc-git.awk >> vc +fi diff --git a/paper/resources/vc-git.awk b/paper/resources/vc-git.awk new file mode 100644 index 0000000..66b3526 --- /dev/null +++ b/paper/resources/vc-git.awk @@ -0,0 +1,89 @@ +# This is file 'vc-git.awk' from the vc bundle for TeX. +# The original file can be found at CTAN:support/vc. +# This file is Public Domain. +BEGIN { + +### Process output of "git status". + if (script=="status") { + modified = 0 + } + +} + + + +### Process output of "git log". +script=="log" && /^Hash:/ { Hash = substr($0, 2+match($0, ":")) } +script=="log" && /^Abr. Hash:/ { AbrHash = substr($0, 2+match($0, ":")) } +script=="log" && /^Parent Hashes:/ { ParentHashes = substr($0, 2+match($0, ":")) } +script=="log" && /^Abr. Parent Hashes:/ { AbrParentHashes = substr($0, 2+match($0, ":")) } +script=="log" && /^Author Name:/ { AuthorName = substr($0, 2+match($0, ":")) } +script=="log" && /^Author Email:/ { AuthorEmail = substr($0, 2+match($0, ":")) } +script=="log" && /^Author Date:/ { AuthorDate = substr($0, 2+match($0, ":")) } +script=="log" && /^Committer Name:/ { CommitterName = substr($0, 2+match($0, ":")) } +script=="log" && /^Committer Email:/ { CommitterEmail = substr($0, 2+match($0, ":")) } +script=="log" && /^Committer Date:/ { CommitterDate = substr($0, 2+match($0, ":")) } + +### Process output of "git status". +### Changed index? +script=="status" && /^# Changes to be committed:/ { modified = 1 } +### Unstaged modifications? +script=="status" && /^# Changed but not updated:/ { modified = 2 } + + + +END { + +### Process output of "git log". + if (script=="log") { +### Standard encoding is UTF-8. + if (Encoding == "") Encoding = "UTF-8" +### Extract relevant information from variables. + LongDate = substr(AuthorDate, 1, 25) + DateRAW = substr(LongDate, 1, 10) + DateISO = DateRAW + DateTEX = DateISO + gsub("-", "/", DateTEX) + Time = substr(LongDate, 12, 14) +### Write file identification to vc.tex. + print "%%% This file has been generated by the vc bundle for TeX." + print "%%% Do not edit this file!" + print "%%%" +### Write Git specific macros. + print "%%% Define Git specific macros." + print "\\gdef\\GITHash{" Hash "}%" + print "\\gdef\\GITAbrHash{" AbrHash "}%" + print "\\gdef\\GITParentHashes{" ParentHashes "}%" + print "\\gdef\\GITAbrParentHashes{" AbrParentHashes "}%" + print "\\gdef\\GITAuthorName{" AuthorName "}%" + print "\\gdef\\GITAuthorEmail{" AuthorEmail "}%" + print "\\gdef\\GITAuthorDate{" AuthorDate "}%" + print "\\gdef\\GITCommitterName{" CommitterName "}%" + print "\\gdef\\GITCommitterEmail{" CommitterEmail "}%" + print "\\gdef\\GITCommitterDate{" CommitterDate "}%" +### Write generic version control macros. + print "%%% Define generic version control macros." + print "\\gdef\\VCRevision{\\GITAbrHash}%" + print "\\gdef\\VCAuthor{\\GITAuthorName}%" + print "\\gdef\\VCDateRAW{" DateRAW "}%" + print "\\gdef\\VCDateISO{" DateISO "}%" + print "\\gdef\\VCDateTEX{" DateTEX "}%" + print "\\gdef\\VCTime{" Time "}%" + print "\\gdef\\VCModifiedText{\\textcolor{red}{with local modifications!}}%" + print "%%% Assume clean working copy." + print "\\gdef\\VCModified{0}%" + print "\\gdef\\VCRevisionMod{\\VCRevision}%" + } + +### Process output of "git status". + if (script=="status") { + print "%%% Is working copy modified?" + print "\\gdef\\VCModified{" modified "}%" + if (modified==0) { + print "\\gdef\\VCRevisionMod{\\VCRevision}%" + } else { + print "\\gdef\\VCRevisionMod{\\VCRevision~\\VCModifiedText}%" + } + } + +} diff --git a/paper/ugmm8a.pfb b/paper/ugmm8a.pfb new file mode 100644 index 0000000..79a60e6 Binary files /dev/null and b/paper/ugmm8a.pfb differ diff --git a/paper/ugmmi8a.pfb b/paper/ugmmi8a.pfb new file mode 100644 index 0000000..19bb435 Binary files /dev/null and b/paper/ugmmi8a.pfb differ diff --git a/paper/ugmr8a.pfb b/paper/ugmr8a.pfb new file mode 100644 index 0000000..0017029 Binary files /dev/null and b/paper/ugmr8a.pfb differ diff --git a/paper/ugmri8a.pfb b/paper/ugmri8a.pfb new file mode 100644 index 0000000..9b90d31 Binary files /dev/null and b/paper/ugmri8a.pfb differ